-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new bi sample benchmark tests #1995
Conversation
Closes #515 Non-breaking addition to `Library.finalize_staged_data` API to allow persisting of metadata along with the symbol.
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <[email protected]>
- Separates non-batch and batch BasicFunctions. Uses multiple symbols only where it is required - Use single symbol for ModificationFunctions - Use less rows for batch benchmarks - Decrease version chain length for IteratingVersionChain
#### Reference Issues/PRs Implements request #1003 #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->
#### Reference Issues/PRs Implements #1391 and fixes #1532. - Added read_index in LibraryTool (as required in #1391) - Allowed Library to get LibraryTool and call read_index (to fix #1532). Also updated the docs accordingly. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->
* storage access docs moved to their own section at the bottom * code formatting now copy/paste friendly * rewrite of getting started * section on transactions added Co-authored-by: Nick Clarke <[email protected]>
The short_wide ModificationFunction benchmarks were failing because of an error in the #1530 restructure.
… columns (#1538) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? This fixes two things: 1. When using dynamic schema and not using direct read the function copying the source column to the dest column now checks if the source is empty type. If it is empty type it will call default_initialize. Previously it tried to call memcpy and crashed because it tried to deref NULL 2. During decoding (both v1 and v2) the number of row in a segment is determined by taking the max number of rows in each column. Previously it was determined by the number of rows in the first column. However it the first column was of empty type it would report 0 rows, which is wrong. This lead to creating sparse maps for dense columns. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> Co-authored-by: Vasil Pashov <[email protected]>
#### Reference Issues/PRs Fixes #944 The error message now includes the field index which shows a better error when two columns are simply swapped. To do this, a formatter is defined for `FieldCollection` because the index is not saved in `Field` (previously the formatter of Field was called for composing the error message), it is only known at `FieldCollection` level. Furthermore, the pattern matching regular expression is also updated as the output now becomes `FD<name=col1, type=TD<type=typ1, dim=0>, idx=0>, FD<name=col2, type=TD<type=type2, dim=0>, idx=1>`. After `idx` was added, the previous regular expression was separating this into ```python ["FD<name=col1, type=TD<type=typ1, dim=0>", "idx=0>", "FD<name=col2, type=TD<type=type2, dim=0>", "idx=1>"] ``` but we want ```python ["FD<name=col1, type=TD<type=typ1, dim=0>, idx=0>", "FD<name=col2, type=TD<type=type2, dim=0>, idx=1>"] ``` Now the fix shows the following output which explains the error better: ``` The columns (names and types) in the argument are not identical to that of the existing version: APPEND (Showing only the mismatch. Full col list saved in the `last_mismatch_msg` attribute of the lib instance. '-' marks columns missing from the argument, '+' for unexpected.) -FD<name=col1, type=TD<type=INT64, dim=0>, idx=0> -FD<name=col2, type=TD<type=INT64, dim=0>, idx=1> +FD<name=col2, type=TD<type=INT64, dim=0>, idx=0> +FD<name=col1, type=TD<type=INT64, dim=0>, idx=1> ``` #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->
We noticed a speed regression in the version_chain benchmarks arising from #1152. Profiling showed that the majority of the slowdown was because the `~SegmentInMemory` destructor became a lot slower. And all of the slowdown was because of the recurring `ConfigsMap::get_int` inside `maybe_trim`. We fix this by again using a `static` variable for getting the config value. I've verified that there aren't other similar regressions from that PR.
We need to take the GIL when Python needs to allocate special characters, otherwise we will deadlock. At the moment read() releases the GIL but update() and append() don't
Docs only change --------- Co-authored-by: Nick Clarke <[email protected]>
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> Closes: #1226 #### What does this implement or fix? * Remove the environment variable which was used to enable the consolidation phase * Make our custom block manager to cast blocks for empty columns to float64. To be removed after empty types become enabled. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <[email protected]>
#### Reference Issues/PRs Fixes the 10th item of the folly replacement plan, issue #1412. #### What does this implement or fix? This removes the single use of `folly/ThreadCachedInt`. It is replaced by a partial vendoring of the `folly` code plus use of `boost::thread_specific_ptr`. `ThreadCachedInt` is used to count the number of freed memory blocks. It is (presumably) not just implemented as an atomic integer count as thread locking would be too slow, so instead each thread has its own count and when a single thread's count exceeds some threshold it is added to the overall count. The original `folly` implementation has two ways of reading the count which are slow (fully accurate) and fast (not fully accurate). ArcticDB only uses the fast option, so the implementation is much simpler than `folly`'s, requiring fewer `atomic`s. New class `ThreadCachedInt` in `thread_cached_int.hpp` is derived from https://github.com/facebook/folly/blob/main/folly/ThreadCachedInt.h but containing only the subset of functionality required. Instead of using `folly'`s own `ThreadLocalPtr` this uses `boost::thread_specific_ptr`. The `folly` source code claims that their implementation is 4x faster than the `boost` one: https://github.com/facebook/folly/blob/dbc9e565f54eabb40ad6600656ad9dea919f51c0/folly/ThreadLocal.h#L18-L20 but that claim dates from 12 years ago and this solution is simpler than theirs. This does need to be benchmarked against `master` to confirm that it is not measurably slower. #### Any other comments? The only place this is ultimately used is to control when `malloc_trim` is called here https://github.com/man-group/ArcticDB/blob/e3fab24b653439f9894495a2657bb2dcfc1fbb42/cpp/arcticdb/util/allocator.cpp#L286-L288 to release memory back to the system. This only occurs on Linux. Other OSes could have all of this code removed but this would be a bigger change with many `#ifdef` blocks, etc. --------- Signed-off-by: Ian Thomas <[email protected]>
The SlabAlloc tests were super slow on the build servers. Investigation showed this was due to poor performance of slab allocator with many threads. More details: https://manwiki.maninvestments.com/display/AlphaTech/Slab+Allocator+poor+multi-threaded+performance+with+aggressive+allocations This commit does the following: - Place an upper limit on num_threads of 8 (which speeds up drastically the tests on the build servers) - Improve multi-threaded performance by ~20% by removing unneded atomic::load() calls (because atomic::compare_exchange_strong already does this for us) - Adds compiled away logging for log contention. Can be enabled with #define LOG_SLAB_ALLOC_INTERNALS.
) #### What does this implement or fix? We should only write the version ref key once when we write with `prune_previous_versions=True`. Currently we are writing it twice - once after we write the tombstone all and once when we write the new version. This means that there is a period of time where the symbol is unreadable. This was fixed a while ago with PR #1104 but regressed with PR #1355.
#### What does this implement or fix? `SymbolDescription.last_update_time` is a `pandas._libs.tslibs.timestamps.Timestamp` which is a `datetime.datetime` subtype, not a `numpy.datetime64`. #### Any other comments? Should this be `Union[datetime.datetime, numpy.datetime64]`? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->
…ilder (#1557) #### Reference Issues/PRs Fixes #1148. previously if a type was mismatched in the QueryBuilder as follows ```python df1 = pd.DataFrame({"col1": [1, 2, 3], "col2": [2, 3, 4], "col3": [4, 5, 6], "col_str": ["1", "2", "3"]}) sym = "symbol" lib.write(sym, df1) q = QueryBuilder() q = q[q["col1" + 1] == "3"] lib.read(sym, query_builder=q) ``` it would show an unclear message with `UTF_DYNAMIC64` type shown for strings ``` arcticdb_ext.exceptions.InternalException: E_ASSERTION_FAILURE Cannot compare TD<type=UTF_DYNAMIC64, dim=0> to TD<type=UINT32, dim=0> (possible categorical?) ``` Now the error is much clearer where column names are generated according to the query and STRING type is shown for strings ``` arcticdb_ext.exceptions.UserInputException: E_INVALID_USER_ARGUMENT Invalid comparison (col1 + 1) (TD<type=INT64, dim=0>) == "3" (TD<type=STRING, dim=0>) ``` For a more complex query like `q = q[1 + q["col1"] * q["col2"] - q["col3"] == q["col_str"]]` it will show column name `((1 + (col1 * col2)) - col3)` in the error message which allows the user to better understand the error. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [x] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [x] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? This PR is aiming to provide a way to get the name/identifier of a given store. [This is what the identifier](https://github.com/man-group/ArcticDB/blob/f208433e44d17f03011a96db64ef25dba3fccaa4/cpp/arcticdb/storage/s3/s3_storage.cpp#L57) looks like now - s3_storage-us-east-1/test_bucket_1/local/lib-kOGPh-3-True-target-1 - this applies only to s3 really - {storage_type}-{region}/{bucket}/{perfix or lib_name} - for the other stores - {storage_type}-{perfix or lib_name}
Signed-off-by: Julien Jerphanion <[email protected]>
Signed-off-by: Julien Jerphanion <[email protected]>
Revoke removing assert Remove useless ca cert path in non ssl enabled testing environment Address PR comment Better test Address PR comments Update docs/mkdocs/docs/api/arctic_uri.md Co-authored-by: Alex Seaton <[email protected]>
I've added a few new points to the FAQs. Please find the link below; I would appreciate any feedback. https://1a1ca458.arcticdb-docs.pages.dev/dev/faq/
Closes #1010
…uct is too big (#1981) #### Reference Issues/PRs Fixes man-group/arcticdb-man#127 #### What does this implement or fix? Changes error message when the metastruct for recursively normalized data is too large to no longer reference user-defined metadata.
#### Reference Issues/PRs Fixes #1841 #### What does this implement or fix? Before this change, if a `Series` had an empty-string as a name, this would be roundtripped as a `None`. This introduces a `has_name` bool to the normalization metadata protobuf, as a backwards-compatible way of effectively making the `name` field optional. The behaviour (which has been verified) can be summarised as follows: ``` Writer version | Series name | Protobuf name field | Protobuf has_name field | Series name read by <=5.0.0 | Series name read by this branch ---------------|-------------|---------------------|-------------------------|-----------------------------|-------------------------------- <=5.0.0 | "hello" | "hello" | Not present | "hello" | "hello" <=5.0.0 | "" | "" | Not present | None | None <=5.0.0 | None | "" | Not present | None | None This branch | "hello" | "hello" | True | "hello" | "hello" This branch | "" | "" | True | None | "" This branch | None | "" | False | None | None ```
#### Reference Issues/PRs Bitmagic is failing ArcitcDB builds. The issue is reported [here](tlk00/BitMagic#76) and is fixed in master. A ticket to port the fix into vcpkg is made [here](microsoft/vcpkg#41935). Use custom overlay while waiting for proper port. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->
Also since 4.5.1 is now available we replace 4.5.0 with 4.5.1 which should re-enable the lmdb and mongo tests with their segfaults fixed.
The issue with conda feedstock runners is they use some variable for abspath, which doesn't get correctly expanded by `os.path.expandvar`. This requires using `shell=True` for linux as well. This was tested to properly fix compat tests in #1931 combined with [feedstock pr](conda-forge/arcticdb-feedstock#322). However it still doesn't work for some of the macos builds which are not skipped correctly. Still this is a good change for local runs as well and we'll fix the macos issue in a separate run.
The destruction bug on mongo I thought I fixed in #1862 can still be seen. We skip the mongo test for now.
After we fix the segfaults we can re-enable
) #### Reference Issues/PRs Fixes #1937 #### What does this implement or fix? See ticket for details. Being able to create the append ref linked-list structure is useful for testing, so moved this to the `LibraryTool`
#### Reference Issues/PRs Closes #1895 Fixes man-group/arcticdb-man#171 Fixes #1936 Fixes #1939 Fixes #1940 #### What does this implement or fix? Schedules all work asynchronously in batch reads when processing is involved, as well as when all symbols are being read directly. Previously, symbols were processed sequentially, leading to idle CPUs when processing lots of smaller symbols. This works by making `read_frame_for_version` schedule work and return futures, rather than actually performing the processing. This implementation can then be used for all 4 combinations of batch/non-batch and direct/with processing reads, significantly simplifying the code and removing the now redundant `async_read_direct` (the fact that there were two different implementations to achieve effectively the same thing is what led to 2 of the bugs in the first place). Several bugs that were discovered during the implementation (flagged above) have also been fixed. Further work in this area covered in #1968
…ent symbol (#1991) #### Reference Issues/PRs Closes man-group/arcticdb-ursus#8 #### What does this implement or fix? Actual bug was fixed, probably in #1950, this just adds a test of the correct behaviour
#### Reference Issues/PRs Fixes #1655
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Add Coverity scan. The current implementation does not get PR comments and does not block the build. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <[email protected]>
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Make static analysis a cron job running at 3 A.M. Disable the on-pr run as currently there is a branch limit of 10 branches. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->
363c7e5
to
6076d15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for this PR, but it might be nice to do these benchmarks also for the read_batch
method.
But this will probably be more reasonable if we add more of these data sets in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to setup git lfs for this file before merging the PR
python/benchmarks/bi_benchmarks.py
Outdated
df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q) | ||
return df.data | ||
|
||
def get_query_groupaby_city_count_isin_filter(self, q): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should be groupby
Applies to all the methods with the same typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
python/benchmarks/bi_benchmarks.py
Outdated
def time_query_readall(self, times_bigger): | ||
self.lib.read(f"{self.symbol}{times_bigger}") | ||
|
||
def get_query_groupaby_city_count_all(self, q): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like there is no need for these to be methods of the class, can be defined outside the class which will make it a bit more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
python/benchmarks/bi_benchmarks.py
Outdated
_df = self.df.copy(deep=True) | ||
arctic_df = self.time_query_groupaby_city_count_filter_two_aggregations(BIBenchmarks.params[0]) | ||
_df = self.get_query_groupaby_city_count_filter_two_aggregations(_df) | ||
arctic_df.sort_index(inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the sorting of the index is already done in the assert_frame_equal
function, so it is not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ìndeed! late refactoring omission
python/benchmarks/bi_benchmarks.py
Outdated
|
||
del self.ac | ||
|
||
def assert_frame_equal(self, pandas_df:pd.DataFrame, arctic_df:pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Looks like this doesn't need to be a method of the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree!
python/benchmarks/bi_benchmarks.py
Outdated
df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q) | ||
return df.data | ||
|
||
def peakmem_query_groupaby_city_count_filter_two_aggregations(self, times_bigger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good idea to add a peakmem_... for the other time_... variants as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considing this also. I will add them later we can remove them easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I added also to print info for large dataframes that are produced by this test. So our first test is with 1 GB DF while last is with 10 GB df
OUTPUT -------->
Parquet file exists!
The procedure is creating N times larger dataframes
by concatenating original DF N times
DF for iterration xSize original ready: 10
<class 'pandas.core.frame.DataFrame'>
Index: 9126570 entries, 0 to 912656
Data columns (total 31 columns):
# Column Dtype
--- ------ -----
0 City/Admin object
1 City/State object
2 City object
3 Created Date/Time float64
4 Date Joined float64
5 FF Ratio float64
6 Favorites int32
7 First Link in Tweet object
8 Followers int32
9 Following int32
10 Gender object
11 Influencer? int32
12 Keyword object
13 LPF float64
14 Language object
15 Lat float64
16 Listed Number int32
17 Long Domain object
18 Long float64
19 Number of Records int32
20 Region object
21 Short Domain object
22 State/Country object
23 State object
24 Tweet Text object
25 Tweets int32
26 Twitter Client object
27 User Bio object
28 User Loc object
29 Username 1 object
30 Username object
dtypes: float64(6), int32(7), object(18)
memory usage: 10.8 GB
5c21cbe
to
600bd5a
Compare
New PR was created to resolve issues #2019 |
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> From PR #1995 (all comments addressed) Sample test benchmark for using one opensource BI CSV source. The logic of a test is - download if parquet file does not exists source in .bz2 format - convert it to parquet format - prepare library with it containing several symbols that are constructed based on this DF - for each query we want to benchmark do a pre-check that this query produces SAME result on Pandas and arcticDB - run the benchmark tests #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev>
Reference Issues/PRs
What does this implement or fix?
NEW PR CREATED : #2019
Any other comments?
Checklist
Checklist for code changes...