new bi sample benchmark tests #1995

grusev · 2024-11-12T15:59:44Z

Reference Issues/PRs

What does this implement or fix?

NEW PR CREATED : #2019

    Sample test benchmark for using one opensource BI CSV source.
    The logic of a test is 
        - download if parquet file does not exists source in .bz2 format
        - convert it to parquet format
        - prepare library with it containing  several symbols that are constructed based on this DF
        - for each query we want to benchmark do a pre-check that this query produces SAME result on Pandas and arcticDB
        - run the benchmark tests

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

Closes #515 Non-breaking addition to `Library.finalize_staged_data` API to allow persisting of metadata along with the symbol.

#### Reference Issues/PRs  #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>  --------- Co-authored-by: Vasil Pashov <[email protected]>

Closes #1507 Closes #1509 While fixing, noticed that the added test `test_append_range_index_from_zero` would also not have passed, so fixed this too.

- Separates non-batch and batch BasicFunctions. Uses multiple symbols only where it is required - Use single symbol for ModificationFunctions - Use less rows for batch benchmarks - Decrease version chain length for IteratingVersionChain

#### Reference Issues/PRs Implements request #1003 #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

#### Reference Issues/PRs Implements #1391 and fixes #1532. - Added read_index in LibraryTool (as required in #1391) - Allowed Library to get LibraryTool and call read_index (to fix #1532). Also updated the docs accordingly. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

* storage access docs moved to their own section at the bottom * code formatting now copy/paste friendly * rewrite of getting started * section on transactions added Co-authored-by: Nick Clarke <[email protected]>

The short_wide ModificationFunction benchmarks were failing because of an error in the #1530 restructure.

… columns (#1538) #### Reference Issues/PRs  #### What does this implement or fix? This fixes two things: 1. When using dynamic schema and not using direct read the function copying the source column to the dest column now checks if the source is empty type. If it is empty type it will call default_initialize. Previously it tried to call memcpy and crashed because it tried to deref NULL 2. During decoding (both v1 and v2) the number of row in a segment is determined by taking the max number of rows in each column. Previously it was determined by the number of rows in the first column. However it the first column was of empty type it would report 0 rows, which is wrong. This lead to creating sparse maps for dense columns. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>  Co-authored-by: Vasil Pashov <[email protected]>

#### Reference Issues/PRs Fixes #944 The error message now includes the field index which shows a better error when two columns are simply swapped. To do this, a formatter is defined for `FieldCollection` because the index is not saved in `Field` (previously the formatter of Field was called for composing the error message), it is only known at `FieldCollection` level. Furthermore, the pattern matching regular expression is also updated as the output now becomes `FD<name=col1, type=TD<type=typ1, dim=0>, idx=0>, FD<name=col2, type=TD<type=type2, dim=0>, idx=1>`. After `idx` was added, the previous regular expression was separating this into ```python ["FD<name=col1, type=TD<type=typ1, dim=0>", "idx=0>", "FD<name=col2, type=TD<type=type2, dim=0>", "idx=1>"] ``` but we want ```python ["FD<name=col1, type=TD<type=typ1, dim=0>, idx=0>", "FD<name=col2, type=TD<type=type2, dim=0>, idx=1>"] ``` Now the fix shows the following output which explains the error better: ``` The columns (names and types) in the argument are not identical to that of the existing version: APPEND (Showing only the mismatch. Full col list saved in the `last_mismatch_msg` attribute of the lib instance. '-' marks columns missing from the argument, '+' for unexpected.) -FD<name=col1, type=TD<type=INT64, dim=0>, idx=0> -FD<name=col2, type=TD<type=INT64, dim=0>, idx=1> +FD<name=col2, type=TD<type=INT64, dim=0>, idx=0> +FD<name=col1, type=TD<type=INT64, dim=0>, idx=1> ``` #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

We noticed a speed regression in the version_chain benchmarks arising from #1152. Profiling showed that the majority of the slowdown was because the `~SegmentInMemory` destructor became a lot slower. And all of the slowdown was because of the recurring `ConfigsMap::get_int` inside `maybe_trim`. We fix this by again using a `static` variable for getting the config value. I've verified that there aren't other similar regressions from that PR.

We need to take the GIL when Python needs to allocate special characters, otherwise we will deadlock. At the moment read() releases the GIL but update() and append() don't

Docs only change --------- Co-authored-by: Nick Clarke <[email protected]>

#### Reference Issues/PRs  Closes: #1226 #### What does this implement or fix? * Remove the environment variable which was used to enable the consolidation phase * Make our custom block manager to cast blocks for empty columns to float64. To be removed after empty types become enabled. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>  --------- Co-authored-by: Vasil Pashov <[email protected]>

#### Reference Issues/PRs Fixes the 10th item of the folly replacement plan, issue #1412. #### What does this implement or fix? This removes the single use of `folly/ThreadCachedInt`. It is replaced by a partial vendoring of the `folly` code plus use of `boost::thread_specific_ptr`. `ThreadCachedInt` is used to count the number of freed memory blocks. It is (presumably) not just implemented as an atomic integer count as thread locking would be too slow, so instead each thread has its own count and when a single thread's count exceeds some threshold it is added to the overall count. The original `folly` implementation has two ways of reading the count which are slow (fully accurate) and fast (not fully accurate). ArcticDB only uses the fast option, so the implementation is much simpler than `folly`'s, requiring fewer `atomic`s. New class `ThreadCachedInt` in `thread_cached_int.hpp` is derived from https://github.com/facebook/folly/blob/main/folly/ThreadCachedInt.h but containing only the subset of functionality required. Instead of using `folly'`s own `ThreadLocalPtr` this uses `boost::thread_specific_ptr`. The `folly` source code claims that their implementation is 4x faster than the `boost` one: https://github.com/facebook/folly/blob/dbc9e565f54eabb40ad6600656ad9dea919f51c0/folly/ThreadLocal.h#L18-L20 but that claim dates from 12 years ago and this solution is simpler than theirs. This does need to be benchmarked against `master` to confirm that it is not measurably slower. #### Any other comments? The only place this is ultimately used is to control when `malloc_trim` is called here https://github.com/man-group/ArcticDB/blob/e3fab24b653439f9894495a2657bb2dcfc1fbb42/cpp/arcticdb/util/allocator.cpp#L286-L288 to release memory back to the system. This only occurs on Linux. Other OSes could have all of this code removed but this would be a bigger change with many `#ifdef` blocks, etc. --------- Signed-off-by: Ian Thomas <[email protected]>

The SlabAlloc tests were super slow on the build servers. Investigation showed this was due to poor performance of slab allocator with many threads. More details: https://manwiki.maninvestments.com/display/AlphaTech/Slab+Allocator+poor+multi-threaded+performance+with+aggressive+allocations This commit does the following: - Place an upper limit on num_threads of 8 (which speeds up drastically the tests on the build servers) - Improve multi-threaded performance by ~20% by removing unneded atomic::load() calls (because atomic::compare_exchange_strong already does this for us) - Adds compiled away logging for log contention. Can be enabled with #define LOG_SLAB_ALLOC_INTERNALS.

) #### What does this implement or fix? We should only write the version ref key once when we write with `prune_previous_versions=True`. Currently we are writing it twice - once after we write the tombstone all and once when we write the new version. This means that there is a period of time where the symbol is unreadable. This was fixed a while ago with PR #1104 but regressed with PR #1355.

#### What does this implement or fix? `SymbolDescription.last_update_time` is a `pandas._libs.tslibs.timestamps.Timestamp` which is a `datetime.datetime` subtype, not a `numpy.datetime64`. #### Any other comments? Should this be `Union[datetime.datetime, numpy.datetime64]`? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

…ilder (#1557) #### Reference Issues/PRs Fixes #1148. previously if a type was mismatched in the QueryBuilder as follows ```python df1 = pd.DataFrame({"col1": [1, 2, 3], "col2": [2, 3, 4], "col3": [4, 5, 6], "col_str": ["1", "2", "3"]}) sym = "symbol" lib.write(sym, df1) q = QueryBuilder() q = q[q["col1" + 1] == "3"] lib.read(sym, query_builder=q) ``` it would show an unclear message with `UTF_DYNAMIC64` type shown for strings ``` arcticdb_ext.exceptions.InternalException: E_ASSERTION_FAILURE Cannot compare TD<type=UTF_DYNAMIC64, dim=0> to TD<type=UINT32, dim=0> (possible categorical?) ``` Now the error is much clearer where column names are generated according to the query and STRING type is shown for strings ``` arcticdb_ext.exceptions.UserInputException: E_INVALID_USER_ARGUMENT Invalid comparison (col1 + 1) (TD<type=INT64, dim=0>) == "3" (TD<type=STRING, dim=0>) ``` For a more complex query like `q = q[1 + q["col1"] * q["col2"] - q["col3"] == q["col_str"]]` it will show column name `((1 + (col1 * col2)) - col3)` in the error message which allows the user to better understand the error. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [x] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [x] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

#### Reference Issues/PRs  #### What does this implement or fix? This PR is aiming to provide a way to get the name/identifier of a given store. [This is what the identifier](https://github.com/man-group/ArcticDB/blob/f208433e44d17f03011a96db64ef25dba3fccaa4/cpp/arcticdb/storage/s3/s3_storage.cpp#L57) looks like now - s3_storage-us-east-1/test_bucket_1/local/lib-kOGPh-3-True-target-1 - this applies only to s3 really - {storage_type}-{region}/{bucket}/{perfix or lib_name} - for the other stores - {storage_type}-{perfix or lib_name}

Signed-off-by: Julien Jerphanion <[email protected]>

Revoke removing assert Remove useless ca cert path in non ssl enabled testing environment Address PR comment Better test Address PR comments Update docs/mkdocs/docs/api/arctic_uri.md Co-authored-by: Alex Seaton <[email protected]>

I've added a few new points to the FAQs. Please find the link below; I would appreciate any feedback. https://1a1ca458.arcticdb-docs.pages.dev/dev/faq/

Closes #1010

…rovided in QueryBuilder operations (#1976) #### Reference Issues/PRs Fixes #1970

…uct is too big (#1981) #### Reference Issues/PRs Fixes man-group/arcticdb-man#127 #### What does this implement or fix? Changes error message when the metastruct for recursively normalized data is too large to no longer reference user-defined metadata.

#### Reference Issues/PRs Fixes #1841 #### What does this implement or fix? Before this change, if a `Series` had an empty-string as a name, this would be roundtripped as a `None`. This introduces a `has_name` bool to the normalization metadata protobuf, as a backwards-compatible way of effectively making the `name` field optional. The behaviour (which has been verified) can be summarised as follows: ``` Writer version | Series name | Protobuf name field | Protobuf has_name field | Series name read by <=5.0.0 | Series name read by this branch ---------------|-------------|---------------------|-------------------------|-----------------------------|-------------------------------- <=5.0.0 | "hello" | "hello" | Not present | "hello" | "hello" <=5.0.0 | "" | "" | Not present | None | None <=5.0.0 | None | "" | Not present | None | None This branch | "hello" | "hello" | True | "hello" | "hello" This branch | "" | "" | True | None | "" This branch | None | "" | False | None | None ```

#### Reference Issues/PRs Bitmagic is failing ArcitcDB builds. The issue is reported [here](tlk00/BitMagic#76) and is fixed in master. A ticket to port the fix into vcpkg is made [here](microsoft/vcpkg#41935). Use custom overlay while waiting for proper port. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

Also since 4.5.1 is now available we replace 4.5.0 with 4.5.1 which should re-enable the lmdb and mongo tests with their segfaults fixed.

The issue with conda feedstock runners is they use some variable for abspath, which doesn't get correctly expanded by `os.path.expandvar`. This requires using `shell=True` for linux as well. This was tested to properly fix compat tests in #1931 combined with [feedstock pr](conda-forge/arcticdb-feedstock#322). However it still doesn't work for some of the macos builds which are not skipped correctly. Still this is a good change for local runs as well and we'll fix the macos issue in a separate run.

The destruction bug on mongo I thought I fixed in #1862 can still be seen. We skip the mongo test for now.

After we fix the segfaults we can re-enable

) #### Reference Issues/PRs Fixes #1937 #### What does this implement or fix? See ticket for details. Being able to create the append ref linked-list structure is useful for testing, so moved this to the `LibraryTool`

…ferent timestamps (#1978) #### Reference Issues/PRs Closes #1306 Actual bug was fixed in #1227, this just adds a comprehensive nonreg test to ensure we never reintroduce anything similar again

#### Reference Issues/PRs Closes #1895 Fixes man-group/arcticdb-man#171 Fixes #1936 Fixes #1939 Fixes #1940 #### What does this implement or fix? Schedules all work asynchronously in batch reads when processing is involved, as well as when all symbols are being read directly. Previously, symbols were processed sequentially, leading to idle CPUs when processing lots of smaller symbols. This works by making `read_frame_for_version` schedule work and return futures, rather than actually performing the processing. This implementation can then be used for all 4 combinations of batch/non-batch and direct/with processing reads, significantly simplifying the code and removing the now redundant `async_read_direct` (the fact that there were two different implementations to achieve effectively the same thing is what led to 2 of the bugs in the first place). Several bugs that were discovered during the implementation (flagged above) have also been fixed. Further work in this area covered in #1968

…ent symbol (#1991) #### Reference Issues/PRs Closes man-group/arcticdb-ursus#8 #### What does this implement or fix? Actual bug was fixed, probably in #1950, this just adds a test of the correct behaviour

#### Reference Issues/PRs Fixes #1655

#### Reference Issues/PRs  #### What does this implement or fix? Add Coverity scan. The current implementation does not get PR comments and does not block the build. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>  --------- Co-authored-by: Vasil Pashov <[email protected]>

#### Reference Issues/PRs  #### What does this implement or fix? Make static analysis a cron job running at 3 A.M. Disable the on-pr run as currently there is a branch limit of 10 branches. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

G-D-Petrov

Not necessary for this PR, but it might be nice to do these benchmarks also for the read_batch method.
But this will probably be more reasonable if we add more of these data sets in the future.

G-D-Petrov · 2024-11-18T13:10:20Z

python/data/CityMaxCapita_1.parquet.gzip

We will need to setup git lfs for this file before merging the PR

G-D-Petrov · 2024-11-18T13:11:14Z