Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new bi sample benchmark tests #1995

Closed
wants to merge 1,187 commits into from
Closed

new bi sample benchmark tests #1995

wants to merge 1,187 commits into from

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Nov 12, 2024

Reference Issues/PRs

What does this implement or fix?

NEW PR CREATED : #2019

    Sample test benchmark for using one opensource BI CSV source.
    The logic of a test is 
        - download if parquet file does not exists source in .bz2 format
        - convert it to parquet format
        - prepare library with it containing  several symbols that are constructed based on this DF
        - for each query we want to benchmark do a pre-check that this query produces SAME result on Pandas and arcticDB
        - run the benchmark tests

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

alexowens90 and others added 30 commits April 25, 2024 09:23
Closes #515 

Non-breaking addition to `Library.finalize_staged_data` API to allow
persisting of metadata along with the symbol.
#### Reference Issues/PRs
<!--Example: Fixes #1234. See also #3456.-->

#### What does this implement or fix?

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->

---------

Co-authored-by: Vasil Pashov <[email protected]>
Closes #1507 
Closes #1509

While fixing, noticed that the added test
`test_append_range_index_from_zero` would also not have passed, so fixed
this too.
- Separates non-batch and batch BasicFunctions. Uses multiple symbols
  only where it is required
- Use single symbol for ModificationFunctions
- Use less rows for batch benchmarks
- Decrease version chain length for IteratingVersionChain
#### Reference Issues/PRs
Implements request #1003 

#### What does this implement or fix?

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->
#### Reference Issues/PRs
Implements #1391 and fixes #1532.
- Added read_index in LibraryTool (as required in #1391)
- Allowed Library to get LibraryTool and call read_index (to fix #1532).
Also updated the docs accordingly.

#### What does this implement or fix?

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->
* storage access docs moved to their own section at the bottom
* code formatting now copy/paste friendly
* rewrite of getting started
* section on transactions added

Co-authored-by: Nick Clarke <[email protected]>
The short_wide ModificationFunction benchmarks were failing because of
an error in the #1530 restructure.
… columns (#1538)

#### Reference Issues/PRs
<!--Example: Fixes #1234. See also #3456.-->

#### What does this implement or fix?
This fixes two things:
1. When using dynamic schema and not using direct read the function
copying the source column to the dest column now checks if the source is
empty type. If it is empty type it will call default_initialize.
Previously it tried to call memcpy and crashed because it tried to deref
NULL
2. During decoding (both v1 and v2) the number of row in a segment is
determined by taking the max number of rows in each column. Previously
it was determined by the number of rows in the first column. However it
the first column was of empty type it would report 0 rows, which is
wrong. This lead to creating sparse maps for dense columns.
#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->

Co-authored-by: Vasil Pashov <[email protected]>
#### Reference Issues/PRs
Fixes #944
The error message now includes the field index which shows a better
error when two columns are simply swapped.
To do this, a formatter is defined for `FieldCollection` because the
index is not saved in `Field` (previously the formatter of Field was
called for composing the error message), it is only known at
`FieldCollection` level.

Furthermore, the pattern matching regular expression is also updated as
the output now becomes `FD<name=col1, type=TD<type=typ1, dim=0>, idx=0>,
FD<name=col2, type=TD<type=type2, dim=0>, idx=1>`. After `idx` was
added, the previous regular expression was separating this into
```python
["FD<name=col1, type=TD<type=typ1, dim=0>", "idx=0>", "FD<name=col2, type=TD<type=type2, dim=0>", "idx=1>"]
```
but we want
```python
["FD<name=col1, type=TD<type=typ1, dim=0>, idx=0>", "FD<name=col2, type=TD<type=type2, dim=0>, idx=1>"]
```

Now the fix shows the following output which explains the error better:
```
The columns (names and types) in the argument are not identical to that of the existing version: APPEND 
(Showing only the mismatch. Full col list saved in the `last_mismatch_msg` attribute of the lib instance.
'-' marks columns missing from the argument, '+' for unexpected.)
-FD<name=col1, type=TD<type=INT64, dim=0>, idx=0>
-FD<name=col2, type=TD<type=INT64, dim=0>, idx=1>
+FD<name=col2, type=TD<type=INT64, dim=0>, idx=0>
+FD<name=col1, type=TD<type=INT64, dim=0>, idx=1>
```

#### What does this implement or fix?

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->
We noticed a speed regression in the version_chain benchmarks arising
from #1152.

Profiling showed that the majority of the slowdown was because the
`~SegmentInMemory` destructor became a lot slower. And all of the
slowdown was because of the recurring `ConfigsMap::get_int` inside
`maybe_trim`.

We fix this by again using a `static` variable for getting the config
value.

I've verified that there aren't other similar regressions from that PR.
We need to take the GIL when Python needs to allocate special characters, otherwise we will deadlock. At the moment read() releases the GIL but update() and append() don't
Docs only change

---------

Co-authored-by: Nick Clarke <[email protected]>
#### Reference Issues/PRs
<!--Example: Fixes #1234. See also #3456.-->
Closes: #1226
#### What does this implement or fix?
* Remove the environment variable which was used to enable the
consolidation phase
* Make our custom block manager to cast blocks for empty columns to
float64. To be removed after empty types become enabled.
#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->

---------

Co-authored-by: Vasil Pashov <[email protected]>
#### Reference Issues/PRs

Fixes the 10th item of the folly replacement plan, issue #1412.

#### What does this implement or fix?

This removes the single use of `folly/ThreadCachedInt`. It is replaced
by a partial vendoring of the `folly` code plus use of
`boost::thread_specific_ptr`.

`ThreadCachedInt` is used to count the number of freed memory blocks. It
is (presumably) not just implemented as an atomic integer count as
thread locking would be too slow, so instead each thread has its own
count and when a single thread's count exceeds some threshold it is
added to the overall count. The original `folly` implementation has two
ways of reading the count which are slow (fully accurate) and fast (not
fully accurate). ArcticDB only uses the fast option, so the
implementation is much simpler than `folly`'s, requiring fewer
`atomic`s.

New class `ThreadCachedInt` in `thread_cached_int.hpp` is derived from
https://github.com/facebook/folly/blob/main/folly/ThreadCachedInt.h but
containing only the subset of functionality required. Instead of using
`folly'`s own `ThreadLocalPtr` this uses `boost::thread_specific_ptr`.
The `folly` source code claims that their implementation is 4x faster
than the `boost` one:


https://github.com/facebook/folly/blob/dbc9e565f54eabb40ad6600656ad9dea919f51c0/folly/ThreadLocal.h#L18-L20

but that claim dates from 12 years ago and this solution is simpler than
theirs. This does need to be benchmarked against `master` to confirm
that it is not measurably slower.

#### Any other comments?

The only place this is ultimately used is to control when `malloc_trim`
is called here


https://github.com/man-group/ArcticDB/blob/e3fab24b653439f9894495a2657bb2dcfc1fbb42/cpp/arcticdb/util/allocator.cpp#L286-L288

to release memory back to the system. This only occurs on Linux. Other
OSes could have all of this code removed but this would be a bigger
change with many `#ifdef` blocks, etc.

---------

Signed-off-by: Ian Thomas <[email protected]>
The SlabAlloc tests were super slow on the build servers.
Investigation showed this was due to poor performance of slab allocator with many threads. More details:
https://manwiki.maninvestments.com/display/AlphaTech/Slab+Allocator+poor+multi-threaded+performance+with+aggressive+allocations

This commit does the following:
 - Place an upper limit on num_threads of 8 (which speeds up drastically the tests on the build servers)
 - Improve multi-threaded performance by ~20% by removing unneded atomic::load() calls (because atomic::compare_exchange_strong already does this for us)
 - Adds compiled away logging for log contention. Can be enabled with #define LOG_SLAB_ALLOC_INTERNALS.
)

#### What does this implement or fix?

We should only write the version ref key once when we write with
`prune_previous_versions=True`. Currently we are writing it twice - once
after we write the tombstone all and once when we write the new version.
This means that there is a period of time where the symbol is
unreadable.

This was fixed a while ago with PR #1104 but regressed with PR #1355.
#### What does this implement or fix?

`SymbolDescription.last_update_time` is a
`pandas._libs.tslibs.timestamps.Timestamp` which is a
`datetime.datetime` subtype, not a `numpy.datetime64`.

#### Any other comments?

Should this be `Union[datetime.datetime, numpy.datetime64]`?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->
…ilder (#1557)

#### Reference Issues/PRs
Fixes #1148.
previously if a type was mismatched in the QueryBuilder as follows
```python
df1 = pd.DataFrame({"col1": [1, 2, 3], "col2": [2, 3, 4], "col3": [4, 5, 6], "col_str": ["1", "2", "3"]})
sym = "symbol"
lib.write(sym, df1)
q = QueryBuilder()
q = q[q["col1" + 1]  == "3"]
lib.read(sym, query_builder=q)
```
it would show an unclear message with `UTF_DYNAMIC64` type shown for
strings
```
arcticdb_ext.exceptions.InternalException: E_ASSERTION_FAILURE Cannot compare TD<type=UTF_DYNAMIC64, dim=0> to TD<type=UINT32, dim=0> (possible categorical?)
```
Now the error is much clearer where column names are generated according
to the query and STRING type is shown for strings
```
arcticdb_ext.exceptions.UserInputException: E_INVALID_USER_ARGUMENT Invalid comparison (col1 + 1) (TD<type=INT64, dim=0>) == "3" (TD<type=STRING, dim=0>)
```
For a more complex query like `q = q[1 + q["col1"] * q["col2"] -
q["col3"] == q["col_str"]]` it will show column name `((1 + (col1 *
col2)) - col3)` in the error message which allows the user to better
understand the error.

#### What does this implement or fix?

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [x] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [x] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->
#### Reference Issues/PRs
<!--Example: Fixes #1234. See also #3456.-->

#### What does this implement or fix?
This PR is aiming to provide a way to get the name/identifier of a given
store.
[This is what the
identifier](https://github.com/man-group/ArcticDB/blob/f208433e44d17f03011a96db64ef25dba3fccaa4/cpp/arcticdb/storage/s3/s3_storage.cpp#L57)
looks like now -
s3_storage-us-east-1/test_bucket_1/local/lib-kOGPh-3-True-target-1
- this applies only to s3 really -
{storage_type}-{region}/{bucket}/{perfix or lib_name}
- for the other stores - {storage_type}-{perfix or lib_name}
Signed-off-by: Julien Jerphanion <[email protected]>
Revoke removing assert

Remove useless ca cert path in non ssl enabled testing environment

Address PR comment

Better test

Address PR comments

Update docs/mkdocs/docs/api/arctic_uri.md

Co-authored-by: Alex Seaton <[email protected]>
I've added a few new points to the FAQs. Please find the link below; I
would appreciate any feedback.

https://1a1ca458.arcticdb-docs.pages.dev/dev/faq/
alexowens90 and others added 15 commits November 5, 2024 13:12
…rovided in QueryBuilder operations (#1976)

#### Reference Issues/PRs
Fixes #1970
…uct is too big (#1981)

#### Reference Issues/PRs
Fixes man-group/arcticdb-man#127

#### What does this implement or fix?
Changes error message when the metastruct for recursively normalized
data is too large to no longer reference user-defined metadata.
#### Reference Issues/PRs
Fixes #1841 

#### What does this implement or fix?
Before this change, if a `Series` had an empty-string as a name, this
would be roundtripped as a `None`.
This introduces a `has_name` bool to the normalization metadata
protobuf, as a backwards-compatible way of effectively making the `name`
field optional.

The behaviour (which has been verified) can be summarised as follows:
```
Writer version | Series name | Protobuf name field | Protobuf has_name field | Series name read by <=5.0.0 | Series name read by this branch
---------------|-------------|---------------------|-------------------------|-----------------------------|--------------------------------
       <=5.0.0 |     "hello" |             "hello" |             Not present |                     "hello" |                         "hello"
       <=5.0.0 |          "" |                  "" |             Not present |                        None |                            None
       <=5.0.0 |        None |                  "" |             Not present |                        None |                            None
   This branch |     "hello" |             "hello" |                    True |                     "hello" |                         "hello"
   This branch |          "" |                  "" |                    True |                        None |                              ""
   This branch |        None |                  "" |                   False |                        None |                            None
```
#### Reference Issues/PRs
Bitmagic is failing ArcitcDB builds. The issue is reported
[here](tlk00/BitMagic#76) and is fixed in
master. A ticket to port the fix into vcpkg is made
[here](microsoft/vcpkg#41935). Use custom
overlay while waiting for proper port.

#### What does this implement or fix?

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->
Also since 4.5.1 is now available we replace 4.5.0 with 4.5.1 which
should re-enable the lmdb and mongo tests with their segfaults fixed.
The issue with conda feedstock runners is they use some variable for
abspath, which doesn't get correctly expanded by `os.path.expandvar`.

This requires using `shell=True` for linux as well.

This was tested to properly fix compat tests in #1931 combined with
[feedstock pr](conda-forge/arcticdb-feedstock#322).
However it still doesn't work for some of the macos builds which are not skipped correctly.

Still this is a good change for local runs as well and we'll fix the
macos issue in a separate run.
The destruction bug on mongo I thought I fixed in #1862 can still be
seen.

We skip the mongo test for now.
After we fix the segfaults we can re-enable
)

#### Reference Issues/PRs
Fixes #1937 

#### What does this implement or fix?
See ticket for details. Being able to create the append ref linked-list
structure is useful for testing, so moved this to the `LibraryTool`
…ferent timestamps (#1978)

#### Reference Issues/PRs
Closes #1306 

Actual bug was fixed in #1227, this just adds a comprehensive nonreg
test to ensure we never reintroduce anything similar again
#### Reference Issues/PRs
Closes #1895 
Fixes man-group/arcticdb-man#171
Fixes #1936 
Fixes #1939 
Fixes #1940 

#### What does this implement or fix?
Schedules all work asynchronously in batch reads when processing is
involved, as well as when all symbols are being read directly.
Previously, symbols were processed sequentially, leading to idle CPUs
when processing lots of smaller symbols.

This works by making `read_frame_for_version` schedule work and return
futures, rather than actually performing the processing. This
implementation can then be used for all 4 combinations of
batch/non-batch and direct/with processing reads, significantly
simplifying the code and removing the now redundant `async_read_direct`
(the fact that there were two different implementations to achieve
effectively the same thing is what led to 2 of the bugs in the first
place).

Several bugs that were discovered during the implementation (flagged
above) have also been fixed.

Further work in this area covered in #1968
…ent symbol (#1991)

#### Reference Issues/PRs
Closes man-group/arcticdb-ursus#8

#### What does this implement or fix?
Actual bug was fixed, probably in #1950, this just adds a test of the
correct behaviour
#### Reference Issues/PRs
<!--Example: Fixes #1234. See also #3456.-->

#### What does this implement or fix?
Add Coverity scan. The current implementation does not get PR comments
and does not block the build.
#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->

---------

Co-authored-by: Vasil Pashov <[email protected]>
phoebusm and others added 3 commits November 15, 2024 12:59
#### Reference Issues/PRs
<!--Example: Fixes #1234. See also #3456.-->

#### What does this implement or fix?
Make static analysis a cron job running at 3 A.M.
Disable the on-pr run as currently there is a branch limit of 10
branches.
#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->
@grusev grusev force-pushed the perf_first_bi_benchmark branch from 363c7e5 to 6076d15 Compare November 18, 2024 07:06
Copy link
Collaborator

@G-D-Petrov G-D-Petrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this PR, but it might be nice to do these benchmarks also for the read_batch method.
But this will probably be more reasonable if we add more of these data sets in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to setup git lfs for this file before merging the PR

df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q)
return df.data

def get_query_groupaby_city_count_isin_filter(self, q):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should be groupby
Applies to all the methods with the same typo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

def time_query_readall(self, times_bigger):
self.lib.read(f"{self.symbol}{times_bigger}")

def get_query_groupaby_city_count_all(self, q):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like there is no need for these to be methods of the class, can be defined outside the class which will make it a bit more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

_df = self.df.copy(deep=True)
arctic_df = self.time_query_groupaby_city_count_filter_two_aggregations(BIBenchmarks.params[0])
_df = self.get_query_groupaby_city_count_filter_two_aggregations(_df)
arctic_df.sort_index(inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the sorting of the index is already done in the assert_frame_equal function, so it is not needed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ìndeed! late refactoring omission


del self.ac

def assert_frame_equal(self, pandas_df:pd.DataFrame, arctic_df:pd.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks like this doesn't need to be a method of the class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree!

df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q)
return df.data

def peakmem_query_groupaby_city_count_filter_two_aggregations(self, times_bigger):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to add a peakmem_... for the other time_... variants as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considing this also. I will add them later we can remove them easily

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I added also to print info for large dataframes that are produced by this test. So our first test is with 1 GB DF while last is with 10 GB df

         OUTPUT -------->
         Parquet file exists!
         The procedure is creating N times larger dataframes
         by concatenating original DF N times
         DF for iterration xSize original ready:  10
         <class 'pandas.core.frame.DataFrame'>
         Index: 9126570 entries, 0 to 912656
         Data columns (total 31 columns):
          #   Column               Dtype
         ---  ------               -----
          0   City/Admin           object
          1   City/State           object
          2   City                 object
          3   Created Date/Time    float64
          4   Date Joined          float64
          5   FF Ratio             float64
          6   Favorites            int32
          7   First Link in Tweet  object
          8   Followers            int32
          9   Following            int32
          10  Gender               object
          11  Influencer?          int32
          12  Keyword              object
          13  LPF                  float64
          14  Language             object
          15  Lat                  float64
          16  Listed Number        int32
          17  Long Domain          object
          18  Long                 float64
          19  Number of Records    int32
          20  Region               object
          21  Short Domain         object
          22  State/Country        object
          23  State                object
          24  Tweet Text           object
          25  Tweets               int32
          26  Twitter Client       object
          27  User Bio             object
          28  User Loc             object
          29  Username 1           object
          30  Username             object
         dtypes: float64(6), int32(7), object(18)
         memory usage: 10.8 GB

@grusev grusev force-pushed the perf_first_bi_benchmark branch from 5c21cbe to 600bd5a Compare November 25, 2024 17:12
@grusev grusev closed this Nov 25, 2024
@grusev
Copy link
Collaborator Author

grusev commented Nov 25, 2024

New PR was created to resolve issues #2019

grusev added a commit that referenced this pull request Dec 9, 2024
#### Reference Issues/PRs
<!--Example: Fixes #1234. See also #3456.-->

From PR #1995 (all comments addressed)

    Sample test benchmark for using one opensource BI CSV source.
    The logic of a test is 
        - download if parquet file does not exists source in .bz2 format
        - convert it to parquet format
- prepare library with it containing several symbols that are
constructed based on this DF
- for each query we want to benchmark do a pre-check that this query
produces SAME result on Pandas and arcticDB
        - run the benchmark tests

#### What does this implement or fix?

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->

---------

Co-authored-by: Georgi Rusev <Georgi Rusev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.