Support for `first` and `last` aggregators - string columns #1151

Hind-M · 2023-12-11T15:53:33Z

python/tests/unit/arcticdb/version_store/test_aggregation.py

alexowens90

Can we add some tests with None and NaN values in the aggregation columns?
Can we also make the *_with_append tests a bit more complicated, possibly using hypothesis and lmdb_version_store_tiny_segment?

alexowens90 · 2024-01-12T17:02:42Z

cpp/arcticdb/pipeline/read_frame.cpp

@@ -1013,7 +1013,7 @@ std::unique_ptr<StringReducer> get_string_reducer(
            const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, true);
            string_reducer = std::make_unique<UnicodeConvertingStringReducer>(column, context, frame, frame_field, alloc_width);
        } else {
-            const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, false);
+            const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, true);


This will be prohibitively expensive for large dataframes with fixed-width strings that have not been aggregated, and so false would be sufficient. You need to pass the information in to this method of what this parameter should be.

A hacky solution would be to set the orig_type in the aggregator/clause, so that was_coerced_from_dynamic_to_fixed is true in this case. Then add a comment that this is a big hack. Long term we want all reads to go through the ComponentManager, at which point it will decide what to do, so this is OK for now.

I first tried the hacky solution, but it's just bringing more complexity since the final dataframe to be compared with the expected one in the tests turns out to have different indexes and data regarding encoding, which involves additional hacking to do.
I think the first solution is fair enough (see this commit).

Hind-M · 2024-02-27T15:41:52Z

Can we add some tests with None and NaN values in the aggregation columns?

I added None values to the tests (cf. this commit). Regarding NaN values, the tests were already including them, or are you thinking about something specific?

Can we also make the *_with_append tests a bit more complicated, possibly using hypothesis and lmdb_version_store_tiny_segment?

Using hypothesis in *_with_append tests gives an unexpected output when the given input dataframes are the following:
df1:

          grouping_column      a
0               0             0.0

df2:

          grouping_column      a
0               0             0.0

df3:

           grouping_column      a
0               0              0.0
1              00              0.0

using:

lib.write(symbol, df1)
lib.append(symbol, df2)
lib.append(symbol, df3)

Outputs are:

expected_df:                 
grouping_column     a
0                  0.0
00                 0.0

actual_dataframe:  
grouping_column     a
0                  0.0
0                  0.0
00                 0.0

I tried replicating this behavior with a basic test without using hypothesis, but it does give the right expected output dataframe. It seems that something is happening in the PartitionClause on the repartition level which makes it behave this way. I'm not sure what that could be yet...

Hind-M · 2024-02-29T09:03:07Z

Update: For some reason, all the values in the grouping_column corresponding to 0 don't get to be in the same bucket...

Capture this directly in fcts

… work with string columns and dynamic schema (#1264)" This reverts commit 72cda57.

Hind-M force-pushed the first_groupby_str branch 2 times, most recently from a9b25c0 to d423312 Compare December 19, 2023 18:08

Hind-M force-pushed the first_groupby_str branch from 05115a9 to 55f0aeb Compare December 28, 2023 14:08

Hind-M commented Jan 2, 2024

View reviewed changes

python/tests/unit/arcticdb/version_store/test_aggregation.py Outdated Show resolved Hide resolved

Hind-M force-pushed the first_groupby_str branch from b931c22 to 34a369e Compare January 3, 2024 13:50

Hind-M marked this pull request as ready for review January 4, 2024 09:37

alexowens90 requested changes Jan 12, 2024

View reviewed changes

Hind-M force-pushed the first_groupby_str branch from 73ab7fc to 79a2042 Compare February 27, 2024 10:38

Hind-M added 14 commits March 18, 2024 15:28

First version for first agg for strings

7b9add9

Update offset for strings in first and last agg

8a0fc6a

Move map to inside loop

b56af9b

Capture this directly in fcts

Add check with holds_alternative

4580440

Add strings columns test for last agg

f32e7de

Modify to make strings agg work with append

d99481c

Add failing test

f0da3a9

Fix seg fault

1bc7301

Modify tests

5cce6d4

Add dynamic tests

1f4b0f3

Add flag for handling fixed strings

812f928

Add None to tests

9846154

Revert "REVERT ME: Feature flag off first/last aggregators until they…

8cc7fd4

… work with string columns and dynamic schema (#1264)" This reverts commit 72cda57.

Add missing includes

848ab32

Hind-M force-pushed the first_groupby_str branch from aa673df to 848ab32 Compare March 18, 2024 16:22

Hind-M added 2 commits March 18, 2024 18:14

Fix mishandled rebase

00bab63

Add failing tests with hypothesis and append

7173225

maxim-morozov added the replicated label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `first` and `last` aggregators - string columns #1151

Support for `first` and `last` aggregators - string columns #1151

Hind-M commented Dec 11, 2023 •

edited

Loading

alexowens90 left a comment

alexowens90 Jan 12, 2024

alexowens90 Jan 12, 2024

Hind-M Feb 27, 2024

Hind-M commented Feb 27, 2024

Hind-M commented Feb 29, 2024

Support for first and last aggregators - string columns #1151

Are you sure you want to change the base?

Support for first and last aggregators - string columns #1151

Conversation

Hind-M commented Dec 11, 2023 • edited Loading

alexowens90 left a comment

Choose a reason for hiding this comment

alexowens90 Jan 12, 2024

Choose a reason for hiding this comment

alexowens90 Jan 12, 2024

Choose a reason for hiding this comment

Hind-M Feb 27, 2024

Choose a reason for hiding this comment

Hind-M commented Feb 27, 2024

Hind-M commented Feb 29, 2024

Support for `first` and `last` aggregators - string columns #1151

Support for `first` and `last` aggregators - string columns #1151

Hind-M commented Dec 11, 2023 •

edited

Loading