Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for first and last aggregators - string columns #1151

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

Hind-M
Copy link
Collaborator

@Hind-M Hind-M commented Dec 11, 2023

Fix #1105

@Hind-M Hind-M force-pushed the first_groupby_str branch 2 times, most recently from a9b25c0 to d423312 Compare December 19, 2023 18:08
@Hind-M Hind-M force-pushed the first_groupby_str branch from b931c22 to 34a369e Compare January 3, 2024 13:50
@Hind-M Hind-M marked this pull request as ready for review January 4, 2024 09:37
Copy link
Collaborator

@alexowens90 alexowens90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests with None and NaN values in the aggregation columns?
Can we also make the *_with_append tests a bit more complicated, possibly using hypothesis and lmdb_version_store_tiny_segment?

@@ -1013,7 +1013,7 @@ std::unique_ptr<StringReducer> get_string_reducer(
const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, true);
string_reducer = std::make_unique<UnicodeConvertingStringReducer>(column, context, frame, frame_field, alloc_width);
} else {
const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, false);
const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be prohibitively expensive for large dataframes with fixed-width strings that have not been aggregated, and so false would be sufficient. You need to pass the information in to this method of what this parameter should be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A hacky solution would be to set the orig_type in the aggregator/clause, so that was_coerced_from_dynamic_to_fixed is true in this case. Then add a comment that this is a big hack. Long term we want all reads to go through the ComponentManager, at which point it will decide what to do, so this is OK for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first tried the hacky solution, but it's just bringing more complexity since the final dataframe to be compared with the expected one in the tests turns out to have different indexes and data regarding encoding, which involves additional hacking to do.
I think the first solution is fair enough (see this commit).

@Hind-M
Copy link
Collaborator Author

Hind-M commented Feb 27, 2024

Can we add some tests with None and NaN values in the aggregation columns?

I added None values to the tests (cf. this commit). Regarding NaN values, the tests were already including them, or are you thinking about something specific?

Can we also make the *_with_append tests a bit more complicated, possibly using hypothesis and lmdb_version_store_tiny_segment?

Using hypothesis in *_with_append tests gives an unexpected output when the given input dataframes are the following:
df1:

          grouping_column      a
0               0             0.0

df2:

          grouping_column      a
0               0             0.0

df3:

           grouping_column      a
0               0              0.0
1              00              0.0

using:

lib.write(symbol, df1)
lib.append(symbol, df2)
lib.append(symbol, df3)

Outputs are:

expected_df:                 
grouping_column     a
0                  0.0
00                 0.0
actual_dataframe:  
grouping_column     a
0                  0.0
0                  0.0
00                 0.0

I tried replicating this behavior with a basic test without using hypothesis, but it does give the right expected output dataframe. It seems that something is happening in the PartitionClause on the repartition level which makes it behave this way. I'm not sure what that could be yet...

@Hind-M
Copy link
Collaborator Author

Hind-M commented Feb 29, 2024

Update: For some reason, all the values in the grouping_column corresponding to 0 don't get to be in the same bucket...

@Hind-M Hind-M force-pushed the first_groupby_str branch from aa673df to 848ab32 Compare March 18, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for first and last aggregators - string columns
3 participants