-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for first
and last
aggregators - string columns
#1151
base: master
Are you sure you want to change the base?
Conversation
a9b25c0
to
d423312
Compare
05115a9
to
55f0aeb
Compare
b931c22
to
34a369e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some tests with None
and NaN
values in the aggregation columns?
Can we also make the *_with_append
tests a bit more complicated, possibly using hypothesis and lmdb_version_store_tiny_segment
?
cpp/arcticdb/pipeline/read_frame.cpp
Outdated
@@ -1013,7 +1013,7 @@ std::unique_ptr<StringReducer> get_string_reducer( | |||
const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, true); | |||
string_reducer = std::make_unique<UnicodeConvertingStringReducer>(column, context, frame, frame_field, alloc_width); | |||
} else { | |||
const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, false); | |||
const auto alloc_width = get_max_string_size_in_column(column.data().buffer(), context, frame, frame_field, slice_map, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be prohibitively expensive for large dataframes with fixed-width strings that have not been aggregated, and so false
would be sufficient. You need to pass the information in to this method of what this parameter should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A hacky solution would be to set the orig_type in the aggregator/clause, so that was_coerced_from_dynamic_to_fixed is true in this case. Then add a comment that this is a big hack. Long term we want all reads to go through the ComponentManager
, at which point it will decide what to do, so this is OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first tried the hacky solution, but it's just bringing more complexity since the final dataframe to be compared with the expected one in the tests turns out to have different indexes and data regarding encoding, which involves additional hacking to do.
I think the first solution is fair enough (see this commit).
73ab7fc
to
79a2042
Compare
I added
Using hypothesis in
df2:
df3:
using:
Outputs are:
I tried replicating this behavior with a basic test without using hypothesis, but it does give the right expected output dataframe. It seems that something is happening in the |
Update: For some reason, all the values in the |
Capture this directly in fcts
aa673df
to
848ab32
Compare
Fix #1105