-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Optimize some functions to rewrite dictionary-encoded strings #627
Conversation
Thanks @vaibhawvipul. This is looking good. There is also this section in fn cast_array(&self, array: ArrayRef) -> DataFusionResult<ArrayRef> {
let to_type = &self.data_type;
let array = array_with_timezone(array, self.timezone.clone(), Some(to_type))?;
let from_type = array.data_type().clone();
// unpack dictionary string arrays first
// TODO: we are unpacking a dictionary-encoded array and then performing
// the cast. We could potentially improve performance here by casting the
// dictionary values directly without unpacking the array first, although this
// would add more complexity to the code |
This is done @andygrove . Thank you for pointing this out. |
There is a test failure:
|
fixed |
); | ||
|
||
let casted_result = match to_type { | ||
DataType::Dictionary(_, _) => Arc::new(casted_dictionary.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we would currently hit the use case of casting to a dictionary type, or whether this is something that could help us in the future (it seems like we could take advantage of this when running natively even though Spark doesn't support keeping data dictionary-encoded after a cast, as far as I know, but maybe @viirya could confirm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awaiting @viirya 's comment before making any change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for refactoring the code @vaibhawvipul, it is looking much cleaner now. I would like to see the additional test that I mentioned before we merge this.
I think this is ready to merge, but I'd like to get a another review. Perhaps @parthchandra @huaxingao @kazuyukitanimura or @viirya could take a look |
} else { | ||
cast_array.append_null() | ||
} | ||
if to_type != &DataType::Date32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a logic change here. Is this just refactoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, no logic change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @vaibhawvipul and thanks for the review @parthchandra |
…pache#627) * dedup code * transforming the dict directly * code optimization for cast string to timestamp * minor optimizations * fmt fixes and casting to dict array without unpacking to array first * bug fixes * revert unrelated change * Added test case and code refactor * minor optimization * minor optimization again * convert the cast to array * Revert "convert the cast to array" This reverts commit 9270aed. * bug fixes * rename the test to cast_dict_to_timestamp arr
Which issue does this PR close?
Closes #504 .
Rationale for this change
Improves performance
What changes are included in this PR?
rewrite dictionary encoded strings
How are these changes tested?
All test cases pass.