Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Optimize some functions to rewrite dictionary-encoded strings #627

Merged
merged 16 commits into from
Jul 15, 2024

Conversation

vaibhawvipul
Copy link
Contributor

Which issue does this PR close?

Closes #504 .

Rationale for this change

Improves performance

What changes are included in this PR?

rewrite dictionary encoded strings

How are these changes tested?

All test cases pass.

@vaibhawvipul vaibhawvipul marked this pull request as ready for review July 4, 2024 13:43
@andygrove
Copy link
Member

Thanks @vaibhawvipul. This is looking good. There is also this section in cast.rs:

    fn cast_array(&self, array: ArrayRef) -> DataFusionResult<ArrayRef> {
        let to_type = &self.data_type;
        let array = array_with_timezone(array, self.timezone.clone(), Some(to_type))?;
        let from_type = array.data_type().clone();

        // unpack dictionary string arrays first
        // TODO: we are unpacking a dictionary-encoded array and then performing
        // the cast. We could potentially improve performance here by casting the
        // dictionary values directly without unpacking the array first, although this
        // would add more complexity to the code

@vaibhawvipul
Copy link
Contributor Author

Thanks @vaibhawvipul. This is looking good. There is also this section in cast.rs:

    fn cast_array(&self, array: ArrayRef) -> DataFusionResult<ArrayRef> {
        let to_type = &self.data_type;
        let array = array_with_timezone(array, self.timezone.clone(), Some(to_type))?;
        let from_type = array.data_type().clone();

        // unpack dictionary string arrays first
        // TODO: we are unpacking a dictionary-encoded array and then performing
        // the cast. We could potentially improve performance here by casting the
        // dictionary values directly without unpacking the array first, although this
        // would add more complexity to the code

This is done @andygrove . Thank you for pointing this out.

@andygrove
Copy link
Member

There is a test failure:

Cause: org.apache.comet.CometNativeException: Unsupported data type Dictionary(Int32, Binary) for function md5

@vaibhawvipul
Copy link
Contributor Author

There is a test failure:

Cause: org.apache.comet.CometNativeException: Unsupported data type Dictionary(Int32, Binary) for function md5

fixed

@vaibhawvipul vaibhawvipul requested a review from andygrove July 12, 2024 05:42
);

let casted_result = match to_type {
DataType::Dictionary(_, _) => Arc::new(casted_dictionary.clone()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we would currently hit the use case of casting to a dictionary type, or whether this is something that could help us in the future (it seems like we could take advantage of this when running natively even though Spark doesn't support keeping data dictionary-encoded after a cast, as far as I know, but maybe @viirya could confirm).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awaiting @viirya 's comment before making any change.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for refactoring the code @vaibhawvipul, it is looking much cleaner now. I would like to see the additional test that I mentioned before we merge this.

@andygrove
Copy link
Member

I think this is ready to merge, but I'd like to get a another review. Perhaps @parthchandra @huaxingao @kazuyukitanimura or @viirya could take a look

} else {
cast_array.append_null()
}
if to_type != &DataType::Date32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a logic change here. Is this just refactoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, no logic change here.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andygrove
Copy link
Member

Thanks @vaibhawvipul and thanks for the review @parthchandra

@andygrove andygrove merged commit 086513c into apache:main Jul 15, 2024
74 checks passed
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…pache#627)

* dedup code

* transforming the dict directly

* code optimization for cast string to timestamp

* minor optimizations

* fmt fixes and casting to dict array without unpacking to array first

* bug fixes

* revert unrelated change

* Added test case and code refactor

* minor optimization

* minor optimization again

* convert the cast to array

* Revert "convert the cast to array"

This reverts commit 9270aed.

* bug fixes

* rename the test to cast_dict_to_timestamp arr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf: Optimize some functions to rewrite dictionary-encoded strings
3 participants