Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support murmur3_hash and sha2 family hash functions #226

Merged
merged 4 commits into from
Apr 26, 2024

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Mar 22, 2024

Which issue does this PR close?

This partially closes #205

Rationale for this change

More expression coverage for comet

What changes are included in this PR?

  1. add user_defined scalar function: spark_murmur3_hash to support murmur3_hash in comet
  2. refine and re-enable md5 expression test.
  3. bridge sha2 family functions to Datafusion's built-in scalar functions
  4. glue code to convert Spark expression to proto messages
  5. new test code

How are these changes tested?

Added new test code

@codecov-commenter
Copy link

codecov-commenter commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 90.47619% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 33.35%. Comparing base (8aab44c) to head (1c63018).
Report is 5 commits behind head on main.

Files Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 90.47% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #226      +/-   ##
============================================
- Coverage     33.41%   33.35%   -0.06%     
- Complexity      768      770       +2     
============================================
  Files           107      107              
  Lines         36329    37057     +728     
  Branches       7935     8110     +175     
============================================
+ Hits          12138    12361     +223     
- Misses        21643    22097     +454     
- Partials       2548     2599      +51     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@advancedxy
Copy link
Contributor Author

All cleared, cc @sunchao @viirya PTAL.

@@ -1350,7 +1350,7 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde {
scalarExprToProto("lower", childExpr)

case Md5(child) =>
val childExpr = exprToProtoInternal(Cast(child, StringType), inputs)
val childExpr = exprToProtoInternal(child, inputs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add a cast here? As Spark will perform the cast to Binary type and DataFusion supports both utf8 and binary input now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is binary support added to DataFusion Md5 recently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I think binary support is added by this PR: apache/datafusion#3124, which is pretty old.

Anyway, currently all the digest func supports both utf8 and binary input.

@advancedxy advancedxy changed the title feat: Support murmur3_hash, md5 and sha2 hash functions feat: Support murmur3_hash and sha2 family hash functions Mar 22, 2024
@advancedxy
Copy link
Contributor Author

I have rebased again the latest main branch. cc @sunchao @viirya PTAL.

@advancedxy
Copy link
Contributor Author

Gently ping @sunchao and @viirya.

@viirya
Copy link
Member

viirya commented Apr 17, 2024

I will review this soon. Thanks.

@@ -983,8 +983,7 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper {
}
}

// TODO: enable this when we add md5 function to Comet
ignore("md5") {
test("md5") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I remember we explicitly disable it because DataFusion crypto_expressions feature includes blake3 which cannot be built on Mac platform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see you add crypto_expressions to Cargo.toml, how does it work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think crypto_expressions is enabled as default features in DataFusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we explicitly disable it because DataFusion crypto_expressions feature includes blake3 which cannot be built on Mac platform.

Do you have any issues to track this one? I think I can build DataFusion by default on Apple Silicon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think crypto_expressions is enabled as default features in DataFusion?

Yes, but we don't use default features:

default-features = false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we explicitly disable it because DataFusion crypto_expressions feature includes blake3 which cannot be built on Mac platform.

Do you have any issues to track this one? I think I can build DataFusion by default on Apple Silicon.

Not sure if the crate is updated to fix that. We encountered the issue and disabled crypto_expressions one year ago (internally, before we open sourced Comet).

Maybe it is okay now. But I'm wondering as we don't add back crypto_expressions feature, is md5 function working? I think it is guarded by this feature in DataFusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm wondering as we don't add back crypto_expressions feature, is md5 function working? I think it is guarded by this feature in DataFusion.

That's new.
I did a quick digging. It seems that the crypto expressions are enabled in datafusion-functions as it's enabled as default features.

Let me ensure the crypto_expressions feature is enabled then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is md5 function working?

I manually checked with the web ui and the unit test passes:
image

@@ -1646,6 +1647,37 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde {
None
}

case Murmur3Hash(children, seed) if children.forall(c => supportedDataType(c.dataType)) =>
// TODO: support list/map/struct type for murmur3 hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO seems unnecessary as other expressions also don't support nested types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
Let me create an issue to track the complex(list/map/struct) type support then? I think we can start by supporting them in literal and scalar expressions.

@advancedxy
Copy link
Contributor Author

@viirya do you have any other comments?

@viirya
Copy link
Member

viirya commented Apr 24, 2024

I will take another look today.

@viirya viirya merged commit 8485558 into apache:main Apr 26, 2024
28 checks passed
@viirya
Copy link
Member

viirya commented Apr 26, 2024

Merged. Thanks.

@advancedxy
Copy link
Contributor Author

Merged. Thanks.

Thanks for your comments and review.

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* feat: Support murmur3_hash and sha2 family hash functions

* address comments

* apply scalafix

* ensure crypto_expressions feature is enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Epic] Hash Expressions support
3 participants