-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support murmur3_hash and sha2 family hash functions #226
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #226 +/- ##
============================================
- Coverage 33.41% 33.35% -0.06%
- Complexity 768 770 +2
============================================
Files 107 107
Lines 36329 37057 +728
Branches 7935 8110 +175
============================================
+ Hits 12138 12361 +223
- Misses 21643 22097 +454
- Partials 2548 2599 +51 ☔ View full report in Codecov by Sentry. |
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Outdated
Show resolved
Hide resolved
@@ -1350,7 +1350,7 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde { | |||
scalarExprToProto("lower", childExpr) | |||
|
|||
case Md5(child) => | |||
val childExpr = exprToProtoInternal(Cast(child, StringType), inputs) | |||
val childExpr = exprToProtoInternal(child, inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to add a cast here? As Spark will perform the cast to Binary type and DataFusion supports both utf8 and binary input now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is binary support added to DataFusion Md5 recently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I think binary support is added by this PR: apache/datafusion#3124, which is pretty old.
Anyway, currently all the digest func supports both utf8 and binary input.
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Outdated
Show resolved
Hide resolved
1c63018
to
e55f8fc
Compare
e55f8fc
to
8931c10
Compare
I will review this soon. Thanks. |
@@ -983,8 +983,7 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { | |||
} | |||
} | |||
|
|||
// TODO: enable this when we add md5 function to Comet | |||
ignore("md5") { | |||
test("md5") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I remember we explicitly disable it because DataFusion crypto_expressions
feature includes blake3
which cannot be built on Mac platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see you add crypto_expressions
to Cargo.toml
, how does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think crypto_expressions
is enabled as default features in DataFusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we explicitly disable it because DataFusion crypto_expressions feature includes blake3 which cannot be built on Mac platform.
Do you have any issues to track this one? I think I can build DataFusion by default on Apple Silicon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think
crypto_expressions
is enabled as default features in DataFusion?
Yes, but we don't use default features:
default-features = false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we explicitly disable it because DataFusion crypto_expressions feature includes blake3 which cannot be built on Mac platform.
Do you have any issues to track this one? I think I can build DataFusion by default on Apple Silicon.
Not sure if the crate is updated to fix that. We encountered the issue and disabled crypto_expressions
one year ago (internally, before we open sourced Comet).
Maybe it is okay now. But I'm wondering as we don't add back crypto_expressions
feature, is md5
function working? I think it is guarded by this feature in DataFusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm wondering as we don't add back crypto_expressions feature, is md5 function working? I think it is guarded by this feature in DataFusion.
That's new.
I did a quick digging. It seems that the crypto expressions are enabled in datafusion-functions as it's enabled as default features.
Let me ensure the crypto_expressions feature is enabled then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1646,6 +1647,37 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde { | |||
None | |||
} | |||
|
|||
case Murmur3Hash(children, seed) if children.forall(c => supportedDataType(c.dataType)) => | |||
// TODO: support list/map/struct type for murmur3 hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO seems unnecessary as other expressions also don't support nested types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Let me create an issue to track the complex(list/map/struct) type support then? I think we can start by supporting them in literal and scalar expressions.
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Outdated
Show resolved
Hide resolved
8931c10
to
991fec0
Compare
@viirya do you have any other comments? |
I will take another look today. |
Merged. Thanks. |
Thanks for your comments and review. |
* feat: Support murmur3_hash and sha2 family hash functions * address comments * apply scalafix * ensure crypto_expressions feature is enabled
Which issue does this PR close?
This partially closes #205
Rationale for this change
More expression coverage for comet
What changes are included in this PR?
spark_murmur3_hash
to support murmur3_hash in cometHow are these changes tested?
Added new test code