-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: correlation support #456
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #456 +/- ##
============================================
- Coverage 34.02% 34.02% -0.01%
- Complexity 857 859 +2
============================================
Files 116 116
Lines 38565 38671 +106
Branches 8517 8564 +47
============================================
+ Hits 13120 13156 +36
- Misses 22691 22753 +62
- Partials 2754 2762 +8 ☔ View full report in Codecov by Sentry. |
.setCorrelation(corrBuilder) | ||
.build()) | ||
} else { | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add withInfo()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I forgot this. Added.
sql(s"insert into $table values(1, 4, 1), (2, 5, 1), (3, 6, 2)") | ||
val expectedNumOfCometAggregates = 2 | ||
|
||
sql("SELECT corr(col1, col2) FROM test GROUP BY col3").show |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can reuse s$table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -1212,6 +1212,157 @@ class CometAggregateSuite extends CometTestBase with AdaptiveSparkPlanHelper { | |||
} | |||
} | |||
|
|||
test("correlation") { | |||
withSQLConf(CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true") { | |||
Seq(false).foreach { cometColumnShuffleEnabled => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that we only test for cometColumnShuffleEnabled==false
? I have similar questions for parquet.enable.dictionary
and nullOnDivideByZero
, where we also iterate over a single boolean and don't test both for true and false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be both true
and false
. I removed one of them so it's easier for me to debug. Forgot to put back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged. Thanks @huaxingao @andygrove @kazuyukitanimura |
Thanks @viirya @andygrove @kazuyukitanimura |
* feat: correlation support * fmt * remove un-used import * address comments * address comment --------- Co-authored-by: Huaxin Gao <[email protected]> (cherry picked from commit 6d23e28)
Which issue does this PR close?
Supports
CORR
The implementation mostly is the same as the DataFusion's implementation. The reason
we have our own implementation is that DataFusion has UInt64 for state_field count,
while Spark has Double for count. Also adding null_on_divide_by_zero
to be consistent with Spark's implementation.
Closes #.
Rationale for this change
What changes are included in this PR?
How are these changes tested?