Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Port Datafusion Covariance to Comet #234

Merged
merged 9 commits into from
Apr 18, 2024

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Mar 26, 2024

Which issue does this PR close?

Closes #.
Port Covariance in Comet. The reason we can't use DataFusion Covariance implementation is that Spark Covariance aggregate expression's state is different from DataFusion's state.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@huaxingao huaxingao marked this pull request as draft March 26, 2024 23:26
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 84.61538% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 33.36%. Comparing base (b0234a6) to head (06bbb36).
Report is 2 commits behind head on main.

Files Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 84.61% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #234      +/-   ##
============================================
+ Coverage     33.32%   33.36%   +0.04%     
- Complexity      769      776       +7     
============================================
  Files           107      108       +1     
  Lines         37037    37179     +142     
  Branches       8106     8194      +88     
============================================
+ Hits          12341    12404      +63     
- Misses        22098    22154      +56     
- Partials       2598     2621      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@huaxingao huaxingao force-pushed the port_covariance branch 2 times, most recently from 5324464 to 47604a3 Compare April 10, 2024 18:16
@huaxingao huaxingao marked this pull request as ready for review April 10, 2024 19:11
@huaxingao
Copy link
Contributor Author

cc @andygrove @viirya

Comment on lines +41 to +42
expr1: Arc<dyn PhysicalExpr>,
expr2: Arc<dyn PhysicalExpr>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there better names for these two children expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark uses left and right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave this as is unless you prefer left and right.

Comment on lines 156 to 160
fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> {
Ok(Box::new(CovarianceAccumulator::try_new(
StatsType::Population,
)?))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the difference between CovariancePop and Covariance is only the StatsType parameter when creating CovarianceAccumulator?

If so, maybe we only need Covariance and make StatsType as parameter when creating Covariance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow-datafusion-comet/blob/955a0b9b9b026b477b5b527aba133d9e1402bd7e/core/src/execution/datafusion/expressions/covariance.rs#L368
Population Covariance is calculated over the entire dataset(N) whereas Sample Covariance is calculated over a sample (N-1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, but the behavior is decided in CovarianceAccumulator based on its StatsType.

I mean this two struct CovariancePop and Covariance. They are basically the same, except that name and StatsType are different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can simply have one Covariance and add one more parameter of StatsType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks!

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update EXPRESSIONS.md too.

PhysicalExpr,
};

/// COVAR_SAMP and COVAR_POP aggregate expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to mention why we need to port it in Comet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

message CovSample {
Expr child1 = 1;
Expr child2 = 2;
bool null_on_divide_by_zero = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use this yet, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see it is specified in JVM side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is not used yet. I will have a follow up to address this.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments.

@viirya viirya merged commit 4710d62 into apache:main Apr 18, 2024
28 checks passed
@viirya
Copy link
Member

viirya commented Apr 18, 2024

Merged. Thanks.

@huaxingao
Copy link
Contributor Author

Thanks @viirya

@huaxingao huaxingao deleted the port_covariance branch April 18, 2024 01:22
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* feat: Port Datafusion Covariance to Comet

* feat: Port Datafusion Covariance to Comet

* fmt

* update EXPRESSIONS.md

* combine COVAR_SAMP and COVAR_POP

* fix fmt

* address comment

---------

Co-authored-by: Huaxin Gao <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants