Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Variance #297

Merged
merged 5 commits into from
Apr 25, 2024
Merged

feat: Support Variance #297

merged 5 commits into from
Apr 25, 2024

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Apr 21, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Supports VAR_SAMP and VAR_POP
The implementation mostly is the same as the DataFusion's implementation. The reason
we have our own implementation is that DataFusion has UInt64 for state_field count,
while Spark has Double for count. Also adding null_on_divide_by_zero
to be consistent with Spark's implementation.

What changes are included in this PR?

How are these changes tested?

@huaxingao
Copy link
Contributor Author

cc @andygrove @viirya

.setVarianceSample(varBuilder)
.build())
} else {
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the explainPlan info here as well?

.setVariancePopulation(varBuilder)
.build())
} else {
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, add the explainPlan info here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks

Comment on lines 170 to 179
message VarianceSample {
Expr child = 1;
bool null_on_divide_by_zero = 2;
DataType datatype = 3;
}

message VariancePopulation {
Expr child = 1;
bool null_on_divide_by_zero = 2;
DataType datatype = 3;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to have the same struct defined twice here because we have separate field tags (14 and 15) in AggExpr to represent the two different expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment @andygrove
I am not sure if there is a way to access the field tag. I added an enum StatisticsType so I don't need two structs for VarianceSample and VariancePopulation. Also I think adding a StatisticsType is more consistent with the implementation in variance.rs.

}
}

enum StatisticsType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @huaxingao

}

fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
let values = &cast(&values[0], &DataType::Float64)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to cast input array to Float64? Isn't it already Float64 array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariancePop's input type is DoubleType in Spark. I think we can be sure its input is Float64 array always.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks!

I have the same casting in covariance. There are a few problems I need to fix in covariance

  1. remove the unnecessary cast
  2. add null_on_divide_by_zero
  3. combine CovSample and CovPopulation in expr.proto

I will have a PR to fix these problems.

@viirya viirya merged commit 49bf503 into apache:main Apr 25, 2024
28 checks passed
@viirya
Copy link
Member

viirya commented Apr 25, 2024

Merged. Thanks @huaxingao @andygrove @parthchandra

@huaxingao
Copy link
Contributor Author

Thanks, everyone!

@huaxingao huaxingao deleted the variance branch April 25, 2024 18:33
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* feat: Support Variance

* Add StatisticsType in expr.poto

* add explainPlan info and fix fmt

* remove iunnecessary cast

* remove unused import

---------

Co-authored-by: Huaxin Gao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants