-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support Variance #297
Conversation
.setVarianceSample(varBuilder) | ||
.build()) | ||
} else { | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the explainPlan info here as well?
.setVariancePopulation(varBuilder) | ||
.build()) | ||
} else { | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, add the explainPlan info here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thanks
core/src/execution/proto/expr.proto
Outdated
message VarianceSample { | ||
Expr child = 1; | ||
bool null_on_divide_by_zero = 2; | ||
DataType datatype = 3; | ||
} | ||
|
||
message VariancePopulation { | ||
Expr child = 1; | ||
bool null_on_divide_by_zero = 2; | ||
DataType datatype = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to have the same struct defined twice here because we have separate field tags (14 and 15) in AggExpr to represent the two different expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment @andygrove
I am not sure if there is a way to access the field tag. I added an enum StatisticsType
so I don't need two structs for VarianceSample
and VariancePopulation
. Also I think adding a StatisticsType
is more consistent with the implementation in variance.rs
.
} | ||
} | ||
|
||
enum StatisticsType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @huaxingao
} | ||
|
||
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { | ||
let values = &cast(&values[0], &DataType::Float64)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to cast input array to Float64? Isn't it already Float64 array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VariancePop
's input type is DoubleType
in Spark. I think we can be sure its input is Float64 array always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Thanks!
I have the same casting in covariance. There are a few problems I need to fix in covariance
- remove the unnecessary cast
- add
null_on_divide_by_zero
- combine
CovSample
andCovPopulation
inexpr.proto
I will have a PR to fix these problems.
Merged. Thanks @huaxingao @andygrove @parthchandra |
Thanks, everyone! |
* feat: Support Variance * Add StatisticsType in expr.poto * add explainPlan info and fix fmt * remove iunnecessary cast * remove unused import --------- Co-authored-by: Huaxin Gao <[email protected]>
Which issue does this PR close?
Closes #.
Rationale for this change
Supports
VAR_SAMP
andVAR_POP
The implementation mostly is the same as the DataFusion's implementation. The reason
we have our own implementation is that DataFusion has UInt64 for state_field
count
,while Spark has Double for count. Also adding
null_on_divide_by_zero
to be consistent with Spark's implementation.
What changes are included in this PR?
How are these changes tested?