feat: Support Variance #297

huaxingao · 2024-04-21T16:10:55Z

Which issue does this PR close?

Closes #.

Rationale for this change

Supports VAR_SAMP and VAR_POP
The implementation mostly is the same as the DataFusion's implementation. The reason
we have our own implementation is that DataFusion has UInt64 for state_field count,
while Spark has Double for count. Also adding null_on_divide_by_zero
to be consistent with Spark's implementation.

What changes are included in this PR?

How are these changes tested?

huaxingao · 2024-04-22T15:13:56Z

cc @andygrove @viirya

parthchandra · 2024-04-22T21:03:33Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

+              .setVarianceSample(varBuilder)
+              .build())
+        } else {
+          None


Could you add the explainPlan info here as well?

parthchandra · 2024-04-22T21:03:58Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

+              .setVariancePopulation(varBuilder)
+              .build())
+        } else {
+          None


ditto, add the explainPlan info here as well

Added. Thanks

andygrove · 2024-04-23T14:02:12Z

core/src/execution/proto/expr.proto

+message VarianceSample {
+  Expr child = 1;
+  bool null_on_divide_by_zero = 2;
+  DataType datatype = 3;
+}
+
+message VariancePopulation {
+  Expr child = 1;
+  bool null_on_divide_by_zero = 2;
+  DataType datatype = 3;
+}


I don't think we need to have the same struct defined twice here because we have separate field tags (14 and 15) in AggExpr to represent the two different expressions?

Thanks for your comment @andygrove
I am not sure if there is a way to access the field tag. I added an enum StatisticsType so I don't need two structs for VarianceSample and VariancePopulation. Also I think adding a StatisticsType is more consistent with the implementation in variance.rs.

andygrove · 2024-04-24T15:15:49Z

core/src/execution/proto/expr.proto

  }
 }

+enum StatisticsType {


andygrove

LGTM. Thanks @huaxingao

viirya · 2024-04-24T19:32:40Z

core/src/execution/datafusion/expressions/variance.rs

+    }
+
+    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
+        let values = &cast(&values[0], &DataType::Float64)?;


Why we need to cast input array to Float64? Isn't it already Float64 array?

VariancePop's input type is DoubleType in Spark. I think we can be sure its input is Float64 array always.

Removed. Thanks!

I have the same casting in covariance. There are a few problems I need to fix in covariance

remove the unnecessary cast

add null_on_divide_by_zero

combine CovSample and CovPopulation in expr.proto

I will have a PR to fix these problems.

viirya · 2024-04-25T18:12:46Z

Merged. Thanks @huaxingao @andygrove @parthchandra

huaxingao · 2024-04-25T18:33:08Z

Thanks, everyone!

* feat: Support Variance * Add StatisticsType in expr.poto * add explainPlan info and fix fmt * remove iunnecessary cast * remove unused import --------- Co-authored-by: Huaxin Gao <[email protected]>

parthchandra reviewed Apr 22, 2024

View reviewed changes

andygrove reviewed Apr 23, 2024

View reviewed changes

huaxingao force-pushed the variance branch from b774461 to c250bea Compare April 24, 2024 00:58

Huaxin Gao added 3 commits April 23, 2024 18:00

feat: Support Variance

e183c0c

Add StatisticsType in expr.poto

9c1c808

add explainPlan info and fix fmt

862d20a

huaxingao force-pushed the variance branch from c250bea to 862d20a Compare April 24, 2024 01:14

andygrove reviewed Apr 24, 2024

View reviewed changes

core/src/execution/proto/expr.proto

}

}

enum StatisticsType {

Copy link

Member

andygrove Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

andygrove approved these changes Apr 24, 2024

View reviewed changes

viirya reviewed Apr 24, 2024

View reviewed changes

Huaxin Gao added 2 commits April 24, 2024 13:34

remove iunnecessary cast

d11f47e

remove unused import

86f68a4

viirya approved these changes Apr 24, 2024

View reviewed changes

viirya merged commit 49bf503 into apache:main Apr 25, 2024
28 checks passed

huaxingao deleted the variance branch April 25, 2024 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Variance #297

feat: Support Variance #297

huaxingao commented Apr 21, 2024 •

edited

Loading

huaxingao commented Apr 22, 2024

parthchandra Apr 22, 2024

parthchandra Apr 22, 2024

huaxingao Apr 24, 2024

andygrove Apr 23, 2024

huaxingao Apr 24, 2024

andygrove Apr 24, 2024

andygrove left a comment

viirya Apr 24, 2024

viirya Apr 24, 2024

huaxingao Apr 24, 2024

viirya commented Apr 25, 2024

huaxingao commented Apr 25, 2024

feat: Support Variance #297

feat: Support Variance #297

Conversation

huaxingao commented Apr 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

huaxingao commented Apr 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Apr 25, 2024

huaxingao commented Apr 25, 2024

huaxingao commented Apr 21, 2024 •

edited

Loading