Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: coalesce should return correct datatype #168

Merged
merged 2 commits into from
Mar 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 28 additions & 19 deletions spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,29 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde {
expr: Expression,
input: Seq[Attribute],
binding: Boolean = true): Option[Expr] = {
def castToProto(
timeZoneId: Option[String],
dt: DataType,
childExpr: Option[Expr]): Option[Expr] = {
val dataType = serializeDataType(dt)

if (childExpr.isDefined && dataType.isDefined) {
val castBuilder = ExprOuterClass.Cast.newBuilder()
castBuilder.setChild(childExpr.get)
castBuilder.setDatatype(dataType.get)

val timeZone = timeZoneId.getOrElse("UTC")
castBuilder.setTimezone(timeZone)

Some(
ExprOuterClass.Expr
.newBuilder()
.setCast(castBuilder)
.build())
} else {
None
}
}

def exprToProtoInternal(expr: Expression, inputs: Seq[Attribute]): Option[Expr] = {
SQLConf.get
Expand All @@ -363,24 +386,7 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde {

case Cast(child, dt, timeZoneId, _) =>
val childExpr = exprToProtoInternal(child, inputs)
val dataType = serializeDataType(dt)

if (childExpr.isDefined && dataType.isDefined) {
val castBuilder = ExprOuterClass.Cast.newBuilder()
castBuilder.setChild(childExpr.get)
castBuilder.setDatatype(dataType.get)

val timeZone = timeZoneId.getOrElse("UTC")
castBuilder.setTimezone(timeZone)

Some(
ExprOuterClass.Expr
.newBuilder()
.setCast(castBuilder)
.build())
} else {
None
}
castToProto(timeZoneId, dt, childExpr)

case add @ Add(left, right, _) if supportedDataType(left.dataType) =>
val leftExpr = exprToProtoInternal(left, inputs)
Expand Down Expand Up @@ -1494,7 +1500,10 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde {

case a @ Coalesce(_) =>
val exprChildren = a.children.map(exprToProtoInternal(_, inputs))
scalarExprToProto("coalesce", exprChildren: _*)
val childExpr = scalarExprToProto("coalesce", exprChildren: _*)
// TODO: Remove this once we have new DataFusion release which includes
// the fix: https://github.com/apache/arrow-datafusion/pull/9459
castToProto(None, a.dataType, childExpr)
Comment on lines +1504 to +1506
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround for now before we have new DataFusion release that includes the fix: apache/datafusion#9459


// With Spark 3.4, CharVarcharCodegenUtils.readSidePadding gets called to pad spaces for
// char types. Use rpad to achieve the behavior.
Expand Down
13 changes: 13 additions & 0 deletions spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,19 @@ import org.apache.comet.CometSparkSessionExtensions.{isSpark32, isSpark33Plus, i
class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper {
import testImplicits._

test("coalesce should return correct datatype") {
Seq(true, false).foreach { dictionaryEnabled =>
withTempDir { dir =>
val path = new Path(dir.toURI.toString, "test.parquet")
makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 10000)
withParquetTable(path.toString, "tbl") {
checkSparkAnswerAndOperator(
"SELECT coalesce(cast(_18 as date), cast(_19 as date), _20) FROM tbl")
}
}
}
}
Comment on lines +37 to +48
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the issue apache/datafusion#9458, the return type and the actual output array is different in DataFusion coalesce function:

  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (192.168.86.44 executor driver): org.apache.comet.CometNativeException
: Arrow error: Invalid argument error: column types must match schema types, expected Utf8 but found Date32 at column index 0                                                                                                          
        at org.apache.comet.Native.executePlan(Native Method)                                                      
        at org.apache.comet.CometExecIterator.executeNative(CometExecIterator.scala:65)
        at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:111)
        at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:126)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)


test("bitwise shift with different left/right types") {
Seq(false, true).foreach { dictionary =>
withSQLConf("parquet.enable.dictionary" -> dictionary.toString) {
Expand Down
Loading