build: Add spark-4.0 profile and shims #407

kazuyukitanimura · 2024-05-09T17:15:55Z

Which issue does this PR close?

Part of #372

Rationale for this change

To be ready for Spark 4.0

What changes are included in this PR?

This PR adds the spark-4.0 profile and shims

How are these changes tested?

This is an initial commit. Tests with the spark-4.0 profile do not pass yet. Tests for spark-3.x should pass.

kazuyukitanimura · 2024-05-09T17:18:59Z

@viirya @andygrove Please approve to start CI

viirya · 2024-05-09T17:38:23Z

Triggered.

kazuyukitanimura · 2024-05-09T21:55:16Z

@viirya @andygrove Is there a way to start CI without bothering you?

viirya · 2024-05-09T21:57:39Z

I remember only the first-time contributors need approval to trigger CI.

kazuyukitanimura · 2024-05-11T11:47:58Z

@viirya @andygrove passed all the tests on my personal github actions
It looks still not triggering here. Maybe need a committership to start CI with github actions workflow change?

viirya · 2024-05-11T17:41:13Z

Triggered CI pipelines.

common/src/main/spark-4.0/org/apache/comet/shims/ShimFileFormat.scala

pom.xml

common/src/main/spark-4.0/org/apache/spark/sql/comet/shims/ShimCometParquetUtils.scala

common/src/main/spark-4.0/org/apache/comet/shims/ShimBatchReader.scala

spark/src/main/scala/org/apache/spark/sql/comet/DecimalPrecision.scala

...k/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala

viirya · 2024-05-15T20:50:42Z

...k/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala

  override def write(
-      rdd: RDD[_],
+      inputs: Iterator[_],


If the input iterator has partitionId, we still need to get rid of it. I don't run this but looks like this will cause failures.

We have ShimCometShuffleWriteProcessor.write() for Spark 3.x

Yea, I mean for Spark 4.0 case, we also need to get rid of the partition. I saw you replied my other comment: #407 (comment)

I implemented the partitionId removal.
Thanks @viirya

...k/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala

kazuyukitanimura · 2024-05-20T20:35:58Z

@viirya Please take another look
cc @andygrove @comphead

viirya · 2024-05-23T01:04:56Z

.github/workflows/pr_build.yml

+      matrix:
+        os: [ubuntu-latest]
+        java_version: [17]
+        test-target: [rust, java]


Is it necessary to run rust test for Spark 4.0 separately? It should be no difference with 3.4 or 3.3/3.2.
We also don't run rust test for Spark 3.3/3.2 but only for Spark 3.4.

viirya · 2024-05-23T01:07:17Z

.github/workflows/pr_build.yml

+      - if: matrix.test-target == 'java'
+        name: Install Spark
+        shell: bash
+        working-directory: ./apache-spark
+        run: build/mvn install -Phive -Phadoop-cloud -DskipTests


This is only needed for Spark 4.0? I don't see we install it for other Spark versions.

Yes, This is only needed for Spark 4.0 because there is no 4.0.0-SNAPSHOT jar publicly available

viirya · 2024-05-23T01:32:58Z

spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala

+    /**
+      * Returns a tuple of expressions for the `unhex` function.
+      */


Suggested change

/**

* Returns a tuple of expressions for the `unhex` function.

*/

/**

* Returns a tuple of expressions for the `unhex` function.

*/

viirya · 2024-05-23T04:58:52Z

spark/src/test/spark-3.4-plus/org/apache/comet/exec/CometExec3_4PlusSuite.scala

@@ -29,7 +29,7 @@ import org.apache.comet.CometConf
 /**
 * This test suite contains tests for only Spark 3.4.


nit:

Suggested change

* This test suite contains tests for only Spark 3.4.

* This test suite contains tests for only Spark 3.4+.

viirya

Looks good overall. I have some minor comments.

kazuyukitanimura · 2024-05-23T05:49:01Z

Thank you @viirya Please take a final look 32bc314

viirya · 2024-05-28T18:26:06Z

@kazuyukitanimura I think you can try to merge this.

kazuyukitanimura · 2024-05-28T19:09:12Z

Thank you @viirya merged

This PR adds the spark-4.0 profile and shims This is an initial commit. Tests with the spark-4.0 profile do not pass yet. Tests for spark-3.x should pass. (cherry picked from commit 9b3e87b)

kazuyukitanimura added 10 commits May 1, 2024 11:46

build: Add spark-4.0 profile

a1fff9b

build: Add spark-4.0 profile

465b828

build: Add spark-4.0 profile

62b7d2f

build: Add spark-4.0 profile

8db78cb

Merge remote-tracking branch 'upstream/main' into spark-4.0

02a970a

build: Add spark-4.0 profile

7251eb2

build: Add spark-4.0 profile

17a6995

build: Add spark-4.0 profile

57d6538

build: Add spark-4.0 profile

d3efeb8

Merge remote-tracking branch 'upstream/main' into spark-4.0

e310eb1

build: Add spark-4.0 profile and shims

69ca228

kazuyukitanimura added 2 commits May 9, 2024 15:00

build: Add spark-4.0 profile and shims

3aec9e6

build: Add spark-4.0 profile and shims

d629df1

kazuyukitanimura force-pushed the spark-4.0 branch from ae59e5b to d629df1 Compare May 9, 2024 23:07

kazuyukitanimura added 5 commits May 9, 2024 17:44

build: Add spark-4.0 profile and shims

65628fb

build: Add spark-4.0 profile and shims

328705f

build: Add spark-4.0 profile and shims

b85c712

build: Add spark-4.0 profile and shims

8dc9dba

build: Add spark-4.0 profile and shims

9a4b605