test: Reduce end-to-end test time #109

sunchao · 2024-02-25T06:36:17Z

Which issue does this PR close?

Closes #.

Rationale for this change

Currently Java tests take over 2 hours to finish, which is very long. This increases the end-to-end time for pull requests to be processed, and decreases the developer efficiency.

What changes are included in this PR?

This PR makes a few changes in shuffle & aggregation tests:

The shuffle tests are now split into CometColumnarShuffleSuite and CometNativeShuffleSuite, for columnar and native shuffle respectively. This is more like a refactoring rather than changing the tests themselves
Reduces the test time for "Comet shuffle: different data type", and only run it with native shuffle enabled. We have a separate test for columnar shuffle.
Removes NonFastMerge related tests. In Spark, the fast merge feature has been true by default since 1.4, so it is not so useful to run all tests with the feature being turned off. If needed, we can add a dedicated test for it.
Reduce the number of rows for "fix: Too many task completion listener of ArrowReaderIterator causes OOM" by 10X. This test takes extremely long time to finish.
Reduce the time for "all types, with nulls" in CometAggregateSuite.

The total time is now reduced to ~40min

How are these changes tested?

viirya · 2024-02-25T06:48:45Z

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

-      withSQLConf(
-        CometConf.COMET_EXEC_ENABLED.key -> "true",
-        CometConf.COMET_COLUMNAR_SHUFFLE_ENABLED.key -> "true",
-        CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true") {


Hmm, these configs are used to produce a specific situation or corner case. If they are changed, you won't fail the test but maybe it just tests nothing.

I don't think anything changed. I moved the common configs to the beginning.

If you move the common configs and it should be good. There are many tests so we should be careful if no one is missed or with changed configs.

sunchao · 2024-02-25T06:51:20Z

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

  protected val numElementsForceSpillThreshold: Int = 10

  override protected def sparkConf: SparkConf = {
    val conf = super.sparkConf
    conf
      .set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, adaptiveExecutionEnabled.toString)
+      .set(CometConf.COMET_EXEC_ENABLED.key, "false")


moved the common configs for columnar shuffle to the sparkConf.

sunchao · 2024-02-25T06:53:26Z

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

-    }
-  }
-
-  test("Comet shuffle: different data type") {


We have this which tests both columnar and native shuffle, and "Comet columnar shuffle shuffle: different data type" below which only test columnar shuffle. This PR moves the columnar shuffle from this test and moves it into CometNativeShuffleSuite, to reduce the test time.

I think by purpose these tests focus on different aspects. For example, "different data type" tests are for shuffling on different data types especially.

They might be some overlapping when we look them now. At beginning, we want to have much more test coverage as possible to reduce missing corner cases.

We have this which tests both columnar and native shuffle

I looked at this test now. I think it is not to test both columnar and native shuffle, but only native shuffle. You can see it never enabled columnar shuffle, and the feature is disable by default.

When CometExec is disabled in this test, I think it goes to test if we can still run Spark shuffle without issue, i.e, if we can fallback to original Spark shuffle correctly.

Oops you are right. I got confused when looking at the execEnabled. It's interesting that for this test case, the check checkCometExchange(shuffled, 1, true) can pass regardless of whether execEnabled is true or false. Looks like native shuffle can be enabled even if native execution is not.

Because the shuffle is directly on top of Scan:

+- BosonExchange hashpartitioning(_4#82021, 10), REPARTITION_BY_NUM, BosonNativeShuffle, [plan_id=342026] +- BosonScan parquet [_1#82018,_4#82021] Batched: true, DataFilters: [], Format: BosonParquet, Location: InMemoryFileIndex(1 paths)[file:/Users/liangc hi/repos/boson/spark/target/tmp/spark-f2ec6474-6664-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:boolean,_4:int>

So even Comet exec is disabled, it still can work.

Oh I see. I'll update and keep the execEnabled

sunchao · 2024-02-27T17:02:18Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

@@ -90,13 +90,13 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
    Range(0, len).map(_ => chars.charAt(r.nextInt(chars.length))).mkString
  }

-  private def fuzzCastFromString(chars: String, maxLen: Int, toType: DataType) {


This change is added by accident (from make format). I'll remove it later.

sunchao · 2024-02-27T17:02:49Z

cc @viirya @advancedxy

advancedxy

I did a quick through, I think the test settings are not changed before and after this PR. Let me do another detailed code review and comment back.

The time reduction looks fantastic, thanks for your effort.

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

advancedxy · 2024-02-28T01:47:09Z

PR Build / Run TPC-DS queries with SF=1 (pull_request) Successful in 62m

TPC-DS run now is the bottleneck, I think we can reduce it in a follow up PR.

advancedxy

Left some minor comments

spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala

advancedxy · 2024-02-28T02:55:13Z

spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala

                  }
+                  checkSparkAnswer(s"SELECT _g$gCol, SUM(_1), SUM(_2) FROM tbl GROUP BY _g$gCol")


similar for sum, count, min, avg and max.

Count(distinct xx) and sum(distinct xx) is different, might have to be iterated by 4 cols.

advancedxy · 2024-02-28T02:58:01Z

spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala

-                (1 to 4).foreach { col =>
-                  (1 to 14).foreach { gCol =>
+                (1 to 14).foreach { gCol =>
+                  (1 to 4).foreach { col =>


Another unrelated question: why 1 to 4? seems like _1 to _4 are both integer types.

We probably want to test other types like float/double, decimal etc?

But this should be addressed in another PR.

Right, this only covers integer types. The other types like float/double, decimal are covered by other tests in the same suite. We could add them here but it might cause explosion in the total test time.

advancedxy · 2024-02-28T03:19:00Z

spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala

+    }
+  }
+
+  test("fix: comet native shuffle with binary data") {


Seems like this test case has already been covered by test("native shuffle: different data type") {?

Let me do a refactor in a follow up PR?

Sure. It seems we don't have binary type in the Parquet table, but only FixedLengthByteArray. We can add it.

sunchao · 2024-02-28T05:29:15Z

Thanks. Updated the PR.

TPC-DS run now is the bottleneck, I think we can reduce it in a follow up PR.

Yes it's easy to break it up.

advancedxy

LGTM, pending CI passes.

viirya · 2024-02-28T06:10:51Z

spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala

                  }
+                  checkSparkAnswer(s"SELECT _g$gCol, SUM(_1), SUM(_2), COUNT(_3), COUNT(_4), " +


Are columns _1, _2... different to each other? If so, seems this removes SUM(_3) and SUM(_4) and also misses some SUM(DISTINCT xx)? Because previously it tests all columns with all aggregate expressions.

Yes, the code path is pretty much the same though, so I thought it is fine to lose some precision here.

viirya · 2024-02-28T07:34:49Z

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

    }
  }
 }

-class CometAsyncShuffleSuite extends CometShuffleSuiteBase {


Yea, async is only meaningful for columnar shuffle. We don't need run native shuffle test cases with async.

viirya · 2024-02-28T07:36:29Z

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

-      CometConf.COMET_BATCH_SIZE.key -> "1",
-      CometConf.COMET_COLUMNAR_SHUFFLE_ENABLED.key -> "false",
-      CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true") {
-      withParquetTable((0 until 1000000).map(i => (1, (i + 1).toLong)), "tbl") {


I remember I set a big number because the bug only happens for many rows. We probably can reduce the number just no sure which is a proper one.

viirya · 2024-02-28T07:37:55Z

spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala

+    }
+  }
+
+  test("columnar shuffle: single partition") {


Do we need to change test name? It is native shuffle suite.

Yea let me update this.

viirya

Looks good. Only concern is the "different data types" test moved to native shuffle suite. Maybe we can restore the change there.

sunchao · 2024-02-29T00:21:27Z

Merged, thanks!

viirya reviewed Feb 25, 2024

View reviewed changes

sunchao commented Feb 25, 2024

View reviewed changes

sunchao force-pushed the reduce-shuffle branch from 2bd9915 to e407d96 Compare February 26, 2024 04:58

sunchao mentioned this pull request Feb 26, 2024

Speedup CI and reduce test run time #67

Closed

sunchao force-pushed the reduce-shuffle branch from 521bf2a to a908b9a Compare February 26, 2024 21:00

sunchao added 9 commits February 26, 2024 20:13

test: Reduce shuffle test time

b7e7444

fix

6ce3f31

remove non-fast merge test combination

70d6fea

reduce time in shuffle

26e69d7

reduce time in aggregate

102bdfe

try parallel

edd03a8

fix format

f1b4f9b

fix format

c2cde0d

fix config

b3b14e9

sunchao force-pushed the reduce-shuffle branch from 7fa120f to b3b14e9 Compare February 27, 2024 04:18

revert pom changes

510a4d1

sunchao changed the title ~~test: Reduce shuffle test time~~ test: Reduce test time Feb 27, 2024

sunchao marked this pull request as ready for review February 27, 2024 16:39

sunchao commented Feb 27, 2024

View reviewed changes

sunchao changed the title ~~test: Reduce test time~~ test: Reduce end-to-end test time Feb 27, 2024

advancedxy reviewed Feb 28, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala Show resolved Hide resolved

advancedxy reviewed Feb 28, 2024

View reviewed changes

review

7b58ae5

advancedxy approved these changes Feb 28, 2024

View reviewed changes

viirya reviewed Feb 28, 2024

View reviewed changes

more

2ec3d3e

viirya reviewed Feb 28, 2024

View reviewed changes

viirya approved these changes Feb 28, 2024

View reviewed changes

advancedxy mentioned this pull request Feb 28, 2024

build: Separate and speedup TPC-DS benchmark #130

Merged

sunchao force-pushed the reduce-shuffle branch from 52a9033 to 6c5cab6 Compare February 28, 2024 18:12

update

dbbdc93

sunchao force-pushed the reduce-shuffle branch from 6c5cab6 to dbbdc93 Compare February 28, 2024 18:13

sunchao merged commit e306fc6 into apache:main Feb 29, 2024
10 checks passed

sunchao deleted the reduce-shuffle branch February 29, 2024 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Reduce end-to-end test time #109

test: Reduce end-to-end test time #109

sunchao commented Feb 25, 2024 •

edited

Loading

viirya Feb 25, 2024 •

edited

Loading

sunchao Feb 25, 2024

viirya Feb 25, 2024

sunchao Feb 25, 2024

sunchao Feb 25, 2024

viirya Feb 28, 2024

viirya Feb 28, 2024

sunchao Feb 28, 2024

viirya Feb 28, 2024

sunchao Feb 28, 2024

sunchao Feb 27, 2024

sunchao commented Feb 27, 2024

advancedxy left a comment

advancedxy commented Feb 28, 2024

advancedxy left a comment

advancedxy Feb 28, 2024

advancedxy Feb 28, 2024

sunchao Feb 28, 2024

advancedxy Feb 28, 2024

sunchao Feb 28, 2024

sunchao commented Feb 28, 2024

advancedxy left a comment

viirya Feb 28, 2024

sunchao Feb 28, 2024

viirya Feb 28, 2024

viirya Feb 28, 2024

viirya Feb 28, 2024

sunchao Feb 28, 2024

viirya left a comment

sunchao commented Feb 29, 2024

		}
		checkSparkAnswer(s"SELECT _g$gCol, SUM(_1), SUM(_2) FROM tbl GROUP BY _g$gCol")

test: Reduce end-to-end test time #109

test: Reduce end-to-end test time #109

Conversation

sunchao commented Feb 25, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya Feb 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Feb 27, 2024

advancedxy left a comment

Choose a reason for hiding this comment

advancedxy commented Feb 28, 2024

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Feb 28, 2024

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

sunchao commented Feb 29, 2024

sunchao commented Feb 25, 2024 •

edited

Loading

viirya Feb 25, 2024 •

edited

Loading