add benchmarking guide

apache · May 17, 2024 · 3902258 · 3902258
1 parent f8fec7f
commit 3902258
Show file tree

Hide file tree

Showing 2 changed files with 44 additions and 0 deletions.
diff --git a/docs/source/contributor-guide/benchmarking.md b/docs/source/contributor-guide/benchmarking.md
@@ -0,0 +1,43 @@
+# Comet Benchmarking Guide
+
+To track progress on performance, we regularly run benchmarks derived from TPC-H and TPC-DS. Benchmarking scripts are
+available in the [DataFusion Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
+
+Here is an example command for running the benchmarks. This command will need to be adapted based on the Spark 
+environment and location of data files.
+
+This command assumes that `datafusion-benchmarks` is checked out in a parallel directory to `datafusion-comet`.
+
+```shell
+$SPARK_HOME/bin/spark-submit \ 
+    --master "local[*]" \ 
+    --conf spark.driver.memory=8G \ 
+    --conf spark.executor.memory=64G \ 
+    --conf spark.executor.cores=16 \ 
+    --conf spark.cores.max=16 \ 
+    --conf spark.eventLog.enabled=true \ 
+    --conf spark.sql.autoBroadcastJoinThreshold=-1 \ 
+    --jars $COMET_JAR \ 
+    --conf spark.driver.extraClassPath=$COMET_JAR \ 
+    --conf spark.executor.extraClassPath=$COMET_JAR \ 
+    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ 
+    --conf spark.comet.enabled=true \ 
+    --conf spark.comet.exec.enabled=true \ 
+    --conf spark.comet.exec.all.enabled=true \ 
+    --conf spark.comet.cast.allowIncompatible=true \ 
+    --conf spark.comet.explainFallback.enabled=true \ 
+    --conf spark.comet.parquet.io.enabled=false \ 
+    --conf spark.comet.batchSize=8192 \ 
+    --conf spark.comet.columnar.shuffle.enabled=false \ 
+    --conf spark.comet.exec.shuffle.enabled=true \ 
+    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ 
+    --conf spark.sql.adaptive.coalescePartitions.enabled=false \ 
+    --conf spark.comet.shuffle.enforceMode.enabled=true \
+    ../datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
+    --benchmark tpch \ 
+    --data /mnt/bigdata/tpch/sf100-parquet/ \ 
+    --queries ../datafusion-benchmarks/tpch/queries 
+```
+
+Comet performance can be compared to regular Spark performance by running the benchmark twice, once with 
+`spark.comet.enabled` set to `true` and once with it set to `false`. 
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -57,6 +57,7 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
    Comet Plugin Overview <contributor-guide/plugin_overview>
    Development Guide <contributor-guide/development>
    Debugging Guide <contributor-guide/debugging>
+   Benchmarking Guide <contributor-guide/benchmarking>
    Profiling Native Code <contributor-guide/profiling_native_code>
    Github and Issue Tracker <https://github.com/apache/datafusion-comet>