apache · andygrove · May 31, 2024 · May 31, 2024 · May 31, 2024 · May 31, 2024
diff --git a/README.md b/README.md
@@ -19,58 +19,84 @@ under the License.
 
 # Apache DataFusion Comet
 
-Apache DataFusion Comet is an Apache Spark plugin that uses [Apache DataFusion](https://datafusion.apache.org/)
-as native runtime to achieve improvement in terms of query efficiency and query runtime.
+Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
+[Apache DataFusion](https://datafusion.apache.org) query engine. Comet is designed to significantly enhance the
+performance of Apache Spark workloads while leveraging existing commodity hardware and seamlessly integrating with the
+Spark ecosystem without requiring any code changes.
 
-Comet runs Spark SQL queries using the native DataFusion runtime, which is
-typically faster and more resource efficient than JVM based runtimes.
+# Benefits of Using Comet
 
-<a href="docs/source/_static/images/comet-overview.png"><img src="docs/source/_static/images/comet-system-diagram.png" align="center" width="500" ></a>
+## Run Spark Queries at DataFusion Speeds
 
-Comet aims to support:
+Comet delivers a performance speedup for many queries, enabling faster data processing and shorter time-to-insights.
 
-- a native Parquet implementation, including both reader and writer
-- full implementation of Spark operators, including
-  Filter/Project/Aggregation/Join/Exchange etc.
-- full implementation of Spark built-in expressions
-- a UDF framework for users to migrate their existing UDF to native
+The following chart shows the time it takes to run the 22 TPC-H queries against 100 GB of data in Parquet format 
+using a single executor with 8 cores.
 
-## Architecture
+When using Comet, the overall run time is reduced from 649 seconds to 440 seconds, which is 1.5x faster.
 
-The following diagram illustrates the architecture of Comet:
+When running TPC-H queries with DataFusion standalone (without Spark), the overall  run time is 3.9x faster.
 
-<a href="docs/source/_static/images/comet-overview.png"><img src="docs/source/_static/images/comet-overview.png" align="center" height="600" width="750" ></a>
+Comet is not yet achieving full DataFusion speeds in all cases, but with future work we aim to provide a 2x-4x speedup 
+for many use cases.
 
-## Current Status
+![](docs/source/_static/images/tpch_allqueries.png)
 
-The project is currently integrated into Apache Spark 3.2, 3.3, and 3.4.
+Here is a breakdown showing relative performance of Spark, Comet, and DataFusion for each TPC-H query.
 
-## Feature Parity with Apache Spark
+![](docs/source/_static/images/tpch_queries_compare.png)
 
-The project strives to keep feature parity with Apache Spark, that is,
-users should expect the same behavior (w.r.t features, configurations,
-query results, etc) with Comet turned on or turned off in their Spark
-jobs. In addition, Comet extension should automatically detect unsupported
-features and fallback to Spark engine.
+The following chart shows how much Comet currently accelerates each query from the benchmark. Performance optimization
+is an ongoing task, and we welcome contributions from the community to help achieve even greater speedups in the future.
 
-To achieve this, besides unit tests within Comet itself, we also re-use
-Spark SQL tests and make sure they all pass with Comet extension
-enabled.
+![](docs/source/_static/images/tpch_queries_speedup.png)
 
-## Supported Platforms
+These benchmarks can be reproduced in any environment using the documentation in the 
+[Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html). We encourage 
+you to run your own benchmarks.
 
-Linux, Apple OSX (Intel and M1)
+## Use Existing Hardware
 
-## Requirements
+Comet leverages your existing hardware infrastructure, eliminating the need for costly hardware upgrades or
+specialized hardware accelerators. By maximizing the utilization of commodity hardware, Comet ensures
+cost-effectiveness and scalability for your Spark deployments.
 
-- Apache Spark 3.2, 3.3, or 3.4
-- JDK 8, 11 and 17 (JDK 11 recommended because Spark 3.2 doesn't support 17)
-- GLIBC 2.17 (Centos 7) and up
+## Spark Compatibility
 
-## Getting started
+Comet aims for 100% compatibility with all supported versions of Apache Spark, allowing you to integrate Comet into
+your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness
+the benefits of Comet's acceleration capabilities without disrupting your Spark applications.
 
-See the [DataFusion Comet User Guide](https://datafusion.apache.org/comet/user-guide/installation.html) for installation instructions.
+## Tight Integration with Apache DataFusion
+
+Comet tightly integrates with the core Apache DataFusion project, leveraging its powerful execution engine. With
+seamless interoperability between Comet and DataFusion, you can achieve optimal performance and efficiency in your
+Spark workloads.
+
+## Active Community
+
+Comet boasts a vibrant and active community of developers, contributors, and users dedicated to advancing the
+capabilities of Apache DataFusion and accelerating the performance of Apache Spark.
+
+## Getting Started
+
+To get started with Apache DataFusion Comet, follow the
+[installation instructions](https://datafusion.apache.org/comet/user-guide/installation.html). Join the
+[DataFusion Slack and Discord channels](https://datafusion.apache.org/contributor-guide/communication.html) to connect
+with other users, ask questions, and share your experiences with Comet.
 
 ## Contributing
-See the [DataFusion Comet Contribution Guide](https://datafusion.apache.org/comet/contributor-guide/contributing.html)
-for information on how to get started contributing to the project.
+
+We welcome contributions from the community to help improve and enhance Apache DataFusion Comet. Whether it's fixing
+bugs, adding new features, writing documentation, or optimizing performance, your contributions are invaluable in
+shaping the future of Comet. Check out our
+[contributor guide](https://datafusion.apache.org/comet/contributor-guide/contributing.html) to get started.
+
+## License
+
+Apache DataFusion Comet is licensed under the Apache License 2.0. See the [LICENSE.txt](LICENSE.txt) file for details.
+
+## Acknowledgments
+
+We would like to express our gratitude to the Apache DataFusion community for their support and contributions to
+Comet. Together, we're building a faster, more efficient future for big data processing with Apache Spark.
diff --git a/docs/source/_static/images/tpch_allqueries.png b/docs/source/_static/images/tpch_allqueries.png
diff --git a/docs/source/_static/images/tpch_queries_compare.png b/docs/source/_static/images/tpch_queries_compare.png
diff --git a/docs/source/_static/images/tpch_queries_speedup.png b/docs/source/_static/images/tpch_queries_speedup.png
diff --git a/docs/source/contributor-guide/benchmarking.md b/docs/source/contributor-guide/benchmarking.md
@@ -19,44 +19,61 @@ under the License.
 
 # Comet Benchmarking Guide
 
-To track progress on performance, we regularly run benchmarks derived from TPC-H and TPC-DS. Benchmarking scripts are
-available in the [DataFusion Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
+To track progress on performance, we regularly run benchmarks derived from TPC-H and TPC-DS. Data generation and 
+benchmarking documentation and scripts are available in the [DataFusion Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
 
-Here is an example command for running the benchmarks. This command will need to be adapted based on the Spark 
-environment and location of data files.
+Here are example commands for running the benchmarks against a Spark cluster. This command will need to be 
+adapted based on the Spark environment and location of data files.
 
 This command assumes that `datafusion-benchmarks` is checked out in a parallel directory to `datafusion-comet`.
 
+## Running Benchmarks Against Apache Spark
+
 ```shell
-$SPARK_HOME/bin/spark-submit \ 
-    --master "local[*]" \ 
-    --conf spark.driver.memory=8G \ 
-    --conf spark.executor.memory=64G \ 
-    --conf spark.executor.cores=16 \ 
-    --conf spark.cores.max=16 \ 
-    --conf spark.eventLog.enabled=true \ 
-    --conf spark.sql.autoBroadcastJoinThreshold=-1 \ 
-    --jars $COMET_JAR \ 
-    --conf spark.driver.extraClassPath=$COMET_JAR \ 
-    --conf spark.executor.extraClassPath=$COMET_JAR \ 
-    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ 
-    --conf spark.comet.enabled=true \ 
-    --conf spark.comet.exec.enabled=true \ 
-    --conf spark.comet.exec.all.enabled=true \ 
-    --conf spark.comet.cast.allowIncompatible=true \ 
-    --conf spark.comet.explainFallback.enabled=true \ 
-    --conf spark.comet.parquet.io.enabled=false \ 
-    --conf spark.comet.batchSize=8192 \ 
-    --conf spark.comet.columnar.shuffle.enabled=false \ 
-    --conf spark.comet.exec.shuffle.enabled=true \ 
-    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ 
-    --conf spark.sql.adaptive.coalescePartitions.enabled=false \ 
-    --conf spark.comet.shuffle.enforceMode.enabled=true \
-    ../datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
-    --benchmark tpch \ 
-    --data /mnt/bigdata/tpch/sf100-parquet/ \ 
-    --queries ../datafusion-benchmarks/tpch/queries 
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.memory=64G \
+    --conf spark.executor.cores=16 \
+    --conf spark.cores.max=16 \
+    --conf spark.eventLog.enabled=true \
+    --conf spark.sql.autoBroadcastJoinThreshold=-1 \
+    tpcbench.py \
+    --benchmark tpch \
+    --data /mnt/bigdata/tpch/sf100/ \
+    --queries ../../tpch/queries \
+    --iterations 5
 ```
 
-Comet performance can be compared to regular Spark performance by running the benchmark twice, once with 
-`spark.comet.enabled` set to `true` and once with it set to `false`. 
+## Running Benchmarks Against Apache Spark with Apache DataFusion Comet Enabled
+
+```shell
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.memory=64G \
+    --conf spark.executor.cores=16 \
+    --conf spark.cores.max=16 \
+    --conf spark.eventLog.enabled=true \
+    --conf spark.sql.autoBroadcastJoinThreshold=-1 \
+    --jars $COMET_JAR \
+    --conf spark.driver.extraClassPath=$COMET_JAR \
+    --conf spark.executor.extraClassPath=$COMET_JAR \
+    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+    --conf spark.comet.enabled=true \
+    --conf spark.comet.exec.enabled=true \
+    --conf spark.comet.exec.all.enabled=true \
+    --conf spark.comet.cast.allowIncompatible=true \
+    --conf spark.comet.explainFallback.enabled=true \
+    --conf spark.comet.parquet.io.enabled=false \
+    --conf spark.comet.batchSize=8192 \
+    --conf spark.comet.exec.shuffle.enabled=true \
+    --conf spark.comet.exec.shuffle.mode=auto \
+    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
+    --conf spark.sql.adaptive.coalescePartitions.enabled=false \
+    tpcbench.py \
+    --benchmark tpch \
+    --data /mnt/bigdata/tpch/sf100/ \
+    --queries ../../tpch/queries \
+    --iterations 5
+```