diff --git a/README.md b/README.md index fb17535aa..b0b72fdb6 100644 --- a/README.md +++ b/README.md @@ -19,58 +19,86 @@ under the License. # Apache DataFusion Comet -Apache DataFusion Comet is an Apache Spark plugin that uses [Apache DataFusion](https://datafusion.apache.org/) -as native runtime to achieve improvement in terms of query efficiency and query runtime. +Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful +[Apache DataFusion](https://datafusion.apache.org) query engine. Comet is designed to significantly enhance the +performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the +Spark ecosystem without requiring any code changes. -Comet runs Spark SQL queries using the native DataFusion runtime, which is -typically faster and more resource efficient than JVM based runtimes. +# Benefits of Using Comet - +## Run Spark Queries at DataFusion Speeds -Comet aims to support: +Comet delivers a performance speedup for many queries, enabling faster data processing and shorter time-to-insights. -- a native Parquet implementation, including both reader and writer -- full implementation of Spark operators, including - Filter/Project/Aggregation/Join/Exchange etc. -- full implementation of Spark built-in expressions -- a UDF framework for users to migrate their existing UDF to native +The following chart shows the time it takes to run the 22 TPC-H queries against 100 GB of data in Parquet format +using a single executor with 8 cores. See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html) +for details of the environment used for these benchmarks. -## Architecture +When using Comet, the overall run time is reduced from 649 seconds to 440 seconds, a 1.5x speedup. -The following diagram illustrates the architecture of Comet: +Running the same queries with DataFusion standalone (without Spark) using the same number of cores results in a 3.9x +speedup compared to Spark. - +Comet is not yet achieving full DataFusion speeds in all cases, but with future work we aim to provide a 2x-4x speedup +for many use cases. -## Current Status +![](docs/source/_static/images/tpch_allqueries.png) -The project is currently integrated into Apache Spark 3.2, 3.3, and 3.4. +Here is a breakdown showing relative performance of Spark, Comet, and DataFusion for each TPC-H query. -## Feature Parity with Apache Spark +![](docs/source/_static/images/tpch_queries_compare.png) -The project strives to keep feature parity with Apache Spark, that is, -users should expect the same behavior (w.r.t features, configurations, -query results, etc) with Comet turned on or turned off in their Spark -jobs. In addition, Comet extension should automatically detect unsupported -features and fallback to Spark engine. +The following chart shows how much Comet currently accelerates each query from the benchmark. Performance optimization +is an ongoing task, and we welcome contributions from the community to help achieve even greater speedups in the future. -To achieve this, besides unit tests within Comet itself, we also re-use -Spark SQL tests and make sure they all pass with Comet extension -enabled. +![](docs/source/_static/images/tpch_queries_speedup.png) -## Supported Platforms +These benchmarks can be reproduced in any environment using the documentation in the +[Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html). We encourage +you to run your own benchmarks. -Linux, Apple OSX (Intel and M1) +## Use Commodity Hardware -## Requirements +Comet leverages commodity hardware, eliminating the need for costly hardware upgrades or +specialized hardware accelerators, such as GPUs or FGPA. By maximizing the utilization of commodity hardware, Comet +ensures cost-effectiveness and scalability for your Spark deployments. -- Apache Spark 3.2, 3.3, or 3.4 -- JDK 8, 11 and 17 (JDK 11 recommended because Spark 3.2 doesn't support 17) -- GLIBC 2.17 (Centos 7) and up +## Spark Compatibility -## Getting started +Comet aims for 100% compatibility with all supported versions of Apache Spark, allowing you to integrate Comet into +your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness +the benefits of Comet's acceleration capabilities without disrupting your Spark applications. -See the [DataFusion Comet User Guide](https://datafusion.apache.org/comet/user-guide/installation.html) for installation instructions. +## Tight Integration with Apache DataFusion + +Comet tightly integrates with the core Apache DataFusion project, leveraging its powerful execution engine. With +seamless interoperability between Comet and DataFusion, you can achieve optimal performance and efficiency in your +Spark workloads. + +## Active Community + +Comet boasts a vibrant and active community of developers, contributors, and users dedicated to advancing the +capabilities of Apache DataFusion and accelerating the performance of Apache Spark. + +## Getting Started + +To get started with Apache DataFusion Comet, follow the +[installation instructions](https://datafusion.apache.org/comet/user-guide/installation.html). Join the +[DataFusion Slack and Discord channels](https://datafusion.apache.org/contributor-guide/communication.html) to connect +with other users, ask questions, and share your experiences with Comet. ## Contributing -See the [DataFusion Comet Contribution Guide](https://datafusion.apache.org/comet/contributor-guide/contributing.html) -for information on how to get started contributing to the project. + +We welcome contributions from the community to help improve and enhance Apache DataFusion Comet. Whether it's fixing +bugs, adding new features, writing documentation, or optimizing performance, your contributions are invaluable in +shaping the future of Comet. Check out our +[contributor guide](https://datafusion.apache.org/comet/contributor-guide/contributing.html) to get started. + +## License + +Apache DataFusion Comet is licensed under the Apache License 2.0. See the [LICENSE.txt](LICENSE.txt) file for details. + +## Acknowledgments + +We would like to express our gratitude to the Apache DataFusion community for their support and contributions to +Comet. Together, we're building a faster, more efficient future for big data processing with Apache Spark. diff --git a/docs/source/_static/images/tpch_allqueries.png b/docs/source/_static/images/tpch_allqueries.png new file mode 100644 index 000000000..a6788d5a4 Binary files /dev/null and b/docs/source/_static/images/tpch_allqueries.png differ diff --git a/docs/source/_static/images/tpch_queries_compare.png b/docs/source/_static/images/tpch_queries_compare.png new file mode 100644 index 000000000..927680612 Binary files /dev/null and b/docs/source/_static/images/tpch_queries_compare.png differ diff --git a/docs/source/_static/images/tpch_queries_speedup.png b/docs/source/_static/images/tpch_queries_speedup.png new file mode 100644 index 000000000..fb417ff1d Binary files /dev/null and b/docs/source/_static/images/tpch_queries_speedup.png differ diff --git a/docs/source/contributor-guide/benchmark-results/2024-05-30/comet-8-exec-5-runs.json b/docs/source/contributor-guide/benchmark-results/2024-05-30/comet-8-exec-5-runs.json new file mode 100644 index 000000000..38142151d --- /dev/null +++ b/docs/source/contributor-guide/benchmark-results/2024-05-30/comet-8-exec-5-runs.json @@ -0,0 +1,201 @@ +{ + "engine": "datafusion-comet", + "benchmark": "tpch", + "data_path": "/mnt/bigdata/tpch/sf100/", + "query_path": "../../tpch/queries", + "spark_conf": { + "spark.comet.explainFallback.enabled": "true", + "spark.jars": "file:///home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar", + "spark.comet.cast.allowIncompatible": "true", + "spark.executor.extraClassPath": "/home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar", + "spark.executor.memory": "8G", + "spark.comet.exec.shuffle.enabled": "true", + "spark.app.name": "DataFusion Comet Benchmark derived from TPC-H / TPC-DS", + "spark.driver.port": "36573", + "spark.sql.adaptive.coalescePartitions.enabled": "false", + "spark.app.startTime": "1716923498046", + "spark.comet.batchSize": "8192", + "spark.app.id": "app-20240528131138-0043", + "spark.serializer.objectStreamReset": "100", + "spark.app.initial.jar.urls": "spark://woody.lan:36573/jars/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar", + "spark.submit.deployMode": "client", + "spark.sql.autoBroadcastJoinThreshold": "-1", + "spark.comet.exec.all.enabled": "true", + "spark.eventLog.enabled": "false", + "spark.driver.host": "woody.lan", + "spark.driver.extraJavaOptions": "-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false", + "spark.sql.warehouse.dir": "file:/home/andy/git/apache/datafusion-benchmarks/runners/datafusion-comet/spark-warehouse", + "spark.shuffle.manager": "org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager", + "spark.comet.exec.enabled": "true", + "spark.repl.local.jars": "file:///home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar", + "spark.executor.id": "driver", + "spark.master": "spark://woody:7077", + "spark.executor.instances": "8", + "spark.comet.exec.shuffle.mode": "auto", + "spark.sql.extensions": "org.apache.comet.CometSparkSessionExtensions", + "spark.driver.memory": "8G", + "spark.driver.extraClassPath": "/home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar", + "spark.rdd.compress": "True", + "spark.executor.extraJavaOptions": "-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false", + "spark.cores.max": "8", + "spark.comet.enabled": "true", + "spark.app.submitTime": "1716923497738", + "spark.submit.pyFiles": "", + "spark.executor.cores": "1", + "spark.comet.parquet.io.enabled": "false" + }, + "1": [ + 32.121661901474, + 27.997092485427856, + 27.756758451461792, + 28.55236315727234, + 28.332542181015015 + ], + "2": [ + 18.269107580184937, + 16.200955629348755, + 16.194639682769775, + 16.745808839797974, + 16.59864115715027 + ], + "3": [ + 17.265466690063477, + 17.069786310195923, + 17.12887978553772, + 19.33678102493286, + 18.182055234909058 + ], + "4": [ + 8.367004156112671, + 8.172023296356201, + 8.023266077041626, + 8.350765228271484, + 8.258736610412598 + ], + "5": [ + 34.10048794746399, + 32.69314408302307, + 33.21383595466614, + 36.391114473342896, + 39.00048065185547 + ], + "6": [ + 3.1693499088287354, + 3.044705390930176, + 3.047694206237793, + 3.2817511558532715, + 3.274174928665161 + ], + "7": [ + 25.369214296340942, + 24.020941257476807, + 24.0787034034729, + 28.47402787208557, + 28.23443365097046 + ], + "8": [ + 40.06126809120178, + 39.828824281692505, + 45.250510454177856, + 44.406742572784424, + 48.98451232910156 + ], + "9": [ + 62.822797775268555, + 61.26328158378601, + 64.95581865310669, + 69.51708793640137, + 73.52380013465881 + ], + "10": [ + 20.55334782600403, + 20.546096324920654, + 20.57452392578125, + 22.84211039543152, + 23.724371671676636 + ], + "11": [ + 11.068235158920288, + 10.715423822402954, + 11.353424310684204, + 11.37632942199707, + 11.530814170837402 + ], + "12": [ + 10.264788389205933, + 8.67864990234375, + 8.845952033996582, + 8.593009233474731, + 8.540803909301758 + ], + "13": [ + 9.603406190872192, + 9.648627042770386, + 13.040799140930176, + 10.154011249542236, + 9.716034412384033 + ], + "14": [ + 6.20926308631897, + 6.0385496616363525, + 7.674488544464111, + 10.53052043914795, + 7.661675691604614 + ], + "15": [ + 11.466301918029785, + 11.473632097244263, + 11.279382228851318, + 13.291078329086304, + 12.81026816368103 + ], + "16": [ + 8.096073865890503, + 7.73410701751709, + 7.742897272109985, + 8.477537631988525, + 7.821273326873779 + ], + "17": [ + 43.69264578819275, + 43.33040428161621, + 46.291987657547, + 54.654345989227295, + 54.37124800682068 + ], + "18": [ + 27.205485105514526, + 26.785916090011597, + 27.331408262252808, + 29.946768760681152, + 28.037617444992065 + ], + "19": [ + 8.100102186203003, + 7.845783472061157, + 8.52329158782959, + 8.907397985458374, + 9.13755488395691 + ], + "20": [ + 13.09695029258728, + 12.683861255645752, + 15.612725019454956, + 13.361177206039429, + 16.614356517791748 + ], + "21": [ + 43.69623780250549, + 43.26758122444153, + 46.91650056838989, + 47.875754833221436, + 57.9763662815094 + ], + "22": [ + 4.5090577602386475, + 4.420571804046631, + 4.639787673950195, + 5.118046998977661, + 5.017346143722534 + ] +} \ No newline at end of file diff --git a/docs/source/contributor-guide/benchmark-results/2024-05-30/datafusion-python-8-cores.json b/docs/source/contributor-guide/benchmark-results/2024-05-30/datafusion-python-8-cores.json new file mode 100644 index 000000000..f032d536d --- /dev/null +++ b/docs/source/contributor-guide/benchmark-results/2024-05-30/datafusion-python-8-cores.json @@ -0,0 +1,73 @@ +{ + "engine": "datafusion-python", + "datafusion-version": "38.0.1", + "benchmark": "tpch", + "data_path": "/mnt/bigdata/tpch/sf100/", + "query_path": "../../tpch/queries/", + "1": [ + 7.410699844360352 + ], + "2": [ + 2.966364622116089 + ], + "3": [ + 3.988652467727661 + ], + "4": [ + 1.8821499347686768 + ], + "5": [ + 6.957948684692383 + ], + "6": [ + 1.779731273651123 + ], + "7": [ + 14.559604167938232 + ], + "8": [ + 7.062309265136719 + ], + "9": [ + 14.908353805541992 + ], + "10": [ + 7.73533296585083 + ], + "11": [ + 2.346423387527466 + ], + "12": [ + 2.7248904705047607 + ], + "13": [ + 6.38663387298584 + ], + "14": [ + 2.4675676822662354 + ], + "15": [ + 4.799000024795532 + ], + "16": [ + 1.9091999530792236 + ], + "17": [ + 19.230653762817383 + ], + "18": [ + 25.15683078765869 + ], + "19": [ + 4.2268781661987305 + ], + "20": [ + 8.66620659828186 + ], + "21": [ + 17.696006059646606 + ], + "22": [ + 1.3805692195892334 + ] +} \ No newline at end of file diff --git a/docs/source/contributor-guide/benchmark-results/2024-05-30/spark-8-exec-5-runs.json b/docs/source/contributor-guide/benchmark-results/2024-05-30/spark-8-exec-5-runs.json new file mode 100644 index 000000000..012b05c3a --- /dev/null +++ b/docs/source/contributor-guide/benchmark-results/2024-05-30/spark-8-exec-5-runs.json @@ -0,0 +1,184 @@ +{ + "engine": "datafusion-comet", + "benchmark": "tpch", + "data_path": "/mnt/bigdata/tpch/sf100/", + "query_path": "../../tpch/queries", + "spark_conf": { + "spark.driver.extraJavaOptions": "-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false", + "spark.sql.warehouse.dir": "file:/home/andy/git/apache/datafusion-benchmarks/runners/datafusion-comet/spark-warehouse", + "spark.app.id": "app-20240528090804-0041", + "spark.app.submitTime": "1716908883258", + "spark.executor.memory": "8G", + "spark.master": "spark://woody:7077", + "spark.executor.id": "driver", + "spark.executor.instances": "8", + "spark.app.name": "DataFusion Comet Benchmark derived from TPC-H / TPC-DS", + "spark.driver.memory": "8G", + "spark.rdd.compress": "True", + "spark.executor.extraJavaOptions": "-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false", + "spark.serializer.objectStreamReset": "100", + "spark.cores.max": "8", + "spark.submit.pyFiles": "", + "spark.executor.cores": "1", + "spark.submit.deployMode": "client", + "spark.sql.autoBroadcastJoinThreshold": "-1", + "spark.eventLog.enabled": "false", + "spark.app.startTime": "1716908883579", + "spark.driver.port": "33725", + "spark.driver.host": "woody.lan" + }, + "1": [ + 76.91316103935242, + 79.55859923362732, + 81.10397529602051, + 79.01998662948608, + 79.1286551952362 + ], + "2": [ + 23.977370262145996, + 22.214473247528076, + 22.686659812927246, + 22.016682386398315, + 21.766324520111084 + ], + "3": [ + 22.700742721557617, + 21.980144739151, + 21.876065969467163, + 21.661516189575195, + 21.69345998764038 + ], + "4": [ + 17.377647638320923, + 16.249598264694214, + 16.15747308731079, + 16.128843069076538, + 16.04338026046753 + ], + "5": [ + 44.38863182067871, + 45.47764492034912, + 45.76063895225525, + 45.16393995285034, + 60.848369121551514 + ], + "6": [ + 3.2041075229644775, + 2.970944881439209, + 2.891291856765747, + 2.9719409942626953, + 3.0702600479125977 + ], + "7": [ + 24.369274377822876, + 24.684266567230225, + 24.146574020385742, + 24.023175716400146, + 30.56047773361206 + ], + "8": [ + 46.46081209182739, + 45.9838604927063, + 46.341185092926025, + 45.833823919296265, + 46.61182403564453 + ], + "9": [ + 67.67960548400879, + 67.34667444229126, + 70.34601259231567, + 71.24095153808594, + 84.38811421394348 + ], + "10": [ + 19.16477870941162, + 19.081010580062866, + 19.501060009002686, + 19.165698528289795, + 20.216782331466675 + ], + "11": [ + 17.158706426620483, + 17.05184030532837, + 17.714542150497437, + 17.004602909088135, + 17.700096130371094 + ], + "12": [ + 11.654477834701538, + 11.805298805236816, + 11.822469234466553, + 12.79678750038147, + 13.64478850364685 + ], + "13": [ + 20.430822372436523, + 20.18759250640869, + 21.26596975326538, + 21.234288454055786, + 20.189200162887573 + ], + "14": [ + 5.60215950012207, + 5.160705089569092, + 5.080057382583618, + 4.937625408172607, + 5.853632688522339 + ], + "15": [ + 14.17775845527649, + 13.898571729660034, + 14.215840578079224, + 14.316090106964111, + 14.356236457824707 + ], + "16": [ + 6.252386808395386, + 6.010213375091553, + 6.054978370666504, + 5.886059522628784, + 5.923115253448486 + ], + "17": [ + 71.41593313217163, + 70.25399804115295, + 72.07622528076172, + 72.27566242218018, + 72.20579051971436 + ], + "18": [ + 65.72738265991211, + 65.47461080551147, + 67.14260482788086, + 65.95489883422852, + 69.51795554161072 + ], + "19": [ + 7.1520891189575195, + 6.516514301300049, + 6.580992698669434, + 6.486274242401123, + 6.418147087097168 + ], + "20": [ + 12.619760036468506, + 12.235978126525879, + 12.116347551345825, + 12.161245584487915, + 12.30910348892212 + ], + "21": [ + 60.795483350753784, + 60.484593629837036, + 61.27316427230835, + 60.475560426712036, + 81.21473670005798 + ], + "22": [ + 8.926804065704346, + 8.113754034042358, + 8.029133796691895, + 7.99291467666626, + 8.439452648162842 + ] +} \ No newline at end of file diff --git a/docs/source/contributor-guide/benchmarking.md b/docs/source/contributor-guide/benchmarking.md index 502b35c29..3e9a61efb 100644 --- a/docs/source/contributor-guide/benchmarking.md +++ b/docs/source/contributor-guide/benchmarking.md @@ -19,44 +19,87 @@ under the License. # Comet Benchmarking Guide -To track progress on performance, we regularly run benchmarks derived from TPC-H and TPC-DS. Benchmarking scripts are -available in the [DataFusion Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository. +To track progress on performance, we regularly run benchmarks derived from TPC-H and TPC-DS. Data generation and +benchmarking documentation and scripts are available in the [DataFusion Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository. -Here is an example command for running the benchmarks. This command will need to be adapted based on the Spark -environment and location of data files. +Here are example commands for running the benchmarks against a Spark cluster. This command will need to be +adapted based on the Spark environment and location of data files. -This command assumes that `datafusion-benchmarks` is checked out in a parallel directory to `datafusion-comet`. +These commands are intended to be run from the `runners/datafusion-comet` directory in the `datafusion-benchmarks` +repository. + +## Running Benchmarks Against Apache Spark ```shell -$SPARK_HOME/bin/spark-submit \ - --master "local[*]" \ - --conf spark.driver.memory=8G \ - --conf spark.executor.memory=64G \ - --conf spark.executor.cores=16 \ - --conf spark.cores.max=16 \ - --conf spark.eventLog.enabled=true \ - --conf spark.sql.autoBroadcastJoinThreshold=-1 \ - --jars $COMET_JAR \ - --conf spark.driver.extraClassPath=$COMET_JAR \ - --conf spark.executor.extraClassPath=$COMET_JAR \ - --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ - --conf spark.comet.enabled=true \ - --conf spark.comet.exec.enabled=true \ - --conf spark.comet.exec.all.enabled=true \ - --conf spark.comet.cast.allowIncompatible=true \ - --conf spark.comet.explainFallback.enabled=true \ - --conf spark.comet.parquet.io.enabled=false \ - --conf spark.comet.batchSize=8192 \ - --conf spark.comet.columnar.shuffle.enabled=false \ - --conf spark.comet.exec.shuffle.enabled=true \ - --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ - --conf spark.sql.adaptive.coalescePartitions.enabled=false \ - --conf spark.comet.shuffle.enforceMode.enabled=true \ - ../datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \ - --benchmark tpch \ - --data /mnt/bigdata/tpch/sf100-parquet/ \ - --queries ../datafusion-benchmarks/tpch/queries +$SPARK_HOME/bin/spark-submit \ + --master $SPARK_MASTER \ + --conf spark.driver.memory=8G \ + --conf spark.executor.memory=32G \ + --conf spark.executor.cores=8 \ + --conf spark.cores.max=8 \ + --conf spark.sql.autoBroadcastJoinThreshold=-1 \ + tpcbench.py \ + --benchmark tpch \ + --data /mnt/bigdata/tpch/sf100/ \ + --queries ../../tpch/queries \ + --iterations 5 ``` -Comet performance can be compared to regular Spark performance by running the benchmark twice, once with -`spark.comet.enabled` set to `true` and once with it set to `false`. \ No newline at end of file +## Running Benchmarks Against Apache Spark with Apache DataFusion Comet Enabled + +```shell +$SPARK_HOME/bin/spark-submit \ + --master $SPARK_MASTER \ + --conf spark.driver.memory=8G \ + --conf spark.executor.memory=64G \ + --conf spark.executor.cores=8 \ + --conf spark.cores.max=8 \ + --conf spark.sql.autoBroadcastJoinThreshold=-1 \ + --jars $COMET_JAR \ + --conf spark.driver.extraClassPath=$COMET_JAR \ + --conf spark.executor.extraClassPath=$COMET_JAR \ + --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ + --conf spark.comet.enabled=true \ + --conf spark.comet.exec.enabled=true \ + --conf spark.comet.exec.all.enabled=true \ + --conf spark.comet.cast.allowIncompatible=true \ + --conf spark.comet.explainFallback.enabled=true \ + --conf spark.comet.parquet.io.enabled=false \ + --conf spark.comet.batchSize=8192 \ + --conf spark.comet.exec.shuffle.enabled=true \ + --conf spark.comet.exec.shuffle.mode=auto \ + --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ + --conf spark.sql.adaptive.coalescePartitions.enabled=false \ + tpcbench.py \ + --benchmark tpch \ + --data /mnt/bigdata/tpch/sf100/ \ + --queries ../../tpch/queries \ + --iterations 5 +``` + +## Current Performance + +Comet is not yet achieving full DataFusion speeds in all cases, but with future work we aim to provide a 2x-4x speedup +for many use cases. + +The following benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X CPU (16 cores), 128 GB RAM, and +data stored locally on NVMe storage. Performance characteristics will vary in different environments and we encourage +you to run these benchmarks in your own environments. + +![](../../_static/images/tpch_allqueries.png) + +Here is a breakdown showing relative performance of Spark, Comet, and DataFusion for each TPC-H query. + +![](../../_static/images/tpch_queries_compare.png) + +The following chart shows how much Comet currently accelerates each query from the benchmark. Performance optimization +is an ongoing task, and we welcome contributions from the community to help achieve even greater speedups in the future. + +![](../../_static/images/tpch_queries_speedup.png) + +The raw results of these benchmarks in JSON format is available here: + +- [Spark](./benchmark-results/2024-05-30/spark-8-exec-5-runs.json) +- [Comet](./benchmark-results/2024-05-30/comet-8-exec-5-runs.json) +- [DataFusion](./benchmark-results/2024-05-30/datafusion-python-8-cores.json) +