From 30c20b2e98937cfacb6a59e7e8bcb372025df74f Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Sun, 28 Apr 2024 15:54:47 -0600 Subject: [PATCH] Add a plugin overview page to the contributors guide --- docs/source/contributor-guide/debugging.md | 2 +- .../contributor-guide/plugin_overview.md | 50 +++++++++++++++++++ docs/source/index.rst | 5 +- 3 files changed, 54 insertions(+), 3 deletions(-) create mode 100644 docs/source/contributor-guide/plugin_overview.md diff --git a/docs/source/contributor-guide/debugging.md b/docs/source/contributor-guide/debugging.md index 3b20ed0b2..38c396c15 100644 --- a/docs/source/contributor-guide/debugging.md +++ b/docs/source/contributor-guide/debugging.md @@ -99,7 +99,7 @@ https://mail.openjdk.org/pipermail/hotspot-dev/2019-September/039429.html Detecting the debugger https://stackoverflow.com/questions/5393403/can-a-java-application-detect-that-a-debugger-is-attached#:~:text=No.,to%20let%20your%20app%20continue.&text=I%20know%20that%20those%20are,meant%20with%20my%20first%20phrase). -# Verbose debug +## Verbose debug By default, Comet outputs the exception details specific for Comet. diff --git a/docs/source/contributor-guide/plugin_overview.md b/docs/source/contributor-guide/plugin_overview.md new file mode 100644 index 000000000..e8bcad16f --- /dev/null +++ b/docs/source/contributor-guide/plugin_overview.md @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. + +Comet does not support partially replacing subsets of the plan because this would involve adding transitions to convert between row-based and columnar data between Spark operators and Comet operators and the overhead of this could outweigh the benefits of running parts of the plan natively in Comet. + +Once the plan has been transformed, it is serialized into Comet protocol buffer format by the `QueryPlanSerde` class and this serialized plan is passed into the native code by `CometExecIterator`. + +In the native code there is a `PhysicalPlanner` struct (in `planner.rs`) which converts the serialized plan into an Apache DataFusion physical plan. In some cases, Comet provides specialized physical operators and expressions to override the DataFusion versions to ensure compatibility with Apache Spark. diff --git a/docs/source/index.rst b/docs/source/index.rst index 4462a8d87..0bf9929d4 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -49,9 +49,10 @@ as a native runtime to achieve improvement in terms of query efficiency and quer :caption: Contributor Guide Getting Started + Comet Plugin Overview + Development Guide + Debugging Guide Github and Issue Tracker - contributor-guide/development - contributor-guide/debugging .. _toc.asf-links: .. toctree::