Skip to content

Commit

Permalink
doc: Add Quickstart Comet doc section (#125)
Browse files Browse the repository at this point in the history
Co-authored-by: o_voievodin <[email protected]>
  • Loading branch information
comphead and o_voievodin authored Feb 28, 2024
1 parent 0a96145 commit 27f167b
Showing 1 changed file with 46 additions and 0 deletions.
46 changes: 46 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,49 @@ Linux, Apple OSX (Intel and M1)
- Apache Spark 3.2, 3.3, or 3.4
- JDK 8 and up
- GLIBC 2.17 (Centos 7) and up

## Getting started

Make sure the requirements above are met and software installed on your machine

### Clone repo
```commandline
git clone https://github.com/apache/arrow-datafusion-comet.git
```

### Specify the Spark version and build the Comet
Spark 3.4 used for the example.
```
cd arrow-datafusion-comet
make release PROFILES="-Pspark-3.4"
```

### Run Spark with Comet enabled
Make sure `SPARK_HOME` points to the same Spark version as Comet has built for.

```
$SPARK_HOME/bin/spark-shell --jars spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.enabled=true \
--conf spark.comet.exec.all.enabled=true
```

### Verify Comet enabled for Spark SQL query

Create a test Parquet source
```scala
scala> (0 until 10).toDF("a").write.mode("overwrite").parquet("/tmp/test")
```

Query the data from the test source and check:
- INFO message shows the native Comet library has been initialized.
- The query plan reflects Comet operators being used for this query instead of Spark ones
```scala
scala> spark.read.parquet("/tmp/test").createOrReplaceTempView("t1"); spark.sql("select * from t1 where a > 5").explain
INFO src/lib.rs: Comet native library initialized
== Physical Plan ==
*(1) ColumnarToRow
+- CometFilter [a#14], (isnotnull(a#14) AND (a#14 > 5))
+- CometScan parquet [a#14] Batched: true, DataFilters: [isnotnull(a#14), (a#14 > 5)], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/test], PartitionFilters: [], PushedFilters: [IsNotNull(a), GreaterThan(a,5)], ReadSchema: struct<a:int>
```

0 comments on commit 27f167b

Please sign in to comment.