Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with datafusion comet,no performance improvement. #1084

Open
shaileneF opened this issue Nov 14, 2024 · 6 comments
Open

with datafusion comet,no performance improvement. #1084

shaileneF opened this issue Nov 14, 2024 · 6 comments

Comments

@shaileneF
Copy link

env:

host
os:CentOS Linux release 8.2.2004 (Core)
kernel:4.18.0-193.el8.x86_64
memory:1T
jdk:1.8
maven:3.9.6
spark:3.4
scala:2.12
container:
os:CentOS Linux release 7.4.1708 (Core)
kernel:4.18.0-193.el8.x86_64
cpu cores:128
spark: 3.4.3
memory:1T
jdk:11
maven:3.9.6
spark:3.4
scala:2.12

data:TPCDS 100G/1T

with datafusion comet, spark-submit shell:

export COMET_JAR=/export/datafusion-test/comet-spark-spark3.4_2.12-0.3.0.jar

$SPARK_HOME/bin/spark-submit \
    --master local \
    --name comet-tpcbench \
    --driver-memory 20G \
    --conf spark.driver.memory=20G \
    --conf spark.executor.instances=16 \
    --conf spark.executor.memory=40G \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=128 \
    --conf spark.task.cpus=1 \
    --conf spark.executor.memoryOverhead=3G \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=40G \
    --jars $COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
    --conf spark.comet.enabled=true \
    --conf spark.comet.exec.enabled=true \
    --conf spark.comet.exec.all.enabled=true \
    --conf spark.comet.cast.allowIncompatible=true \
    --conf spark.comet.exec.shuffle.enabled=true \
    --conf spark.comet.exec.shuffle.mode=auto \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    /export/datafusion-test/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpcds \
    --data /export/cy/test-data-100G-1024/ \
    --queries /export/datafusion-test/datafusion-benchmarks/tpcds/queries-spark \
    --output /export/datafusion-test/output \
    --iterations 2

without datafusion comet, spark-submit shell:

$SPARK_HOME/bin/spark-submit \
    --master local \
    --name comet-tpcbench \
    --driver-memory 20G \
    --conf spark.driver.memory=20G \
    --conf spark.executor.instances=16 \
    --conf spark.executor.memory=80G \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=128 \
    --conf spark.task.cpus=1 \
    --conf spark.executor.memoryOverhead=3G \
    --conf spark.memory.offHeap.enabled=false \
    /export/datafusion-test/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpcds \
    --data /export/cy/test-data-100G-1024/ \
    --queries /export/datafusion-test/datafusion-benchmarks/tpcds/queries-spark \
    --output /export/datafusion-test/output \
    --iterations 2


description :

hello,I run spark+datafusion comet+tpcds in local model. Whether config master is set to local or local[*], DataFusion Comet does not significantly improve performance, and there are even many queries that result in negative gains. Could you please help check if my configuration is incorrect? I tested with 100GB and 1TB TPC-DS datasets, and the performance improvement with DataFusion Comet is very low, with the total query duration improving by only about 6%. My container specifications are 128 cores and 1TB of memory.🙏🙏🙏

@andygrove
Copy link
Member

Hi @shaileneF Are you testing with the 0.3.0 release or the latest from the main branch? I am going to be running benchmarks today and tomorrow in preparation for the 0.4.0 release so will share my results with you.

@shaileneF
Copy link
Author

Hi @shaileneF Are you testing with the 0.3.0 release or the latest from the main branch? I am going to be running benchmarks today and tomorrow in preparation for the 0.4.0 release so will share my results with you.

Yes,0.3.0,I download the release jar from https://datafusion.apache.org/comet/user-guide/installation.html.
thank you for running the benchmark. I want to know my spark-submit config is right or not.

@andygrove
Copy link
Member

One more question @shaileneF ... is your data set partitioned by date?

@shaileneF
Copy link
Author

One more question @shaileneF ... is your data set partitioned by date?
the dataset was partitioned during generation, but it was not partitioned by date.

@shaileneF
Copy link
Author

One more question @shaileneF ... is your data set partitioned by date?

Here is the dataset generation shell.
https://github.com/apache/incubator-gluten/tree/main/tools/workload/tpcds/gen_data/parquet_dataset

@andygrove
Copy link
Member

We likely need to resolve #1123 to get better performance results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants