pyspark not starting task on GPU #8094

saifmasood · 2023-04-13T08:37:41Z

saifmasood
Apr 13, 2023

I'm testing gpu support for pyspark with spark-rapids using a simple program to read a csv file into a dataframe and display it. However, no tasks are being run and the pyspark progress bar simply displays (0 + 0) / 1 i.e no tasks are active. Could anyone point out what I might be doing wrong?

pyspark-version: 3.3.0 (local mode)
Using rapids-4-spark_2.12-23.02.0.jar, cudf-23.02.0-cuda11.jar

Python code:

def get_spark_session():
    from pyspark.sql.session import SparkSession
    from pyspark import SparkContext, SparkConf
    
    spark_conf = SparkConf()
    spark_conf.set("spark.eventLog.enabled", "true")
    spark_conf.set("spark.rapids.sql.enabled", "true")
    spark_conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
    spark_conf.set('spark.rapids.sql.explain', 'ALL')
    spark_conf.set("spark.jars", "/home/local/ASUAD/snola119/DFInferBench/cudf-23.02.0-cuda11.jar,/home/local/ASUAD/snola119/DFInferBench/rapids-4-spark_2.12-23.02.0.jar,/home/local/ASUAD/snola119/DFInferBench/postgresql-42.6.0.jar")
    spark_conf.set("spark.executor.resource.gpu.discoveryScript","/home/local/ASUAD/snola119/DFInferBench/getGpusResources.sh")
    spark_conf.set("spark.driver.resource.gpu.discoveryScript","/home/local/ASUAD/snola119/DFInferBench/getGpusResources.sh")

    spark_conf.set("spark.task.resource.gpu.amount","0.125")
    spark_conf.set("spark.driver.resource.gpu.amount","1")
    spark_conf.set("spark.executor.resource.gpu.amount", "1")
    spark_conf.set('spark.rapids.sql.format.csv.read.enabled', 'true')
    spark_conf.set('spark.rapids.sql.format.csv.enabled', 'true')
    spark_conf.set("spark.rapids.sql.incompatibleOps.enabled", 'true') 

    sc = SparkContext(conf = spark_conf).getOrCreate("DFInferBench")
    sc.setLogLevel("INFO")
    return SparkSession(sc)


def fetch_data(spark):
    path =  "/home/local/ASUAD/snola119/DFInferBench/dataset/HIGGS_mtest.csv"
    df = spark.read.csv(path,inferSchema = True)
    df.show()

spark = get_spark_session()
df = fetch_data(spark)

Output:

nvidia-smi (before running the program)

nvidia-smi (after running the program)

I can see that the task has been scheduled but I'm not sure why it's not making any progress.

Answered by revans2

Apr 13, 2023

@saifmasood thank you for filing this.

Reading a CSV file happens in two different stages. The first stage is schema discovery. Schema discovery happens if you do not provide a schema for the CSV data, like you are doing in your query. We have not optimized schema discovery for CSV or JSON for a number of reasons. The output from the plugin shows that it saw the schema discovery portion and tried to translate at least parts of it to the GPU.

I see a few potential problems with your configs depending on what mode you are running in.

If you are in local mode, Spark does not deal with GPU resources well at all and will hang. Please remove all requests for GPU resources in local mode. Probabl…

View full answer

revans2 · 2023-04-13T13:27:45Z

revans2
Apr 13, 2023
Maintainer

@saifmasood thank you for filing this.

Reading a CSV file happens in two different stages. The first stage is schema discovery. Schema discovery happens if you do not provide a schema for the CSV data, like you are doing in your query. We have not optimized schema discovery for CSV or JSON for a number of reasons. The output from the plugin shows that it saw the schema discovery portion and tried to translate at least parts of it to the GPU.

I see a few potential problems with your configs depending on what mode you are running in.

If you are in local mode, Spark does not deal with GPU resources well at all and will hang. Please remove all requests for GPU resources in local mode. Probably good to remove all resource requests in general in local mode. Also local mode cannot deal with more than one GPU either. So it will select one of the GPUs to use and ignore the other A10.

If you are in standalone, YARN, or Kubernetes, then it looks like you may have a problem with requesting a GPU for the driver.

    spark_conf.set("spark.driver.resource.gpu.amount","1")

You have 2 GPUs so it might work out okay to have one GPU for the driver, which it will not use, and one GPU for an executor. But that should not cause a hang.

There is also a warning spit out by spark indicating that you have 8 CPU cores, but are only asking for 1/4 of a GPU per task. This means that you would run with only 4 of your 8 CPU cores in an executor and likely there would be no other executors running. If this is intended, then that is fine. Just wanted to call it out. It is very unlikely that it is causing the hang.

It would be great to see that the resource scheduler says about what is happening, and also to get a jstack output for the java process on the GPU and if there is a driver process too, that would be great to understand.

0 replies

saifmasood · 2023-04-14T09:26:01Z

saifmasood
Apr 14, 2023
Author

@revans2 Thank you for the detailed response. After going through you response, I moved to spark standalone and found out that the workers did not have resources to schedule the taxks. It turns out that the issue was with
spark_conf.set("spark.executor.resource.gpu.amount", "1")
I followed this resource and was able to make pyspark work in both local and standalone mode using rapids. However, I see that gpu doesn't outperform the cpu for my workloads (I see that a lot of tasks cannot be scheduled on GPU).

Specifically, I am using pyspark's MLlib for training random forest models. Do you think spark-rapids would not help me in this scenario?

0 replies

revans2 · 2023-04-14T13:53:58Z

revans2
Apr 14, 2023
Maintainer

The current code for the RAPIDs accelerator does not improve MLlib libraries, but https://github.com/NVIDIA/spark-rapids-ml should provide a mostly API compatible library for many MLLib operations including random forest.

1 reply

saifmasood Apr 14, 2023
Author

Thanks @revans2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyspark not starting task on GPU #8094

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

pyspark not starting task on GPU #8094

saifmasood Apr 13, 2023

Replies: 3 comments · 1 reply

revans2 Apr 13, 2023 Maintainer

saifmasood Apr 14, 2023 Author

revans2 Apr 14, 2023 Maintainer

saifmasood Apr 14, 2023 Author

saifmasood
Apr 13, 2023

Replies: 3 comments 1 reply

revans2
Apr 13, 2023
Maintainer

saifmasood
Apr 14, 2023
Author

revans2
Apr 14, 2023
Maintainer

saifmasood Apr 14, 2023
Author