pyspark not starting task on GPU #8094
-
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
@saifmasood thank you for filing this. Reading a CSV file happens in two different stages. The first stage is schema discovery. Schema discovery happens if you do not provide a schema for the CSV data, like you are doing in your query. We have not optimized schema discovery for CSV or JSON for a number of reasons. The output from the plugin shows that it saw the schema discovery portion and tried to translate at least parts of it to the GPU. I see a few potential problems with your configs depending on what mode you are running in. If you are in local mode, Spark does not deal with GPU resources well at all and will hang. Please remove all requests for GPU resources in local mode. Probably good to remove all resource requests in general in local mode. Also local mode cannot deal with more than one GPU either. So it will select one of the GPUs to use and ignore the other A10. If you are in standalone, YARN, or Kubernetes, then it looks like you may have a problem with requesting a GPU for the driver.
You have 2 GPUs so it might work out okay to have one GPU for the driver, which it will not use, and one GPU for an executor. But that should not cause a hang. There is also a warning spit out by spark indicating that you have 8 CPU cores, but are only asking for 1/4 of a GPU per task. This means that you would run with only 4 of your 8 CPU cores in an executor and likely there would be no other executors running. If this is intended, then that is fine. Just wanted to call it out. It is very unlikely that it is causing the hang. It would be great to see that the resource scheduler says about what is happening, and also to get a jstack output for the java process on the GPU and if there is a driver process too, that would be great to understand. |
Beta Was this translation helpful? Give feedback.
-
@revans2 Thank you for the detailed response. After going through you response, I moved to spark standalone and found out that the workers did not have resources to schedule the taxks. It turns out that the issue was with Specifically, I am using pyspark's MLlib for training random forest models. Do you think spark-rapids would not help me in this scenario? |
Beta Was this translation helpful? Give feedback.
-
The current code for the RAPIDs accelerator does not improve MLlib libraries, but https://github.com/NVIDIA/spark-rapids-ml should provide a mostly API compatible library for many MLLib operations including random forest. |
Beta Was this translation helpful? Give feedback.
@saifmasood thank you for filing this.
Reading a CSV file happens in two different stages. The first stage is schema discovery. Schema discovery happens if you do not provide a schema for the CSV data, like you are doing in your query. We have not optimized schema discovery for CSV or JSON for a number of reasons. The output from the plugin shows that it saw the schema discovery portion and tried to translate at least parts of it to the GPU.
I see a few potential problems with your configs depending on what mode you are running in.
If you are in local mode, Spark does not deal with GPU resources well at all and will hang. Please remove all requests for GPU resources in local mode. Probabl…