My Dagster code is hanging or taking longer than I expect to execute. How can I troubleshoot? #14771
-
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 7 replies
-
We recommend that you use py-spy to profile your code. For hanging code, For slow code, Note that py-spy usually requires elevated permissions in order to run. -- A typical workflow for generating a py-spy dump for a hanging run in Kubernetes: a) Set up your Dagster deployment so that each run pod is using a security context that can run py-spy https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-kubernetes For example, you can set this in your dagster_cloud.yaml file for your code location if you're running a Kubernetes agent to make both your code servers and run pods able to work with py-spy:
Or if you're using the Dagster open-source helm chart, you can configure the run launcher to launch each run with
(Note that this gives the pod elevated permissions - check with your cluster admins to make sure this is an acceptable change to make temporarily) See https://docs.dagster.io/dagster-cloud/deployment/agents/kubernetes/configuration-reference#per-location-configuration (Cloud) and https://docs.dagster.io/deployment/guides/kubernetes/customizing-your-deployment#per-job-or-per-op-kubernetes-configuration (OSS) for more information on how to apply these types of customizations to your Kubernetes pods. b) Launch a run and wait until it hangs c) Check the event logs for the run to find the run pod, then
d) install py-spy, then run it:
This should output a dump of what each thread in the process is doing. |
Beta Was this translation helpful? Give feedback.
-
How would this work for the docker based deployment? Would it be possible to reuse the https://dagster.slack.com/archives/C02LJ7G0LAZ/p1686306297612339 special tag to launch the run in a persistently running container (where I can prepare the install of pyspy)? |
Beta Was this translation helpful? Give feedback.
-
Hi @gibsondan , I am trying to run py-spy in the user_code container to understand why a job container takes a lot of time (once ready on AWS) , to get started before actually running the job itself. the loading of the code location within the job container seems to be the issue. do you know which process should I track with py-spy?
|
Beta Was this translation helpful? Give feedback.
We recommend that you use py-spy to profile your code.
For hanging code,
py-spy dump
can give you a dump of each thread, which usually makes it immediately clear where the hang is happening.For slow code,
py-spy record
can produce an SVG file that gives you a flame graph of where the process is spending the most time. (We recommendpy-spy record -f speedscope --idle
to produce speedscope profiles, and to include idle CPU time in the results)Note that py-spy usually requires elevated permissions in order to run.
--
A typical workflow for generating a py-spy dump for a hanging run in Kubernetes:
a) Set up your Dagster deployment so that each run pod is using a security context that can ru…