My Dagster code is hanging or taking longer than I expect to execute. How can I troubleshoot? #14771

gibsondan · 2023-06-13T16:21:58Z

gibsondan
Jun 13, 2023
Maintainer

An op is taking longer than I would expect to run, how can I find the bottleneck?
A dagster run is hanging and I want to know where it's stuck
A code server has become unresponsive but doesn't seem to be memory or CPU-bound, why is it stuck?

Jun 13, 2023

We recommend that you use py-spy to profile your code.

For hanging code, py-spy dump can give you a dump of each thread, which usually makes it immediately clear where the hang is happening.

For slow code, py-spy record can produce an SVG file that gives you a flame graph of where the process is spending the most time. (We recommend py-spy record -f speedscope --idle to produce speedscope profiles, and to include idle CPU time in the results)

Note that py-spy usually requires elevated permissions in order to run.

--

A typical workflow for generating a py-spy dump for a hanging run in Kubernetes:

a) Set up your Dagster deployment so that each run pod is using a security context that can ru…

View full answer

gibsondan · 2023-06-13T16:38:50Z

gibsondan
Jun 13, 2023
Maintainer Author

We recommend that you use py-spy to profile your code.

For hanging code, py-spy dump can give you a dump of each thread, which usually makes it immediately clear where the hang is happening.

For slow code, py-spy record can produce an SVG file that gives you a flame graph of where the process is spending the most time. (We recommend py-spy record -f speedscope --idle to produce speedscope profiles, and to include idle CPU time in the results)

Note that py-spy usually requires elevated permissions in order to run.

--

A typical workflow for generating a py-spy dump for a hanging run in Kubernetes:

a) Set up your Dagster deployment so that each run pod is using a security context that can run py-spy https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-kubernetes

For example, you can set this in your dagster_cloud.yaml file for your code location if you're running a Kubernetes agent to make both your code servers and run pods able to work with py-spy:

locations:
  - location_name: cloud-examples
    image: dagster/dagster-cloud-examples:latest
    code_source:
      package_name: dagster_cloud_examples
    container_context:
      k8s:
        server_k8s_config: # Raw kubernetes config for code servers launched by the agent
          container_config:
            securityContext:
              capabilities:
                add:
                  - SYS_PTRACE
        run_k8s_config: # Raw kubernetes config for runs launched by the agent
          container_config:
            securityContext:
              capabilities:
                add:
                  - SYS_PTRACE

Or if you're using the Dagster open-source helm chart, you can configure the run launcher to launch each run with

runLauncher:
  type: K8sRunLauncher
  config:
    k8sRunLauncher:
      runK8sConfig:
        containerConfig:
          securityContext:
            capabilities:
              add:
                - SYS_PTRACE

(Note that this gives the pod elevated permissions - check with your cluster admins to make sure this is an acceptable change to make temporarily)

See https://docs.dagster.io/dagster-cloud/deployment/agents/kubernetes/configuration-reference#per-location-configuration (Cloud) and https://docs.dagster.io/deployment/guides/kubernetes/customizing-your-deployment#per-job-or-per-op-kubernetes-configuration (OSS) for more information on how to apply these types of customizations to your Kubernetes pods.

b) Launch a run and wait until it hangs

c) Check the event logs for the run to find the run pod, then kubectl exec into the pod to run py-spy:

kubectl exec -it <pod name here> /bin/bash

d) install py-spy, then run it:

pip install py-spy
py-spy dump --pid 1

This should output a dump of what each thread in the process is doing.

1 reply

abi-mutinex Apr 15, 2024

@gibsondan if I want to see what's taking my code location ~600s to load, can I use this? Where would I add it?

geoHeil · 2023-06-14T09:16:58Z

geoHeil
Jun 14, 2023

How would this work for the docker based deployment?
I I need to roll out py-spy by default?
Or would I manually have to connect to the spun up container and install it on the fly?

Would it be possible to reuse the https://dagster.slack.com/archives/C02LJ7G0LAZ/p1686306297612339 special tag to launch the run in a persistently running container (where I can prepare the install of pyspy)?

1 reply

gibsondan Jun 16, 2023
Maintainer Author

py-spy has instructions for running in docker as well: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker

this would work similarly as the k8s instructions above but with something like:

locations:
  - location_name: cloud-examples
    image: dagster/dagster-cloud-examples:latest
    code_source:
      package_name: dagster_cloud_examples
    container_context:
      docker:
        container_kwargs:
          cap_add:
            - SYS_PTRACE

yamhirotoCS · 2024-10-28T16:38:36Z

yamhirotoCS
Oct 28, 2024

Hi @gibsondan , I am trying to run py-spy in the user_code container to understand why a job container takes a lot of time (once ready on AWS) , to get started before actually running the job itself. the loading of the code location within the job container seems to be the issue. do you know which process should I track with py-spy?

python /opt/dagster/dagster_home/.venv/bin/dagster code-server start -h 0.0.0.0 -p 4000 -m bi_dagster
python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmpbbs_u4y6 --heartbeat --heartbeat-timeout 30 --fixed-server-id 7dbfd3b4-11b1-45e6-a022-46faa7fec5c3 --log-level info --location-name bi_dagster --container-image dagster/user_code -m bi_dagster -d /opt/dagster/dagster_home
python -c from multiprocessing.resource_tracker import main;

5 replies

gibsondan Oct 28, 2024
Maintainer Author

The code is actually being loaded in the second one (dagster api grpc)

gibsondan Oct 28, 2024
Maintainer Author

That said, if you're launching a run using the default run launcher, that would be happening in some subprocess of that "dagster api grpc" process.

yamhirotoCS Oct 28, 2024

Amazing, thanks for your quick response !
We are using dagster_aws.ecs as a run Launcher.
I will be checking :
python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmpbbs_u4y6 --heartbeat --heartbeat-timeout 30 --fixed-server-id 7dbfd3b4-11b1-45e6-a022-46faa7fec5c3 --log-level info --location-name bi_dagster --container-image dagster/user_code -m bi_dagster -d /opt/dagster/dagster_home

as you suggested. thanks again !

gibsondan Oct 28, 2024
Maintainer Author

If you're using the EcsRunLauncher, the code is being loaded in in the run inside an ECS task that is entirely separately from the code server. But what you describe there should still be useful to determine how long it takes the code to import, potentially.

yamhirotoCS Oct 28, 2024

noted, I will check with the user_code first, as when we reduce the amount of assets to be loaded, not only the code location takes less time, but so does jobs executed locally. I will investigate locally and escalate to the ECS right after.
There seems to be noise with extra processes irrelevant to dagster, but I will dig start from there.

thanks !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My Dagster code is hanging or taking longer than I expect to execute. How can I troubleshoot? #14771

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

My Dagster code is hanging or taking longer than I expect to execute. How can I troubleshoot? #14771

gibsondan Jun 13, 2023 Maintainer

Replies: 3 comments · 7 replies

gibsondan Jun 13, 2023 Maintainer Author

abi-mutinex Apr 15, 2024

geoHeil Jun 14, 2023

gibsondan Jun 16, 2023 Maintainer Author

yamhirotoCS Oct 28, 2024

gibsondan Oct 28, 2024 Maintainer Author

gibsondan Oct 28, 2024 Maintainer Author

yamhirotoCS Oct 28, 2024

gibsondan Oct 28, 2024 Maintainer Author

yamhirotoCS Oct 28, 2024

gibsondan
Jun 13, 2023
Maintainer

Replies: 3 comments 7 replies

gibsondan
Jun 13, 2023
Maintainer Author

geoHeil
Jun 14, 2023

gibsondan Jun 16, 2023
Maintainer Author

yamhirotoCS
Oct 28, 2024

gibsondan Oct 28, 2024
Maintainer Author

gibsondan Oct 28, 2024
Maintainer Author

gibsondan Oct 28, 2024
Maintainer Author