Code location instability - gRPC Error code: UNAVAILABLE #17729
-
I've recently increased the number of concurrent jobs running on Dagster, and I've noticed that the code location has become less stable than before. I'm encountering errors such as What can I do to enhance the stability of the code location? |
Beta Was this translation helpful? Give feedback.
Replies: 12 comments 2 replies
-
Hi @ytoast . The You'll likely need to use the debugging tools for your particular deployment type to debug why the processes are failing |
Beta Was this translation helpful? Give feedback.
-
hi @jamiedemaria :) I am facing the same issue on my EKS deployment. I am using 2 deployments of user code, one with the default image "docker.io/dagster/dagster-celery-k8s" and one is mine. The default user code is The exact error message coming from the Dagit is the 172.20.189.35 ip is the service ip of my custom user code, I don’t think my issue is related to networking because it connects with the service of the default user code. I tried to run my image on my local setup by so what could be the problem? I can’t get useful info from |
Beta Was this translation helpful? Give feedback.
-
We have also been running into this periodically, as far as I can tell the code deployments are fine the entire time so it feels weird that the webserver cant seem to connect to it. |
Beta Was this translation helpful? Give feedback.
-
Checking in representing another team who has been having this problem: irregular, but frequent (several times per week) instances of the GRPC server becoming unreachable, with no other apparent problems. Leading to us having to not only restart Dagster but also go in and manually terminate jobs from the old instance (which Dagster will then "fail" to terminate, saying they're unreachable - but at least then they stop being erroneously blue on the asset graph.) |
Beta Was this translation helpful? Give feedback.
-
What kind of debugging tools would you suggest? @jamiedemaria I'm using a k8 deployment setup. I tried snooping around but can't find anything substantial |
Beta Was this translation helpful? Give feedback.
-
@jamiedemaria My team is also experiencing this issue on our k8s deployment. We have around 40 assets actively materializing on 90s interval sensors, which we'd like to scale up to much more. In addition to intermittent gRPC errors, we are also finding the UI to be very slow or unresponsive at times. We are looking for ways to debug or tweak our setup for better stability. We are also trying to understand if there is an upper limit to the number of jobs a k8s Dagster deployment can handle. This is currently a barrier to our team adopting Dagster as a tool. |
Beta Was this translation helpful? Give feedback.
-
Hello 👋 I'm here also representing my team, which is facing the same issue. We have Dagster running on K8s with three pods. One is for the user code, one for dagit, and the last for the daemon/scheduler. The gRPC error happens randomly, and our pipelines don't fail, but the connection sometimes gets lost when there's a new deployment, and the user code pod needs to be destroyed/created again. To "fix" this, we have to go to the UI and manually reload the workspace. As I said, this doesn't always happen, and considering that we have multiple deployments in a day, the probability is low, but either way, it would be nice to find a solution. |
Beta Was this translation helpful? Give feedback.
-
Hello, To expand on what was explained by @jamiedemaria, the The underlying causes of this error could include your code location being unhealthy, resource constrained (cpu or memory exhaustion) and unable to answer within a specific timeouts, network issues or underlying infrastructure issues (node cycling, eviction, spot instance termination, etc). Sometimes one misconfigured node, or malfunctioning essential daemonset on a specific node, can create recurring but not predictable problems at the networking level (dns resolution, setting iptables rules depending on how your Kubernetes is configured to implement ClusterIp services).
Top of my head:
You can hit several soft limit as you scale your usage of Dagster on k8s. How much you compute resources you allocate to the various element of the dagster control plane, kubernetes, etc. How did you configure node scaling for your clusters? How big are your jobs (are they running for hours and consuming tons of cpu or memory) or are they lightweight. Many such soft limits are not Dagster specific but common to running large distributed workloads in general. What are the specific pain points you're encountering? |
Beta Was this translation helpful? Give feedback.
-
This might be irrelevant, as this error can occur for a lot of reasons, but we have been struggling with this issue on GKE. In our scenario, we occasionally have spot instances that are moved to new GKE nodes. If this causes the dagster postgres instance to be moved, then the new postgres instance on the new node is delayed starting up again because the volume takes up to 5mins to copy. This seems to leave the code location in a broken state and it doesnt seem to recover, even when the postgres finally comes back online. Its worth looking at the logs in the daemon process as well as the code location user deployment to see if you are seeing postgres connection errors as an underlying cause. |
Beta Was this translation helpful? Give feedback.
-
A recurring culprit of slow code location server start up is the parsing of dbt models to generate its manifest file. See more about this in: Loading dbt models from a dbt project. To prevent this, ensure |
Beta Was this translation helpful? Give feedback.
-
Our team has been hit by this issue as well. We are just running "dagster dev" inside a docker container. The error message we see is slightly different from the one that posted by @nadamakram
I see no spikes in cpu/memory/network usage when this occurs, and the problem disappears after an hour or so by itself. We had to set "max_concurrent_runs: 1" to limit the concurrency as each of our asset runs in parallel using ray. I don't think this has anything to do with this issue. Do you recommend setting any special options for docker container (such as shm_size?) |
Beta Was this translation helpful? Give feedback.
-
Besides the I have tried everything, from increasing the local_startup_timeout and reload_time in the code_servers:
local_startup_timeout: 300
reload_timeout: 300 to add env variables DAGSTER_GRPC_TIMEOUT_SECONDS=300
DAGSTER_REPOSITORY_GRPC_TIMEOUT_SECONDS=300 Still, nothing seems to improve. |
Beta Was this translation helpful? Give feedback.
Hi @ytoast . The
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach the user code server. gRPC Error code: UNAVAILABLE
means that the process trying to connect to the user code server couldn't. This could be due to any number of reasons depending on your deployment setup. As for the job failures - the executing jobs don't communicate with the user code server, so theUNAVAILABLE
error isn't causing the job failures, but the cause of theUNAVAILABLE
error could also be causing the jobs to fail.You'll likely need to use the debugging tools for your particular deployment type to debug why the processes are failing