Code location instability - gRPC Error code: UNAVAILABLE #17729

ytoast · 2023-11-06T18:32:49Z

ytoast
Nov 6, 2023

I've recently increased the number of concurrent jobs running on Dagster, and I've noticed that the code location has become less stable than before. I'm encountering errors such as dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach the user code server. gRPC Error code: UNAVAILABLE. It seems that the gRPC server connection intermittently disconnects, approximately every 10-20 minutes. This connectivity issue is causing existing jobs to abruptly fail with dagster._core.errors.DagsterExecutionInterruptedError errors. It seems that I'm getting this error bcos it is failing the dagster api grpc-health-check

What can I do to enhance the stability of the code location?

Answered by jamiedemaria

Nov 7, 2023

Hi @ytoast . The dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach the user code server. gRPC Error code: UNAVAILABLE means that the process trying to connect to the user code server couldn't. This could be due to any number of reasons depending on your deployment setup. As for the job failures - the executing jobs don't communicate with the user code server, so the UNAVAILABLE error isn't causing the job failures, but the cause of the UNAVAILABLE error could also be causing the jobs to fail.

You'll likely need to use the debugging tools for your particular deployment type to debug why the processes are failing

View full answer

jamiedemaria · 2023-11-07T17:01:24Z

jamiedemaria
Nov 7, 2023
Maintainer

Hi @ytoast . The dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach the user code server. gRPC Error code: UNAVAILABLE means that the process trying to connect to the user code server couldn't. This could be due to any number of reasons depending on your deployment setup. As for the job failures - the executing jobs don't communicate with the user code server, so the UNAVAILABLE error isn't causing the job failures, but the cause of the UNAVAILABLE error could also be causing the jobs to fail.

You'll likely need to use the debugging tools for your particular deployment type to debug why the processes are failing

1 reply

mlarose Aug 14, 2024
Collaborator

Also, see these comments below to drill down on cause and for possible solutions:

nadamakram · 2023-11-08T02:27:43Z

nadamakram
Nov 8, 2023

hi @jamiedemaria :)

I am facing the same issue on my EKS deployment.

I am using 2 deployments of user code, one with the default image "docker.io/dagster/dagster-celery-k8s" and one is mine. The default user code is Running pod and my one is CrashLoopBackOff pod.

The exact error message coming from the Dagit is grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.189.35:3030: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.189.35:3030: Failed to connect to remote host: Connection refused {grpc_status:14

the 172.20.189.35 ip is the service ip of my custom user code, I don’t think my issue is related to networking because it connects with the service of the default user code.

I tried to run my image on my local setup by docker run -it my-image /bin/sh and then try dagster dev and dagster api grpc --module-name my-module-name --host 0.0.0.0 --port 3030and it is working properly! and dagster api grpc-health-check -p 3030 gives gRPC connection successful

so what could be the problem? I can’t get useful info from kubectl describe

0 replies

pablo-statsig · 2023-12-14T19:10:45Z

pablo-statsig
Dec 14, 2023

We have also been running into this periodically, as far as I can tell the code deployments are fine the entire time so it feels weird that the webserver cant seem to connect to it.

0 replies

maxwelljoslyn · 2024-01-23T22:13:32Z

maxwelljoslyn
Jan 23, 2024

Checking in representing another team who has been having this problem: irregular, but frequent (several times per week) instances of the GRPC server becoming unreachable, with no other apparent problems. Leading to us having to not only restart Dagster but also go in and manually terminate jobs from the old instance (which Dagster will then "fail" to terminate, saying they're unreachable - but at least then they stop being erroneously blue on the asset graph.)

0 replies

ytoast · 2024-01-24T07:02:05Z

ytoast
Jan 24, 2024
Author

What kind of debugging tools would you suggest? @jamiedemaria I'm using a k8 deployment setup. I tried snooping around but can't find anything substantial

0 replies

tlarrue · 2024-03-19T14:52:21Z

tlarrue
Mar 19, 2024

@jamiedemaria My team is also experiencing this issue on our k8s deployment. We have around 40 assets actively materializing on 90s interval sensors, which we'd like to scale up to much more. In addition to intermittent gRPC errors, we are also finding the UI to be very slow or unresponsive at times. We are looking for ways to debug or tweak our setup for better stability. We are also trying to understand if there is an upper limit to the number of jobs a k8s Dagster deployment can handle. This is currently a barrier to our team adopting Dagster as a tool.

0 replies

kennycontreras · 2024-03-20T14:52:11Z

kennycontreras
Mar 20, 2024

Hello 👋 I'm here also representing my team, which is facing the same issue.

We have Dagster running on K8s with three pods. One is for the user code, one for dagit, and the last for the daemon/scheduler.

The gRPC error happens randomly, and our pipelines don't fail, but the connection sometimes gets lost when there's a new deployment, and the user code pod needs to be destroyed/created again. To "fix" this, we have to go to the UI and manually reload the workspace. As I said, this doesn't always happen, and considering that we have multiple deployments in a day, the probability is low, but either way, it would be nice to find a solution.

0 replies

mlarose · 2024-03-20T16:16:50Z

mlarose
Mar 20, 2024
Collaborator

Hello,

To expand on what was explained by @jamiedemaria, the DagsterUserCodeUnreachableError can be emitted by a variety of orchestration routines with the code location, from health checks to launching runs, obtaining assets information, scheduling, sensors, etc.

The underlying causes of this error could include your code location being unhealthy, resource constrained (cpu or memory exhaustion) and unable to answer within a specific timeouts, network issues or underlying infrastructure issues (node cycling, eviction, spot instance termination, etc).

Sometimes one misconfigured node, or malfunctioning essential daemonset on a specific node, can create recurring but not predictable problems at the networking level (dns resolution, setting iptables rules depending on how your Kubernetes is configured to implement ClusterIp services).

so what could be the problem? I can’t get useful info from kubectl describe

Top of my head:

kubectl get events would give you events for the entire cluster which might yield clues
kubectl get pod -o json could give you information about problems such as container restart, exit code and what node it's running on
kubectl logs to get logs from your container or the --previous that might not have been forwarded to dagster
kubectl exec - to get on the pod and examine it from within, or another pod and try to curl to the grpc endpoint of your code location to see if it's reachable, or if it can resolve the dns address of the service, which could point to other services in kubernetes such as core-dns being unhealthy.
kubectl port-forward is another way to test if the grpc endpoint is responding

What kind of debugging tools would you suggest?

https://k9scli.io/ is fairly user-friendly and perhaps would allow you to diagnose some category of issues.
some open source or commercial metrics solution to monitor your containers cpu and memory usage, to identify possible problems, like prometheus and grafana
log collector and aggregation can help understand incidents and problems across services, pods, nodes, etc.
py-spy to perhaps investigate what your code location is doing when/if it's not responsive
you could preload your code location image with a slew of linux process and network debugging tools (dnsutils, psutil, strace or tcpdump) but for some of these you would also need to add capabilities to your container (like CAP_SYS_ADMIN) or run it as a privileged container.

We are also trying to understand if there is an upper limit to the number of jobs a k8s Dagster deployment can handle. This is currently a barrier to our team adopting Dagster as a tool.

You can hit several soft limit as you scale your usage of Dagster on k8s. How much you compute resources you allocate to the various element of the dagster control plane, kubernetes, etc. How did you configure node scaling for your clusters? How big are your jobs (are they running for hours and consuming tons of cpu or memory) or are they lightweight. Many such soft limits are not Dagster specific but common to running large distributed workloads in general. What are the specific pain points you're encountering?

0 replies

the4thamigo-uk · 2024-03-20T21:04:12Z

the4thamigo-uk
Mar 20, 2024

This might be irrelevant, as this error can occur for a lot of reasons, but we have been struggling with this issue on GKE. In our scenario, we occasionally have spot instances that are moved to new GKE nodes. If this causes the dagster postgres instance to be moved, then the new postgres instance on the new node is delayed starting up again because the volume takes up to 5mins to copy. This seems to leave the code location in a broken state and it doesnt seem to recover, even when the postgres finally comes back online.

Its worth looking at the logs in the daemon process as well as the code location user deployment to see if you are seeing postgres connection errors as an underlying cause.

1 reply

mlarose Mar 21, 2024
Collaborator

Thanks for that contribution @the4thamigo-uk. That is a very specific example, but it illustrates a category of infrastructure deployment related problem that might cause this issue indeed and may give insight to others.

mlarose · 2024-08-14T14:06:27Z

mlarose
Aug 14, 2024
Collaborator

A recurring culprit of slow code location server start up is the parsing of dbt models to generate its manifest file. See more about this in: Loading dbt models from a dbt project.

To prevent this, ensure dbt parse pregenerates this manifest from CI/CD, either by including this command in your docker image or in your PEX package if you are using serverless with fast deployment enabled.

0 replies

shayashi-aam · 2024-10-24T13:34:56Z

shayashi-aam
Oct 24, 2024

Our team has been hit by this issue as well. We are just running "dagster dev" inside a docker container. The error message we see is slightly different from the one that posted by @nadamakram

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
  File "/opt/anaconda/2023/envs/pandas2/lib/python3.11/site-packages/dagster/_grpc/client.py", line 657, in start_run
    res = self._query(
          ^^^^^^^^^^^^
  File "/opt/anaconda/2023/envs/pandas2/lib/python3.11/site-packages/dagster/_grpc/client.py", line 216, in _query
    self._raise_grpc_exception(
  File "/opt/anaconda/2023/envs/pandas2/lib/python3.11/site-packages/dagster/_grpc/client.py", line 199, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: unix:/tmp/tmpda756w01: connect: No such file or directory (2)"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-10-24T00:00:37.058736227+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: unix:/tmp/tmpda756w01: connect: No such file or directory (2)"}"
>
  File "/opt/anaconda/2023/envs/pandas2/lib/python3.11/site-packages/dagster/_grpc/client.py", line 214, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/2023/envs/pandas2/lib/python3.11/site-packages/dagster/_grpc/client.py", line 174, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/2023/envs/pandas2/lib/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda/2023/envs/pandas2/lib/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I see no spikes in cpu/memory/network usage when this occurs, and the problem disappears after an hour or so by itself.

We had to set "max_concurrent_runs: 1" to limit the concurrency as each of our asset runs in parallel using ray. I don't think this has anything to do with this issue. Do you recommend setting any special options for docker container (such as shm_size?)

0 replies

kennycontreras · 2024-10-26T14:30:23Z

kennycontreras
Oct 26, 2024

Besides the dbt manifest suggestion, has anyone been able to find a solution?

I have tried everything, from increasing the local_startup_timeout and reload_time in the dagster.yaml like the following:

code_servers:
  local_startup_timeout: 300
  reload_timeout: 300

to add env variables

DAGSTER_GRPC_TIMEOUT_SECONDS=300
DAGSTER_REPOSITORY_GRPC_TIMEOUT_SECONDS=300

Still, nothing seems to improve.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code location instability - gRPC Error code: UNAVAILABLE #17729

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Code location instability - gRPC Error code: UNAVAILABLE #17729

Replies: 12 comments · 2 replies

jamiedemaria Nov 7, 2023 Maintainer

mlarose Aug 14, 2024 Collaborator

ytoast Jan 24, 2024 Author

mlarose Mar 20, 2024 Collaborator

mlarose Mar 21, 2024 Collaborator

mlarose Aug 14, 2024 Collaborator

Replies: 12 comments 2 replies

jamiedemaria
Nov 7, 2023
Maintainer

mlarose Aug 14, 2024
Collaborator

ytoast
Jan 24, 2024
Author

mlarose
Mar 20, 2024
Collaborator

mlarose Mar 21, 2024
Collaborator

mlarose
Aug 14, 2024
Collaborator