[Guide] Managing sensor timeouts #20021

dpeng817 · 2024-02-23T22:07:15Z

dpeng817
Feb 23, 2024
Maintainer

In this guide, we’ll discuss how to handle sensors that time out. Note: All suggestions apply to both Dagster Open Source (OSS) and Dagster Plus, unless specifically stated.

The default sensor timeout is 60 seconds, and sensors can hit that limit for a myriad of reasons. Most of the time, timeouts are the result of poorly optimized business logic or network issues, but occasionally, the business logic does require long evaluation cycles.

For these cases, we provide a mechanism to increase the default timeout of sensors, in both OSS and Plus. However, note that we consider changing the sensor timeout to be a last resort and the other options in this guide should be considered before attempting to change the timeout.

Finding the root of the timeout

The first step of addressing sensor timeouts is to understand why they are occurring. To do this, we recommend using py-spy, which allows you to profile your business logic and understand where the hot paths are. See this GitHub discussion for info about using py-spy.

Often, hot paths can manifest as calls to external APIs hanging or poorly optimized business logic. In the case of hanging external API calls, timing out is the appropriate failure mode. One might even consider making the sensor timeout shorter to improve resilience to these types of transient API errors.

Breaking up evaluations

If, after understanding the hot paths to your code, you don’t have external hanging API calls and there’s no clear place for optimization, the next step is to consider breaking up the expensive sensor evaluations into multiple steps. This can be done using sensor cursors. See our documentation for more info.

Using the right tool for the job (no pun intended)

The above advice is well and good if evaluations take slightly over 60 seconds, but it’s possible sensor evaluations need to take longer. Generally, if evaluations require more than 3 minutes to complete, a sensor may not be the correct tool for your needs.

Sensors are best used for frequently evaluating small amounts of data to kick off real work down the line. But if your sensors are running beyond the expected limit, then a scheduled job which sends run requests to engage in other work is more suited to your task.

Increasing the timeout

At this point in the guide, you should have identified that:

Your sensor is timing out, but doesn’t need to run for over 3 minutes
There are no obvious opportunities for code optimization
Breaking up the sensor evaluation is not enough for you, or does not work for your use case

In this case, it’s time to consider increasing the sensor timeout.

Dagster Open Source (OSS)

In Dagster OSS, the timeout can be increased using the DAGSTER_SENSOR_GRPC_TIMEOUT_SECONDS environment variable. In OSS, the Dagster daemon sends requests directly to the user code servers. In both the daemon and code servers, the following can all be bottlenecks:

CPU
Memory
Maximum request throughput

The number of workers available on the code server can be configured using the DAGSTER_GRPC_MAX_WORKERS environment variable. The number of workers available to the sensor daemon can be configured in the dagster.yaml sensor settings:

sensors:
  use_threads: true
  num_workers: 8

Dagster Plus

The sensor timeout can currently be changed as a deployment setting default_sensor_timeout. Follow the instructions here to access the settings for your deployment.

Set the default_sensor_timeout to the number of seconds that you wish your sensors to time out at. Note that this will affect all sensors in your deployment, including sensors added after this change.

Hybrid deployments

Hybrid deployments have a few additional options and considerations.

When you increase the timeout of your sensors, you are potentially placing more load on your user code deployment. In Hybrid deployments, it’s important to be aware of the health of your system and the mitigations available to you if increasing the timeout leads to any instability.

The above diagram describes how sensor requests flow through the system. They are generated in Dagster’s host cloud and then sent to the agent, which sends requests off to the code server running a particular sensor.

As sensor throughput increases, there’s the potential for bottlenecks anywhere in this flow. Each bottleneck is limited by 3 things:

The maximum concurrent requests that can be handled by the system component
The CPU limits of the component
The memory limits of the component

Bumping up against 1 or 2 usually manifests as increased lag times between sensor runs. Bumping up against 3 can lead to the underlying container being killed and restarted.

Monitoring the CPU and memory of the agent and code servers should be done using the underlying deployment service (k8s, ECS, etc), but is out of the scope of this guide. If you find that increasing the sensor timeout leads to issues, necessary action to increase your resource limits should be taken:

For k8s, refer to this resource management guide.
For ECS, we provide AgentMemory, AgentCpu, CodeServerMemory, and CodeServerCPU parameters on our Cloudformation agent template that allow you to easily configure your resource constraints.

Exposing agent and code server metrics in Dagster Plus Hybrid

Let’s turn our attention to handling the maximum request throughput of the agent/code server. With dagster >= 1.6.4, you can enable logging that exposes the current throughput of your agent and/or code server.

It is important to make sure that both your agent and your code server versions are running 1.6.4 or greater. The metrics collection feature is not backwards compatible with older versions of Dagster, and you could run into errors if both are not upgraded.

To check the version of an agent:

Click the Deployment tab in the top nav bar of the UI.
In the main window, click on the Agents tab.
Underneath the Labels header, you should see a number v1.x.x. This is the dagster version of your agent.

To check the version of a code server:

Click the Deployment tab in the top nav bar of the UI.
In the main window, click on the Code Locations tab.
Underneath the Actions header, there should be button with a dropdown, one for each code location. Click this dropdown icon, and select View dagster libraries.
The Version next to dagster is the version of your code server.

To check the version of a code server, click Deployment > Code locations tab:

After verifying your agent and code server versions meet the requirements, you can enable throughput logging:

K8s: Add the following values to your Helm chart at the top level:

agentMetrics:
  enabled: true
codeServerMetrics:
  enabled: true

ECS: Set the AgentMetricsEnabled and CodeServerMetricsEnabled settings to true.
Docker: This functionality isn’t currently supported.

Once enabled, you should see metrics tracking the throughput of the agent in your agent logs. This includes not only the agent’s maximum throughput, but also each code server. In both cases, the request utilization metrics are provided in a JSON blob.

For the agent, the log looks like this:

dagster_cloud_agent - Current agent threadpool utilization: 0/20 threads

For the code server, the request utilization information must be parsed out of a larger JSON blob of information:

> user_code_launcher - Updated code server metrics for location sensor_test_87 in deployment prod: {'container_utilization': {'num_allocated_cores': 2, 'cpu_usage': 320.930756512, 'cpu_cfs_quota_us': -1.0, 'cpu_cfs_period_us': 100000.0, 'memory_usage': 150790144, 'memory_limit': 9223372036854771712, 'measurement_timestamp': 1708649283.031707, 'previous_cpu_usage': 320.823456119, 'previous_measurement_timestamp': 1708649221.99951}, 'request_utilization': {'max_concurrent_requests': 50, 'num_running_requests': 1, 'num_queued_requests': 0}, 'per_api_metrics': {'Ping': {'current_request_count': 1}}}

In the above code example, there are many fields that can be ignored. The important fields from the perspective of request throughput are:

num_running_requests: tracks the number of currently running requests on the code server
num_queued_requests: tracks number of enqueued requests on the code server (not currently running).
max_concurrent_requests: the maximum number of requests which can run at any given time on the code server.

Increasing the agent’s maximum concurrent requests

It may turn out that you are reaching utilization for the number of requests your agent can serve. The default number of workers is dependent upon the number of cores available to your container. This number can be increased using the DAGSTER_CLOUD_AGENT_MAX_THREADPOOL_WORKERS environment variable. This should be set on the container running the agent itself.

Increasing the maximum concurrent requests for the code server

The default number of workers is dependent upon the number of cores available to your container. However, the number of concurrent requests can also be increased for code servers. This can be done with the DAGSTER_GRPC_MAX_WORKERS environment variable being set on startup in the container running the code server.

Answered by dpeng817

Feb 23, 2024

Answer in description

View full answer

dpeng817 · 2024-02-23T22:19:35Z

dpeng817
Feb 23, 2024
Maintainer Author

Answer in description

2 replies

hiimtu Mar 7, 2024

@dpeng817 Very interesting guide! I have some question on the definition of request to the agent and code server. Are these request to run jobs / sensors or does it also include other things?

For context, I'm looking to understand job queue and how to get visibility on the queue size in a hybrid deployment. Does the queuing / prioritization of jobs happen on the customer environment or dagster cloud?

Thanks in advance!

dpeng817 Mar 8, 2024
Maintainer Author

so the requests I'm referring to here are three things:

Requests to run a sensor
Requests to run a schedule
Requests for information about the code location, which power the UI.
Requests to run a job are vended by dagster cloud, and then sent through the agent to a separate piece of machinery called a "run launcher", not the code server. but prioritization of jobs can be affected by a few different mechanisms in dagster. https://docs.dagster.io/guides/customizing-run-queue-priority#customizing-run-queue-priority could be a good starting point if you haven't seen yet.

f-tremblay · 2024-06-06T18:19:20Z

f-tremblay
Jun 6, 2024

@dpeng817, thanks for the guide.

I can't find instructions on how to change deployment settings (i.e., default_sensor_timeout). We are on Dagster Cloud Serverless.
The link you provided seems to be unrelated.

The sensor timeout can currently be changed as a deployment setting default_sensor_timeout. Follow the instructions here to access the settings for your deployment.

1 reply

dpeng817 Jun 6, 2024
Maintainer Author

Thanks for letting me know @f-tremblay. I just fixed the link.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Guide] Managing sensor timeouts #20021

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Guide] Managing sensor timeouts #20021

dpeng817 Feb 23, 2024 Maintainer

Finding the root of the timeout

Breaking up evaluations

Using the right tool for the job (no pun intended)

Increasing the timeout

Dagster Open Source (OSS)

Dagster Plus

Hybrid deployments

Replies: 2 comments · 3 replies

dpeng817 Feb 23, 2024 Maintainer Author

hiimtu Mar 7, 2024

dpeng817 Mar 8, 2024 Maintainer Author

f-tremblay Jun 6, 2024

dpeng817 Jun 6, 2024 Maintainer Author

dpeng817
Feb 23, 2024
Maintainer

Replies: 2 comments 3 replies

dpeng817
Feb 23, 2024
Maintainer Author

dpeng817 Mar 8, 2024
Maintainer Author

f-tremblay
Jun 6, 2024

dpeng817 Jun 6, 2024
Maintainer Author