[Guide] Managing sensor timeouts #20021
-
In this guide, we’ll discuss how to handle sensors that time out. Note: All suggestions apply to both Dagster Open Source (OSS) and Dagster Plus, unless specifically stated. The default sensor timeout is 60 seconds, and sensors can hit that limit for a myriad of reasons. Most of the time, timeouts are the result of poorly optimized business logic or network issues, but occasionally, the business logic does require long evaluation cycles. For these cases, we provide a mechanism to increase the default timeout of sensors, in both OSS and Plus. However, note that we consider changing the sensor timeout to be a last resort and the other options in this guide should be considered before attempting to change the timeout. Finding the root of the timeoutThe first step of addressing sensor timeouts is to understand why they are occurring. To do this, we recommend using Often, hot paths can manifest as calls to external APIs hanging or poorly optimized business logic. In the case of hanging external API calls, timing out is the appropriate failure mode. One might even consider making the sensor timeout shorter to improve resilience to these types of transient API errors. Breaking up evaluationsIf, after understanding the hot paths to your code, you don’t have external hanging API calls and there’s no clear place for optimization, the next step is to consider breaking up the expensive sensor evaluations into multiple steps. This can be done using sensor cursors. See our documentation for more info. Using the right tool for the job (no pun intended)The above advice is well and good if evaluations take slightly over 60 seconds, but it’s possible sensor evaluations need to take longer. Generally, if evaluations require more than 3 minutes to complete, a sensor may not be the correct tool for your needs. Sensors are best used for frequently evaluating small amounts of data to kick off real work down the line. But if your sensors are running beyond the expected limit, then a scheduled job which sends run requests to engage in other work is more suited to your task. Increasing the timeoutAt this point in the guide, you should have identified that:
In this case, it’s time to consider increasing the sensor timeout. Dagster Open Source (OSS)In Dagster OSS, the timeout can be increased using the
The number of workers available on the code server can be configured using the sensors:
use_threads: true
num_workers: 8 Dagster PlusThe sensor timeout can currently be changed as a deployment setting Set the Hybrid deploymentsHybrid deployments have a few additional options and considerations. When you increase the timeout of your sensors, you are potentially placing more load on your user code deployment. In Hybrid deployments, it’s important to be aware of the health of your system and the mitigations available to you if increasing the timeout leads to any instability. The above diagram describes how sensor requests flow through the system. They are generated in Dagster’s host cloud and then sent to the agent, which sends requests off to the code server running a particular sensor. As sensor throughput increases, there’s the potential for bottlenecks anywhere in this flow. Each bottleneck is limited by 3 things:
Bumping up against 1 or 2 usually manifests as increased lag times between sensor runs. Bumping up against 3 can lead to the underlying container being killed and restarted. Monitoring the CPU and memory of the agent and code servers should be done using the underlying deployment service (k8s, ECS, etc), but is out of the scope of this guide. If you find that increasing the sensor timeout leads to issues, necessary action to increase your resource limits should be taken:
Exposing agent and code server metrics in Dagster Plus Hybrid Let’s turn our attention to handling the maximum request throughput of the agent/code server. With It is important to make sure that both your agent and your code server versions are running 1.6.4 or greater. The metrics collection feature is not backwards compatible with older versions of Dagster, and you could run into errors if both are not upgraded. To check the version of an agent:
To check the version of a code server:
To check the version of a code server, click Deployment > Code locations tab: After verifying your agent and code server versions meet the requirements, you can enable throughput logging:
Once enabled, you should see metrics tracking the throughput of the agent in your agent logs. This includes not only the agent’s maximum throughput, but also each code server. In both cases, the request utilization metrics are provided in a JSON blob. For the agent, the log looks like this: dagster_cloud_agent - Current agent threadpool utilization: 0/20 threads For the code server, the request utilization information must be parsed out of a larger JSON blob of information: > user_code_launcher - Updated code server metrics for location sensor_test_87 in deployment prod: {'container_utilization': {'num_allocated_cores': 2, 'cpu_usage': 320.930756512, 'cpu_cfs_quota_us': -1.0, 'cpu_cfs_period_us': 100000.0, 'memory_usage': 150790144, 'memory_limit': 9223372036854771712, 'measurement_timestamp': 1708649283.031707, 'previous_cpu_usage': 320.823456119, 'previous_measurement_timestamp': 1708649221.99951}, 'request_utilization': {'max_concurrent_requests': 50, 'num_running_requests': 1, 'num_queued_requests': 0}, 'per_api_metrics': {'Ping': {'current_request_count': 1}}} In the above code example, there are many fields that can be ignored. The important fields from the perspective of request throughput are:
Increasing the agent’s maximum concurrent requests It may turn out that you are reaching utilization for the number of requests your agent can serve. The default number of workers is dependent upon the number of cores available to your container. This number can be increased using the Increasing the maximum concurrent requests for the code server The default number of workers is dependent upon the number of cores available to your container. However, the number of concurrent requests can also be increased for code servers. This can be done with the |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Answer in description |
Beta Was this translation helpful? Give feedback.
-
@dpeng817, thanks for the guide. I can't find instructions on how to change deployment settings (i.e.,
|
Beta Was this translation helpful? Give feedback.
Answer in description