diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md index 5cce349f12..314237fc97 100644 --- a/docs/sources/mimir/manage/mimir-runbooks/_index.md +++ b/docs/sources/mimir/manage/mimir-runbooks/_index.md @@ -767,17 +767,47 @@ The procedure to investigate it is the same as the one for [`MimirSchedulerQueri This alert fires if queries are piling up in the query-scheduler. -The size of the queue is shown on the `Queue length` dashboard panel on the `Mimir / Reads` (for the standard query path) or `Mimir / Remote Ruler Reads` +#### Dashboard Panels + +The size of the queue is shown on the `Queue Length` dashboard panel on the [`Mimir / Reads`](https://admin-ops-eu-south-0.grafana-ops.net/grafana/d/e327503188913dc38ad571c647eef643) (for the standard query path) or `Mimir / Remote Ruler Reads` (for the dedicated rule evaluation query path) dashboards. -How it **works**: +The `Latency (Time in Queue)` is broken out in the dashboard row below by the "Expected Query Component" - +the scheduler queue itself is partitioned by the Expected Query Component for each query, +which is an estimate from the query time range of which component the querier will utilize to fetch data +(ingester, store-gateway, both, or unknown). + +The row below shows peak values for `Query-scheduler <-> Querier Inflight Requests`, also broken out by query component. +This shows when the queriers are saturated with inflight query requests +as well as which query components are being utilized to service the queries. + +#### How it Works - A query-frontend API endpoint is called to execute a query - The query-frontend enqueues the request to the query-scheduler - The query-scheduler is responsible for dispatching enqueued queries to idle querier workers -- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query +- The querier fetches data from ingesters, store-gateways, or both, runs the query against the data, + sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query. -How to **investigate**: +#### How to Investigate + +Note that elevated measures of _inflight_ queries at any part of the read path are likely a symptom, not a cause. + +**Ingester or Store-Gateway Issues** + +With querier autoscaling in place, the most common cause of a query backlog is that either ingesters or store-gateways +are not able to keep up with their query load. + +Investigate the RPS and Latency panels for ingesters and store-gateways on the `Mimir / Reads` dashboard +and compare to the `Latency (Time in Queue)` or `Query-scheduler <-> Querier Inflight Requests` +breakouts on the `Mimir / Reads` or `Mimir / Remote Ruler Reads` dashboard. +Additionally, check the `Mimir / Reads Resources` dashboard for elevated resource utilization or limiting on ingesters or store-gateways. + +Generally, this should show that one of either the ingesters or store-gateways is experiencing issues +and then the query component can be investigated further on its own. +Scaling up queriers is unlikely to help in this case, as it will place more load on an already-overloaded component. + +**Querier Issues** - Are queriers in a crash loop (eg. OOMKilled)? - `OOMKilled`: temporarily increase queriers memory request/limit @@ -791,15 +821,24 @@ How to **investigate**: - Check if a specific tenant is running heavy queries - Run `sum by (user) (cortex_query_scheduler_queue_length{namespace=""}) > 0` to find tenants with enqueued queries - If remote ruler evaluation is enabled, make sure you understand which one of the read paths (user or ruler queries?) is being affected - check the alert message. - - Check the `Mimir / Slow Queries` dashboard to find slow queries + - Check the [`Mimir / Slow Queries`](https://admin-ops-eu-south-0.grafana-ops.net/grafana/d/6089e1ce1e678788f46312a0a1e647e6) dashboard to find slow queries - On multi-tenant Mimir cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). - On multi-tenant Mimir cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Mimir return errors for that given user once the per-tenant queue is full. - On multi-tenant Mimir clusters with **query-sharding enabled** and **more than a few tenants** being affected: The workload exceeds the available downstream capacity. Scaling of queriers and potentially store-gateways should be considered. - On multi-tenant Mimir clusters with **query-sharding enabled** and **only a single tenant** being affected: - - Verify if the particular queries are hitting edge cases, where query-sharding is not benefical, by getting traces from the `Mimir / Slow Queries` dashboard and then look where time is spent. If time is spent in the query-frontend running PromQL engine, then it means query-sharding is not beneficial for this tenant. Consider disabling query-sharding or reduce the shard count using the `query_sharding_total_shards` override. + - Verify if the particular queries are hitting edge cases, where query-sharding is not benefical, by getting traces from the [`Mimir / Slow Queries`](https://admin-ops-eu-south-0.grafana-ops.net/grafana/d/6089e1ce1e678788f46312a0a1e647e6) dashboard and then look where time is spent. If time is spent in the query-frontend running PromQL engine, then it means query-sharding is not beneficial for this tenant. Consider disabling query-sharding or reduce the shard count using the `query_sharding_total_shards` override. - Otherwise and only if the queries by the tenant are within reason representing normal usage, consider scaling of queriers and potentially store-gateways. - On a Mimir cluster with **querier auto-scaling enabled** after checking the health of the existing querier replicas, check to see if the auto-scaler has added additional querier replicas or if the maximum number of querier replicas has been reached and is not sufficient and should be increased. +**Query-Scheduler Issues** + +In rare cases, the query-scheduler itself may be the bottleneck. +When querier-connection utilization is low in the `Query-scheduler <-> Querier Inflight Requests` dashboard panels +but the queue length or latency is high, it indicates that the query-scheduler is very slow in dispatching queries. + +In this case, if the scheduler is not resource-constrained we can use CPU profiles +to see where the scheduler's query dispatch process is spending its time. + ### MimirCacheRequestErrors This alert fires if the Mimir cache client is experiencing a high error rate for a specific cache and operation.