From 043aa3dd9dd9b2bc131e43fa65241071c614ae1e Mon Sep 17 00:00:00 2001 From: Bryan Boreham Date: Wed, 25 Sep 2024 15:25:02 +0100 Subject: [PATCH 1/3] Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries --- docs/sources/mimir/manage/mimir-runbooks/_index.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md index 1b0ea26423f..c969e7d1598 100644 --- a/docs/sources/mimir/manage/mimir-runbooks/_index.md +++ b/docs/sources/mimir/manage/mimir-runbooks/_index.md @@ -41,7 +41,15 @@ If nothing obvious from the above, check for increased load: ### MimirIngesterReachingSeriesLimit -This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed. +This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. +The threshold is set at 80%, to give some chance to react before the limit is reached. +Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed. + +Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender. +While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender. +If the condition is cleared in a short time, service can be restored with no data loss. + +This is different to what happens when the `max_global_series_per_user` is exceeded, which is considered a "client error" (4xx) where excess data is discarded. In case of **emergency**: From 173f39ed7e2f8cb8d5692684b199ab29447f5475 Mon Sep 17 00:00:00 2001 From: Bryan Boreham Date: Wed, 6 Nov 2024 16:02:50 +0000 Subject: [PATCH 2/3] Apply suggestions from code review Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com> --- docs/sources/mimir/manage/mimir-runbooks/_index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md index c969e7d1598..daa4ad952e0 100644 --- a/docs/sources/mimir/manage/mimir-runbooks/_index.md +++ b/docs/sources/mimir/manage/mimir-runbooks/_index.md @@ -42,14 +42,14 @@ If nothing obvious from the above, check for increased load: ### MimirIngesterReachingSeriesLimit This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. -The threshold is set at 80%, to give some chance to react before the limit is reached. -Once the limit is reached, writes to the ingester will fail for new series. Appending samples to existing ones will continue to succeed. +The threshold is set at 80% to give the chance to react before the limit is reached. +After the limit is reached, writes to the ingester fail for new series. Appending samples to existing ones continue to succeed. Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender. -While this situation continues, these retries will stall the flow of data, and newer data will queue up on the sender. +While this situation continues, these retries stall the flow of data, and newer data queues up on the sender. If the condition is cleared in a short time, service can be restored with no data loss. -This is different to what happens when the `max_global_series_per_user` is exceeded, which is considered a "client error" (4xx) where excess data is discarded. +This is different to what happens when the `max_global_series_per_user` limit is exceeded, which is considered a "client error" (4xx). In this case, excess data is discarded. In case of **emergency**: From ae6519c33e56a628f89ad2b8307dc2c143fe09bb Mon Sep 17 00:00:00 2001 From: Bryan Boreham Date: Wed, 6 Nov 2024 16:06:59 +0000 Subject: [PATCH 3/3] More review feedback --- docs/sources/mimir/manage/mimir-runbooks/_index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md index daa4ad952e0..53f389bc8b3 100644 --- a/docs/sources/mimir/manage/mimir-runbooks/_index.md +++ b/docs/sources/mimir/manage/mimir-runbooks/_index.md @@ -41,11 +41,11 @@ If nothing obvious from the above, check for increased load: ### MimirIngesterReachingSeriesLimit -This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. +This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is close to reaching the limit. The threshold is set at 80% to give the chance to react before the limit is reached. -After the limit is reached, writes to the ingester fail for new series. Appending samples to existing ones continue to succeed. +After the limit is reached, write requests to the ingester fail for new series. Appending samples to existing ones continue to succeed. -Note that the error responses sent back to the sender are classed as "server error" (5xx), which should result in a retry by the sender. +Note that the error responses sent back to the sender are classified as "server errors" (5xx), which should result in a retry by the sender. While this situation continues, these retries stall the flow of data, and newer data queues up on the sender. If the condition is cleared in a short time, service can be restored with no data loss. @@ -131,7 +131,7 @@ How to **fix** it: ### MimirIngesterReachingTenantsLimit -This alert fires when the `max_tenants` per ingester instance limit is enabled and the actual number of tenants in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new tenants, while they will continue to succeed for previously existing ones. +This alert fires when the `max_tenants` per ingester instance limit is enabled and the actual number of tenants in an ingester is reaching the limit. Once the limit is reached, write requests to the ingester will fail (5xx) for new tenants, while they will continue to succeed for previously existing ones. The per-tenant memory utilisation in ingesters includes the overhead of allocations for TSDB stripes and chunk writer buffers. If the tenant number is high, this may contribute significantly to the total ingester memory utilization. The size of these allocations is controlled by `-blocks-storage.tsdb.stripe-size` (default 16KiB) and `-blocks-storage.tsdb.head-chunks-write-buffer-size-bytes` (default 4MiB), respectively.