Skip to content

Commit

Permalink
Add MimirIngesterStuckProcessingRecordsFromKafka alert (#8147)
Browse files Browse the repository at this point in the history
* Add MimirIngesterStuckProcessingRecordsFromKafka alert

Signed-off-by: Marco Pracucci <[email protected]>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <[email protected]>

* Apply suggestions from code review

---------

Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
  • Loading branch information
pracucci and pstibrany authored May 21, 2024
1 parent ff93a70 commit 7b811fd
Show file tree
Hide file tree
Showing 6 changed files with 76 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@
* [CHANGE] Alerts: Removed obsolete `MimirQueriesIncorrect` alert that used test-exporter metrics. Test-exporter support was however removed in Mimir 2.0 release. #7774
* [CHANGE] Alerts: Change threshold for `MimirBucketIndexNotUpdated` alert to fire before queries begin to fail due to bucket index age. #7879
* [FEATURE] Dashboards: added 'Remote ruler reads networking' dashboard. #7751
* [FEATURE] Alerts: Add `MimirIngesterStuckProcessingRecordsFromKafka` alert. #8147
* [ENHANCEMENT] Alerts: allow configuring alerts range interval via `_config.base_alerts_range_interval_minutes`. #7591
* [ENHANCEMENT] Dashboards: Add panels for monitoring distributor and ingester when using ingest-storage. These panels are disabled by default, but can be enabled using `show_ingest_storage_panels: true` config option. Similarly existing panels used when distributors and ingesters use gRPC for forwarding requests can be disabled by setting `show_grpc_ingestion_panels: false`. #7670 #7699
* [ENHANCEMENT] Alerts: add the following alerts when using ingest-storage: #7699 #7702
Expand Down
16 changes: 16 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1427,6 +1427,22 @@ How to **investigate**:
- Check ingester logs to see why requests are failing, and troubleshoot based on that.
### MimirIngesterStuckProcessingRecordsFromKafka
This alert fires when an ingester has successfully fetched records from Kafka but it's not processing them at all.
How it **works**:
- Ingester reads records from Kafka, and processes them locally. Processing means unmarshalling the data and handling write requests stored in records.
- Fetched records, containing write requests, are expected to be processed by ingesting the write requests data into the ingester.
- This alert fires if no processing is occurring at all, like if the processing is stuck (e.g. a deadlock in ingester).
How to **investigate**:
- Take goroutine profile of the ingester and check if there's any routine calling `pushToStorage`:
- If the call exists and it's waiting on a lock then there may be a deadlock.
- If the call doesn't exist then it could either mean processing is not stuck (false positive) or the `pushToStorage` wasn't called at all, and so you should investigate the callers in the code.
### MimirIngesterFailsEnforceStrongConsistencyOnReadPath
This alert fires when too many read-requests with strong consistency are failing.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -988,6 +988,19 @@ spec:
for: 5m
labels:
severity: critical
- alert: MimirIngesterStuckProcessingRecordsFromKafka
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is stuck processing write requests from Kafka.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterstuckprocessingrecordsfromkafka
expr: |
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
(sum by (cluster, namespace, pod) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
for: 5m
labels:
severity: critical
- alert: MimirIngesterFailsEnforceStrongConsistencyOnReadPath
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to enforce strong-consistency on read-path.
Expand Down
13 changes: 13 additions & 0 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -962,6 +962,19 @@ groups:
for: 5m
labels:
severity: critical
- alert: MimirIngesterStuckProcessingRecordsFromKafka
annotations:
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} is stuck processing write requests from Kafka.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterstuckprocessingrecordsfromkafka
expr: |
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
(sum by (cluster, namespace, instance) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (cluster, namespace, instance) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
for: 5m
labels:
severity: critical
- alert: MimirIngesterFailsEnforceStrongConsistencyOnReadPath
annotations:
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to enforce strong-consistency on read-path.
Expand Down
13 changes: 13 additions & 0 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -976,6 +976,19 @@ groups:
for: 5m
labels:
severity: critical
- alert: MimirIngesterStuckProcessingRecordsFromKafka
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is stuck processing write requests from Kafka.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterstuckprocessingrecordsfromkafka
expr: |
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
(sum by (cluster, namespace, pod) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
for: 5m
labels:
severity: critical
- alert: MimirIngesterFailsEnforceStrongConsistencyOnReadPath
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to enforce strong-consistency on read-path.
Expand Down
20 changes: 20 additions & 0 deletions operations/mimir-mixin/alerts/ingest-storage.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
},
},

// Alert firing if an ingester is failing to read from Kafka.
{
alert: $.alertName('IngesterFailsToProcessRecordsFromKafka'),
'for': '5m',
Expand All @@ -110,6 +111,25 @@
},
},

// Alert firing is an ingester is reading from Kafka, there are buffered records to process, but processing is stuck.
{
alert: $.alertName('IngesterStuckProcessingRecordsFromKafka'),
'for': '5m',
expr: |||
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
(sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
||| % $._config,
labels: {
severity: 'critical',
},
annotations: {
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s is stuck processing write requests from Kafka.' % $._config,
},
},

{
alert: $.alertName('IngesterFailsEnforceStrongConsistencyOnReadPath'),
'for': '5m',
Expand Down

0 comments on commit 7b811fd

Please sign in to comment.