Skip to content

Commit

Permalink
Add MimirIngesterStuckProcessingRecordsFromKafka alert
Browse files Browse the repository at this point in the history
Signed-off-by: Marco Pracucci <[email protected]>
  • Loading branch information
pracucci committed May 15, 2024
1 parent 307567a commit f3aea1d
Show file tree
Hide file tree
Showing 5 changed files with 75 additions and 0 deletions.
16 changes: 16 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1427,6 +1427,22 @@ How to **investigate**:
- Check ingester logs to see why requests are failing, and troubleshoot based on that.
### MimirIngesterStuckProcessingRecordsFromKafka
This alert fires when an ingester has successfully fetched records from Kafka but it's not processing them at all.
How it **works**:
- Ingester reads records from Kafka, and processes them locally. Processing means unmarshalling the data and handling write requests stored in records.
- Fetched records, containing write requests, are expected to be processed ingesting the write requests data into the ingester.
- This alert fires if no processing is occurring at all, like if the processing is stuck (e.g. a deadlock in ingester).
How to **investigate**:
- Take goroutine profile of the ingester and check if there's any routine calling `pushToStorage` and what's it state:
- If the call exists and it's waiting on a lock then there may be a deadlock.
- If the call doesn't exist then it could either mean processing is not stuck (false positive) or the `pushToStorage` wasn't called at all, and so you should investigate the callers in the code.
### MimirIngesterFailsEnforceStrongConsistencyOnReadPath
This alert fires when too many read-requests with strong consistency are failing.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -988,6 +988,19 @@ spec:
for: 5m
labels:
severity: critical
- alert: MimirIngesterStuckProcessingRecordsFromKafka
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is stuck processing write requests from Kafka.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterstuckprocessingrecordsfromkafka
expr: |
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
(sum by (cluster, namespace, pod) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
for: 5m
labels:
severity: critical
- alert: MimirIngesterFailsEnforceStrongConsistencyOnReadPath
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to enforce strong-consistency on read-path.
Expand Down
13 changes: 13 additions & 0 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -962,6 +962,19 @@ groups:
for: 5m
labels:
severity: critical
- alert: MimirIngesterStuckProcessingRecordsFromKafka
annotations:
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} is stuck processing write requests from Kafka.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterstuckprocessingrecordsfromkafka
expr: |
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
(sum by (cluster, namespace, instance) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (cluster, namespace, instance) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
for: 5m
labels:
severity: critical
- alert: MimirIngesterFailsEnforceStrongConsistencyOnReadPath
annotations:
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to enforce strong-consistency on read-path.
Expand Down
13 changes: 13 additions & 0 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -976,6 +976,19 @@ groups:
for: 5m
labels:
severity: critical
- alert: MimirIngesterStuckProcessingRecordsFromKafka
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is stuck processing write requests from Kafka.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterstuckprocessingrecordsfromkafka
expr: |
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
(sum by (cluster, namespace, pod) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
for: 5m
labels:
severity: critical
- alert: MimirIngesterFailsEnforceStrongConsistencyOnReadPath
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to enforce strong-consistency on read-path.
Expand Down
20 changes: 20 additions & 0 deletions operations/mimir-mixin/alerts/ingest-storage.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
},
},

// Alert firing if an ingester is failing to read from Kafka.
{
alert: $.alertName('IngesterFailsToProcessRecordsFromKafka'),
'for': '5m',
Expand All @@ -110,6 +111,25 @@
},
},

// Alert firing is an ingester is reading from Kafka, there are buffered records to process, but processing is stuck.
{
alert: $.alertName('IngesterStuckProcessingRecordsFromKafka'),
'for': '5m',
expr: |||
# Alert if the reader is not processing any records, but there buffered records to process in the Kafka client.
(sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_reader_records_total[5m])) == 0)
and
# NOTE: the cortex_ingest_storage_reader_buffered_fetch_records_total metric is a gauge showing the current number of buffered records.
(sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (cortex_ingest_storage_reader_buffered_fetch_records_total) > 0)
||| % $._config,
labels: {
severity: 'critical',
},
annotations: {
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s is stuck processing write requests from Kafka.' % $._config,
},
},

{
alert: $.alertName('IngesterFailsEnforceStrongConsistencyOnReadPath'),
'for': '5m',
Expand Down

0 comments on commit f3aea1d

Please sign in to comment.