Skip to content

Commit

Permalink
blockbuilder: Basic alerts (#9723)
Browse files Browse the repository at this point in the history
* mimir-mixin: basic alerting for block-builder

Signed-off-by: Vladimir Varankin <[email protected]>

* runbook

Signed-off-by: Vladimir Varankin <[email protected]>

* rebuild assets

Signed-off-by: Vladimir Varankin <[email protected]>

* Update docs/sources/mimir/manage/mimir-runbooks/_index.md

Co-authored-by: Marco Pracucci <[email protected]>

* per-instance alerting

Signed-off-by: Vladimir Varankin <[email protected]>

* rebuild assets

Signed-off-by: Vladimir Varankin <[email protected]>

* Apply suggestions from code review

Co-authored-by: Taylor C <[email protected]>

* add MimirBlockBuilderLaging

Signed-off-by: Vladimir Varankin <[email protected]>

* fixup! rebuild assets

* improve MimirBlockBuilderLagging

Signed-off-by: Vladimir Varankin <[email protected]>

* fixup! rebuild assets

---------

Signed-off-by: Vladimir Varankin <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
Co-authored-by: Taylor C <[email protected]>
  • Loading branch information
3 people authored Oct 30, 2024
1 parent eda1a4b commit ad2ecd3
Show file tree
Hide file tree
Showing 5 changed files with 169 additions and 0 deletions.
41 changes: 41 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1611,6 +1611,47 @@ How to **fix**:
1. Once ingesters are stable, revert the temporarily config applied in the previous step.
### MimirBlockBuilderNoCycleProcessing
This alert fires when the block-builder stops reporting any processed cycles for an unexpectedly long time.
How it **works**:
- The block-builder periodically consumes a portion of the backlog from Kafka partition, and processes the consumed data into TSDB blocks. The block-builder calls these periods "cycles".
- If the block-builder doesn't process any cycles for an extended period of time, this could indicate that a block-builder instance is stuck and cannot complete cycle processing.
How to **investigate**:
- Check the block-builder logs to see what its pods have been busy with. The block-builder logs the `start consuming` and `done consuming` log messages, that mark per-partition conume-cycles. These log records include the details about the cycle, the Kafka topic's offsets, etc. Troubleshoot based on that.
### MimirBlockBuilderLagging
This alert fires when the block-builder instances report a large number of unprocessed records in the Kafka partitions.
How it **works**:
- When the block-builder starts a new consume cycle, it checks how many records the Kafka partition has in the backlog. This number is tracked in the `cortex_blockbuilder_consumer_lag_records` metric.
- The block-builder must consume and process these records into TSDB blocks.
- At the end of the processing, the block-builder commits the offset of the last fully processed record into Kafka.
- If the block-builder reports high values in the lag, this could indicate that a block-builder instance cannot fully process and commit Kafka record.
How to **investigate**:
- Check if the per-partition lag, reported by the `cortex_blockbuilder_consumer_lag_records` metric, has been growing over the past hours.
- Explore the block-builder logs for any errors reported while it processed the partition.
### MimirBlockBuilderCompactAndUploadFailed
How it **works**:
- The block-builder periodically consumes data from a Kafka topic and processes the consumed data into TSDB blocks.
- It compacts and uploads the produced TSDB blocks to object storage.
- If the block-builder encounters issues while compacting or uploading the blocks, it reports the failure metric, which then triggers the alert.
How to **investigate**:
- Explore the block-builder logs to check what errors are there.
## Errors catalog
Mimir has some codified error IDs that you might see in HTTP responses or logs.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1163,6 +1163,33 @@ spec:
for: 5m
labels:
severity: critical
- alert: MimirBlockBuilderNoCycleProcessing
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} has not processed cycles in the past hour.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildernocycleprocessing
expr: |
max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
for: 5m
labels:
severity: warning
- alert: MimirBlockBuilderLagging
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} reports partition lag of {{ printf "%.2f" $value }}%.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuilderlagging
expr: |
max by(cluster, namespace, pod) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
for: 75m
labels:
severity: warning
- alert: MimirBlockBuilderCompactAndUploadFailed
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to compact and upload blocks.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildercompactanduploadfailed
expr: |
sum by (cluster, namespace, pod) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
for: 5m
labels:
severity: warning
- name: mimir_continuous_test
rules:
- alert: MimirContinuousTestNotRunningOnWrites
Expand Down
27 changes: 27 additions & 0 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1137,6 +1137,33 @@ groups:
for: 5m
labels:
severity: critical
- alert: MimirBlockBuilderNoCycleProcessing
annotations:
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} has not processed cycles in the past hour.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildernocycleprocessing
expr: |
max by(cluster, namespace, instance) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
for: 5m
labels:
severity: warning
- alert: MimirBlockBuilderLagging
annotations:
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} reports partition lag of {{ printf "%.2f" $value }}%.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuilderlagging
expr: |
max by(cluster, namespace, instance) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
for: 75m
labels:
severity: warning
- alert: MimirBlockBuilderCompactAndUploadFailed
annotations:
message: Mimir {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to compact and upload blocks.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildercompactanduploadfailed
expr: |
sum by (cluster, namespace, instance) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
for: 5m
labels:
severity: warning
- name: mimir_continuous_test
rules:
- alert: MimirContinuousTestNotRunningOnWrites
Expand Down
27 changes: 27 additions & 0 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1151,6 +1151,33 @@ groups:
for: 5m
labels:
severity: critical
- alert: MimirBlockBuilderNoCycleProcessing
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} has not processed cycles in the past hour.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildernocycleprocessing
expr: |
max by(cluster, namespace, pod) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
for: 5m
labels:
severity: warning
- alert: MimirBlockBuilderLagging
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} reports partition lag of {{ printf "%.2f" $value }}%.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuilderlagging
expr: |
max by(cluster, namespace, pod) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
for: 75m
labels:
severity: warning
- alert: MimirBlockBuilderCompactAndUploadFailed
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} fails to compact and upload blocks.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirblockbuildercompactanduploadfailed
expr: |
sum by (cluster, namespace, pod) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
for: 5m
labels:
severity: warning
- name: mimir_continuous_test
rules:
- alert: MimirContinuousTestNotRunningOnWrites
Expand Down
47 changes: 47 additions & 0 deletions operations/mimir-mixin/alerts/ingest-storage.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,53 @@
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s Kafka client produce buffer utilization is {{ printf "%%.2f" $value }}%%.' % $._config,
},
},

// Alert if block-builder didn't process cycles in the past hour.
{
alert: $.alertName('BlockBuilderNoCycleProcessing'),
'for': '5m',
expr: |||
max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (histogram_count(increase(cortex_blockbuilder_consume_cycle_duration_seconds[60m]))) == 0
||| % $._config,
labels: {
severity: 'warning',
},
annotations: {
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s has not processed cycles in the past hour.' % $._config,
},
},

// Alert if block-builder per partition lag is higher than the threshhold.
// The value of the threshhold is arbitary large for now. We will reconsider this alert after we get the block-builder-scheduler.
// Note on "for: 75m": we assume one cycle is 1hr; with 10m loopback we expect the warning to trigger only if the metric is above the threshold for more than one cycle.
{
alert: $.alertName('BlockBuilderLagging'),
'for': '75m',
expr: |||
max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cortex_blockbuilder_consumer_lag_records[10m])) > 4e6
||| % $._config,
labels: {
severity: 'warning',
},
annotations: {
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s reports partition lag of {{ printf "%%.2f" $value }}%%.' % $._config,
},
},

// Alert if block-builder is failing to compact and upload any blocks.
{
alert: $.alertName('BlockBuilderCompactAndUploadFailed'),
'for': '5m',
expr: |||
sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_blockbuilder_tsdb_compact_and_upload_failed_total[1m])) > 0
||| % $._config,
labels: {
severity: 'warning',
},
annotations: {
message: '%(product)s {{ $labels.%(per_instance_label)s }} in %(alert_aggregation_variables)s fails to compact and upload blocks.' % $._config,
},
},
],
},
],
Expand Down

0 comments on commit ad2ecd3

Please sign in to comment.