Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split ksm alerts in 2 separate ones #912

Merged
merged 11 commits into from
Sep 21, 2023
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Changed

- Split `KubeStateMetricsDown` alert into 2 alerts : `KubeStateMetricsDown` and `KubeStateMetricsNotRetrievingMetrics`

## [2.133.0] - 2023-09-19

### Changed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,34 @@ spec:
groups:
- name: kube-state-metrics
rules:
- alert: KubeStateMetricsDown
annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is down.`}}'
opsrecipe: kube-state-metrics-down/
expr: |-
(
# modern clusters
label_replace(up{app="kube-state-metrics",instance=~".*:8080"}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0 or absent(up{app="kube-state-metrics",instance=~".*:8080"} == 1)
)
and
(
# vintage clusters without servicemonitor
label_replace(up{app="kube-state-metrics",container=""}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0 or absent(up{app="kube-state-metrics",container=""} == 1)
)
for: 15m
labels:
area: kaas
cancel_if_apiserver_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
inhibit_kube_state_metrics_down: "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this inhibition make sense for KubeStateMetricsDown itself?

Copy link
Contributor

@hervenicol hervenicol Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inhibitions work with a source label and a target label: if an alert with the source label fires, alerts with the target label are inhibited.

Labels inhibit_xxx are source labels. This basically means "when this alert fires, please inhibit alerts that depend on kube_state_metrics_down.

cancel_if_prometheus_agent_down: "true"
cancel_if_kubelet_down: "true"
cancel_if_outside_working_hours: "false"
severity: page
team: atlas
topic: observability
- alert: KubeStateMetricsSlow
annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is too slow.`}}'
Expand All @@ -28,6 +56,27 @@ spec:
severity: page
team: atlas
topic: observability
- alert: KubeStateMetricsNotRetrievingMetrics
annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is not retrieving metrics.`}}'
opsrecipe: kube-state-metrics-down/
expr: |-
# When it looks up but we don't have metrics
count({app="kube-state-metrics"}) < 10
for: 20m
labels:
area: kaas
cancel_if_apiserver_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
inhibit_kube_state_metrics_down: "true"
cancel_if_kubelet_down: "true"
cancel_if_kube_state_metrics_down: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
topic: observability
- alert: KubeConfigMapCreatedMetricMissing
annotations:
description: '{{`kube_configmap_created metric is missing for cluster {{ $labels.cluster_id }}.`}}'
Expand Down
33 changes: 0 additions & 33 deletions helm/prometheus-rules/templates/alerting-rules/up.all.rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,36 +46,3 @@ spec:
severity: page
team: atlas
topic: observability
- alert: KubeStateMetricsDown
annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is down.`}}'
opsrecipe: kube-state-metrics-down/
expr: |-
(
# modern clusters
label_replace(up{app="kube-state-metrics",instance=~".*:8080"}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0 or absent(up{app="kube-state-metrics",instance=~".*:8080"} == 1)
)
and
(
# vintage clusters without servicemonitor
label_replace(up{app="kube-state-metrics",container=""}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0 or absent(up{app="kube-state-metrics",container=""} == 1)
)
or
QuantumEnigmaa marked this conversation as resolved.
Show resolved Hide resolved
(
# When it looks up but we don't have metrics
count({app="kube-state-metrics"}) < 10
)
for: 15m
labels:
area: kaas
cancel_if_apiserver_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
inhibit_kube_state_metrics_down: "true"
cancel_if_kubelet_down: "true"
cancel_if_outside_working_hours: "false"
cancel_if_prometheus_agent_down: "true"
severity: page
team: atlas
topic: observability
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
rule_files:
- up.all.rules.yml
- kube-state-metrics.rules.yml

tests:
# KubeStateMetricsDown tests
Expand Down
Loading