Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split ksm alerts in 2 separate ones #912

Merged
merged 11 commits into from
Sep 21, 2023
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Changed

- Split `KubeStateMetricsDown` alert into 2 alerts : `KubeStateMetricsDown` and `KubeStateMetricsNotRetrievingMetrics`

## [2.132.0] - 2023-09-15

### Changed
Expand Down
26 changes: 21 additions & 5 deletions helm/prometheus-rules/templates/alerting-rules/up.all.rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,6 @@ spec:
# vintage clusters without servicemonitor
label_replace(up{app="kube-state-metrics",container=""}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0 or absent(up{app="kube-state-metrics",container=""} == 1)
)
or
QuantumEnigmaa marked this conversation as resolved.
Show resolved Hide resolved
(
# When it looks up but we don't have metrics
count({app="kube-state-metrics"}) < 10
)
for: 15m
labels:
area: kaas
Expand All @@ -76,3 +71,24 @@ spec:
severity: page
team: atlas
topic: observability

- alert: KubeStateMetricsNotRetrievingMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference with KubeSecretMetricMissing and KubeStateMetricsSlow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we have a KSMdown inhibition :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my own understanding, KubeStateMetricsSlow is monitoring the response time from KSM to make sure it doesn't take too long to retrieve metrics while KubeStateMetricsNotRetrievingMetrics is making sure that there are actually metrics retrieved by KSM

But maybe we can consider that if KSM is taking too long to retrieve metrics then it means that it's unable to retrieve it and thus we can get rid of KubeStateMetricsNotRetrievingMetrics to only rely on KubeStateMetricsSlow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure about the slow one but is this alert not the same as those

?

Also maybe we should regroup all KSM related alerts to this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, is kube_configmap_created a random metric normally retrieved by KSM that we check to make sure KSM is retrieving metrics in general ?

In that case I'd prefer also keeping the new alert because its name is more straightforward about its usage and moreover it indicates that KSM as a whole is not able to retrieve metrics while the KubeConfigMapCreatedMetricMissing would indicate that KSM is not able to retrive metrics from a particular instance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well kube_configmap_created is the metric exposed by KSM when access the apiserver but we can have both if you prefer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would make sense to have both yes

annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is not retrieving metrics.`}}'
opsrecipe: kube-state-metrics-down/
expr: |-
# When it looks up but we don't have metrics
count({app="kube-state-metrics"}) < 10
for: 60m
labels:
area: kaas
cancel_if_apiserver_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
inhibit_kube_state_metrics_down: "true"
cancel_if_kubelet_down: "true"
cancel_if_outside_working_hours: "false"
severity: page
team: atlas
topic: observability