-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split ksm alerts in 2 separate ones #912
Conversation
@@ -76,3 +71,24 @@ spec: | |||
severity: page | |||
team: atlas | |||
topic: observability | |||
|
|||
- alert: KubeStateMetricsNotRetrievingMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference with KubeSecretMetricMissing and KubeStateMetricsSlow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we have a KSMdown inhibition :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my own understanding, KubeStateMetricsSlow
is monitoring the response time from KSM to make sure it doesn't take too long to retrieve metrics while KubeStateMetricsNotRetrievingMetrics
is making sure that there are actually metrics retrieved by KSM
But maybe we can consider that if KSM is taking too long to retrieve metrics then it means that it's unable to retrieve it and thus we can get rid of KubeStateMetricsNotRetrievingMetrics
to only rely on KubeStateMetricsSlow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure about the slow one but is this alert not the same as those
prometheus-rules/helm/prometheus-rules/templates/alerting-rules/kube-state-metrics.rules.yml
Line 31 in 737a5e9
- alert: KubeConfigMapCreatedMetricMissing |
Also maybe we should regroup all KSM related alerts to this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, is kube_configmap_created
a random metric normally retrieved by KSM that we check to make sure KSM is retrieving metrics in general ?
In that case I'd prefer also keeping the new alert because its name is more straightforward about its usage and moreover it indicates that KSM as a whole is not able to retrieve metrics while the KubeConfigMapCreatedMetricMissing
would indicate that KSM is not able to retrive metrics from a particular instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well kube_configmap_created is the metric exposed by KSM when access the apiserver but we can have both if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would make sense to have both yes
helm/prometheus-rules/templates/alerting-rules/kube-state-metrics.rules.yml
Outdated
Show resolved
Hide resolved
cancel_if_cluster_status_creating: "true" | ||
cancel_if_cluster_status_deleting: "true" | ||
cancel_if_cluster_has_no_workers: "true" | ||
inhibit_kube_state_metrics_down: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this inhibition make sense for KubeStateMetricsDown
itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inhibitions work with a source
label and a target
label: if an alert with the source
label fires, alerts with the target
label are inhibited.
Labels inhibit_xxx
are source
labels. This basically means "when this alert fires, please inhibit alerts that depend on kube_state_metrics_down
.
3f16148
to
834eb01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If atlas is happy I am happy
This PR splits the KSMDown alert into 2 different alerts :
Checklist
oncall-kaas-cloud
GitHub group).