-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check if KSM was up 2 minutes ago in WorkloadClusterCriticalPodNotRunningAWS alert #974
Conversation
to check if KSM was up 2 minutes ago.
Thanks for looking into this! Here is a first feedback:
Actually we (Atlas) were talking about this alert this morning, and we thought we should split the
I did not check how easy/hard that would be though. |
@hervenicol Thanks for reviewing! |
I had to write that down, didn't I... Seems we are in the exact case where the KSM was down for a bit (~21 minutes), came back up, lifted the inhibition and immediately the WorkloadClusterCriticalPodNotRunningAWS triggered even though those pods were ok: https://opsg.in/a/i/giantswarm/6b1a8c26-396a-49bf-a100-35ceedc57b62-1701293882325 Some metrics on grafana: up{app="kube-state-metrics", cluster_id="pcn01"} Metrics seem to come back at 22:37:15. |
Oh, timing was really tight! IdeaWhat I was thinking we should do is use "over_time" queries to match if we had an issue recently. Here is a small demo with ...and here is the prometheus link with my queries on argali / pcn01 caveatsWe probably don't want the "ksmdown" alert(s) to fire for longer. Also, my example is quite simplistic, and we want to have the right behaviour for each of the different terms of the current KSMDown alerts. |
Is this PR still relevant @weseven? |
I think not; we can close this down, if the same issue catches my eye again we'll start from here :) |
Towards: https://github.com/giantswarm/giantswarm/issues/27895
This PR adds a condition to check if the Kube State Metrics pod was up 2 minutes before this query, in order to prevent triggering of the alert when the KSM was down (with the absence of metrics triggering the second half of the alert) and just came up (lifting the alert inhibition, but potentially triggering this false alert).
Checklist
oncall-kaas-cloud
GitHub group).