Check if KSM was up 2 minutes ago in WorkloadClusterCriticalPodNotRunningAWS alert #974

weseven · 2023-11-27T10:58:31Z

Towards: https://github.com/giantswarm/giantswarm/issues/27895

This PR adds a condition to check if the Kube State Metrics pod was up 2 minutes before this query, in order to prevent triggering of the alert when the KSM was down (with the absence of metrics triggering the second half of the alert) and just came up (lifting the alert inhibition, but potentially triggering this false alert).

Checklist

Update CHANGELOG.md
Add Unit tests
Follow Alert structure
Consider creating a dashboard (guidelines) (if it does not exist already) to help oncallers monitor the status of the issue.
Request review from oncall area, as well as team (e.g: oncall-kaas-cloud GitHub group).

to check if KSM was up 2 minutes ago.

hervenicol · 2023-11-28T15:20:47Z

Thanks for looking into this!

Here is a first feedback:

offset will take an instant time in the past, right? What if it was up 5min ago, but had a glitch between 4 and 3 minutes ago? Maybe something like min_over_time would be safer?
the ksmdown alert is supposed to check that KSM is up and running, and triggers an inhibition that will prevent this alert from paging. Maybe we should rather act on increasing the inhibition time window?

Actually we (Atlas) were talking about this alert this morning, and we thought we should split the ksmdown alert in 2:

one that fires the actual alert, that would not change
one that fires the inhibition alert, that would last longer

I did not check how easy/hard that would be though.
Happy to discuss it with you if you want to take over or have ideas.

weseven · 2023-11-29T10:54:12Z

@hervenicol Thanks for reviewing!
You're absolutely right, in case the KSM is flapping using up with offset might be misleading.
I think what you are suggesting (having the alert that controls the inhibitions be a bit more "sticky", that is, not lifting the inhibition immediately) is the better way to approach this.
That was something I was also thinking about as it's the proper place to fix for all the alerts, but wasn't sure how to approach that since there are already the KubeStateMetricsSlow and KubeStateMetricsNotRetrievingMetrics that possibly address similar use cases.
Also I'm not sure if it would help (or worsen things) to reduce the time windows for those two alerts; I guess we'll need to examine real world occurrences of the issue, although (fortunately) I don't see this happening recently.

weseven · 2023-11-30T11:01:03Z

although (fortunately) I don't see this happening recently.

I had to write that down, didn't I...
It happened yesterday: https://gigantic.slack.com/archives/C067ZGZ6XNY

Seems we are in the exact case where the KSM was down for a bit (~21 minutes), came back up, lifted the inhibition and immediately the WorkloadClusterCriticalPodNotRunningAWS triggered even though those pods were ok: https://opsg.in/a/i/giantswarm/6b1a8c26-396a-49bf-a100-35ceedc57b62-1701293882325

Some metrics on grafana:

up{app="kube-state-metrics", cluster_id="pcn01"}

kube_pod_container_status_running{container=~"(k8s-api-server|k8s-controller-manager|k8s-scheduler)",namespace="kube-system", cluster_id="pcn01"}

Metrics seem to come back at 22:37:15.
I don't know if it's relevant, but looking at the "up" metric, it seems the "metric" endpoint (port 8081) of KSM was discovered up at 22:36:30 ~45s before the http endpoint (port 8080).

hervenicol · 2023-11-30T15:08:40Z

Oh, timing was really tight!
FYI, we have a dashboard that shows the alerts that have been firing, on grafana cloud.

Idea

What I was thinking we should do is use "over_time" queries to match if we had an issue recently.

Here is a small demo with absent:

compared to:

...and here is the prometheus link with my queries on argali / pcn01

caveats

We probably don't want the "ksmdown" alert(s) to fire for longer.
So we probably want to duplicate these alerts, and create a counterpart for each of them that would only trigger the inhibition but not page.

Also, my example is quite simplistic, and we want to have the right behaviour for each of the different terms of the current KSMDown alerts.

hervenicol · 2024-03-05T11:30:13Z

Is this PR still relevant @weseven?

weseven · 2024-03-05T13:03:11Z

I think not; we can close this down, if the same issue catches my eye again we'll start from here :)

weseven added 2 commits November 27, 2023 11:52

Added condition to WorkloadClusterCriticalPodNotRunningAWS

0445035

to check if KSM was up 2 minutes ago.

Updated changelog

0fc742e

weseven marked this pull request as ready for review November 28, 2023 10:51

weseven requested a review from a team as a code owner November 28, 2023 10:51

weseven closed this Mar 5, 2024

QuentinBisson deleted the issue-27895-alert-after-KSM-down branch November 5, 2024 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if KSM was up 2 minutes ago in WorkloadClusterCriticalPodNotRunningAWS alert #974

Check if KSM was up 2 minutes ago in WorkloadClusterCriticalPodNotRunningAWS alert #974

weseven commented Nov 27, 2023 •

edited

Loading

hervenicol commented Nov 28, 2023

weseven commented Nov 29, 2023

weseven commented Nov 30, 2023 •

edited

Loading

hervenicol commented Nov 30, 2023

hervenicol commented Mar 5, 2024

weseven commented Mar 5, 2024

Check if KSM was up 2 minutes ago in WorkloadClusterCriticalPodNotRunningAWS alert #974

Check if KSM was up 2 minutes ago in WorkloadClusterCriticalPodNotRunningAWS alert #974

Conversation

weseven commented Nov 27, 2023 • edited Loading

Checklist

hervenicol commented Nov 28, 2023

weseven commented Nov 29, 2023

weseven commented Nov 30, 2023 • edited Loading

hervenicol commented Nov 30, 2023

Idea

caveats

hervenicol commented Mar 5, 2024

weseven commented Mar 5, 2024

weseven commented Nov 27, 2023 •

edited

Loading

weseven commented Nov 30, 2023 •

edited

Loading