Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if KSM was up 2 minutes ago in WorkloadClusterCriticalPodNotRunningAWS alert #974

Closed
wants to merge 2 commits into from

Conversation

weseven
Copy link
Contributor

@weseven weseven commented Nov 27, 2023

Towards: https://github.com/giantswarm/giantswarm/issues/27895

This PR adds a condition to check if the Kube State Metrics pod was up 2 minutes before this query, in order to prevent triggering of the alert when the KSM was down (with the absence of metrics triggering the second half of the alert) and just came up (lifting the alert inhibition, but potentially triggering this false alert).

Checklist

@weseven weseven marked this pull request as ready for review November 28, 2023 10:51
@weseven weseven requested a review from a team as a code owner November 28, 2023 10:51
@hervenicol
Copy link
Contributor

Thanks for looking into this!

Here is a first feedback:

  • offset will take an instant time in the past, right? What if it was up 5min ago, but had a glitch between 4 and 3 minutes ago? Maybe something like min_over_time would be safer?
  • the ksmdown alert is supposed to check that KSM is up and running, and triggers an inhibition that will prevent this alert from paging. Maybe we should rather act on increasing the inhibition time window?

Actually we (Atlas) were talking about this alert this morning, and we thought we should split the ksmdown alert in 2:

  • one that fires the actual alert, that would not change
  • one that fires the inhibition alert, that would last longer

I did not check how easy/hard that would be though.
Happy to discuss it with you if you want to take over or have ideas.

@weseven
Copy link
Contributor Author

weseven commented Nov 29, 2023

@hervenicol Thanks for reviewing!
You're absolutely right, in case the KSM is flapping using up with offset might be misleading.
I think what you are suggesting (having the alert that controls the inhibitions be a bit more "sticky", that is, not lifting the inhibition immediately) is the better way to approach this.
That was something I was also thinking about as it's the proper place to fix for all the alerts, but wasn't sure how to approach that since there are already the KubeStateMetricsSlow and KubeStateMetricsNotRetrievingMetrics that possibly address similar use cases.
Also I'm not sure if it would help (or worsen things) to reduce the time windows for those two alerts; I guess we'll need to examine real world occurrences of the issue, although (fortunately) I don't see this happening recently.

@weseven
Copy link
Contributor Author

weseven commented Nov 30, 2023

although (fortunately) I don't see this happening recently.

I had to write that down, didn't I...
It happened yesterday: https://gigantic.slack.com/archives/C067ZGZ6XNY

Seems we are in the exact case where the KSM was down for a bit (~21 minutes), came back up, lifted the inhibition and immediately the WorkloadClusterCriticalPodNotRunningAWS triggered even though those pods were ok: https://opsg.in/a/i/giantswarm/6b1a8c26-396a-49bf-a100-35ceedc57b62-1701293882325

Some metrics on grafana:

up{app="kube-state-metrics", cluster_id="pcn01"}

kube_pod_container_status_running{container=~"(k8s-api-server|k8s-controller-manager|k8s-scheduler)",namespace="kube-system", cluster_id="pcn01"}

Metrics seem to come back at 22:37:15.
I don't know if it's relevant, but looking at the "up" metric, it seems the "metric" endpoint (port 8081) of KSM was discovered up at 22:36:30 ~45s before the http endpoint (port 8080).

@hervenicol
Copy link
Contributor

Oh, timing was really tight!
FYI, we have a dashboard that shows the alerts that have been firing, on grafana cloud.

Idea

What I was thinking we should do is use "over_time" queries to match if we had an issue recently.

Here is a small demo with absent:
image
compared to:
image

...and here is the prometheus link with my queries on argali / pcn01

caveats

We probably don't want the "ksmdown" alert(s) to fire for longer.
So we probably want to duplicate these alerts, and create a counterpart for each of them that would only trigger the inhibition but not page.

Also, my example is quite simplistic, and we want to have the right behaviour for each of the different terms of the current KSMDown alerts.

@hervenicol
Copy link
Contributor

Is this PR still relevant @weseven?

@weseven
Copy link
Contributor Author

weseven commented Mar 5, 2024

I think not; we can close this down, if the same issue catches my eye again we'll start from here :)

@weseven weseven closed this Mar 5, 2024
@QuentinBisson QuentinBisson deleted the issue-27895-alert-after-KSM-down branch November 5, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants