Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split ksm alerts in 2 separate ones #912

Merged
merged 11 commits into from
Sep 21, 2023
Merged

split ksm alerts in 2 separate ones #912

merged 11 commits into from
Sep 21, 2023

Conversation

QuantumEnigmaa
Copy link
Contributor

@QuantumEnigmaa QuantumEnigmaa commented Sep 19, 2023

This PR splits the KSMDown alert into 2 different alerts :

  • the KSMDown one which triggers whenever the KSM component is actually down
  • the KSMNotRetrievingMetrics one which triggers when KSM is not able to retrieve metrics from the cluster

Checklist

@QuantumEnigmaa QuantumEnigmaa requested review from a team September 19, 2023 09:06
@QuantumEnigmaa QuantumEnigmaa requested a review from a team as a code owner September 19, 2023 09:06
@QuantumEnigmaa QuantumEnigmaa self-assigned this Sep 19, 2023
@@ -76,3 +71,24 @@ spec:
severity: page
team: atlas
topic: observability

- alert: KubeStateMetricsNotRetrievingMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference with KubeSecretMetricMissing and KubeStateMetricsSlow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we have a KSMdown inhibition :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my own understanding, KubeStateMetricsSlow is monitoring the response time from KSM to make sure it doesn't take too long to retrieve metrics while KubeStateMetricsNotRetrievingMetrics is making sure that there are actually metrics retrieved by KSM

But maybe we can consider that if KSM is taking too long to retrieve metrics then it means that it's unable to retrieve it and thus we can get rid of KubeStateMetricsNotRetrievingMetrics to only rely on KubeStateMetricsSlow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure about the slow one but is this alert not the same as those

?

Also maybe we should regroup all KSM related alerts to this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, is kube_configmap_created a random metric normally retrieved by KSM that we check to make sure KSM is retrieving metrics in general ?

In that case I'd prefer also keeping the new alert because its name is more straightforward about its usage and moreover it indicates that KSM as a whole is not able to retrieve metrics while the KubeConfigMapCreatedMetricMissing would indicate that KSM is not able to retrive metrics from a particular instance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well kube_configmap_created is the metric exposed by KSM when access the apiserver but we can have both if you prefer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would make sense to have both yes

cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
inhibit_kube_state_metrics_down: "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this inhibition make sense for KubeStateMetricsDown itself?

Copy link
Contributor

@hervenicol hervenicol Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inhibitions work with a source label and a target label: if an alert with the source label fires, alerts with the target label are inhibited.

Labels inhibit_xxx are source labels. This basically means "when this alert fires, please inhibit alerts that depend on kube_state_metrics_down.

Copy link
Contributor

@hervenicol hervenicol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@whites11 whites11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If atlas is happy I am happy

@QuantumEnigmaa QuantumEnigmaa merged commit 47fcf66 into master Sep 21, 2023
4 checks passed
@QuantumEnigmaa QuantumEnigmaa deleted the split-ksm-alerts branch September 21, 2023 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants