-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add inhibitions expressions for CAPI clusters #1153
Conversation
helm/prometheus-rules/templates/alerting-rules/inhibit.management-cluster.rules.yml
Outdated
Show resolved
Hide resolved
@@ -80,7 +80,7 @@ spec: | |||
- alert: InhibitionClusterNodePoolsNotReady | |||
annotations: | |||
description: '{{`Cluster {{ $labels.cluster_id }} node pools are not ready. Either they have been scaled down to 0 or they are not up yet.`}}' | |||
expr: cluster_operator_node_pool_desired_workers == 0 and cluster_operator_node_pool_ready_workers == 0 | |||
expr: (cluster_operator_node_pool_desired_workers == 0 and cluster_operator_node_pool_ready_workers == 0) or capi_machinepool_status_condition{type="Ready", status="False"} == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it some more, what happens if there are multiple node pools? (e.g. one per AZ as is recommend) This wouldn't equal to 1
would it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this would equal to 1 even if only one nodePool among N is not ready.
Is that not the desired behaviour? I see the cancel_if_cluster_with_notready_nodepools
is used in DNS errorrate, APIserver latency and network error rate alerts, so I assumed even only one nodepool not ready would affect those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but I think the non-capi query would only match if all node pools were not ready. I'm not sure what is desired though, I was just pointing out that it looked odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works the same way as in vintage, there's a node_pool_id
label on the cluster_operator_node_pool_desired_workers
and cluster_operator_node_pool_ready_workers
metrics.
So for example it is currently triggering for deu01, which has a mix of active nodepools with nodes and nodepools scaled to 0.
It does sound odd though the more I think about it... @QuentinBisson, do you have any historical context on this? (Just because git blame points to your commit on these lines 😅 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no context over this, this was moved from another repo 3 years ago, What you need to be sure about is that the inhibitoin and alert can relate to a nodepool label but I'm afraid we do not have it in alerts that use this inhibition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doubt is about triggering the inhibition for just one nodepool that has been scaled to 0... but if we don't have any alerts using the inhibition currently, I'm also wondering whether we actually need it 😅
I would still keep it in this PR (with the CAPI update), and remove it in a later PR if it's actually useless, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe it could be useful for the cluster downscaler?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure you can add and remove :D
Towards: giantswarm/roadmap#3338
This PR adds the expressions to have inhibitions working on capi clusters.
Checklist
oncall-kaas-cloud
GitHub group).