Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inhibitions expressions for CAPI clusters #1153

Merged
merged 3 commits into from
May 6, 2024

Conversation

weseven
Copy link
Contributor

@weseven weseven commented Apr 30, 2024

Towards: giantswarm/roadmap#3338
This PR adds the expressions to have inhibitions working on capi clusters.

Checklist

@weseven weseven requested review from a team April 30, 2024 16:42
@weseven weseven requested a review from a team as a code owner April 30, 2024 16:42
@weseven weseven changed the title Added inhibitions expressions for CAPI clusters. Add inhibitions expressions for CAPI clusters Apr 30, 2024
@@ -80,7 +80,7 @@ spec:
- alert: InhibitionClusterNodePoolsNotReady
annotations:
description: '{{`Cluster {{ $labels.cluster_id }} node pools are not ready. Either they have been scaled down to 0 or they are not up yet.`}}'
expr: cluster_operator_node_pool_desired_workers == 0 and cluster_operator_node_pool_ready_workers == 0
expr: (cluster_operator_node_pool_desired_workers == 0 and cluster_operator_node_pool_ready_workers == 0) or capi_machinepool_status_condition{type="Ready", status="False"} == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it some more, what happens if there are multiple node pools? (e.g. one per AZ as is recommend) This wouldn't equal to 1 would it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this would equal to 1 even if only one nodePool among N is not ready.
Is that not the desired behaviour? I see the cancel_if_cluster_with_notready_nodepools is used in DNS errorrate, APIserver latency and network error rate alerts, so I assumed even only one nodepool not ready would affect those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but I think the non-capi query would only match if all node pools were not ready. I'm not sure what is desired though, I was just pointing out that it looked odd to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works the same way as in vintage, there's a node_pool_id label on the cluster_operator_node_pool_desired_workers and cluster_operator_node_pool_ready_workers metrics.
So for example it is currently triggering for deu01, which has a mix of active nodepools with nodes and nodepools scaled to 0.
It does sound odd though the more I think about it... @QuentinBisson, do you have any historical context on this? (Just because git blame points to your commit on these lines 😅 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no context over this, this was moved from another repo 3 years ago, What you need to be sure about is that the inhibitoin and alert can relate to a nodepool label but I'm afraid we do not have it in alerts that use this inhibition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doubt is about triggering the inhibition for just one nodepool that has been scaled to 0... but if we don't have any alerts using the inhibition currently, I'm also wondering whether we actually need it 😅
I would still keep it in this PR (with the CAPI update), and remove it in a later PR if it's actually useless, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe it could be useful for the cluster downscaler?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure you can add and remove :D

@weseven weseven merged commit be49a73 into master May 6, 2024
5 of 6 checks passed
@weseven weseven deleted the update-inhibitions-for-capi branch May 6, 2024 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants