Find CAPI alternatives for our inhibitions #3338

yulianedyalkova · 2024-03-18T14:58:48Z

Motivation

Part of #3315.

As part of Atlas' efforts to migrate alerting to mimir, they asked us to look into out inhibitions (cluster_upgrading, cluster_creating) as they use metrics coming from vintage components and only work in vintage.

Todo

Find CAPI alternatives for our inhibitions

weseven · 2024-04-29T15:13:56Z

We can use CAPI conditions and phases (thanks @nprokopic for the suggestion!).

In particular, we already have metrics exposed by the capi controller.

On a cluster level (probably what we want for the inhibitions:

capi_cluster_status_condition has 4 types (Ready, ControlPlaneInitialized, ControlPlaneReady, InfrastructureReady), that can have 3 statuses (True, False, Unknown):
- capi_cluster_status_condition{cluster_id="clustername", type="Ready", status="False"} is equal to 1 when the cluster is creating, deleting or updating
- capi_cluster_status_condition{cluster_id="clustername", type="ControlPlaneInitialized", status="True"} is equal to 1 only after the cluster has finished creating (stays 1 when cluster is updating or deleting)
- capi_cluster_status_condition{cluster_id="clustername", type="ControlPlaneReady", status="True"} is equal to 1 only when the ControlPlane is ready and available (after it has been initialized, and not when an update is rolling nodes or when the cluster is deleting)
capi_cluster_status_phase has different phases: in particular we might want to use phase="Provisioning" for the cluster creating inhibition and phase="Deleting" for the deleting one

On the control plane level:

capi_kubeadmcontrolplane_status_condition has 9 types (Available, CertificatesAvailable, ControlPlaneComponentsHealthy, EtcdClusterHealthy, MachinesCreated, MachinesReady, MachinesSpecUpToDate, Ready, Resized), with 3 statuses each (True, False, Unknown).
- we can use the ControlPlaneComponentsHealthy for an inhibition or alert specific to the controlplane, and the MachinesSpecUpToDate/MachinesReady for an inhibition on upgrades

On the worker nodes level:

capi_machinepool_status_phase with phase="Running" might be useful when checking for worker nodes availability after creation, together with
capi_machinepool_status_condition{type="Ready", status="True"} (the condition stays true even when the cluster it's deleting).
This can also be used for cluster_with_notready_nodepools inhibition.
capi_machinepool_status_phase with phases ScalingUp and ScalingDown can be used for cluster_with_scaling_nodepools inhibition.

weseven · 2024-04-30T16:43:23Z

Opened giantswarm/prometheus-rules#1153 to update the inhibition expressions using capi metrics.

weseven · 2024-05-06T08:09:08Z

PR has been merged.

Note: from the review InhibitionClusterNodePoolsNotReady seems not used currently.
giantswarm/prometheus-rules#1153 (comment)
Its expression has been updated for CAPI clusters too, but in the future we might want to evaluate whether it's actually useful or not. It was migrated from another repo 3 years ago, so we don't have context on why it was introduced.

It might be somewhat useful when using the cluster downscaler if scaling some worker nodepools to 0 causes alerts, but at the moment it is not used.

yulianedyalkova added this to Roadmap Mar 18, 2024

yulianedyalkova converted this from a draft issue Mar 18, 2024

yulianedyalkova added topic/monitoring team/turtles Team Turtles labels Mar 18, 2024

yulianedyalkova moved this from Inbox 📥 to Up Next ➡️ in Roadmap Mar 19, 2024

weseven self-assigned this Apr 10, 2024

weseven moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Apr 10, 2024

weseven mentioned this issue Apr 30, 2024

Add inhibitions expressions for CAPI clusters giantswarm/prometheus-rules#1153

Merged

5 tasks

weseven closed this as completed May 8, 2024

github-project-automation bot moved this from In Progress ⛏️ to Done ✅ in Roadmap May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find CAPI alternatives for our inhibitions #3338

Find CAPI alternatives for our inhibitions #3338

yulianedyalkova commented Mar 18, 2024 •

edited

Loading

weseven commented Apr 29, 2024 •

edited

Loading

weseven commented Apr 30, 2024

weseven commented May 6, 2024 •

edited

Loading

Find CAPI alternatives for our inhibitions #3338

Find CAPI alternatives for our inhibitions #3338

Comments

yulianedyalkova commented Mar 18, 2024 • edited Loading

Motivation

Todo

weseven commented Apr 29, 2024 • edited Loading

weseven commented Apr 30, 2024

weseven commented May 6, 2024 • edited Loading

yulianedyalkova commented Mar 18, 2024 •

edited

Loading

weseven commented Apr 29, 2024 •

edited

Loading

weseven commented May 6, 2024 •

edited

Loading