Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find CAPI alternatives for our inhibitions #3338

Closed
1 task
yulianedyalkova opened this issue Mar 18, 2024 · 3 comments
Closed
1 task

Find CAPI alternatives for our inhibitions #3338

yulianedyalkova opened this issue Mar 18, 2024 · 3 comments
Assignees

Comments

@yulianedyalkova
Copy link

yulianedyalkova commented Mar 18, 2024

Motivation

Part of #3315.

As part of Atlas' efforts to migrate alerting to mimir, they asked us to look into out inhibitions (cluster_upgrading, cluster_creating) as they use metrics coming from vintage components and only work in vintage.

Todo

  • Find CAPI alternatives for our inhibitions
@yulianedyalkova yulianedyalkova converted this from a draft issue Mar 18, 2024
@yulianedyalkova yulianedyalkova moved this from Inbox 📥 to Up Next ➡️ in Roadmap Mar 19, 2024
@weseven weseven self-assigned this Apr 10, 2024
@weseven weseven moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Apr 10, 2024
@weseven
Copy link

weseven commented Apr 29, 2024

We can use CAPI conditions and phases (thanks @nprokopic for the suggestion!).

In particular, we already have metrics exposed by the capi controller.

On a cluster level (probably what we want for the inhibitions:

  • capi_cluster_status_condition has 4 types (Ready, ControlPlaneInitialized, ControlPlaneReady, InfrastructureReady), that can have 3 statuses (True, False, Unknown):

    • capi_cluster_status_condition{cluster_id="clustername", type="Ready", status="False"} is equal to 1 when the cluster is creating, deleting or updating
    • capi_cluster_status_condition{cluster_id="clustername", type="ControlPlaneInitialized", status="True"} is equal to 1 only after the cluster has finished creating (stays 1 when cluster is updating or deleting)
    • capi_cluster_status_condition{cluster_id="clustername", type="ControlPlaneReady", status="True"} is equal to 1 only when the ControlPlane is ready and available (after it has been initialized, and not when an update is rolling nodes or when the cluster is deleting)
  • capi_cluster_status_phase has different phases: in particular we might want to use phase="Provisioning" for the cluster creating inhibition and phase="Deleting" for the deleting one

On the control plane level:

  • capi_kubeadmcontrolplane_status_condition has 9 types (Available, CertificatesAvailable, ControlPlaneComponentsHealthy, EtcdClusterHealthy, MachinesCreated, MachinesReady, MachinesSpecUpToDate, Ready, Resized), with 3 statuses each (True, False, Unknown).
    • we can use the ControlPlaneComponentsHealthy for an inhibition or alert specific to the controlplane, and the MachinesSpecUpToDate/MachinesReady for an inhibition on upgrades

On the worker nodes level:

  • capi_machinepool_status_phase with phase="Running" might be useful when checking for worker nodes availability after creation, together with
  • capi_machinepool_status_condition{type="Ready", status="True"} (the condition stays true even when the cluster it's deleting).
    This can also be used for cluster_with_notready_nodepools inhibition.
  • capi_machinepool_status_phase with phases ScalingUp and ScalingDown can be used for cluster_with_scaling_nodepools inhibition.

@weseven
Copy link

weseven commented Apr 30, 2024

Opened giantswarm/prometheus-rules#1153 to update the inhibition expressions using capi metrics.

@weseven
Copy link

weseven commented May 6, 2024

PR has been merged.

Note: from the review InhibitionClusterNodePoolsNotReady seems not used currently.
giantswarm/prometheus-rules#1153 (comment)
Its expression has been updated for CAPI clusters too, but in the future we might want to evaluate whether it's actually useful or not. It was migrated from another repo 3 years ago, so we don't have context on why it was introduced.

It might be somewhat useful when using the cluster downscaler if scaling some worker nodepools to 0 causes alerts, but at the moment it is not used.

@weseven weseven closed this as completed May 8, 2024
@github-project-automation github-project-automation bot moved this from In Progress ⛏️ to Done ✅ in Roadmap May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants