-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add self-healing flavor and docs on machine health checks
- Loading branch information
1 parent
d25603f
commit c588511
Showing
7 changed files
with
100 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Machine Health Checks | ||
|
||
CAPL supports auto-remediation of workload cluster Nodes considered to be unhealthy | ||
via [`MachineHealthChecks`](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking). | ||
|
||
## Enabling Machine Health Checks | ||
|
||
While it is possible to manually create and apply a `MachineHealthCheck` resource into the management cluster, | ||
using the `self-healing` flavor is the quickest way to get started: | ||
```sh | ||
clusterctl generate cluster $CLUSTER_NAME \ | ||
--kubernetes-version v1.29.1 \ | ||
--infrastructure linode:0.0.0 \ | ||
--flavor self-healing \ | ||
| kubectl apply -f - | ||
``` | ||
|
||
This flavor deploys a `MachineHealthCheck` for the workers and another `MachineHealthCheck` for the control plane | ||
of the cluster. It also configures the remediation strategy of the kubeadm control plane to prevent unnecessary load | ||
on the infrastructure provider. | ||
|
||
## Configuring Machine Health Checks | ||
|
||
Refer to the [Cluster API documentation](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking) | ||
for further information on configuring and using `MachineHealthChecks`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
resources: | ||
- machinehealthcheck.yaml |
46 changes: 46 additions & 0 deletions
46
templates/addons/machine-health-check/machinehealthcheck.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
apiVersion: cluster.x-k8s.io/v1beta1 | ||
kind: MachineHealthCheck | ||
metadata: | ||
name: ${CLUSTER_NAME}-node-unhealthy-5m | ||
spec: | ||
clusterName: ${CLUSTER_NAME} | ||
# (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy | ||
maxUnhealthy: 40% | ||
# (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for | ||
# a Node to join the cluster, before considering a Machine unhealthy. | ||
# Defaults to 10 minutes if not specified. | ||
# Set to 0 to disable the node startup timeout. | ||
# Disabling this timeout will prevent a Machine from being considered unhealthy when | ||
# the Node it created has not yet registered with the cluster. This can be useful when | ||
# Nodes take a long time to start up or when you only want condition based checks for | ||
# Machine health. | ||
nodeStartupTimeout: 10m | ||
# Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its timeout, the Machine is considered unhealthy | ||
selector: | ||
matchLabels: | ||
cluster.x-k8s.io/deployment-name: ${CLUSTER_NAME}-md-0 | ||
unhealthyConditions: | ||
- type: Ready | ||
status: Unknown | ||
timeout: 300s | ||
- type: Ready | ||
status: "False" | ||
timeout: 300s | ||
--- | ||
apiVersion: cluster.x-k8s.io/v1beta1 | ||
kind: MachineHealthCheck | ||
metadata: | ||
name: ${CLUSTER_NAME}-kcp-unhealthy-5m | ||
spec: | ||
clusterName: ${CLUSTER_NAME} | ||
maxUnhealthy: 100% | ||
selector: | ||
matchLabels: | ||
cluster.x-k8s.io/control-plane: "" | ||
unhealthyConditions: | ||
- type: Ready | ||
status: Unknown | ||
timeout: 300s | ||
- type: Ready | ||
status: "False" | ||
timeout: 300s |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
resources: | ||
- ../default | ||
- ../../addons/machine-health-check | ||
patches: | ||
- target: | ||
group: controlplane.cluster.x-k8s.io | ||
version: v1beta1 | ||
kind: KubeadmControlPlane | ||
patch: |- | ||
apiVersion: controlplane.cluster.x-k8s.io/v1beta1 | ||
kind: KubeadmControlPlane | ||
metadata: | ||
name: ${CLUSTER_NAME}-control-plane | ||
spec: | ||
remediationStrategy: | ||
maxRetry: 5 | ||
retryPeriod: 2m | ||
minHealthyPeriod: 2h |