Skip to content

Commit

Permalink
Add runbook for KubeMemoryOvercommit
Browse files Browse the repository at this point in the history
Signed-off-by: Paul Gier <[email protected]>
  • Loading branch information
pgier committed Jun 2, 2022
1 parent a6bc8ea commit 03af229
Showing 1 changed file with 33 additions and 0 deletions.
33 changes: 33 additions & 0 deletions content/runbooks/kubernetes/KubeMemoryOvercommit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# KubeMemoryOvercommit

## Meaning

This alert fires if the cluster does not have enough total memory to tolerate failure of the largest node in the cluster.

<details>
<summary>Full context</summary>

Each pod requests a certain amount of memory in the pod spec field `resource.requests.memory`. This value can also be
found via the metric `kube_pod_container_resource_requests{resource="memory"}`. If a node failure occurs, it's possible that
some pods will not be rescheduled due to a lack of resources. Thus it's recommended that the cluster has enough total resources
to tolerate a failure of the largest node, at least until that node is replaced.

This alert is calculated by comparing the total memory requested by the pods to the total memory available in the cluster minus the
amount of memory on the largest node.

</details>

## Impact

There is no immediate impact of this alert, however, if a node failure occurs, cluster availability will likely be affected.

## Diagnosis

Check the number and types of nodes being used in the cluster to decide if an additional node is needed. This could also be caused by an imbalance of node groups. For example, if there is a single large node running an app with a large memory requirement, it may not be schedulable
if that one large node fails.

## Mitigation

Adding an additional node (of the largest type in the cluster) or reducing the pod memory requests will normally resolve this issue.
Alternatively, if there are multiple node groups of different types, it may be possible to re-balance the cluster by adding a large
node and removing some small nodes.

0 comments on commit 03af229

Please sign in to comment.