If you are using a released version of Kubernetes, you should refer to the docs that go with that version.
Documentation for other releases can be found at releases.k8s.io.
Author: Derek Carr (@derekwaynecarr)
Status: Proposed
This document presents a specification for how the kubelet
evicts pods when compute resources are too low.
The node needs a mechanism to preserve stability when available compute resources are low.
This is especially important when dealing with incompressible compute resources such as memory or disk. If either resource is exhausted, the node would become unstable.
The kubelet
has some support for influencing system behavior in response to a system OOM by
having the system OOM killer see higher OOM score adjust scores for containers that have consumed
the largest amount of memory relative to their request. System OOM events are very compute
intensive, and can stall the node until the OOM killing process has completed. In addition,
the system is prone to return to an unstable state since the containers that are killed due to OOM
are either restarted or a new pod is scheduled on to the node.
Instead, we would prefer a system where the kubelet
can pro-actively monitor for
and prevent against total starvation of a compute resource, and in cases of where it
could appear to occur, pro-actively fail one or more pods, so the workload can get
moved and scheduled elsewhere when/if its backing controller creates a new pod.
This proposal defines a pod eviction policy for reclaiming compute resources.
In the first iteration, it focuses on memory; later iterations are expected to cover other resources like disk. The proposal focuses on a simple default eviction strategy intended to cover the broadest class of user workloads.
The kubelet
will support the ability to trigger eviction decisions on the following signals.
Eviction Signal | Description |
---|---|
memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
The kubelet
will support the ability to specify eviction thresholds.
An eviction threshold is of the following form:
<eviction-signal><operator><quantity>
- valid
eviction-signal
tokens as defined above. - valid
operator
tokens are<
- valid
quantity
tokens must match the quantity representation used by Kubernetes
If threhold criteria are met, the kubelet
will take pro-active action to attempt
to reclaim the starved compute resource associated with the eviction signal.
The kubelet
will support soft and hard eviction thresholds.
A soft eviction threshold pairs an eviction threshold with a required
administrator specified grace period. No action is taken by the kubelet
to reclaim resources associated with the eviction signal until that grace
period has been exceeded. If no grace period is provided, the kubelet
will
error on startup.
In addition, if a soft eviction threshold has been met, an operator can
specify a maximum allowed pod termination grace period to use when evicting
pods from the node. If specified, the kubelet
will use the lesser value among
the pod.Spec.TerminationGracePeriodSeconds
and the max allowed grace period.
If not specified, the kubelet
will kill pods immediately with no graceful
termination.
To configure soft eviction thresholds, the following flags will be supported:
--eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction.
--eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
--eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
A hard eviction threshold has no grace period, and if observed, the kubelet
will take immediate action to reclaim the associated starved resource. If a
hard eviction threshold is met, the kubelet
will kill the pod immediately
with no graceful termination.
To configure hard eviction thresholds, the following flag will be supported:
--eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction.
The kubelet
will initially evaluate eviction thresholds at the same
housekeeping interval as cAdvisor
housekeeping.
In Kubernetes 1.2, this was defaulted to 10s
.
It is a goal to shrink the monitoring interval to a much shorter window.
This may require changes to cAdvisor
to let alternate housekeeping intervals
be specified for selected data (google/cadvisor#1247)
For the purposes of this proposal, we expect the monitoring interval to be no
more than 10s
to know when a threshold has been triggered, but we will strive
to reduce that latency time permitting.
The kubelet
will support a node condition that corresponds to each eviction signal.
If a hard eviction threshold has been met, or a soft eviction threshold has been met
independent of its associated grace period, the kubelet
will report a condition that
reflects the node is under pressure.
The following node conditions are defined that correspond to the specified eviction signal.
Node Condition | Eviction Signal | Description |
---|---|---|
MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
The kubelet
will continue to report node status updates at the frequency specified by
--node-status-update-frequency
which defaults to 10s
.
If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, it would cause the corresponding node condition to constantly oscillate between true and false, and could cause poor scheduling decisions as a consequence.
To protect against this oscillation, the following flag is defined to control how
long the kubelet
must wait before transitioning out of a pressure condition.
--eviction-pressure-transition-period=5m0s: Duration for which the kubelet has to wait
before transitioning out of an eviction pressure condition.
The kubelet
would ensure that it has not observed an eviction threshold being met
for the specified pressure condition for the period specified before toggling the
condition back to false
.
Let's assume the operator started the kubelet
with the following:
--eviction-hard="memory.available<100Mi"
--eviction-soft="memory.available<300Mi"
--eviction-soft-grace-period="memory.available=30s"
The kubelet
will run a sync loop that looks at the available memory
on the node as reported from cAdvisor
by calculating (capacity - workingSet).
If available memory is observed to drop below 100Mi, the kubelet
will immediately
initiate eviction. If available memory is observed as falling below 300Mi
,
it will record when that signal was observed internally in a cache. If at the next
sync, that criteria was no longer satisfied, the cache is cleared for that
signal. If that signal is observed as being satisfied for longer than the
specified period, the kubelet
will initiate eviction to attempt to
reclaim the resource that has met its eviction threshold.
If an eviction threshold has been met, the kubelet
will initiate the
process of evicting pods until it has observed the signal has gone below
its defined threshold.
The eviction sequence works as follows:
- for each monitoring interval, if eviction thresholds have been met
- find candidate pod
- fail the pod
- block until pod is terminated on node
If a pod is not terminated because a container does not happen to die
(i.e. processes stuck in disk IO for example), the kubelet
may select
an additional pod to fail instead. The kubelet
will invoke the KillPod
operation exposed on the runtime interface. If an error is returned,
the kubelet
will select a subsequent pod.
The kubelet
will implement a default eviction strategy oriented around
the pod quality of service class.
It will target pods that are the largest consumers of the starved compute resource relative to their scheduling request. It ranks pods within a quality of service tier in the following order.
BestEffort
pods that consume the most of the starved resource are failed first.Burstable
pods that consume the greatest amount of the starved resource relative to their request for that resource are killed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.Guaranteed
pods that consume the greatest amount of the starved resource relative to their request are killed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.
A guaranteed pod is guaranteed to never be evicted because of another pod's
resource consumption. That said, guarantees are only as good as the underlying
foundation they are built upon. If a system daemon
(i.e. kubelet
, docker
, journald
, etc.) is consuming more resources than
were reserved via system-reserved
or kube-reserved
allocations, and the node
only has guaranteed pod(s) remaining, then the node must choose to evict a
guaranteed pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other guaranteed pod(s).
The kubelet
will reject BestEffort
pods if any of its associated
eviction thresholds have been exceeded independent of the configured
grace period.
Let's assume the operator started the kubelet
with the following:
--eviction-soft="memory.available<256Mi"
--eviction-soft-grace-period="memory.available=30s"
If the kubelet
sees that it has less than 256Mi
of memory available
on the node, but the kubelet
has not yet initiated eviction since the
grace period criteria has not yet been met, the kubelet
will still immediately
fail any incoming best effort pods.
The reasoning for this decision is the expectation that the incoming pod is
likely to further starve the particular compute resource and the kubelet
should
return to a steady state before accepting new workloads.
The node will report a condition when a compute resource is under pressure. The
scheduler should view that condition as a signal to dissuade placing additional
best effort pods on the node. In this case, the MemoryPressure
condition if true
should dissuade the scheduler from placing new best effort pods on the node since
they will be rejected by the kubelet
in admission.
It is never desired for a kubelet
to evict a pod that was derived from
a DaemonSet
since the pod will immediately be recreated and rescheduled
back to the same node.
At the moment, the kubelet
has no ability to distinguish a pod created
from DaemonSet
versus any other object. If/when that information is
available, the kubelet
could pro-actively filter those pods from the
candidate set of pods provided to the eviction strategy.
In general, it should be strongly recommended that DaemonSet
not
create BestEffort
pods to avoid being identified as a candidate pod
for eviction.