Karpenter is an open-source project designed to enhance node lifecycle management within Kubernetes clusters. It automates provisioning and deprovisioning of nodes based on the specific scheduling needs of pods, allowing efficient scaling and cost optimization. Its main functions are:
-
Monitor pods that the Kubernetes scheduler cannot schedule due to resource constraints.
-
Evaluate the scheduling requirements (resource requests, node selectors, affinities, tolerations, etc.) of the unschedulable pods.
-
Provision new nodes that meet the requirements of those pods.
-
Remove nodes when they are no longer needed.
With Karpenter, you can define NodePools with constraints on node provisioning like taints, labels, requirements (instance types, zones, etc.), and limits on total provisioned resources. When deploying workloads, you can specify various scheduling constraints in the pod specifications like resource requests/limits, node selectors, node/pod affinities, tolerations, and topology spread constraints. Karpenter will then provision right sized nodes based on these specifications.
Reasons to use Karpenter
Before the launch of Karpenter, Kubernetes users relied primarily on Amazon EC2 Auto Scaling groups and the Kubernetes Cluster Autoscaler (CAS) to dynamically adjust the compute capacity of their clusters. With Karpenter, you don’t need to create dozens of node groups to achieve the flexibility and diversity you get with Karpenter. Unlike CAS, Karpenter is not as tightly coupled to Kubernetes versions and doesn’t require you to jump between AWS and Kubernetes APIs.
Karpenter consolidates instance orchestration responsibilities within a single system, which is simpler, more stable and cluster-aware. Karpenter was designed to overcome some of the challenges presented by Cluster Autoscaler by providing simplified ways to:
-
Provision nodes based on workload requirements.
-
Create diverse node configurations by instance type, using flexible NodePool options. Instead of managing many specific custom node groups, Karpenter could let you manage diverse workload capacity with a single, flexible NodePool.
-
Achieve improved pod scheduling at scale by quickly launching nodes and scheduling pods.
For information and documentation on using Karpenter, visit the karpenter.sh site.
Best practices are divided into sections on Karpenter itself, NodePools, and pod scheduling.
The following best practices cover topics related to Karpenter itself.
Karpenter brings scaling management closer to Kubernetes native APIs than do Autoscaling Groups (ASGs) and Managed Node Groups (MNGs). ASGs and MNGs are AWS-native abstractions where scaling is triggered based on AWS level metrics, such as EC2 CPU load. Cluster Autoscaler bridges the Kubernetes abstractions into AWS abstractions, but loses some flexibility because of that, such as scheduling for a specific availability zone.
Karpenter removes a layer of AWS abstraction to bring some of the flexibility directly into Kubernetes. Karpenter is best used for clusters with workloads that encounter periods of high, spiky demand or have diverse compute requirements. MNGs and ASGs are good for clusters running workloads that tend to be more static and consistent. You can use a mix of dynamically and statically managed nodes, depending on your requirements.
You need features that are still being developed in Karpenter. Because Karpenter is a relatively new project, consider other autoscaling projects for the time being if you have a need for features that are not yet part of Karpenter.
Karpenter is installed using a [Helm chart](https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/#4-install-karpenter).
The Helm chart installs the Karpenter controller and a webhook pod as a Deployment
that needs to run before the controller can be used for scaling your cluster.
We recommend a minimum of one small node group with at least one worker
node. As an alternative, you can run these pods on EKS Fargate by
creating a Fargate profile for the karpenter
namespace. Doing so
will cause all pods deployed into this namespace to run on EKS Fargate.
Do not run Karpenter on a node that is managed by Karpenter.
There is no custom launch template support with v1 APIs. You can use custom user data and/or directly specifying custom AMIs in the EC2NodeClass. More information on how to do this is available at NodeClasses.
Consider excluding specific instances types with the
node.kubernetes.io/instance-type
key if they are not required by workloads running in your cluster.
The following example shows how to avoid provisioning large Graviton instances.
- key: node.kubernetes.io/instance-type
operator: NotIn
values:
- m6g.16xlarge
- m6gd.16xlarge
- r6g.16xlarge
- r6gd.16xlarge
- c6g.16xlarge
Karpenter supports
native
interruption handling and can handle involuntary interruption events
like Spot Instance interruptions, scheduled maintenance events, instance
termination/stopping events that could disrupt your workloads. When
Karpenter detects such events for nodes, it automatically taints, drains
and terminates the affected nodes ahead of time to start graceful
cleanup of workloads before disruption. For Spot interruptions with 2
minute notice, Karpenter quickly starts a new node so pods can be moved
before the instance is reclaimed. To enable interruption handling, you
configure the --interruption-queue
CLI argument with the name of the
SQS queue provisioned for this purpose. It is not advised to use
Karpenter interruption handling alongside Node Termination Handler as
explained here.
Pods that require checkpointing or other forms of graceful draining, requiring the 2-mins before shutdown should enable Karpenter interruption handling in their clusters.
When provisioning an EKS Cluster into a VPC with no route to the internet, you have to make sure you’ve configured your environment in accordance with the private cluster requirements that appear in EKS documentation. In addition, you need to make sure you’ve created an STS VPC regional endpoint in your VPC. If not, you will see errors similar to those that appear below.
{"level":"FATAL","time":"2024-02-29T14:28:34.392Z","logger":"controller","message":"Checking EC2 API connectivity, WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.<region>.amazonaws.com/\": dial tcp 54.239.32.126:443: i/o timeout","commit":"596ea97"}
These changes are necessary in a private cluster because the Karpenter Controller uses IAM Roles for Service Accounts (IRSA). Pods configured with IRSA acquire credentials by calling the AWS Security Token Service (AWS STS) API. If there is no outbound internet access, you must create and use an AWS STS VPC endpoint in your VPC.
Private clusters also require you to create a VPC endpoint for SSM. When Karpenter tries to provision a new node, it queries the Launch template configs and an SSM parameter. If you do not have a SSM VPC endpoint in your VPC, it will cause the following error:
{"level":"ERROR","time":"2024-02-29T14:28:12.889Z","logger":"controller","message":"Unable to hydrate the AWS launch template cache, RequestCanceled: request context canceled\ncaused by: context canceled","commit":"596ea97","tag-key":"karpenter.k8s.aws/cluster","tag-value":"eks-workshop"}
...
{"level":"ERROR","time":"2024-02-29T15:08:58.869Z","logger":"controller.nodeclass","message":"discovering amis from ssm, getting ssm parameter \"/aws/service/eks/optimized-ami/1.27/amazon-linux-2/recommended/image_id\", RequestError: send request failed\ncaused by: Post \"https://ssm.<region>.amazonaws.com/\": dial tcp 67.220.228.252:443: i/o timeout","commit":"596ea97","ec2nodeclass":"default","query":"/aws/service/eks/optimized-ami/1.27/amazon-linux-2/recommended/image_id"}
There is no VPC endpoint for the Price List Query API. As a result, pricing data will go stale over time. Karpenter gets around this by including on-demand pricing data in its binary, but only updates that data when Karpenter is upgraded. Failed requests for pricing data will result in the following error messages:
{"level":"ERROR","time":"2024-02-29T15:08:58.522Z","logger":"controller.pricing","message":"retreiving on-demand pricing data, RequestError: send request failed\ncaused by: Post \"https://api.pricing.<region>.amazonaws.com/\": dial tcp 18.196.224.8:443: i/o timeout; RequestError: send request failed\ncaused by: Post \"https://api.pricing.<region>.amazonaws.com/\": dial tcp 18.185.143.117:443: i/o timeout","commit":"596ea97"}
Refer to this documentation to use Karpenter in a completely Private EKS Clusters and to know which VPC endpoints to be created.
The following best practices cover topics related to creating NodePools.
When different teams are sharing a cluster and need to run their workloads on different worker nodes, or have different OS or instance type requirements, create multiple NodePools. For example, one team may want to use Bottlerocket, while another may want to use Amazon Linux. Likewise, one team might have access to expensive GPU hardware that wouldn’t be needed by another team. Using multiple NodePools makes sure that the most appropriate assets are available to each team.
It is recommended to create NodePools that are either mutually exclusive or weighted to provide consistent scheduling behavior. If they are not and multiple NodePools are matched, Karpenter will randomly choose which to use, causing unexpected results. Useful examples for creating multiple NodePools include the following:
Creating a NodePool with GPU and only allowing special workloads to run on these (expensive) nodes:
# NodePool for GPU Instances with Taints
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
disruption:
consolidateAfter: 1m
consolidationPolicy: WhenEmptyOrUnderutilized
template:
metadata: {}
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: Never
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p3.8xlarge
- p3.16xlarge
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
Deployment with toleration for the taint:
# Deployment of GPU Workload will have tolerations defined
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate-gpu
spec:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
For a general deployment for another team, the NodePool spec could
include nodeAffinity. A Deployment could then use nodeSelectorTerms to
match billing-team
.
# NodePool for regular EC2 instances
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: generalcompute
spec:
template:
metadata:
labels:
billing-team: my-team
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: Never
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.large
- m5.xlarge
- m5.2xlarge
- c5.large
- c5.xlarge
- c5a.large
- c5a.xlarge
- r5.large
- r5.xlarge
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
Deployment using nodeAffinity:
# Deployment will have spec.affinity.nodeAffinity defined
kind: Deployment
metadata:
name: workload-my-team
spec:
replicas: 200
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "billing-team"
operator: "In"
values: ["my-team"]
You can use timers on provisioned nodes to set when to delete nodes that
are devoid of workload pods or have reached an expiration time. Node
expiry can be used as a means of upgrading, so that nodes are retired
and replaced with updated versions. See
Expiration in the
Karpenter documentation for information on using
spec.template.spec
to configure node expiry.
Avoid overly constraining the Instance Types that Karpenter can provision, especially when utilizing Spot
When using Spot, Karpenter uses the Price Capacity Optimized allocation strategy to provision EC2 instances. This strategy instructs EC2 to provision instances from the deepest pools for the number of instances that you are launching and have the lowest risk of interruption. EC2 Fleet then requests Spot instances from the lowest priced of these pools. The more instance types you allow Karpenter to utilize, the better EC2 can optimize your spot instance’s runtime. By default, Karpenter will use all Instance Types EC2 offers in the region and availability zones your cluster is deployed in. Karpenter intelligently chooses from the set of all instance types based on pending pods to make sure your pods are scheduled onto appropriately sized and equipped instances. For example, if your pod does not require a GPU, Karpenter will not schedule your pod to an EC2 instance type supporting a GPU. When you’re unsure about which instance types to use, you can run the Amazon ec2-instance-selector to generate a list of instance types that match your compute requirements. For example, the CLI takes memory vCPU, architecture, and region as input parameters and provides you with a list of EC2 instances that satisfy those constraints.
$ ec2-instance-selector --memory 4 --vcpus 2 --cpu-architecture x86_64 -r ap-southeast-1
c5.large
c5a.large
c5ad.large
c5d.large
c6i.large
t2.medium
t3.medium
t3a.medium
You shouldn’t place too many constraints on Karpenter when using Spot instances because doing so can affect the availability of your applications. Say, for example, all of the instances of a particular type are reclaimed and there are no suitable alternatives available to replace them. Your pods will remain in a pending state until the spot capacity for the configured instance types is replenished. You can reduce the risk of insufficient capacity errors by spreading your instances across different availability zones, because spot pools are different across AZs. That said, the general best practice is to allow Karpenter to use a diverse set of instance types when using Spot.
The following best practices relate to deploying pods In a cluster using Karpenter for node provisioning.
If you need to run highly available applications, follow general EKS best practice recommendations. See Topology Spread in Karpenter documentation for details on how to spread pods across nodes and zones. Use Disruption Budgets to set the minimum available pods that need to be maintained, in case there are attempts to evict or delete pods.
Karpenter’s model of layered constraints allows you to create a complex set of NodePool and pod deployment constraints to get the best possible matches for pod scheduling. Examples of constraints that a pod spec can request include the following:
-
Needing to run in availability zones where only particular applications are available. Say, for example, you have pod that has to communicate with another application that runs on an EC2 instance residing in a particular availability zone. If your aim is to reduce cross-AZ traffic in your VPC, you may want to co-locate the pods in the AZ where the EC2 instance is located. This sort of targeting is often accomplished using node selectors. For additional information on Node selectors, please refer to the Kubernetes documentation.
-
Requiring certain kinds of processors or other hardware. See the Accelerators section of the Karpenter docs for a pod spec example that requires the pod to run on a GPU.
When you configure your cluster to automatically scale, you should create billing alarms to warn you when your spend has exceeded a threshold and add resource limits to your Karpenter configuration. Setting resource limits with Karpenter is similar to setting an AWS autoscaling group’s maximum capacity in that it represents the maximum amount of compute resources that can be instantiated by a Karpenter NodePool.
Note
|
It is not possible to set a global limit for the whole cluster. Limits apply to specific NodePools. |
The snippet below tells Karpenter to only provision a maximum of 1000
CPU cores and 1000Gi of memory. Karpenter will stop adding capacity only
when the limit is met or exceeded. When a limit is exceeded the
Karpenter controller will write
memory resource usage of 1001 exceeds limit of 1000
or a similar
looking message to the controller’s logs. If you are routing your
container logs to CloudWatch logs, you can create a
metrics
filter to look for specific patterns or terms in your logs and then
create a
CloudWatch
alarm to alert you when your configured metrics threshold is breached.
For further information using limits with Karpenter, see Setting Resource Limits in the Karpenter documentation.
spec:
limits:
cpu: 1000
memory: 1000Gi
If you don’t use limits or constrain the instance types that Karpenter can provision, Karpenter will continue adding compute capacity to your cluster as needed. While configuring Karpenter in this way allows your cluster to scale freely, it can also have significant cost implications. It is for this reason that we recommend that configuring billing alarms. Billing alarms allow you to be alerted and proactively notified when the calculated estimated charges in your account(s) exceed a defined threshold. See Setting up an Amazon CloudWatch Billing Alarm to Proactively Monitor Estimated Charges for additional information.
You may also want to enable Cost Anomaly Detection which is an AWS Cost Management feature that uses machine learning to continuously monitor your cost and usage to detect unusual spends. Further information can be found in the AWS Cost Anomaly Detection Getting Started guide. If you’ve gone so far as to create a budget in AWS Budgets, you can also configure an action to notify you when a specific threshold has been breached. With budget actions you can send an email, post a message to an SNS topic, or send a message to a chatbot like Slack. For further information see Configuring AWS Budgets actions.
If you are running a critical application on a Karpenter-provisioned
node, such as a long running batch job or stateful application, and
the node’s TTL has expired, the application will be interrupted when the
instance is terminated. By adding a
karpenter.sh/do-not-disrupt
annotation to the pod, you
are instructing Karpenter to preserve the node until the Pod is
terminated or the karpenter.sh/do-not-disrupt
annotation is removed.
See
Distruption
documentation for further information.
If the only non-daemonset pods left on a node are those associated with jobs, Karpenter is able to target and terminate those nodes so long as the job status is succeed or failed.
Consolidation and scheduling in general work by comparing the pods resource requests vs the amount of allocatable resources on a node. The resource limits are not considered. As an example, pods that have a memory limit that is larger than the memory request can burst above the request. If several pods on the same node burst at the same time, this can cause some of the pods to be terminated due to an out of memory (OOM) condition. Consolidation can make this more likely to occur as it works to pack pods onto nodes only considering their requests.
Because Kubernetes doesn’t set default requests or limits, a container’s consumption of resources from the underlying host, CPU, and memory is unbound. The Kubernetes scheduler looks at a pod’s total requests (the higher of the total requests from the pod’s containers or the total resources from the pod’s Init containers) to determine which worker node to schedule the pod onto. Similarly, Karpenter considers a pod’s requests to determine which type of instance it provisions. You can use a limit range to apply a sensible default for a namespace, in case resource requests are not specified by some pods.
Karpenter is able to launch nodes that best fit your workloads when its information about your workloads requirements is accurate. This is particularly important if using Karpenter’s consolidation feature.
When deploying CoreDNS pods on nodes managed by Karpenter, given Karpenter’s dynamic nature in rapidly terminating/creating new nodes to align with demand, it is advisable to adhere to the following best practices:
This will ensure that DNS queries are not directed to a CoreDNS Pod that is not yet ready or has been terminated.
As Karpenter takes an application-first approach to provision compute capacity for to the Kubernetes data plane, there are common workload scenarios that you might be wondering how to configure them properly. Karpenter Blueprints is a repository that includes a list of common workload scenarios following the best practices described here. You’ll have all the resources you need to even create an EKS cluster with Karpenter configured, and test each of the blueprints included in the repository. You can combine different blueprints to finally create the one you need for your workload(s).