Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce AEP with Provisioning Request CRD #5848

Merged
merged 1 commit into from
Sep 11, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
326 changes: 326 additions & 0 deletions cluster-autoscaler/proposals/provisioning-request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
# Provisioning Request CRD

author: kisieland

## Background

Currently CA does not provide any way to express that a group of pods would like
to have a capacity available.
This is caused by the fact that each CA loop picks a group of unschedulable pods
and works on provisioning capacity for them, meaning that the grouping is random
(as it depends on the kube-scheduler and CA loop interactions).
This is especially problematic in couple of cases:

- Users would like to have all-or-nothing semantics for their workloads.
kisieland marked this conversation as resolved.
Show resolved Hide resolved
Currently CA will try to provision this capacity and if it is partially
successful it will leave it in cluster until user removes the workload.
- Users would like to lower e2e scale-up latency for huge scale-ups (100
nodes+). Due to CA nature and kube-scheduler throughput, CA will create
partial scale-ups, e.g. `0->200->400->600` rather than one `0->600`. This
kisieland marked this conversation as resolved.
Show resolved Hide resolved
significantly increases the e2e latency as there is non-negligible time tax
on each scale-up operation.

## Proposal

### High level

Provisioning Request (abbr. ProvReq) is a new namespaced Custom Resource that
aims to allow users to ask CA for capacity for groups of pods.
It allows users to express the fact that group of pods is connected and should
be threated as one entity.
This AEP proposes an API that can have multiple provisioning classes and can be
extended by cloud provider specific ones.
This object is meant as one-shot request to CA, so that if CA fails to provision
the capacity it is up to users to retry (such retry functionality can be added
later on).

### ProvisioningRequest CRD

The following code snippets assume [kubebuilder](https://book.kubebuilder.io/)
is used to generate the CRD:

```go
// ProvisioningRequest is a way to express additional capacity
// that we would like to provision in the cluster. Cluster Autoscaler
// can use this information in its calculations and signal if the capacity
// is available in the cluster or actively add capacity if needed.
type ProvisioningRequest struct {
metav1.TypeMeta `json:",inline"`
// Standard object metadata. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
//
// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`
// Spec contains specification of the ProvisioningRequest object.
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status.
//
// +kubebuilder:validation:Required
Spec ProvisioningRequestSpec `json:"spec"`
// Status of the ProvisioningRequest. CA constantly reconciles this field.
//
// +optional
Status ProvisioningRequestStatus `json:"status,omitempty"`
}

// ProvisioningRequestList is a object for list of ProvisioningRequest.
type ProvisioningRequestList struct {
metav1.TypeMeta `json:",inline"`
// Standard list metadata.
//
// +optional
metav1.ListMeta `json:"metadata"`
// Items, list of ProvisioningRequest returned from API.
//
// +optional
Items []ProvisioningRequest `json:"items"`
}

// ProvisioningRequestSpec is a specification of additional pods for which we
// would like to provision additional resources in the cluster.
type ProvisioningRequestSpec struct {
// PodSets lists groups of pods for which we would like to provision
// resources.
//
// +kubebuilder:validation:Required
kisieland marked this conversation as resolved.
Show resolved Hide resolved
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=32
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
kisieland marked this conversation as resolved.
Show resolved Hide resolved
PodSets []PodSet `json:"podSets"`

// ProvisioningClass describes the different modes of provisioning the resources.
// Supported values:
// * check-capacity.kubernetes.io - check if current cluster state can fullfil this request,
// do not reserve the capacity.
// * atomic-scale-up.kubernetes.io - provision the resources in an atomic manner
// * ... - potential other classes that are specific to the cloud providers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the cluster-autoscaler implementation ignore things that didn't use the kubernetes.io names?

what would a user do if they wanted atomicScaleUp behavior AND some cloud-provider specific provisioner?

if we name this field provisioningClass, and have a ProvisioningClass API type in the future, I'd expect the value of this field to be the name of the API object, but making it domain-qualified (e.g. kubernetes.io/atomicScaleUp) is an invalid API object name (can't contain a slash). How will the ProvisioningClass instances be matched to the values specified here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such a scenario, should we drop the '/' and use a '-' or '.' instead?
will the following be ok?

kubernetes.io.atomic-scale-up
kubernetes.io.check-capacity 

//
// +kubebuilder:validation:Required
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need validation to make sure this is a qualified name (example.com/..., etc) and has a max length, etc

Copy link
Contributor Author

@kisieland kisieland Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the format, so the '/' is not needed.

As for the max length I think we should mimic the validation of the name.
Is there an example which does mimics the name validation?

ProvisioningClass string `json:"provisioningClass"`

// AdditionalParameters contains all other parameters custom classes may require.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API documentation should explicitly state that only unprivileged parameters (as defined by the ProvisioningClass provider) are allowed here, how parameters specified here interact with the same parameters specified in a ProvisioningClass object (which take precedence), and what happens if params unrecognized by the ProvisioningClass provider are specified.

This design should also describe how a ProvisioningClass provider would split parameters into privileged and unprivileged sets and document their use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to not include the details of how the privileged and unprivileged parameters work in this design as the ProvisioningClass is not strictly proposed here.
We discuss it and make the API ready for it but those details can be covered in a dedicated AEP.

//
// +optional
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be validation on the allowed keys (qualified names, length), allowed values (length, at least), and max number of items

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about requiring the qualified names, but the length of the names, entries and number of entries seem reasonable.
Although I am not sure how to limit the length of the keys and values in kubebuilder.
Is there a macro that does that?

AdditionalParameters map[string]string `json:"additionalParameters"`
kisieland marked this conversation as resolved.
Show resolved Hide resolved
}

type PodSet struct {
kisieland marked this conversation as resolved.
Show resolved Hide resolved
// PodTemplateRef is a reference to a PodTemplate object that is representing pods
// that will consume this reservation (must be within the same namespace).
// Users need to make sure that the fields relevant to scheduler (e.g. node selector tolerations)
kisieland marked this conversation as resolved.
Show resolved Hide resolved
// are consistent between this template and actual pods consuming the Provisioning Request.
//
// +kubebuilder:validation:Required
PodTemplateRef Reference `json:"podTemplateRef"`
kisieland marked this conversation as resolved.
Show resolved Hide resolved
// Count contains the number of pods that will be created with a given
// template.
//
// +kubebuilder:validation:Minimum=1
kisieland marked this conversation as resolved.
Show resolved Hide resolved
// +kubebuilder:validation:Maximum=16384
Count int32 `json:"count"`
}

type Reference struct {
// Name of the referenced object.
// More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names#names
//
// +kubebuilder:validation:Required
Name string `json:"name,omitempty"`
}

// ProvisioningRequestStatus represents the status of the resource reservation.
type ProvisioningRequestStatus struct {
// Conditions represent the observations of a Provisioning Request's
// current state. Those will contain information whether the capacity
// was found/created or if there were any issues. The condition types
// may differ between different provisioning classes.
//
// +listType=map
// +listMapKey=type
// +patchStrategy=merge
// +patchMergeKey=type
// +optional
Conditions []metav1.Condition `json:"conditions"`

// AdditionalStatus contains all other status values custom provisioning classes may require.
//
// +optional
// +kubebuilder:validation:MaxItems=64
AdditionalStatus map[string]string `json:"additionalStatus"`
kisieland marked this conversation as resolved.
Show resolved Hide resolved
}
```

### Provisioning Classes

#### check-capacity.kubernetes.io class

The `check-capacity.kubernetes.io` is one-off check to verify that the in the cluster
kisieland marked this conversation as resolved.
Show resolved Hide resolved
there is enough capacity to provision given set of pods.

Note: If two of such objects are created around the same time, CA will consider
them independently and place no guards for the capacity.
Also the capacity is not reserved in any manner so it may be scaled-down.
kisieland marked this conversation as resolved.
Show resolved Hide resolved

#### atomic-scale-up.kubernetes.io class

The `atomic-scale-up.kubernetes.io` aims to provision the resources required for the
specified pods in an atomic way. The proposed logic is to:
1. Try to provision required VMs in one loop.
2. If it failed, remove the partially provisioned VMs and back-off.
kisieland marked this conversation as resolved.
Show resolved Hide resolved
3. Stop the back-off after a given duration (optional), which would be passed
via `AdditionalParameters` field, using `ValidUntilSeconds` key and would contain string
kisieland marked this conversation as resolved.
Show resolved Hide resolved
denoting duration for which we should retry (measured since creation fo the CR).

Note: that the VMs created in this mode are subject to the scale-down logic.
So the duration during which users need to create the Pods is equal to the
value of `--scale-down-unneeded-time` flag.

### Adding pods that consume given ProvisioningRequest

To avoid generating double scale-ups and exclude pods that are meant to consume
given capacity CA should be able to differentiate those from all other pods.
To do so users need to specify the following pod annotation (it is not required
in ProvReq’s template, though it can be specified):

```yaml
annotations:
kisieland marked this conversation as resolved.
Show resolved Hide resolved
"cluster-autoscaler.kubernetes.io/consume-provisioning-request": "provreq-name"
kisieland marked this conversation as resolved.
Show resolved Hide resolved
kisieland marked this conversation as resolved.
Show resolved Hide resolved
kisieland marked this conversation as resolved.
Show resolved Hide resolved
```

If it is provided for the pods that consume the ProvReq with `check-capacity.kubernetes.io` class,
the CA will not provision the capacity, even if it was needed (as some other pods might have been
scheduled on it) and will result in visibility events passed to the ProvReq and pods.
If it is not passed the CA will behave normally and just provision the capacity if it needed.

Note: CA will match all pods with this annotation to a corresponding ProvReq and
ignore them when executing a scale-up loop (so that is up to users to make sure
that the ProvReq count is matching the number of created pods).
kisieland marked this conversation as resolved.
Show resolved Hide resolved
If the ProvReq is missing, all of the pods that consume it will be unschedulable indefinitely.

### CRD lifecycle

1. A ProvReq will be created either by the end user or by a framework.
At this point needed PodTemplate objects should be also created.
2. CA will pick it up, choose a nodepool (or create a new one if NAP is
enabled), and try to create nodes.
kisieland marked this conversation as resolved.
Show resolved Hide resolved
3. If CA successfully creates capacity, ProvReq will receive information about
this fact in `Conditions` field.
4. At this moment, users can create pods in that will consume the ProvReq (in
the same namespace), those will be scheduled on the capacity that was
created by the CA.
5. Once all of the pods are scheduled users can delete the ProvReq object,
kisieland marked this conversation as resolved.
Show resolved Hide resolved
kisieland marked this conversation as resolved.
Show resolved Hide resolved
otherwise it will be garbage collected after some time.
6. When pods finish the work and nodes become unused the CA will scale them
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the rules for scale down when nodes are provisioned but pods didn't start (or didn't start yet)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this was somehow configurable in the longer run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current proposal would be to use the same logic for scale-down so it would be configurable via --scale-down-unneeded-time flag.
For now there is no way for CA to know which VMs come from the atomic scale-up so we cannot differ.

down.

Note: Users can create a ProvReq and pods consuming them at the same time (in a
"fire and forget" manner), but this may result in the pods being unschedulable
and triggering user configured alerts.

### Canceling the requests

To cancel a pending Provisioning Request with atomic class, all that the users need to do is
to delete the Provisioning Request object.
After that the CA will no longer guard the nodes from deletion and proceed with standard scale-down logic.
kisieland marked this conversation as resolved.
Show resolved Hide resolved

### Conditions

The following Condition states should encode the states of the ProvReq:

- Provisioned - VMs were created successfully (Atomic class)
- CapacityAvailable - cluster contains enough capacity to schedule pods (Check
class)
* `CapacityAvailable=true` will denote that cluster contains enough capacity to schedule pods
* `CapacityAvailable=false` will denote that cluster does not contain enough capacity to schedule pods
- Failed - failed to create or check capacity (both classes)
kisieland marked this conversation as resolved.
Show resolved Hide resolved

The Reasons and Messages will contain more details about why the specific
condition was triggered.

Providers of the custom classes should reuse the conditions where available or create their own ones
if items from the above list cannot be used to denote a specific situation.

### CA implementation details

The proposed implementation is to handle each ProvReq in a separate scale-up
loop. This will require changes in multiple parts of CA:

1. Listing unschedulable pods where:
- pods that consume ProvReq need to filtered-out
- pods that are represented by the ProvReq need to be injected (we need to
ensure those are threated as one group by the sharding logic)
2. Scale-up logic, which as of now has no notion atomicity and grouping of
pods. This is simplified as the ScaleUp logic was recently put [behind an
interface](https://github.com/kubernetes/autoscaler/pull/5597).
- This is a place where the biggest part of the change will be made. Here
many parts of the logic are assuming best-effort semantics and the scale
up size is lowered in many situations:
- Estimation logic, which stops after some time-out or number of
pods/nodes.
- Size limiting, which caps the scale-up to match the size
restrictions (on node group or cluster level).
3. Node creation, which needs to support atomic resize. Either via native cloud
provider APIs or best effort with node removal if CA is unable to fulfill
kisieland marked this conversation as resolved.
Show resolved Hide resolved
the scale-up.
- This is also quite substantial change, we can provide a generic
best-effort implementation that will try to scale up and clean-up nodes
if it is unsuccessful, but it is up to cloud providers to integrate with
provider specific APIs.
4. Scale down path is not expected to change much. But users should follow
[best
practices](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node)
to avoid CA disturbing their workloads.

## Testing

The following e2e test scenarios will be created to check whether ProvReq
handling works as expected:

1. A new ProvReq with `check-capacity.kubernetes.io` provisioning class is created, CA
checks if there is enough capacity in cluster to provision specified pods.
2. A new ProvReq with `atomic-scale-up.kubernetes.io` provisioning class is created, CA
picks an appropriate node group scales it up atomically.
3. A new atomic ProvReq is created for which a NAP needs to provision a new
node group. NAP creates it CA scales it atomically.
- Here we should cover some of the different reasons why NAP may be
required.
4. An atomic ProvReq fails due to node group size limits and NAP CPU and/or RAM
limits.
5. Scalability tests.
- Scenario in which many small ProvReqs are created (strain on the number
of scale-up loops).
- Scenario in which big ProvReq is created (strain on a single scale-up
loop).

## Limitations

The current Cluster Autoscaler implementation is not taking into account [Resource Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/). \
The current proposal is to not include handling of the Resource Quotas, but it could be added later on.

## Future Expansions

### ProvisioningClass CRD

One of the expansion of this approach is to introduce the ProvisioningClass CRD,
which follows the same approach as
[StorageClass object](https://kubernetes.io/docs/concepts/storage/storage-classes/).
Such approach would allow administrators of the cluster to introduce a list of allowed
ProvisioningClasses. Such CRD can also contain a pre set configuration, i.e.
administrators may set that `atomic-scale-up.kubernetes.io` would retry up to `2h`.

Possible CRD definition:
```go
// ProvisioningClass is a way to express provisioning classes available in the cluster.
type ProvisioningClass struct {
// Name denotes the name of the object, which is to be used in the ProvisioningClass
// field in Provisioning Request CRD.
//
// +kubebuilder:validation:Required
Name string `json:"name"`

// AdditionalParameters contains all other parameters custom classes may require.
//
// +optional
AdditionalParameters map[string]string `json:"additionalParameters"`
}
```