Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce AEP with Provisioning Request CRD #5848

Merged
merged 1 commit into from
Sep 11, 2023

Conversation

kisieland
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a new AEP to support a new CRD in cluster autoscaler.
This novel API will allow users to specify a dependency between a group of pods.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

N/A

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/cluster-autoscaler labels Jun 13, 2023
@qianlei90
Copy link
Contributor

@mwielgus mwielgus self-assigned this Jun 14, 2023
@mwielgus
Copy link
Contributor

@qianlei90 Yes, it should help with the issue.

created by the CA.
5. Once all of the pods are scheduled users can delete the ProvReq object,
otherwise it will be garbage collected after some time.
6. When pods finish the work and nodes become unused the CA will scale them
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the rules for scale down when nodes are provisioned but pods didn't start (or didn't start yet)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this was somehow configurable in the longer run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current proposal would be to use the same logic for scale-down so it would be configurable via --scale-down-unneeded-time flag.
For now there is no way for CA to know which VMs come from the atomic scale-up so we cannot differ.


// ProvisioningClass describes the different modes of provisioning the resources.
// Supported values:
// * GENERIC_CHECK_CAPACITY - check if current cluster state can fullfil this request
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong need for GENERIC_CHECK_CAPACITY ? There is no scaling in that case , so it seems somewhat outside of "autoscaler" responsbilities. Could we maybe start with just GENERIC_ATOMIC?

Also, what happens if there are two CHECK_CAPACITY requests but there is capacity for one? Does one get capacity and the other one fails? When does the hold expire and is it consistent with how ProvisioningRequest scale downs for GENERIC_ATOMIC_SCALE_UP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there may be quite a few use cases. For example, trying to run low priority jobs only when there is capacity on the statically allocated nodes. Currently K8S doesn't have gang scheduling, and this may work as a best-effort replacement.
The problem with invalidation exist in all cases where shared/non-exclusive resources are provisioned. This should be addressed in the doc. Possible options are:

  • Invalidate request that is still there.
  • Treat PR as a point-in-time query and don't bother about invalidation.

Given that we already have 2 possible behaviors for a single capacity, the params/classes may be actually needed from day 1.

cluster-autoscaler/proposals/provisioning-request.md Outdated Show resolved Hide resolved
@kisieland kisieland force-pushed the aep-prov-req branch 2 times, most recently from 311bd7e to bfddb12 Compare June 19, 2023 15:54
@severinson
Copy link

This capability is interesting for the Armada project, where we'd like to control resource provisioning separately from pod scheduling. Some context: with Armada we typically have long queues of pods that can't currently be scheduled, but we may not want to provision enough capacity to run those pods immediately.

@asm582
Copy link
Member

asm582 commented Jun 23, 2023

Can other external controllers instead of CA use provisioning CRD to allocate resources?

@kisieland
Copy link
Contributor Author

@asm582

Can other external controllers instead of CA use provisioning CRD to allocate resources?

CA is meant as the only component reacting to the CRD, that said if it is not installed there is nothing blocking another component stepping to fill the void (eg. the Karpenter component).
Note: One potential way we can have different components reacting to CRDs in the same cluster is to split them via ProvisioningClass.

As for interacting with the CRD any other component can create CRDs to ask CA to provision the VMs.

@asm582
Copy link
Member

asm582 commented Jun 26, 2023

Note: One potential way we can have different components reacting to CRDs in the same cluster is to split them via ProvisioningClass.

Thanks @kisieland, can you please explain the below statement a bit more:

Note: One potential way we can have different components reacting to CRDs in the same cluster is to split them via ProvisioningClass.

@kisieland
Copy link
Contributor Author

@asm582

We can have ProvisioningClasses GenericAtomicScaleUp and CloudProviderSomOtherProvisioningMode.
CRs with first of which would be handled by the CA and the CRs with the second one would be handled by some other component.

@kisieland
Copy link
Contributor Author

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Jun 26, 2023
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to send my review last week :(

cluster-autoscaler/proposals/provisioning-request.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/provisioning-request.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/provisioning-request.md Outdated Show resolved Hide resolved
@kisieland
Copy link
Contributor Author

Thanks @alculquicondor addressed your comments

Copy link

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading through this, this strikes me as a static capacity request for "ghost pods". i.e. I want negative preemption pods to represent my capacity without those pods actually existing and being interacted with by the kube-scheduler. Is there a similar hierarchical relationship that we can build here similar to how we have varying levels of abstraction for Pod/ReplicaSet/Deployment. Is there a way to represent a capacity request as a single pod and then build off of it with a replica count? Then CAS recognizes either real pods or these static capacity ghost pods. There's also a question in my head about the difference in use-cases here. One use-case is I always want overhead capacity for a warm pool at all times; the other is I eventually want that static capacity to be filled and I don't want to further scale up my capacity when that pre-provisioned static capacity is utilized.

@kisieland
Copy link
Contributor Author

@liggitt PTAL

@mwielgus
Copy link
Contributor

mwielgus commented Sep 7, 2023

@liggitt @jpbetz Do you have any more comments? If no (or there is no response) i will be merging the PR soon.

@kisieland
Copy link
Contributor Author

kisieland commented Sep 8, 2023

Re @jpbetz:

... Is there a way to represent a capacity request as a single pod and then build off of it with a replica count?

Provisioning Requests can represent one pod. But building more of those objects and expecting CAS to group them beets the purpose of the ProvReq object in the first place, why use such ProvReqs, when one can just use pods?
The issue that ProvReq aims to represent groups of Pods is to avoid CA making decisions of partial state of the cluster.
If we used labels to group pods CAS will not know when all of the pods were created and might have started work that is not needed.

One use-case is I always want overhead capacity for a warm pool at all times; the other is I eventually want that static capacity to be filled and I don't want to further scale up my capacity when that pre-provisioned static capacity is utilized.

Here we propose to implement the second mode, but there is nothing prevent you or anybody else from adding a constan size buffer mode, all the required fields are here, only CAS logic would need to be slightly modified.

Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kisieland, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2023
@k8s-ci-robot k8s-ci-robot merged commit 4e34992 into kubernetes:master Sep 11, 2023
Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm not quite familiar with autoscaler source code, but do care about this feature, so I left some comments, maybe not mature, forgive me. 🥲

@kisieland kisieland deleted the aep-prov-req branch September 12, 2023 11:28
@kisieland kisieland mentioned this pull request Sep 13, 2023
@liggitt
Copy link
Member

liggitt commented Sep 14, 2023

@liggitt @jpbetz Do you have any more comments? If no (or there is no response) i will be merging the PR soon.

Sorry for the delay, we have a bi-weekly API review that will be reviewing the updates today.

I see a CRD was merged already in #6104 linking to this PR as API approval.
Lazy consensus does not constitute API approval for k8s.io APIs. Please move the CRD to the x-k8s.io domain if you want to proceed without API approval, or revert the PR until approval is given.

@mwielgus
Copy link
Contributor

The API was meant to go to x-k8s.io from the very beginning. Thanks for noticing it. It will be fixed soon.

@kisieland
Copy link
Contributor Author

Thanks @liggitt for noticing created #6108 to address this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.