-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce AEP with Provisioning Request CRD #5848
Conversation
FYI:
|
@qianlei90 Yes, it should help with the issue. |
created by the CA. | ||
5. Once all of the pods are scheduled users can delete the ProvReq object, | ||
otherwise it will be garbage collected after some time. | ||
6. When pods finish the work and nodes become unused the CA will scale them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the rules for scale down when nodes are provisioned but pods didn't start (or didn't start yet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this was somehow configurable in the longer run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current proposal would be to use the same logic for scale-down so it would be configurable via --scale-down-unneeded-time
flag.
For now there is no way for CA to know which VMs come from the atomic scale-up so we cannot differ.
|
||
// ProvisioningClass describes the different modes of provisioning the resources. | ||
// Supported values: | ||
// * GENERIC_CHECK_CAPACITY - check if current cluster state can fullfil this request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a strong need for GENERIC_CHECK_CAPACITY ? There is no scaling in that case , so it seems somewhat outside of "autoscaler" responsbilities. Could we maybe start with just GENERIC_ATOMIC?
Also, what happens if there are two CHECK_CAPACITY requests but there is capacity for one? Does one get capacity and the other one fails? When does the hold expire and is it consistent with how ProvisioningRequest scale downs for GENERIC_ATOMIC_SCALE_UP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there may be quite a few use cases. For example, trying to run low priority jobs only when there is capacity on the statically allocated nodes. Currently K8S doesn't have gang scheduling, and this may work as a best-effort replacement.
The problem with invalidation exist in all cases where shared/non-exclusive resources are provisioned. This should be addressed in the doc. Possible options are:
- Invalidate request that is still there.
- Treat PR as a point-in-time query and don't bother about invalidation.
Given that we already have 2 possible behaviors for a single capacity, the params/classes may be actually needed from day 1.
311bd7e
to
bfddb12
Compare
This capability is interesting for the Armada project, where we'd like to control resource provisioning separately from pod scheduling. Some context: with Armada we typically have long queues of pods that can't currently be scheduled, but we may not want to provision enough capacity to run those pods immediately. |
Can other external controllers instead of CA use provisioning CRD to allocate resources? |
CA is meant as the only component reacting to the CRD, that said if it is not installed there is nothing blocking another component stepping to fill the void (eg. the Karpenter component). As for interacting with the CRD any other component can create CRDs to ask CA to provision the VMs. |
Thanks @kisieland, can you please explain the below statement a bit more:
|
We can have ProvisioningClasses |
/label api-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to send my review last week :(
Thanks @alculquicondor addressed your comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading through this, this strikes me as a static capacity request for "ghost pods". i.e. I want negative preemption pods to represent my capacity without those pods actually existing and being interacted with by the kube-scheduler. Is there a similar hierarchical relationship that we can build here similar to how we have varying levels of abstraction for Pod/ReplicaSet/Deployment. Is there a way to represent a capacity request as a single pod and then build off of it with a replica count? Then CAS recognizes either real pods or these static capacity ghost pods. There's also a question in my head about the difference in use-cases here. One use-case is I always want overhead capacity for a warm pool at all times; the other is I eventually want that static capacity to be filled and I don't want to further scale up my capacity when that pre-provisioned static capacity is utilized.
afe32fd
to
4280d13
Compare
@liggitt PTAL |
4280d13
to
429b416
Compare
429b416
to
09fb959
Compare
Re @jpbetz:
Provisioning Requests can represent one pod. But building more of those objects and expecting CAS to group them beets the purpose of the ProvReq object in the first place, why use such ProvReqs, when one can just use pods?
Here we propose to implement the second mode, but there is nothing prevent you or anybody else from adding a constan size buffer mode, all the required fields are here, only CAS logic would need to be slightly modified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kisieland, mwielgus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I'm not quite familiar with autoscaler source code, but do care about this feature, so I left some comments, maybe not mature, forgive me. 🥲
Sorry for the delay, we have a bi-weekly API review that will be reviewing the updates today. I see a CRD was merged already in #6104 linking to this PR as API approval. |
The API was meant to go to x-k8s.io from the very beginning. Thanks for noticing it. It will be fixed soon. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds a new AEP to support a new CRD in cluster autoscaler.
This novel API will allow users to specify a dependency between a group of pods.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
N/A