-
Notifications
You must be signed in to change notification settings - Fork 4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- add AEP link to the motivation doc Signed-off-by: vadasambar <[email protected]>
- Loading branch information
1 parent
9cdc8bf
commit 1913064
Showing
2 changed files
with
66 additions
and
78 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,78 +1,65 @@ | ||
## Status | ||
This project is in PoC phase. | ||
|
||
|
||
## Why `kwok` provider? | ||
This is covered in [motivation](./docs/motivation.md). | ||
Check the doc around [motivation](./docs/motivation.md). | ||
|
||
<!-- ## Pre-requisites | ||
* `kwok` should be running in the cluster | ||
* [In-cluster deployment guide](https://kwok.sigs.k8s.io/docs/user/kwok-in-cluster/) --> | ||
|
||
<!-- ## Not supported yet | ||
* `node-group-auto-discovery` flag. Nodegroups are built from the node templates file. | ||
* `node-group-auto-discovery` flag. Nodegroups are built from the node templates file. | ||
* Auto provisioned nodegroups | ||
* GPU nodegroups --> | ||
|
||
## What can I do with `kwok` provider? | ||
Among different things you can do with `kwok` provider, here are a few: | ||
* Test autoscaling behavior of CA with your workloads (without incurring any cost) | ||
* Test behavior of CA at scale (without incurring any cost) | ||
* Run CA locally (like a kubebuilder controller) | ||
|
||
# TODO: Add ASCII Cinema of every usecase. | ||
|
||
|
||
## How to use `kwok` provider | ||
### Pre-requisites | ||
1. Install `kwok` controller | ||
2. Set CA to use `kwok` cloud provider | ||
### Using static node templates | ||
|
||
### Using dynamic node templates | ||
|
||
## Usecases | ||
### 1. Use `kwok` provider for testing behavior of cluster-autoscaler | ||
As a user, here's what you should do: | ||
1. If you are running a cluster-autoscaler in your cluster already, change the `--cloudprovider` flag value to `kwok`. This will install `kwok` controller in your cluster. (you can disable this with `installKwok: false` in the kwok provider config). You can change the fields of the controller as you like. `kwok` provider only deploys the manifests into the cluster if none exist already. | ||
2. Add node templates | ||
1. If you want to use static node templates, specify <kwok-config-json-path-for-file> in the kwok provider config | ||
2. If you want to use dynamic node templates, specify <kwok-config-json-path-for-cluster> in the kwok provider config | ||
3. If you are using static node templates, mount the file on the cluster-autoscaler pod. | ||
4. Use cluster-autoscaler with `kwok` to perform your tests (or anything else) | ||
5. Once you are done, change `--cloudprovider` flag from `kwok` to the original value | ||
## Tweaking the `kwok` provider | ||
|
||
### 2. Run cluster-autoscaler on your local Kubernetes cluster using `kwok` provider | ||
|
||
### What is supported? | ||
### What is not supported? | ||
## I have a problem/suggestion/question/idea/feature request. What should I do? | ||
Awesome! Please: | ||
* [Create a new issue](https://github.com/kubernetes/autoscaler/issues/new/choose) around it. Mention `@vadasambar` (I try to respond within a working day). | ||
* Start a slack thread aruond it in kubernetes `#sig-autoscaling` channel (for invitation, check [this](https://slack.k8s.io/)). Mention `@vadasambar` (I try to respond within a working day) | ||
* Add it to the [weekly sig-autoscaling meeting agenda](https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit) (happens [on Mondays](https://github.com/kubernetes/community/tree/master/sig-autoscaling#meetings)) | ||
|
||
Please don't think too much about creating an issue. We can always close it if it doesn't make sense. | ||
|
||
## What is not supported? | ||
* Creating kwok nodegroups based on `kubernetes/hostname` node label. Why? Imagine you have a `Deployment` with pod anti-affinity on the `kubernetes/hostname` label like this: | ||
![](./docs/images/kwok-provider-hostname-label.png) | ||
Imagine you have only 2 unique hostnames values for `kubernetes/hostname` node label in your cluster: `hostname1`, `hostname2` | ||
If you increase the number of replicas in the `Deployment` to 3, CA creates a fake node internally and runs simulations on it to decide if it should scale up. This fake node has `kubernetes/hostname` set to the name of the fake node which looks like `template-node-xxxx-xxxx` (second `xxxx` is random). Since the value of `kubernetes/hostname` on the fake node is not `hostname1` or `hostname2`, CA thinks it can schedule the `Pending` pod on the fake node and hence keeps on scaling up to infinity (or until it can't). | ||
### TODO | ||
- [ ] remove outdated things in the doc (especially `Future plans`) | ||
- [ ] add docs around kwok config | ||
- [ ] specify required and optional fields | ||
|
||
|
||
### Future plans | ||
1. Support draining kwok nodes when cleaning up | ||
1. Support waiting for `kwok` controller's `Deployment` to come up. | ||
2. Support merging of static and dynamic node templates | ||
3. Evaluate adding support to check if `kwok` controller already exists | ||
4. Find a way to support getting GPU config from other providers (leads to cyclic import error) | ||
5. Refactor config loading and validation (uses a lot of `if`'s right now; a little difficult to maintain) | ||
6. Implement `Refresh` (unimplemented right now) | ||
7. Support customizing annotation used by kwok for managing nodes | ||
8. Clean-up previous installation of `kwok` | ||
* Right now `kwok` is installed when CA starts and uninstalled when CA pod is terminated | ||
* If by any chance there is a previous installation of `kwok` at version A and new CA pod starts and installs version B | ||
* When the new CA pod terminates it will attempt to delete manifests in version B | ||
* If version A had any extra manifests, those manifests would never be deleted | ||
9. Support automatically installing `kwok` when user changes `--cloudprovider` flag to `kwok` | ||
* Strong permissions need to be granted to the entire CA code to be able to do this which can pose security risks | ||
* This needs more discussion with the community before proceeding ahead | ||
### I want a feature | ||
* Create a new issue and mention `@vadasambar`. SLA: reply within a week until end of 2023 (post which I will think about SLO again and might come up with a new one). | ||
|
||
### Troubleshooting | ||
1. Pods are still stuck in `Running` even after CA has cleaned up all the kwok nodes | ||
* `kwok` provider doesn't drain the nodes when it deletes them. It just deletes the nodes. You should see pods running on these nodes change from `Running` state to `Pending` state in a minute or two. But if you don't, try scaling down your workload and scaling it up again. If the issue persists, please create an issue :pray:. | ||
If you increase the number of replicas in the `Deployment` to 3, CA creates a fake node internally and runs simulations on it to decide if it should scale up. This fake node has `kubernetes/hostname` set to the name of the fake node which looks like `template-node-xxxx-xxxx` (second `xxxx` is random). Since the value of `kubernetes/hostname` on the fake node is not `hostname1` or `hostname2`, CA thinks it can schedule the `Pending` pod on the fake node and hence keeps on scaling up to infinity (or until it can't). | ||
|
||
### I have a cool idea around `kwok` provider | ||
* Please create an issue or start a slack thread in #sig-autoscaling mentioning me. | ||
|
||
### I want to contribute | ||
Thank you ❤️ | ||
|
||
#### How to build CA | ||
#### How to run tests | ||
#### Where to reach for help | ||
* Issue comment (mention `@vadasambar`) | ||
* Slack thread in #sig-autoscaling (mention `@vadasambar`) | ||
* sig-autoscaling weekly meeting (happens on Mondays) | ||
## Troubleshooting | ||
1. Pods are still stuck in `Running` even after CA has cleaned up all the kwok nodes | ||
* `kwok` provider doesn't drain the nodes when it deletes them. It just deletes the nodes. You should see pods running on these nodes change from `Running` state to `Pending` state in a minute or two. But if you don't, try scaling down your workload and scaling it up again. If the issue persists, please create an issue :pray:. | ||
|
||
## I want to contribute | ||
Thank you ❤️ | ||
|
||
Here is some info to get you started: | ||
### Learn to build CA | ||
### Get yourself familiar with the `kwok` project | ||
### Try out the `kwok` provider | ||
### Look for a good first issue | ||
### Reach out for help if you get stuck | ||
You can get help in the following ways: | ||
* Mention `@vadasambar` in the issue/PR you are working on. | ||
* Start a slack thread in `#sig-autoscaling` mentioning `@vadasambar`. | ||
* Add it to the weekly sig-autoscaling meeting agenda (happens on Mondays) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters