diff --git a/cluster-autoscaler/cloudprovider/kwok/README.md b/cluster-autoscaler/cloudprovider/kwok/README.md index 7ad9bd3e3dd3..78d003ebc402 100644 --- a/cluster-autoscaler/cloudprovider/kwok/README.md +++ b/cluster-autoscaler/cloudprovider/kwok/README.md @@ -1,78 +1,65 @@ -## Status -This project is in PoC phase. - - ## Why `kwok` provider? -This is covered in [motivation](./docs/motivation.md). +Check the doc around [motivation](./docs/motivation.md). +## What can I do with `kwok` provider? +Among different things you can do with `kwok` provider, here are a few: +* Test autoscaling behavior of CA with your workloads (without incurring any cost) +* Test behavior of CA at scale (without incurring any cost) +* Run CA locally (like a kubebuilder controller) + +# TODO: Add ASCII Cinema of every usecase. + + +## How to use `kwok` provider +### Pre-requisites +1. Install `kwok` controller +2. Set CA to use `kwok` cloud provider +### Using static node templates + +### Using dynamic node templates -## Usecases -### 1. Use `kwok` provider for testing behavior of cluster-autoscaler -As a user, here's what you should do: -1. If you are running a cluster-autoscaler in your cluster already, change the `--cloudprovider` flag value to `kwok`. This will install `kwok` controller in your cluster. (you can disable this with `installKwok: false` in the kwok provider config). You can change the fields of the controller as you like. `kwok` provider only deploys the manifests into the cluster if none exist already. -2. Add node templates - 1. If you want to use static node templates, specify in the kwok provider config - 2. If you want to use dynamic node templates, specify in the kwok provider config -3. If you are using static node templates, mount the file on the cluster-autoscaler pod. -4. Use cluster-autoscaler with `kwok` to perform your tests (or anything else) -5. Once you are done, change `--cloudprovider` flag from `kwok` to the original value +## Tweaking the `kwok` provider -### 2. Run cluster-autoscaler on your local Kubernetes cluster using `kwok` provider -### What is supported? -### What is not supported? +## I have a problem/suggestion/question/idea/feature request. What should I do? +Awesome! Please: +* [Create a new issue](https://github.com/kubernetes/autoscaler/issues/new/choose) around it. Mention `@vadasambar` (I try to respond within a working day). +* Start a slack thread aruond it in kubernetes `#sig-autoscaling` channel (for invitation, check [this](https://slack.k8s.io/)). Mention `@vadasambar` (I try to respond within a working day) +* Add it to the [weekly sig-autoscaling meeting agenda](https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit) (happens [on Mondays](https://github.com/kubernetes/community/tree/master/sig-autoscaling#meetings)) + +Please don't think too much about creating an issue. We can always close it if it doesn't make sense. + +## What is not supported? * Creating kwok nodegroups based on `kubernetes/hostname` node label. Why? Imagine you have a `Deployment` with pod anti-affinity on the `kubernetes/hostname` label like this: ![](./docs/images/kwok-provider-hostname-label.png) Imagine you have only 2 unique hostnames values for `kubernetes/hostname` node label in your cluster: `hostname1`, `hostname2` -If you increase the number of replicas in the `Deployment` to 3, CA creates a fake node internally and runs simulations on it to decide if it should scale up. This fake node has `kubernetes/hostname` set to the name of the fake node which looks like `template-node-xxxx-xxxx` (second `xxxx` is random). Since the value of `kubernetes/hostname` on the fake node is not `hostname1` or `hostname2`, CA thinks it can schedule the `Pending` pod on the fake node and hence keeps on scaling up to infinity (or until it can't). -### TODO -- [ ] remove outdated things in the doc (especially `Future plans`) -- [ ] add docs around kwok config - - [ ] specify required and optional fields - - -### Future plans -1. Support draining kwok nodes when cleaning up -1. Support waiting for `kwok` controller's `Deployment` to come up. -2. Support merging of static and dynamic node templates -3. Evaluate adding support to check if `kwok` controller already exists -4. Find a way to support getting GPU config from other providers (leads to cyclic import error) -5. Refactor config loading and validation (uses a lot of `if`'s right now; a little difficult to maintain) -6. Implement `Refresh` (unimplemented right now) -7. Support customizing annotation used by kwok for managing nodes -8. Clean-up previous installation of `kwok` - * Right now `kwok` is installed when CA starts and uninstalled when CA pod is terminated - * If by any chance there is a previous installation of `kwok` at version A and new CA pod starts and installs version B - * When the new CA pod terminates it will attempt to delete manifests in version B - * If version A had any extra manifests, those manifests would never be deleted -9. Support automatically installing `kwok` when user changes `--cloudprovider` flag to `kwok` - * Strong permissions need to be granted to the entire CA code to be able to do this which can pose security risks - * This needs more discussion with the community before proceeding ahead -### I want a feature -* Create a new issue and mention `@vadasambar`. SLA: reply within a week until end of 2023 (post which I will think about SLO again and might come up with a new one). - -### Troubleshooting -1. Pods are still stuck in `Running` even after CA has cleaned up all the kwok nodes -* `kwok` provider doesn't drain the nodes when it deletes them. It just deletes the nodes. You should see pods running on these nodes change from `Running` state to `Pending` state in a minute or two. But if you don't, try scaling down your workload and scaling it up again. If the issue persists, please create an issue :pray:. +If you increase the number of replicas in the `Deployment` to 3, CA creates a fake node internally and runs simulations on it to decide if it should scale up. This fake node has `kubernetes/hostname` set to the name of the fake node which looks like `template-node-xxxx-xxxx` (second `xxxx` is random). Since the value of `kubernetes/hostname` on the fake node is not `hostname1` or `hostname2`, CA thinks it can schedule the `Pending` pod on the fake node and hence keeps on scaling up to infinity (or until it can't). -### I have a cool idea around `kwok` provider -* Please create an issue or start a slack thread in #sig-autoscaling mentioning me. -### I want to contribute -Thank you ❤️ -#### How to build CA -#### How to run tests -#### Where to reach for help -* Issue comment (mention `@vadasambar`) -* Slack thread in #sig-autoscaling (mention `@vadasambar`) -* sig-autoscaling weekly meeting (happens on Mondays) \ No newline at end of file +## Troubleshooting +1. Pods are still stuck in `Running` even after CA has cleaned up all the kwok nodes + * `kwok` provider doesn't drain the nodes when it deletes them. It just deletes the nodes. You should see pods running on these nodes change from `Running` state to `Pending` state in a minute or two. But if you don't, try scaling down your workload and scaling it up again. If the issue persists, please create an issue :pray:. + +## I want to contribute +Thank you ❤️ + +Here is some info to get you started: +### Learn to build CA +### Get yourself familiar with the `kwok` project +### Try out the `kwok` provider +### Look for a good first issue +### Reach out for help if you get stuck +You can get help in the following ways: +* Mention `@vadasambar` in the issue/PR you are working on. +* Start a slack thread in `#sig-autoscaling` mentioning `@vadasambar`. +* Add it to the weekly sig-autoscaling meeting agenda (happens on Mondays) diff --git a/cluster-autoscaler/cloudprovider/kwok/docs/motivation.md b/cluster-autoscaler/cloudprovider/kwok/docs/motivation.md index c117523e2345..c83aadf8fe4c 100644 --- a/cluster-autoscaler/cloudprovider/kwok/docs/motivation.md +++ b/cluster-autoscaler/cloudprovider/kwok/docs/motivation.md @@ -1,16 +1,17 @@ # KWOK (Kubernetes without Kubelet) cloud provider +*This doc was originally a part of https://github.com/kubernetes/autoscaler/pull/5869* ## Introduction > [KWOK](https://sigs.k8s.io/kwok) is a toolkit that enables setting up a cluster of thousands of Nodes in seconds. Under the scene, all Nodes are simulated to behave like real ones, so the overall approach employs a pretty low resource footprint that you can easily play around on your laptop. https://kwok.sigs.k8s.io/ ## Problem -### 1. It is hard to reproduce an issue happening at scale on local machine +### 1. It is hard to reproduce an issue happening at scale on local machine e.g., https://github.com/kubernetes/autoscaler/issues/5769 -To reproduce such issues, we have the following options today: -### (a) setup [Kubemark](https://github.com/kubernetes/design-proposals-archive/blob/main/scalability/kubemark.md) on a public cloud provider and try reproducing the issue +To reproduce such issues, we have the following options today: +### (a) setup [Kubemark](https://github.com/kubernetes/design-proposals-archive/blob/main/scalability/kubemark.md) on a public cloud provider and try reproducing the issue You can [setup Kubemark](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scalability/kubemark-guide.md) ([related](https://github.com/kubernetes/kubernetes/blob/master/test/kubemark/pre-existing/README.md)) and use the [`kubemark` cloudprovider](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/kubemark) (kubemark [proposal](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/kubemark_integration.md)) directly or [`cluster-api` cloudprovider with kubemark](https://github.com/kubernetes-sigs/cluster-api-provider-kubemark) In either case, @@ -23,42 +24,42 @@ In either case, https://github.com/kubernetes/kubernetes/blob/master/test/kubemark/pre-existing/README.md#introduction -You need to setup a separate VM (Virtual Machine) with master components to get Kubemark running. +You need to setup a separate VM (Virtual Machine) with master components to get Kubemark running. > Currently we're running HollowNode with a limit of 0.09 CPU core/pod and 220MB of memory. However, if we also take into account the resources absorbed by default cluster addons and fluentD running on the 'external' cluster, this limit becomes ~0.1 CPU core/pod, thus allowing ~10 HollowNodes to run per core (on an "n1-standard-8" VM node). https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scalability/kubemark-guide.md#starting-a-kubemark-cluster -Kubemark can mimic 10 nodes with 1 CPU core. +Kubemark can mimic 10 nodes with 1 CPU core. In reality it might be lesser than 10 nodes, > Using Kubernetes and [kubemark](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scalability/kubemark.md) on GCP we have created a following 1000 node cluster setup: >* 1 master - 1-core VM >* 17 nodes - 8-core VMs, each core running up to 8 Kubemark nodes. >* 1 Kubemark master - 32-core VM ->* 1 dedicated VM for Cluster Autoscaler +>* 1 dedicated VM for Cluster Autoscaler https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/scalability_tests.md#test-setup -This is a cheaper option than (c) but if you want to setup Kubemark on your local machine you will need a master node and 1 core per 10 fake nodes i.e., if you want to mimic 100 nodes, that's 10 cores of CPU + extra CPU for master node. Unless you have 10-12 free cores on your local machine, it is hard to run scale tests with Kubemark for nodes > 100. +This is a cheaper option than (c) but if you want to setup Kubemark on your local machine you will need a master node and 1 core per 10 fake nodes i.e., if you want to mimic 100 nodes, that's 10 cores of CPU + extra CPU for master node. Unless you have 10-12 free cores on your local machine, it is hard to run scale tests with Kubemark for nodes > 100. -### (b) try to get as much information from the issue reporter as possible and try to reproduce the issue by tweaking our code tests +### (b) try to get as much information from the issue reporter as possible and try to reproduce the issue by tweaking our code tests This works well if the issue is easy to reproduce by tweaking tests e.g., you want to check why scale down is getting blocked on a particular pod. You can do so by mimicing the pod in the tests by adding an entry [here](https://github.com/kubernetes/autoscaler/blob/1009797f5585d7bf778072ba59fd12eb2b8ab83c/cluster-autoscaler/utils/drain/drain_test.go#L878-L887) and running ``` cluster-autoscaler/utils/drain$ go test -run TestDrain ``` -But when you want to test an issue related to scale e.g., CA is slow in scaling up, it is hard to do. -### (c) try reproducing the issue using the same CA setup as user with actual nodes in a public cloud provider +But when you want to test an issue related to scale e.g., CA is slow in scaling up, it is hard to do. +### (c) try reproducing the issue using the same CA setup as user with actual nodes in a public cloud provider e.g., if the issue reporter has a 200 node cluster in AWS, try creating a 200 node cluster in AWS and use the same CA flags as the issue reporter. -This is a viable option if you already have a cluster running with a similar size but otherwise creating a big cluster just to reproduce the issue is costly. +This is a viable option if you already have a cluster running with a similar size but otherwise creating a big cluster just to reproduce the issue is costly. ### 2. It is hard to confirm behavior of CA at scale For example, a user with a big Kubernetes cluster (> 100-200 nodes) wants to check if adding scheduling properties to their workloads (node affinity, pod affinity, node selectors etc.,) leads to better utilization of the nodes (which saves cost). To give a more concrete example, imagine a situation like this: 1. There is a cluster with > 100 nodes. cpu to memory ratio for the nodes is 1:1, 1:2, 1:8 and 1:16 2. It is observed that 1:16 nodes are underutilized on memory 3. It is observed that workloads with cpu to memory ratio of 1:7 are getting scheduled on 1:16 nodes thereby leaving some memory unused -e.g., +e.g., 1:16 node looks like this: CPUs: 8 Cores Memory: 128Gi @@ -83,7 +84,7 @@ resources wasted on the node: 8 % 1 CPU(s) + 64 % 7 Gi If 1:7 can somehow be scheduled on 1:8 node using node selector or required node affinity, the wastage would go down. User wants to add required node affinity on 1:7 workloads and see how CA would behave without creating actual nodes in public cloud provider. The goal here is to see if the theory is true and if there are any side-effects. -This can be done with Kubemark today but a public cloud provider would be needed to mimic the cluster of this size. It can't be done on a local cluster (kind/minikube etc.,). +This can be done with Kubemark today but a public cloud provider would be needed to mimic the cluster of this size. It can't be done on a local cluster (kind/minikube etc.,). ### How does it look in action? You can check it [here](https://github.com/kubernetes/autoscaler/issues/5769#issuecomment-1590541506). @@ -91,16 +92,16 @@ You can check it [here](https://github.com/kubernetes/autoscaler/issues/5769#iss ### FAQ 1. **Will this be patched back to older releases of Kubernetes?** - As of writing this, the plan is to release it as a part of Kubernetes 1.28 and patch it back to 1.27 and 1.26. -2. **Why did we not use GRPC or cluster-api provider to implement this?** + As of writing this, the plan is to release it as a part of Kubernetes 1.28 and patch it back to 1.27 and 1.26. +2. **Why did we not use GRPC or cluster-api provider to implement this?** The idea was to enable users/contributors to be able to scale-test issues around different cloud providers (e.g., https://github.com/kubernetes/autoscaler/issues/5769). Implementing the `kwok` provider in-tree means we are closer to the actual implementation of our most-used cloud providers (adding gRPC communication in between would mean an extra delay which is not there in our in-tree cloud providers). Although only in-tree provider is a part of this proposal, overall plan is to: * Implement in-tree provider to cover most of the common use-cases * Implement `kwok` provider for `clusterapi` provider so that we can provision `kwok` nodes using `clusterapi` provider ([someone is already working on this](https://kubernetes.slack.com/archives/C8TSNPY4T/p1685648610609449)) * Implement gRPC provider if there is user demand -3. **How performant is `kwok` provider really compared to `kubemark` provider?** +3. **How performant is `kwok` provider really compared to `kubemark` provider?** `kubemark` provider seems to need 1 core per 8-10 nodes (based on our [last scale tests](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/scalability_tests.md#test-setup)). This means we need roughly 10 cores to simulate 100 nodes in `kubemark`. -`kwok` provider can simulate 385 nodes for 122m of CPU and 521Mi of memory. This means, CPU wise `kwok` can simulate 385 / 0.122 =~ 3155 nodes per 1 core of CPU. +`kwok` provider can simulate 385 nodes for 122m of CPU and 521Mi of memory. This means, CPU wise `kwok` can simulate 385 / 0.122 =~ 3155 nodes per 1 core of CPU. ![](images/kwok-provider-grafana.png) ![](images/kwok-provider-in-action.png) -4. **Can I think of `kwok` as a dry-run for my actual `cloudprovider`?** -That is the goal but note that the definition of what exactly `dry-run` means is not very clear and can mean different things for different users. You can think of it as something similar to a `dry-run`. \ No newline at end of file +4. **Can I think of `kwok` as a dry-run for my actual `cloudprovider`?** +That is the goal but note that the definition of what exactly `dry-run` means is not very clear and can mean different things for different users. You can think of it as something similar to a `dry-run`.