Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: wip implement `CloudProvider` interface boilerplate for `kwok` provider Signed-off-by: vadasambar <[email protected]> feat: add builder for `kwok` - add logic to scale up and scale down nodes in `kwok` provider Signed-off-by: vadasambar <[email protected]> feat: wip parse node templates from file Signed-off-by: vadasambar <[email protected]> docs: add short README Signed-off-by: vadasambar <[email protected]> feat: implement remaining things - to get the provider in a somewhat working state Signed-off-by: vadasambar <[email protected]> docs: add in-cluster `kwok` as pre-requisite in the README Signed-off-by: vadasambar <[email protected]> fix: templates file not correctly marshalling into node list Signed-off-by: vadasambar <[email protected]> fix: `invalid leading UTF-8 octet` error during template parsing - remove encoding using `gob` - not required Signed-off-by: vadasambar <[email protected]> fix: use lister to get and list - instead of uncached kube client - add lister as a field on the provider and nodegroup struct Signed-off-by: vadasambar <[email protected]> fix: `did not find nodegroup annotation` error - CA was thinking the annotation is not present even though it is - fix a bug with parsing annotation Signed-off-by: vadasambar <[email protected]> fix: CA node recognizing fake nodegroups - add provider ID to nodes in the format `kwok:<node-name>` - fix invalid `KwokManagedAnnotation` - sanitize template nodes (remove `resourceVersion` etc.,) - not sanitizing the node leads to error during creation of new nodes - abstract code to get NG name into a separate function `getNGNameFromAnnotation` Signed-off-by: vadasambar <[email protected]> fix: node not getting deleted Signed-off-by: vadasambar <[email protected]> test: add empty test file Signed-off-by: vadasambar <[email protected]> chore: add OWNERS file Signed-off-by: vadasambar <[email protected]> feat: wip kwok provider config - add samples for static and dynamic template nodes Signed-off-by: vadasambar <[email protected]> feat: wip implement pulling node templates from cluster - add status field to kwok provider config - this is to capture how the nodes would be grouped by (can be annotation or label) - use kwok provider config status to get ng name from the node template Signed-off-by: vadasambar <[email protected]> fix: syntax error in calling `loadNodeTemplatesFromCluster` Signed-off-by: vadasambar <[email protected]> feat: first draft of dynamic node templates - this allows node templates to be pulled from the cluster - instead of having to specify static templates manually Signed-off-by: vadasambar <[email protected]> fix: syntax error Signed-off-by: vadasambar <[email protected]> refactor: abstract out related code into separate files - use named constants instead of hardcoded values Signed-off-by: vadasambar <[email protected]> feat: cleanup kwok nodes when CA is exiting - so that the user doesn't have to cleanup the fake nodes themselves Signed-off-by: vadasambar <[email protected]> refactor: return `nil` instead of err for `HasInstance` - because there is no underlying cloud provider (hence no reason to return `cloudprovider.ErrNotImplemented` Signed-off-by: vadasambar <[email protected]> test: start working on tests for kwok provider config Signed-off-by: vadasambar <[email protected]> feat: add `gpuLabelKey` under `nodes` field in kwok provider config - fix validation for kwok provider config Signed-off-by: vadasambar <[email protected]> docs: add motivation doc - update README with more details Signed-off-by: vadasambar <[email protected]> feat: update kwok provider config example to support pulling gpu labels and types from existing providers - still needs to be implemented in the code Signed-off-by: vadasambar <[email protected]> feat: wip update kwok provider config to get gpu label and available types Signed-off-by: vadasambar <[email protected]> feat: wip read gpu label and available types from specified provider - add available gpu types in kwok provider config status Signed-off-by: vadasambar <[email protected]> feat: add validation for gpu fields in kwok provider config - load gpu related fields in kwok provider config status Signed-off-by: vadasambar <[email protected]> feat: implement `GetAvailableGPUTypes` Signed-off-by: vadasambar <[email protected]> feat: add support to install and uninstall kwok - add option to disable installation - add option to manually specify kwok release tag - add future scope in readme Signed-off-by: vadasambar <[email protected]> docs: add future scope 'evaluate adding support to check if kwok controller already exists' Signed-off-by: vadasambar <[email protected]> fix: vendor conflict and cyclic import - remove support to get gpu config from the specified provider (can't be used because leads to cyclic import) Signed-off-by: vadasambar <[email protected]> docs: add a TODO 'get gpu config from other providers' Signed-off-by: vadasambar <[email protected]> refactor: rename `file` -> `configmap` - load config and templates from configmap instead of file - move `nodes` and `nodegroups` config to top level - add helper to encode configmap data into `[]bytes` - add helper to get current pod namespace Signed-off-by: vadasambar <[email protected]> feat: add new options to the kwok provider config - auto install kwok only if the version is >= v0.4.0 - add test for `GPULabel()` - use `kubectl apply` way of installing kwok instead of kustomize - add test for kwok helpers - add test for kwok config - inject service account name in CA deployment - add example configmap for node templates and kwok provider config in CA helm chart - add permission to create `clusterrolebinding` (so that kwok provider can create a clusterrolebinding with `cluster-admin` role and create/delete upstream manifests) - update kwok provider sample configs - update `README` Signed-off-by: vadasambar <[email protected]> chore: update go.mod to use v1.28 packages Signed-off-by: vadasambar <[email protected]> chore: `go mod tidy` and `go mod vendor` (again) Signed-off-by: vadasambar <[email protected]> refactor: kwok installation code - add functions to create and delete clusterrolebinding to create kwok resources - refactor kwok install and uninstall fns - delete manifests in the opposite order of install ] - add cleaning up left-over kwok installation to future scope Signed-off-by: vadasambar <[email protected]> fix: nil ptr error - add `TODO` in README for adding docs around kwok config fields Signed-off-by: vadasambar <[email protected]> refactor: remove code to automatically install and uninstall `kwok` - installing/uninstalling requires strong permissions to be granted to `kwok` - granting strong permissions to `kwok` means granting strong permissions to the entire CA codebase - this can pose a security risk - I have removed the code related to install and uninstall for now - will proceed after discussion with the community Signed-off-by: vadasambar <[email protected]> chore: run `go mod tidy` and `go mod vendor` Signed-off-by: vadasambar <[email protected]> fix: add permission to create nodes - to fix permissions error for kwok provider Signed-off-by: vadasambar <[email protected]> test: add more unit tests - add tests for kwok helpers - fix and update kwok config tests - fix a bug where gpu label was getting assigned to `kwokConfig.status.key` - expose `loadConfigFile` -> `LoadConfigFile` - throw error if templates configmap does not have `templates` key (value of which is node templates) - finish test for `GPULabel()` - add tests for `NodeGroupForNode()` - expose `loadNodeTemplatesFromConfigMap` -> `LoadNodeTemplatesFromConfigMap` - fix `KwokCloudProvider`'s kwok config was empty (this caused `GPULabel()` to return empty) Signed-off-by: vadasambar <[email protected]> refactor: abstract provider ID code into `getProviderID` fn - fix provider name in test `kwok` -> `kwok:kind-worker-xxx` Signed-off-by: vadasambar <[email protected]> chore: run `go mod vendor` and `go mod tidy Signed-off-by: vadasambar <[email protected]> docs(cloudprovider/kwok): update info on creating nodegroups based on `hostname/label` Signed-off-by: vadasambar <[email protected]> refactor(charts): replace fromLabelKey value `"kubernetes.io/hostname"` -> `"kwok-nodegroup"` - `"kubernetes.io/hostname"` leads to infinite scale-up Signed-off-by: vadasambar <[email protected]> feat: support running CA with kwok provider locally Signed-off-by: vadasambar <[email protected]> refactor: use global informer factory Signed-off-by: vadasambar <[email protected]> refactor: use `fromNodeLabelKey: "kwok-nodegroup"` in test templates Signed-off-by: vadasambar <[email protected]> refactor: `Cleanup()` logic - clean up only nodes managed by the kwok provider Signed-off-by: vadasambar <[email protected]> fix/refactor: nodegroup creation logic - fix issue where fake node was getting created which caused fatal error - use ng annotation to keep track of nodegroups - (when creating nodegroups) don't process nodes which don't have the right ng nabel - suffix ng name with unix timestamp Signed-off-by: vadasambar <[email protected]> refactor/test(cloudprovider/kwok): write tests for `BuildKwokProvider` and `Cleanup` - pass only the required node lister to cloud provider instead of the entire informer factory - pass the required configmap name to `LoadNodeTemplatesFromConfigMap` instead of passing the entire kwok provider config - implement fake node lister for testing Signed-off-by: vadasambar <[email protected]> test: add test case for dynamic templates in `TestNodeGroupForNode` - remove non-required fields from template node Signed-off-by: vadasambar <[email protected]> test: add tests for `NodeGroups()` - add extra node template without ng selector label to add more variability in the test Signed-off-by: vadasambar <[email protected]> test: write tests for `GetNodeGpuConfig()` Signed-off-by: vadasambar <[email protected]> test: add test for `GetAvailableGPUTypes` Signed-off-by: vadasambar <[email protected]> test: add test for `GetResourceLimiter()` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add tests for nodegroup's `IncreaseSize()` - abstract error msgs into variables to use them in tests Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `DeleteNodes()` fn - add check for deleting too many nodes - rename err msg var names to make them consistent Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add tests for ng `DecreaseTargetSize()` - abstract error msgs into variables (for easy use in tests) Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `Nodes()` - add extra test case for `DecreaseTargetSize()` to check lister error Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `TemplateNodeInfo` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): improve tests for `BuildKwokProvider()` - add more test cases - refactor lister for `TestBuildKwokProvider()` and `TestCleanUp()` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): add test for ng `GetOptions` Signed-off-by: vadasambar <[email protected]> test(cloudprovider/kwok): unset `KWOK_CONFIG_MAP_NAME` at the end of the test - not doing so leads to failure in other tests - remove `kwokRelease` field from kwok config (not used anymore) - this was causing the tests to fail Signed-off-by: vadasambar <[email protected]> chore: bump CA chart version - this is because of changes made related to kwok - fix type `everwhere` -> `everywhere` Signed-off-by: vadasambar <[email protected]> chore: fix linting checks Signed-off-by: vadasambar <[email protected]> chore: address CI lint errors Signed-off-by: vadasambar <[email protected]> chore: generate helm docs for `kwokConfigMapName` - remove `KWOK_CONFIG_MAP_KEY` (not being used in the code) - bump helm chart version Signed-off-by: vadasambar <[email protected]> docs: revise the outline for README - add AEP link to the motivation doc Signed-off-by: vadasambar <[email protected]> docs: wip create an outline for the README - remove `kwok` field from examples (not needed right now) Signed-off-by: vadasambar <[email protected]> docs: add outline for ascii gifs Signed-off-by: vadasambar <[email protected]> refactor: rename env variable `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP` Signed-off-by: vadasambar <[email protected]> docs: update README with info around installation and benefits of using kwok provider - add `Kwok` as a provider in main CA README Signed-off-by: vadasambar <[email protected]> chore: run `go mod vendor` - remove TODOs that are not needed anymore Signed-off-by: vadasambar <[email protected]> docs: finish first draft of README Signed-off-by: vadasambar <[email protected]> fix: env variable in chart `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP` Signed-off-by: vadasambar <[email protected]> refactor: remove redundant/deprecated code Signed-off-by: vadasambar <[email protected]> chore: bump chart version `9.30.1` -> `9.30.2` - because of kwok provider related changes Signed-off-by: vadasambar <[email protected]> chore: fix typo `offical` -> `official` Signed-off-by: vadasambar <[email protected]> chore: remove debug log msg Signed-off-by: vadasambar <[email protected]> docs: add links for getting help Signed-off-by: vadasambar <[email protected]> refactor: fix type in log `external cluster` -> `cluster` Signed-off-by: vadasambar <[email protected]> chore: add newline in chart.yaml to fix CI lint Signed-off-by: vadasambar <[email protected]> docs: fix mistake `sig-kwok` -> `sig-scheduling` - kwok is a part if sig-scheduling (there is no sig-kwok) Signed-off-by: vadasambar <[email protected]> docs: fix type `release"` -> `"release"` Signed-off-by: vadasambar <[email protected]> refactor: pass informer instead of lister to cloud provider builder fn Signed-off-by: vadasambar <[email protected]>
- Loading branch information