[VMs pool] Add implementation for VMs pool #6951

wenxuan0923 · 2024-06-21T00:55:09Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add implementation to support VMs pool autoscaling. Right now, only the basic scenario is covered: it does not include GPU pool. They will be handled in the later PRs.

To test it with VMs pool:

Create a cluster in prod: az aks create -n play-vms -g wenxrg --vm-set-type "VirtualMachines" -l eastus
add deploy CAS with command argument:--nodes=1:6:nodepool1 and required environment variables (note cluster name and cluster resource group name is required for VMs pool).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-06-21T00:55:18Z

Hi @wenxuan0923. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

comtalyst

Overall looking good, but I'm afraid I don't have a permission to approve (I think).
Also, I don't have much expertise in interactions with AKS API on AgentPool. Would rely on you to test.

comtalyst · 2024-07-01T18:37:14Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

-			return nil, err
+
+		// Use Service Principal
+		if len(cfg.AADClientID) > 0 && len(cfg.AADClientSecret) > 0 {


Does this support UseWorkloadIdentityExtension?

I'm not really sure...but it should be fairly easy to add it if needed later.

It won't be used for managed, but it is the preferred method of authentication for self-hosted at the moment. I would recommend you add it.

adding to what robin said, the workload identity extension gets addressed in this function below: https://github.com/kubernetes/autoscaler/pull/6951/files#diff-e18278caa19fbc44db54dcd3c6d8e77d045fff33e3a7cd0b6a65f5a111543b83R288-R354

so I'd look into reusing this function. we use workload identity for local testing + dev

Updated the code to support UseWorkloadIdentityExtension

comtalyst · 2024-07-01T18:40:18Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

-	manager       *AzureManager
-	resourceGroup string
+	manager              *AzureManager
+	resourceGroup        string // MC_ resource group for nodes


I actually prefer resourceGroup to be nodeResourceGroup. But the current convention is being used in other places as well. I think I will save this idea for later.

Yeah we want to keep this consistent with other places.

comtalyst · 2024-07-01T19:00:36Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

+}
+
+// scaleUpToCount sets node count for vms agent pool to target value through PUT AP call.
+func (agentPool *VMsPool) scaleUpToCount(count int64) error {


I generally prefer always-one-use methods like this to be a part of the code calling it, rather than being a separate method, especially when there are no clear boundary of responsibility. I think the case of IncreaseSize and scaleUpToCount (and more) applies.
In case you have a similar opinion, I don't think you need to be restricted by the pattern in azure_scale_set.go for this case.

I don't have a preference here, just thought it might make it a bit easier for maintainers if it has the same pattern with vmss. I would like to collect more feedback of this PR and make changes together if needed.

rakechill · 2024-07-22T17:31:18Z

it does not include Deallocate scale down policy

fyi, there is no support for this in core which is almost definitely required for this nodegroup implementation, too.

here's the issue + work exploring this: #6202
cc @jackfrancis

Bryce-Soghigian · 2024-07-29T23:07:34Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

+	for i, providerID := range providerIDs {
+		// extract the machine name from the providerID by splitting the providerID by '/' and get the last element
+		// The providerID look like this:
+		// "azure:///subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"


do we want to put real subs here since its open source? Given they can't do anything with the subscription, still probably good to not expose it.

Suggested change

// "azure:///subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"

// "azure:///subscriptions/0000000-0000-0000-0000-00000000000/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"

Updated the sub as suggested.

Bryce-Soghigian · 2024-07-29T23:08:37Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

+		providerIDParts := strings.Split(providerID, "/")
+		machineNames[i] = &providerIDParts[len(providerIDParts)-1]


There are a couple places we get these provider parts. Im wondering if a structure similar to https://github.com/Azure/karpenter-provider-azure/blob/main/pkg/utils/subnet_parser.go#L25 might make sense

Also I was thinking if we can have a consistent Validate() function attached to that ProviderIDParser, that validates the provider id shapes. Do we validate the provider id shape ahead of this function?

We validate node.Spec.ProviderID when we call NodeGroupForNode here:
https://github.com/kubernetes/autoscaler/blob/3914bdebd1e9453396408f1f99efa4cc5eeaf078/cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go#L105C1-L114C1

If the Node doesn't have a valid ProviderID, CAS won't be able scale down the corresponding node pool if I understand correctly, so when we reach the code here, we should be able to assume the ID is valid.

Also I just noticed that there's an existing utility method for name parsing here, updated the code to use that instead:

autoscaler/cluster-autoscaler/cloudprovider/azure/azure_util.go

Line 422 in 3914bde

func resourceName(ID string) (string, error) {

Bryce-Soghigian · 2024-07-29T23:09:38Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

 }

 // DeleteNodes extracts the providerIDs from the node spec and
-// delete or deallocate the nodes from the agent pool based on the scale down policy.
+// delete or deallocate the nodes based on the scale down policy of agentpool.


We don't have a deallocate implementation in the upstream codebase here yet right?

We don't have it here in CAS upstream, but AKS-RP is able to deallocate/delete nodes based on property properties.scaleDownMode of agentpool. If an VMs pool has scaleDownMode == Deallocate, the scale down triggered by CAS will cause RP to deallocate nodes rather than delete.

https://learn.microsoft.com/en-us/rest/api/aks/agent-pools/create-or-update?view=rest-aks-2024-02-01&tabs=HTTP#scaledownmode

Bryce-Soghigian · 2024-07-29T23:11:57Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

@@ -102,28 +120,260 @@ func (agentPool *VMsPool) MaxSize() int {
 // TargetSize returns the current TARGET size of the node group. It is possible that the
 // number is different from the number of nodes registered in Kubernetes.
 func (agentPool *VMsPool) TargetSize() (int, error) {
-	// TODO(wenxuan): Implement this method
-	return -1, cloudprovider.ErrNotImplemented
+	size, err := agentPool.getVMsPoolSize()


why can't we just implement TargetSize directly rather than deferring to getVMsPoolSize? Can't we just use TargetSize in place of this method wherever we need it?

I actually followed exact same pattern with VMSS here - thought that will make it easier for maintainers:

autoscaler/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

Line 335 in 3914bde

size, err := scaleSet.GetScaleSetSize()

Bryce-Soghigian · 2024-07-29T23:14:23Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

+
+	if _, err = poller.PollUntilDone(updateCtx, nil); err == nil {
+		// success path
+		klog.Infof("agentPoolClient.BeginCreateOrUpdate for aks cluster %s agentpool %s succeeded", agentPool.clusterName, agentPool.Name)


nit: do we always want to log on success? Won't we in most cases log the results of this function ie the error? So we can infer when its successful that way? Do we log in other nodepool implementations?

We don't do a good job of this now, but really we should take better care across the codebase with verbosity.

Log removed

Bryce-Soghigian · 2024-07-29T23:15:19Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

+		return err
+	}
+
+	klog.V(1).Infof("Scaling up vms pool %s to new target count: %d", agentPool.Name, count)


target size is recorded outside of this iirc, did you find that the existing logging was not enough here?

Log removed

Bryce-Soghigian · 2024-07-29T23:23:44Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

+
+	// if the target size is smaller than the min size, return an error
+	if int(currentSize) <= agentPool.MinSize() {
+		klog.V(3).Infof("min size %d reached, nodes will not be deleted", agentPool.MinSize())


Most places in core, we log the error from DeleteNodes()

autoscaler/cluster-autoscaler/core/static_autoscaler.go

Line 806 in 3914bde

err = nodeGroup.DeleteNodes(nodesToDelete)

for example.

Should we be logging this twice?

Log removed

tallaxes · 2024-07-31T02:20:38Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

 // TemplateNodeInfo is not implemented.
 func (agentPool *VMsPool) TemplateNodeInfo() (*schedulerframework.NodeInfo, error) {
+	// TODO(wenxuan): implement this method when vms pool can fully support GPU nodepool


Not having this I think means scaling from zero is not supported (even on non-GPU nodepools)

tallaxes · 2024-07-31T02:23:52Z

cluster-autoscaler/cloudprovider/azure/azure_vms_pool.go

+	agentPool.manager.invalidateCache()
+	_, err := agentPool.getVMsPoolSize()
+	if err != nil {
+		klog.Warningf("DecreaseTargetSize: failed with error: %v", err)
+	}
+	return nil


Functionally NOOP implementation deserves a comment

Oops I'm supposed to return err here, updated the code. Thanks for pointing it out!

Yes, but my point is a little different.

This implementation of DecreaseTargetSize (admittedly, same as current implementation for VMSS) does not actually do anything (except invalidating the cache), and yet (essentially) never fails. This is a latent problem, as it is used by autoscaler core to adjust the size of nodepool that appears to need adjusting, and not returning an error makes core think adjustment was made, which can in turn lead to persistent issues (that have been observed). It is not a problem if DecreaseTargetSize never gets called - which is actually the most common case for well-behaving nodepool. But a correct implementation would likely a) actually adjust the target size (subject to constraints in comments for DecreaseTargetSize) and b) return an error if it is not possible or not successful. Not doing this is a choice (or a bug?) - that's what I meant by "NOOP implementation deserves a comment".

VMSS version likely needs revisiting as well. It does have an explanation for not doing anything in comment ("VMSS size should be changed automatically after the Node deletion, hence this operation is not required."), which I don't fully understand/agree with. Feels like it needs to either implement adjustment properly, or return an error. And just returning ErrNotImplemented is an option, though I have not tried it.

So please give some thought to whether a better implementation of DecreaseTargetSize for VM node pool (respecting constraints outlined in comments) is possible. For expediency, it is probably Ok to leave it as is for now (maybe at least with a comment like "just following VMSS implementation"), to revisit later. If you do, would be good to validate in testing that it actually never gets called. (Or maybe just change to returning ErrNotImplemented?)

tallaxes · 2024-07-31T02:33:13Z

/lgtm
/approve

wenxuan0923 · 2024-08-01T23:29:35Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

@@ -148,7 +148,7 @@ func (az *azDeploymentsClient) Delete(ctx context.Context, resourceGroupName, de
 	return future.Response(), err
 }

-//go:generate sh -c "mockgen k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure AgentPoolsClient >./agentpool_client.go"
+//go:generate sh -c "mockgen -source=azure_client.go -destination azure_mock_agentpool_client.go -package azure -exclude_interfaces DeploymentsClient -copyright_file copyright.txt"


@rakechill I'm fixing the mockgen command in this PR. Don't know why the mockgen reflect mode is not working, change it to use source mode instead.

k8s-ci-robot · 2024-08-27T02:37:24Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2024-11-25T03:23:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-12-25T04:11:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 21, 2024

k8s-ci-robot requested a review from feiskyer June 21, 2024 00:55

k8s-ci-robot added the area/cluster-autoscaler label Jun 21, 2024

k8s-ci-robot requested a review from gandhipr June 21, 2024 00:55

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 21, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 21, 2024

comtalyst reviewed Jul 1, 2024

View reviewed changes

comtalyst approved these changes Jul 23, 2024

View reviewed changes

Bryce-Soghigian reviewed Jul 29, 2024

View reviewed changes

Bryce-Soghigian approved these changes Jul 30, 2024

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 30, 2024

wenxuan0923 force-pushed the wenx/vms-impl branch from 6c5a7c3 to c15fa4c Compare July 30, 2024 02:39

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 30, 2024

tallaxes reviewed Jul 31, 2024

View reviewed changes

k8s-ci-robot assigned tallaxes Jul 31, 2024

wenxuan0923 commented Aug 1, 2024

View reviewed changes

wenxuan0923 force-pushed the wenx/vms-impl branch 2 times, most recently from 355df75 to 1d15fd6 Compare August 2, 2024 00:02

wenxuan0923 added 14 commits August 14, 2024 11:58

[VMs pool] Add implementation for VMs pool

0b2c0f4

Fx golint

71a2e6b

Address comments

26d5fe1

Remove some logs

85a9d68

Use utility method to parse machine name

4407a79

Bump armcontainerservice to v5

ab9d486

Return error for DecreaseTargetSize

cdeae25

Fx gomock

a0143a4

Fx Boilerplate header

8723021

Resolve conflicts

0d941a1

Deal with 0 node vms pool for register

e964b77

Introduce Node Template

45d1dad

Change vmsPoolSet to vmsPoolMap

33c6330

Support TemplateNodeInfo

eefca96

wenxuan0923 force-pushed the wenx/vms-impl branch from 1d15fd6 to eefca96 Compare August 16, 2024 18:31

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 16, 2024

wenxuan0923 added 2 commits August 16, 2024 11:41

Fx golint

bef048c

Add many UTs

5422bd8

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2024

wenxuan0923 mentioned this pull request Sep 17, 2024

Add vms pool implementation in 1.27 #7292

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2024

	// "azure:///subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"
	// "azure:///subscriptions/0000000-0000-0000-0000-00000000000/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"

		providerIDParts := strings.Split(providerID, "/")
		machineNames[i] = &providerIDParts[len(providerIDParts)-1]

[VMs pool] Add implementation for VMs pool #6951

Are you sure you want to change the base?

[VMs pool] Add implementation for VMs pool #6951

Conversation

wenxuan0923 commented Jun 21, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Jun 21, 2024

comtalyst left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rakechill commented Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenxuan0923 Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenxuan0923 Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenxuan0923 Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Bryce-Soghigian Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallaxes Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

tallaxes commented Jul 31, 2024

wenxuan0923 Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 27, 2024

k8s-triage-robot commented Nov 25, 2024

k8s-triage-robot commented Dec 25, 2024

wenxuan0923 commented Jun 21, 2024 •

edited

Loading

comtalyst left a comment •

edited

Loading

rakechill commented Jul 22, 2024 •

edited

Loading

wenxuan0923 Jul 30, 2024 •

edited

Loading

wenxuan0923 Jul 30, 2024 •

edited

Loading

wenxuan0923 Jul 30, 2024 •

edited

Loading

Bryce-Soghigian Jul 29, 2024 •

edited

Loading

tallaxes Aug 2, 2024 •

edited

Loading

wenxuan0923 Aug 1, 2024 •

edited

Loading