Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VMs pool] Add implementation for VMs pool #6951

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

wenxuan0923
Copy link
Contributor

@wenxuan0923 wenxuan0923 commented Jun 21, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add implementation to support VMs pool autoscaling. Right now, only the basic scenario is covered: it does not include GPU pool. They will be handled in the later PRs.

To test it with VMs pool:

Create a cluster in prod: az aks create -n play-vms -g wenxrg --vm-set-type "VirtualMachines" -l eastus
add deploy CAS with command argument:--nodes=1:6:nodepool1 and required environment variables (note cluster name and cluster resource group name is required for VMs pool).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 21, 2024
@k8s-ci-robot k8s-ci-robot requested a review from feiskyer June 21, 2024 00:55
@k8s-ci-robot k8s-ci-robot requested a review from gandhipr June 21, 2024 00:55
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 21, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @wenxuan0923. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 21, 2024
Copy link
Contributor

@comtalyst comtalyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good, but I'm afraid I don't have a permission to approve (I think).
Also, I don't have much expertise in interactions with AKS API on AgentPool. Would rely on you to test.

return nil, err

// Use Service Principal
if len(cfg.AADClientID) > 0 && len(cfg.AADClientSecret) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this support UseWorkloadIdentityExtension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure...but it should be fairly easy to add it if needed later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be used for managed, but it is the preferred method of authentication for self-hosted at the moment. I would recommend you add it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding to what robin said, the workload identity extension gets addressed in this function below: https://github.com/kubernetes/autoscaler/pull/6951/files#diff-e18278caa19fbc44db54dcd3c6d8e77d045fff33e3a7cd0b6a65f5a111543b83R288-R354

so I'd look into reusing this function. we use workload identity for local testing + dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to support UseWorkloadIdentityExtension

manager *AzureManager
resourceGroup string
manager *AzureManager
resourceGroup string // MC_ resource group for nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer resourceGroup to be nodeResourceGroup. But the current convention is being used in other places as well. I think I will save this idea for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we want to keep this consistent with other places.

}

// scaleUpToCount sets node count for vms agent pool to target value through PUT AP call.
func (agentPool *VMsPool) scaleUpToCount(count int64) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally prefer always-one-use methods like this to be a part of the code calling it, rather than being a separate method, especially when there are no clear boundary of responsibility. I think the case of IncreaseSize and scaleUpToCount (and more) applies.
In case you have a similar opinion, I don't think you need to be restricted by the pattern in azure_scale_set.go for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference here, just thought it might make it a bit easier for maintainers if it has the same pattern with vmss. I would like to collect more feedback of this PR and make changes together if needed.

@rakechill
Copy link
Contributor

rakechill commented Jul 22, 2024

it does not include Deallocate scale down policy

fyi, there is no support for this in core which is almost definitely required for this nodegroup implementation, too.

here's the issue + work exploring this: #6202
cc @jackfrancis

for i, providerID := range providerIDs {
// extract the machine name from the providerID by splitting the providerID by '/' and get the last element
// The providerID look like this:
// "azure:///subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to put real subs here since its open source? Given they can't do anything with the subscription, still probably good to not expose it.

Suggested change
// "azure:///subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"
// "azure:///subscriptions/0000000-0000-0000-0000-00000000000/resourceGroups/mc_wxrg_play-vms_eastus/providers/Microsoft.Compute/virtualMachines/aks-nodes-32301838-vms0"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the sub as suggested.

Comment on lines 312 to 313
providerIDParts := strings.Split(providerID, "/")
machineNames[i] = &providerIDParts[len(providerIDParts)-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple places we get these provider parts. Im wondering if a structure similar to https://github.com/Azure/karpenter-provider-azure/blob/main/pkg/utils/subnet_parser.go#L25 might make sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I was thinking if we can have a consistent Validate() function attached to that ProviderIDParser, that validates the provider id shapes. Do we validate the provider id shape ahead of this function?

Copy link
Contributor Author

@wenxuan0923 wenxuan0923 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We validate node.Spec.ProviderID when we call NodeGroupForNode here:
https://github.com/kubernetes/autoscaler/blob/3914bdebd1e9453396408f1f99efa4cc5eeaf078/cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go#L105C1-L114C1

If the Node doesn't have a valid ProviderID, CAS won't be able scale down the corresponding node pool if I understand correctly, so when we reach the code here, we should be able to assume the ID is valid.

Also I just noticed that there's an existing utility method for name parsing here, updated the code to use that instead:

func resourceName(ID string) (string, error) {

}

// DeleteNodes extracts the providerIDs from the node spec and
// delete or deallocate the nodes from the agent pool based on the scale down policy.
// delete or deallocate the nodes based on the scale down policy of agentpool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a deallocate implementation in the upstream codebase here yet right?

Copy link
Contributor Author

@wenxuan0923 wenxuan0923 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have it here in CAS upstream, but AKS-RP is able to deallocate/delete nodes based on property properties.scaleDownMode of agentpool. If an VMs pool has scaleDownMode == Deallocate, the scale down triggered by CAS will cause RP to deallocate nodes rather than delete.

https://learn.microsoft.com/en-us/rest/api/aks/agent-pools/create-or-update?view=rest-aks-2024-02-01&tabs=HTTP#scaledownmode

image

@@ -102,28 +120,260 @@ func (agentPool *VMsPool) MaxSize() int {
// TargetSize returns the current TARGET size of the node group. It is possible that the
// number is different from the number of nodes registered in Kubernetes.
func (agentPool *VMsPool) TargetSize() (int, error) {
// TODO(wenxuan): Implement this method
return -1, cloudprovider.ErrNotImplemented
size, err := agentPool.getVMsPoolSize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we just implement TargetSize directly rather than deferring to getVMsPoolSize? Can't we just use TargetSize in place of this method wherever we need it?

Copy link
Contributor Author

@wenxuan0923 wenxuan0923 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually followed exact same pattern with VMSS here - thought that will make it easier for maintainers:

size, err := scaleSet.GetScaleSetSize()


if _, err = poller.PollUntilDone(updateCtx, nil); err == nil {
// success path
klog.Infof("agentPoolClient.BeginCreateOrUpdate for aks cluster %s agentpool %s succeeded", agentPool.clusterName, agentPool.Name)
Copy link
Member

@Bryce-Soghigian Bryce-Soghigian Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we always want to log on success? Won't we in most cases log the results of this function ie the error? So we can infer when its successful that way? Do we log in other nodepool implementations?

We don't do a good job of this now, but really we should take better care across the codebase with verbosity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log removed

return err
}

klog.V(1).Infof("Scaling up vms pool %s to new target count: %d", agentPool.Name, count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target size is recorded outside of this iirc, did you find that the existing logging was not enough here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log removed


// if the target size is smaller than the min size, return an error
if int(currentSize) <= agentPool.MinSize() {
klog.V(3).Infof("min size %d reached, nodes will not be deleted", agentPool.MinSize())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most places in core, we log the error from DeleteNodes()

err = nodeGroup.DeleteNodes(nodesToDelete)
for example.

Should we be logging this twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log removed

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 30, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 30, 2024
Comment on lines 455 to 457
// TemplateNodeInfo is not implemented.
func (agentPool *VMsPool) TemplateNodeInfo() (*schedulerframework.NodeInfo, error) {
// TODO(wenxuan): implement this method when vms pool can fully support GPU nodepool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having this I think means scaling from zero is not supported (even on non-GPU nodepools)

Comment on lines 376 to 381
agentPool.manager.invalidateCache()
_, err := agentPool.getVMsPoolSize()
if err != nil {
klog.Warningf("DecreaseTargetSize: failed with error: %v", err)
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally NOOP implementation deserves a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I'm supposed to return err here, updated the code. Thanks for pointing it out!

Copy link
Contributor

@tallaxes tallaxes Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my point is a little different.

This implementation of DecreaseTargetSize (admittedly, same as current implementation for VMSS) does not actually do anything (except invalidating the cache), and yet (essentially) never fails. This is a latent problem, as it is used by autoscaler core to adjust the size of nodepool that appears to need adjusting, and not returning an error makes core think adjustment was made, which can in turn lead to persistent issues (that have been observed). It is not a problem if DecreaseTargetSize never gets called - which is actually the most common case for well-behaving nodepool. But a correct implementation would likely a) actually adjust the target size (subject to constraints in comments for DecreaseTargetSize) and b) return an error if it is not possible or not successful. Not doing this is a choice (or a bug?) - that's what I meant by "NOOP implementation deserves a comment".

VMSS version likely needs revisiting as well. It does have an explanation for not doing anything in comment ("VMSS size should be changed automatically after the Node deletion, hence this operation is not required."), which I don't fully understand/agree with. Feels like it needs to either implement adjustment properly, or return an error. And just returning ErrNotImplemented is an option, though I have not tried it.

So please give some thought to whether a better implementation of DecreaseTargetSize for VM node pool (respecting constraints outlined in comments) is possible. For expediency, it is probably Ok to leave it as is for now (maybe at least with a comment like "just following VMSS implementation"), to revisit later. If you do, would be good to validate in testing that it actually never gets called. (Or maybe just change to returning ErrNotImplemented?)

@tallaxes
Copy link
Contributor

/lgtm
/approve

@@ -148,7 +148,7 @@ func (az *azDeploymentsClient) Delete(ctx context.Context, resourceGroupName, de
return future.Response(), err
}

//go:generate sh -c "mockgen k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure AgentPoolsClient >./agentpool_client.go"
//go:generate sh -c "mockgen -source=azure_client.go -destination azure_mock_agentpool_client.go -package azure -exclude_interfaces DeploymentsClient -copyright_file copyright.txt"
Copy link
Contributor Author

@wenxuan0923 wenxuan0923 Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakechill I'm fixing the mockgen command in this PR. Don't know why the mockgen reflect mode is not working, change it to use source mode instead.

@wenxuan0923 wenxuan0923 force-pushed the wenx/vms-impl branch 2 times, most recently from 355df75 to 1d15fd6 Compare August 2, 2024 00:02
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 16, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants