Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Check MinSize in fix nodegroup size #6650

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

daimaxiaxie
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

When an ASG out of stock and the instance is less than MinSize(especially spot), the entire Cluster Autoscaler will be stuck due to deleteCreatedNodesWithErrors and fixNodeGroupSize return error.

Cluster Autoscaler cannot handle this situation well.

Which issue(s) this PR fixes:

Fixes #6601

Special notes for your reviewer:

Respect MinSize.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/cluster-autoscaler labels Mar 22, 2024
@k8s-ci-robot k8s-ci-robot requested review from vadasambar and x13n March 22, 2024 04:46
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 22, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @daimaxiaxie. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 22, 2024
@towca
Copy link
Collaborator

towca commented Apr 19, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 19, 2024
@MaciekPytel
Copy link
Contributor

This PR fixes problem for ASGs, but at the same time breaks it for other cloudproviders. We generally don't assume min size is enforced at provider side (i.e. the expectation is that if CA decides to delete a node it can do so) and the current behavior of going below is the desired one. The reason is that there is no point keeping a VM that cannot register; it's not really providing capacity to kubernetes cluster. Min size is there to guarantee a minimum capacity of the cluster, but a node that cannot register shouldn't really be counted.

We need a more nuanced solution to the problem, one that allows CA to delete unregistered nodes below min size, while giving a chance to providers to handle it. IIRC we discussed extending cloudprovider to support this with @Bryce-Soghigian - do I remember correctly that you have a fix for this that you wanted to upstream?

@MaciekPytel
Copy link
Contributor

Actually a good question is why do we have such a check in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L785. That feels really inconsistent, why are we ok going below min in deleteCreatedNodesWithErrors, but not in removeOldUnregisteredNodes?

Sure, in the case of deleteCreatedNodesWithErrors we're probably deleting some placeholders and not actual VMs, but what exactly those placeholders are would differ by provider and at the end of the day we're changing target size to be less than min size. If we say that's ok - and I think we should, as per my comment above the meaningful interpretation of min size is "minimum number of nodes" and a VM that cannot register is not really a node - we should do it in all paths where VMs fail to register as nodes. Thoughts @x13n @towca @BigDarkClown ?

@Bryce-Soghigian
Copy link
Member

Yep @MaciekPytel #6658

The idea is that when the number of unregistered nodes is below min count, we won't ever remove them. This Draft PR will add a respectMinCount to the delete nodes interface that will allow us to make min count violating deletions for unregistered nodes.

I need to sit down and finish it. I wanted to upstream the change so that all cloud providers could benefit from it as well but that involved touching a lot more code thus taking longer than I thought it would :)

@Bryce-Soghigian
Copy link
Member

Actually a good question is why do we have such a check in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L785. That feels really inconsistent, why are we ok going below min in deleteCreatedNodesWithErrors, but not in removeOldUnregisteredNodes?

Another thing to noted is not all cloudproviders enforce min count on DeleteNodes() level as well. Some of them rely on these higher level checks.(They shouldn't?)

the meaningful interpretation of min size is "minimum number of nodes" and a VM that cannot register is not really a node - we should do it in all paths where VMs fail to register as nodes.

For things like IsClusterHealthy, or IsNodegroupHealthy, the opposite is true. We only represent real nodes and not these partial vms (vms that never join apiserver) in the health signal. IMO if the bootstrap is failing its more indicative of a health problem than not ready nodes.

@daimaxiaxie
Copy link
Contributor Author

The reason is that there is no point keeping a VM that cannot register; it's not really providing capacity to kubernetes cluster. Min size is there to guarantee a minimum capacity of the cluster, but a node that cannot register shouldn't really be counted.

I very much agree with this view. Thank you all @MaciekPytel @Bryce-Soghigian .

@daimaxiaxie
Copy link
Contributor Author

daimaxiaxie commented Apr 22, 2024

Actually a good question is why do we have such a check in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L785. That feels really inconsistent, why are we ok going below min in deleteCreatedNodesWithErrors, but not in removeOldUnregisteredNodes?

I also have this question, this code comes from @kawych . I think removeOldUnregisteredNodes and deleteCreatedNodesWithErrors should be consistent because they face the same environment.
When the number of instances <= MinSize, if the create error node needs to be deleted(even placeholders), the unregisitered node also needs to be deleted.

@towca
Copy link
Collaborator

towca commented Apr 25, 2024

@daimaxiaxie I think the logic was the same even before @kawych changes FWIW.

@MaciekPytel I have no idea why the behavior is inconsistent between the 2 places, I agree that CA should be able to delete the VMs in both scenarios even if it means going below the MinSize.

@Bryce-Soghigian brought up a good point - if we expect the limit enforcement to happen outside fo CP, it should also mean that CPs shouldn't enforce the limits within DeleteNodes, right. @MaciekPytel for example the GCE cloud provider does so for example, should we change it?

@Bryce-Soghigian
Copy link
Member

One could also argue that having that cloud provider enforce min count in case some upper level logic is helpful. It means that no matter what we can't violate min count and scale beyond max count.

That being said there are cases(like unregistered nodes) where we want to be able to force a deletion.

@x13n
Copy link
Member

x13n commented Apr 26, 2024

I agree we should be able to go below min in case of long unregistered nodes and creation errors. There's no point in keeping such nodes around.

However, I'm not convinced complicating CA logic to pass the "ignore min size" flag down to cloud providers is the right thing to do. When CA is managing a node group, it should be the source of truth for determining group size. When CA tries to delete a node, cloud provider code should actuate this decision (or return an error). If we're allowing force delete in certain cases, the additional safeguard on cloud provider side isn't really going to provide a lot of additional protection I think? What is the use case for sometimes honoring min size internally in cloud provider code? Is it meant as extra layer of protection in case CA goes amok or is there more to it?

@Bryce-Soghigian
Copy link
Member

Yeah exactly. It serves as a safeguard if someone adds a new deletion call. Really, if we didn't have that cloudprovider code I could just delete the unregistered nodes outright. But with the cloudprovider safeguards, i cannot just delete the unregistered nodes.

Since min count is a fundamental principle and people expect it to never be violated it makes sense to have that extra safety in the cloudprovider. But also I would like to be able to remove nodes that are bad. Either modifying that interface for DeleteNodes() or have a new method called UnsafeDeleteNodes() might make sense.

@x13n
Copy link
Member

x13n commented Apr 29, 2024

That makes sense, thanks. Btw, ForceDeleteNodes()? Or DeleteNodes(force bool)? I see your PR currently has respectMinCount arg which is quite specific and I really would like to avoid adding yet another arg or method to respect other possible constraints.

@Bryce-Soghigian
Copy link
Member

Force makes a lot of sense lets do that instead.

@x13n
Copy link
Member

x13n commented May 7, 2024

Thanks, so I think we can close this one then.

/close

@k8s-ci-robot
Copy link
Contributor

@x13n: Closed this PR.

In response to this:

Thanks, so I think we can close this one then.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@daimaxiaxie
Copy link
Contributor Author

Hi @x13n , I updated the code and now only respects MinSize in fixNodeGroupSize. Hope it can be reopened.

Respecting MinSize is necessary in fixNodeGroupSize, even with ForceDeleteNodes, because fixNodeGroupSize does not remove nodes. Thanks.

@daimaxiaxie
Copy link
Contributor Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jun 21, 2024
@k8s-ci-robot
Copy link
Contributor

@daimaxiaxie: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: daimaxiaxie
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -728,6 +728,17 @@ func fixNodeGroupSize(context *context.AutoscalingContext, clusterStateRegistry
incorrectSize.ExpectedSize,
incorrectSize.CurrentSize,
delta)

possibleDelta := nodeGroup.MinSize() - incorrectSize.ExpectedSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this check? MinSize is there to protect Cluster Autoscaler against scaling down below MinSize. If we want to fix the target size to sth below minSize, it means that the nodegroup started below minSize, we tried to scale up, but nodes are not registering correctly. I would expect that adjusting the target size of the group even to a value below minSize is something that we want to be able to do, we're not violating the minSize constraint, just reacting to a failed scale up situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user's perspective, minsize is the minimum value of nodegroup target, and it cannot be less than minsize at any time.

On the other hand, many cloud vendors do not allow you to set a value lower than minsize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this makes sense in CA core. For cloud providers that prevent target size setting below min, adjusting the size can happen on cloud provider side. Otherwise it is going to prevent cloud providers that do allow going below min from working correctly. Note that incorrect size in CSR may stay incorrect if target size is adjusted up to match min size, so the subsequent loops will run this code again.

@better0332
Copy link

I also encountered this bug

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster-autoscaler gets stuck with "Some nodes that failed to create were removed"
9 participants