Consider failed machine as terminating #118

himanshu-kun · 2022-04-22T12:22:01Z

What happened:
A race condition could happen b/w MCM and CA in case of node not joining in 20 min (default creationTimeout and maxNodeProvisionTimeout)
This could lead to MCM deleting Failed machine and creating a new one, and then CA deleting the new one thinking it deleting the Failed machine.

What you expected to happen:
CA should not take action if the machine it wants to terminate is already terminating.

How to reproduce it (as minimally and precisely as possible):

Make CA to scale up a nodegrp
make the machine which got created not join within 20 min. This could be achieved by deleting the VM as soon as its created or tweaking the router rules. Important bit is that node shouldn't get registered
keep the default configurations for both MCM and CA

Anything else we need to know:
This happens because of delay of machineSet controller marking a Failed machine with deletionTimestamp and so the machine is not considered as terminating.

Environment:

The text was updated successfully, but these errors were encountered:

alexkyurtev · 2022-05-10T13:12:25Z

Hello , Can we have an update on the status ?

himanshu-kun · 2022-05-11T04:41:20Z

This would be available from release 1.22 of CA

bbochev · 2022-06-01T07:25:02Z

Hello @himanshu-kun, do you know when 1.22 will be rolled out?

himanshu-kun · 2022-06-01T09:38:36Z

Hi @bbochev , I am planning to make release next week after fix for this issue. Then to reach Canary landscape it could take one more week.

bbochev · 2022-06-14T07:11:26Z

Hello @himanshu-kun,
Could you please update us here - does the new release of CA was implemented successfully on canary?

himanshu-kun · 2022-06-14T07:52:24Z

Hi @bbochev
Apologies, I was not able to work on it this week due to some other major fixes. I'll start working on it now. Then I'll make the release.

himanshu-kun · 2022-06-22T04:47:04Z

After looking more into the issue, we realized that it needs a bigger change , where the state transition of machines and communication between both CA and MCM have to be redesigned.
I see that from the live ticket https://github.tools.sap/kubernetes-live/issues-live/issues/1607 the question was more on why two scale-up attempts happened. The system came back to normal state on its own within an hour.
To deal with this for now, customer can keep the maxNodeProvisionTime different (like 19min) than the machineCreationTimeout which is by default 20min. The trick is not to keep them the same.
cc @unmarshall

gardener-robot · 2022-06-22T04:47:07Z

@himanshu-kun You have mentioned internal references in the public. Please check.

gardener-robot · 2022-06-22T04:47:28Z

@himanshu-kun You have mentioned internal references in the public. Please check.

elankath · 2023-02-28T10:59:55Z

Background

The cluster autoscaler as part of its regular RunOnce loop calls deleteCreatedNodesWithErrors which in turn calls nodeGroup.DeleteNodes(nodesToBeDeleted).
The MCM implementation of NodeGroup.DeleteNodes(nodesToBeDeleted) obtains the machine objects associated with nodesToBeDeleted and then calls mcmManager.DeleteMachines(machines)
DeleteMachines(machines) iterates through each machine object, obtains its machine deployment and uses the following logic to set the replicas in the machine deployment

expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingMachines))
if expectedReplicas == mdclone.Spec.Replicas {
	klog.Infof("MachineDeployment %q is already set to %d, skipping the update", mdclone.Name, expectedReplicas)
			break
}
mdclone.Spec.Replicas = expectedReplicas

As can be seen above, we have a problem: the expectedReplicas is lowered since we never consider Failed machines in the above logic.

Grooming Decisions

For solving the problem specifically mentioned in this issue, we believe that the expectedReplicas should be computed as:

expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingOrFailedMachines))

We understand and acknowledge that this proposed fix (which was also submitted as a PR earlier in this thread) will only address one problem and not other edge cases where we don't see the most optimal scaling. (Machines in Pending phase causing deployment replica count to be lowered, but the machine could have become running the next minute, but now autoscaler would try a new node grp and wait for the node to join there). A separate issue for generic autoscaler<->MCM edge cases is raised for this. Ensure that there is a single actor which reduces the machine deployment replicas #181

mattburgess · 2023-06-08T11:13:24Z

To deal with this for now, customer can keep the maxNodeProvisionTime different (like 19min) than the machineCreationTimeout which is by default 20min.

We're not using Gardener, but are using MCM and hence Gardener's CA fork too. We have these settings:

CA - --max-node-provision-time=10m0s
MCM - --machine-drain-timeout=10m & --machine-health-timeout=10m (note, specifically, that we don't override MCM's --machine-creation-timeout)

In our specific case, we run into this bug when AWS can't provision nodes due to capacity issues. CA's unregistered node handler then kicks in after 10 minutes and triggers this bug. We also only appear to hit it during an rolling update of a MachineDeployment; simply causing CA to scale up doesn't appear to hit it.

Is the right workaround here to adjust MCM's --machine-creation-timeout to be less than CA's --max-node-provision time so that MCM can clean up the failed machine before CA's unregistered node handler has a chance to kick in? Your guidance above has CA set to 19m and MCM to its default of 20m but in our case we're on 10m and 20m respectively, so even though they're different we're still hitting this bug.

mattburgess · 2023-06-12T09:12:15Z

I don't know whether it helps at all, but my testing has shown that even the band-aid fix isn't helpful in our specific case.

When AWS is under capacity pressure (particularly prevalent in eu-central-1a) then we get an error back from the AWS API. This causes the Machine to enter a CrashLoopBackOff state, hence why the patch doesn't help us. That also explains why our adjustment of --machine-creation-timeout doesn't appear to help avoid the race either; that flag only appears to help when the Machine can reach a Pending status, i.e. the cloud provider can provide a backing instance but something else (networking, bootstrapping, etc.) prevents the Machine from reaching a Running state.

In short, whatever approach is used to address this bug will need to cater for both of those scenarios.

himanshu-kun · 2023-07-03T03:10:23Z

in our case we're on 10m and 20m respectively, so even though they're different we're still hitting this bug.

the explanation still hold in your above scenario

that flag only appears to help when the Machine can reach a Pending status

no this flag works for CrashLoopBackoff phase also , though there are some improvements we have to make there, but that shouldn't be causing this issue. This issue is only caused on race. Let me know if you feel otherwise.

sssash18 · 2024-01-16T09:36:06Z

/assign

himanshu-kun added kind/bug Bug priority/1 Priority (lower number equals higher priority) labels Apr 22, 2022

himanshu-kun mentioned this issue Jun 14, 2022

Consider failed machine as terminating #125

Closed

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 22, 2022

himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 28, 2023

unmarshall mentioned this issue Feb 28, 2023

Ensure that there is a single actor which reduces the machine deployment replicas #181

Open

himanshu-kun self-assigned this Mar 31, 2023

himanshu-kun removed their assignment Jun 2, 2023

mattburgess mentioned this issue Jun 8, 2023

Scale down of machineset during rolling update gardener/machine-controller-manager#826

Closed

himanshu-kun added the exp/beginner Issue that requires only basic skills label Sep 12, 2023

himanshu-kun mentioned this issue Oct 11, 2023

Reconcile machine before creation deadline gardener/machine-controller-manager#849

Open

gardener-robot assigned sssash18 Jan 16, 2024

sssash18 mentioned this issue Jan 22, 2024

Added failed machine to prevent scaledown #291

Merged

rishabh-11 closed this as completed in #291 Jan 30, 2024

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider failed machine as terminating #118

Consider failed machine as terminating #118

himanshu-kun commented Apr 22, 2022

alexkyurtev commented May 10, 2022

himanshu-kun commented May 11, 2022

bbochev commented Jun 1, 2022

himanshu-kun commented Jun 1, 2022

bbochev commented Jun 14, 2022

himanshu-kun commented Jun 14, 2022

himanshu-kun commented Jun 22, 2022 •

edited

Loading

gardener-robot commented Jun 22, 2022

gardener-robot commented Jun 22, 2022

elankath commented Feb 28, 2023 •

edited by himanshu-kun

Loading

mattburgess commented Jun 8, 2023

mattburgess commented Jun 12, 2023

himanshu-kun commented Jul 3, 2023

sssash18 commented Jan 16, 2024

Consider failed machine as terminating #118

Consider failed machine as terminating #118

Comments

himanshu-kun commented Apr 22, 2022

alexkyurtev commented May 10, 2022

himanshu-kun commented May 11, 2022

bbochev commented Jun 1, 2022

himanshu-kun commented Jun 1, 2022

bbochev commented Jun 14, 2022

himanshu-kun commented Jun 14, 2022

himanshu-kun commented Jun 22, 2022 • edited Loading

gardener-robot commented Jun 22, 2022

gardener-robot commented Jun 22, 2022

elankath commented Feb 28, 2023 • edited by himanshu-kun Loading

Background

Grooming Decisions

mattburgess commented Jun 8, 2023

mattburgess commented Jun 12, 2023

himanshu-kun commented Jul 3, 2023

sssash18 commented Jan 16, 2024

himanshu-kun commented Jun 22, 2022 •

edited

Loading

elankath commented Feb 28, 2023 •

edited by himanshu-kun

Loading