-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider failed machine as terminating #118
Comments
Hello , Can we have an update on the status ? |
This would be available from release 1.22 of CA |
Hello @himanshu-kun, do you know when 1.22 will be rolled out? |
Hi @bbochev , I am planning to make release next week after fix for this issue. Then to reach Canary landscape it could take one more week. |
Hello @himanshu-kun, |
Hi @bbochev |
After looking more into the issue, we realized that it needs a bigger change , where the state transition of machines and communication between both CA and MCM have to be redesigned. |
@himanshu-kun You have mentioned internal references in the public. Please check. |
1 similar comment
@himanshu-kun You have mentioned internal references in the public. Please check. |
Background
expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingMachines))
if expectedReplicas == mdclone.Spec.Replicas {
klog.Infof("MachineDeployment %q is already set to %d, skipping the update", mdclone.Name, expectedReplicas)
break
}
mdclone.Spec.Replicas = expectedReplicas
Grooming Decisions
expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingOrFailedMachines))
|
We're not using Gardener, but are using MCM and hence Gardener's CA fork too. We have these settings:
In our specific case, we run into this bug when AWS can't provision nodes due to capacity issues. CA's unregistered node handler then kicks in after 10 minutes and triggers this bug. We also only appear to hit it during an rolling update of a MachineDeployment; simply causing CA to scale up doesn't appear to hit it. Is the right workaround here to adjust MCM's |
I don't know whether it helps at all, but my testing has shown that even the band-aid fix isn't helpful in our specific case. When AWS is under capacity pressure (particularly prevalent in eu-central-1a) then we get an error back from the AWS API. This causes the Machine to enter a CrashLoopBackOff state, hence why the patch doesn't help us. That also explains why our adjustment of In short, whatever approach is used to address this bug will need to cater for both of those scenarios. |
the explanation still hold in your above scenario
no this flag works for |
/assign |
What happened:
A race condition could happen b/w MCM and CA in case of node not joining in 20 min (default creationTimeout and maxNodeProvisionTimeout)
This could lead to MCM deleting Failed machine and creating a new one, and then CA deleting the new one thinking it deleting the Failed machine.
What you expected to happen:
CA should not take action if the machine it wants to terminate is already terminating.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
This happens because of delay of machineSet controller marking a
Failed
machine withdeletionTimestamp
and so the machine is not considered as terminating.Environment:
The text was updated successfully, but these errors were encountered: