Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CrashLoopBackOff condition as soon as timeout expires #790

Open
himanshu-kun opened this issue Feb 24, 2023 · 4 comments · May be fixed by #849
Open

Update CrashLoopBackOff condition as soon as timeout expires #790

himanshu-kun opened this issue Feb 24, 2023 · 4 comments · May be fixed by #849
Assignees
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority)

Comments

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Feb 24, 2023

How to categorize this issue?

/area robustness
/kind bug
/priority 2

What happened:

Currently MCM doesn't turn CrashLoopBackoff(CLBF) machines to Failed as soon as creationTimeout expires , but delays have been observed which can range to 2min to any time.

There are 2 parts of the problem:

  1. timeout check for CLBF machine is done after making CreateMachine() driver call . CreateMachine() itself could take any amount of time as it provisions VM on the cloudprovider (in Azure big delays like 30min or more have been seen before)
  2. even if the 1) doesn't exist to contribute to the delay, the retry/re-push to the queue can also introduce the delay of ShortRetry(3min). This happens because the retry period is not calculated considering the time left before timeout but is just a constant value of ShortRetry(3min)

What you expected to happen:
Turn to Failed as soon as timeout expires.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

We could take inspiration from machineDeployment logic which turns Progressing condition to False with reason ProgressDeadlineExceeded as soon as the deadline exceeds.

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/robustness Robustness, reliability, resilience related priority/2 Priority (lower number equals higher priority) labels Feb 24, 2023
@elankath
Copy link
Contributor

elankath commented Feb 24, 2023

Grooming Discussion Results

  • Our retry handling in machine creation is present in machineCreateErrorHandler in machine_util.go. This method specifies a medium retry (3min) for enqueuing the machine object if the VM instance failed to be created.
  • The error handler method mentioned above leverages the helper method getCreateFailurePhase which transitions the the machine phase: CrashLoopBackoffFailed on creation timeout.
  • We should ideally transition the machine phase to Failed faster if we are very close to the creation timeout and if the next retry added to the current time exceeds the creation timeout.
  • To ensure that the above works fine, we should establish that Driver.CreateMachine returns within a bounded time. We can do this by making sure there is a deadline on the Context passed to the CreateMachine method. The deadline should be the time pending for us to reach machine creation timeout.

@himanshu-kun himanshu-kun added the needs/planning Needs (more) planning with other MCM maintainers label Feb 24, 2023
@aaronfern aaronfern linked a pull request Sep 12, 2023 that will close this issue
@aaronfern
Copy link
Contributor

aaronfern commented Nov 10, 2023

On further testing we saw that although this change is fine if MCM runs standalone, there is a scenario that when CA runs alongside this change can cause CAs backoff mechanism to not work.

The case we focussed on was CA backing off from a MachineDeployment if the Machine does not get ready and CA knows that there is another MachineDeployment whose Machine spec is sufficient and adequate for incoming workload. In normal circumstances, CA would backoff, scale down the MachineDeployment it had originally scaled up, and scale up a different MachineDeployment whose Machines would then be used.

With this change however (assuming MCMs --machine-creation-timeout and CAs --max-node-provision-time are the same), MCM will replace the Machine before CA has a chance to look at it and calculate it's backoff parameters. So from CAs point of view it won't see a node not in the Running state for longer than required for it to backoff from the MachineDeployment


Solution:

* Set the --machine-creation-timeout in MCM to 0 by default so that unless set, MCM will not have a timeout for machine creation. Field can be set by the user in the shoot spec

* Document this behaviour and advice that MCMs --machine-creation-timeout should be a value greater then CAs --max-node-provision-time

Updating the solution to be more specific and descriptive

  • Update MCM to be able to interpret a negative or zero timeout value for --machine-creation-timeout. This means that if this flag is set to 0 or some negative value, then MCM will not have a timeout for machine creation. This does not change the default value for this flag, and if no value is passed, MCM will still use it's current default

  • Update gardenlet code to set a negative value to --machine-creation-timeout for MCM be default if no value is set via the shoot spec. If a value is specified in the shoot spec, then the value specified is always used
    This behaviour needs to be documented and there needs to be a recommendation to set MCMs --machine-creation-timeout value to be greater than CAs --max-node-provision-time value.

  • Validation needs to be added to the shoot to not allow machineCreationTimeout to be greater than maxNodeProvisionTime

  • No additional change from CA is needed

@elankath
Copy link
Contributor

Regarding Set the --machine-creation-timeout in MCM to 0 by default - is this OK for standalone consumers of the MCM ? Do we have any stakeholders where folks use the MCM but not our fork of the CA?

@unmarshall
Copy link
Contributor

We should just add a doc that the default value is set to 0. There are 2 different recommendations:

  1. If MCM is used standalone without using CA then consumers should use a value > 0.
  2. If MCM is used along with CA then:
    1. It is recommended that you leave the default value of machine-creation-timeout to 0.
    2. If there is a real need to set this value then machine-creation-timeout should be > CA's max-node-provision-time

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants