Expose instance auto-restart status in the console #2469

hawkw · 2024-09-24T18:19:11Z

PR oxidecomputer/omicron#6503 implemented automatic restarts of instances in the Failed state. This change introduced some additional instance state that should be exposed to users. In particular:

When a Failed instance is automatically restarted, a cooldown timer is started for that instance. If that instance fails again while the cooldown period is still active, it will not be automatically restarted again until the cooldown period has elapsed.
Some instances may be configured with auto-restart policies that do not permit them to be restarted when they are Failed.

New fields were added to the external-API instance message to report state related to automatic restarts. Instances now have an auto_restart_enabled: boolean field that indicates if their auto-restart policy permits restarting the instance, and an auto_restart_cooldown_expiration: string representing the date and time at which the cooldown period will have completed (allowing the instance to be restarted again). See: https://github.com/oxidecomputer/omicron/blob/45813be40b62167eff75333c410515e8bee24211/openapi/nexus.json#L15094-L15104

This data should probably be exposed to users: if an instance is in the Failed state, the user will want to know why it has not yet been automatically restarted, whether it will ever be automatically restarted, and if it will, when that will happen. We probably only need to display this information for instances which are Failed. If a Failed instance has auto_restart_enabled set to false, we should tell the user that auto-restart is disabled for that instance. Otherwise, if there is an auto_restart_cooldown_expiration timestamp, we should tell the user that the instance will be restarted only after that time. If auto_restart_enabled is not false and there is no auto_restart_cooldown_expiration timestamp, then the instance will be automatically restarted --- we might want to indicate that as well.

The text was updated successfully, but these errors were encountered:

askfongjojo · 2024-12-14T01:34:20Z

Tagged this for v13 as this is not gaining the visibility it deserves.

benjaminleonard · 2024-12-19T11:09:15Z

We can probably slip this into the properties table state and perhaps the instance list table too if we figure out an elegant popover.

I need to wrap my head around the state flow of that a little – thank you for your documentation on this!

Regarding the policy itself – we should also be adding the ability to manage auto_restart_policy to:

a. Instance create form (we can probably tuck into the advanced accordion)
b. Instance view page – perhaps a settings tab, with the idea other items might eventually be there also

benjaminleonard · 2024-12-19T14:37:12Z

Few initial questions @hawkw

You had mentioned we might want to show that an instance is starting as a result of auto-restart. Is there a way to discern that from the current API? E.g. auto_restart_cooldown_expiration is present and instance is Starting
Does the cooldown reset each time the instance state is Failed?
Do you anticipate a significant time between the cooldown expiration and attempting to start the instance? Does it get queued?
Do we anticipate ever needing to bubble up some logs as to why an instance has failed? I suppose there's some complicated stuff there around permissions; is it likely to just be a system issue, or can a dodgy image / bad configuration of something cause it?

benjaminleonard · 2024-12-19T14:37:33Z

Accidentally closed!

hawkw · 2024-12-19T22:33:50Z

You had mentioned we might want to show that an instance is starting as a result of auto-restart. Is there a way to discern that from the current API? E.g. auto_restart_cooldown_expiration is present and instance is Starting

The control plane internally tracks why an instance is being started in the instance_start saga, but that information isn't currently stored in the database outside of the saga, and it's not exposed in the API for viewing instance states, so I don't think you currently have any way to determine that. Wiring that through probably won't require too much additional work, but we've not done it yet.

I'd definitely like to get that into the console (and CLI etc) eventually, but I'd file it under "future work" for now.

Does the cooldown reset each time the instance state is Failed?

The cooldown period starts when an instance is automatically restarted. It does not reset if the instance is automatically restarted and then fails again: the intention behind the cooldown is primarily to reduce the impact on the rest of the system when an instance crashes every time it's restarted.

In order to avoid an instance restarting and then immediately crashing again in a hot loop, potentially impacting other instances, we restart the instance immediately the first time it crashes. Then, we start tracking the cooldown period once we restart the instance. If the instance fails again before the cooldown has elapsed, it will be restarted once the cooldown has elapsed. If it fails after the cooldown has elapsed, it will be restarted immediately. Each time it fails, the cooldown is reset. This way, we will not immediately restart the instance multiple times in short succession, but if it fails today and then fails again a few weeks later, it will be automatically restarted immediately both times.

Do you anticipate a significant time between the cooldown expiration and attempting to start the instance? Does it get queued?

It shouldn't take too long. I believe the task responsible for automatically restarting failed instances will run about once every minute, and there's some internal bookkeeping that must be done before an instance can be restarted in order to clean up any resources left behind by its past incarnation. So, there's some delay. I would generally expect a Failed instance that's eligible to be restarted to transition back to Starting within 2-5 minutes of it going to Failed.

Do we anticipate ever needing to bubble up some logs as to why an instance has failed? I suppose there's some complicated stuff there around permissions; is it likely to just be a system issue, or can a dodgy image / bad configuration of something cause it?

I think this falls under the purview of the ongoing fault management work --- we'll definitely want to generate more detailed reports of why an instance has failed within FMA. At present, the control plane doesn't really know anything about why an instance has failed.

askfongjojo added this to the 13 milestone Dec 14, 2024

benjaminleonard closed this as completed Dec 19, 2024

benjaminleonard reopened this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose instance auto-restart status in the console #2469

Expose instance auto-restart status in the console #2469

hawkw commented Sep 24, 2024

askfongjojo commented Dec 14, 2024

benjaminleonard commented Dec 19, 2024

benjaminleonard commented Dec 19, 2024

benjaminleonard commented Dec 19, 2024

hawkw commented Dec 19, 2024

Expose instance auto-restart status in the console #2469

Expose instance auto-restart status in the console #2469

Comments

hawkw commented Sep 24, 2024

askfongjojo commented Dec 14, 2024

benjaminleonard commented Dec 19, 2024

benjaminleonard commented Dec 19, 2024

benjaminleonard commented Dec 19, 2024

hawkw commented Dec 19, 2024