Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go unhealthy if not ready before timeout #306

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

JohnGarbutt
Copy link
Contributor

Users get confused when k8s clusters are not converging after a long time waiting. There can be many causes, such as the cloud being full. For now, mark the cluster as unhealthy after the timeout.

Users get confused when k8s clusters are not converging
after a long time waiting. There can be many causes, such
as the cloud being full. For now, mark the cluster as
unhealthy after the timeout.
@JohnGarbutt JohnGarbutt requested a review from mkjpryor October 14, 2024 16:48
azimuth_capi/status.py Outdated Show resolved Hide resolved
@JohnGarbutt JohnGarbutt marked this pull request as ready for review October 15, 2024 11:16
Copy link
Collaborator

@mkjpryor mkjpryor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced that we need to unset the updated at time... It might be quite nice to have.

Comment on lines 390 to 393
last_updated_timestamp: schema.Optional[dt.datetime] = Field(
default = None,
description = "Used to trigger the timeout of pending states"
)
Copy link
Collaborator

@mkjpryor mkjpryor Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but I think the "timestamp" is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, true.

return

# Trigger a timeout check if we are in a transitional state
if instance.status.phase in {None, api.ClusterPhase.PENDING,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None shouldn't be possible here - is it a condition you have seen? You should get UNKNOWN instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I saw this in my testing, because this can trigger before the first update completes, because I have not set the idle timer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, but I should include unknown in here, maybe I did hit unknown.

azimuth_capi/status.py Outdated Show resolved Hide resolved
azimuth_capi/status.py Outdated Show resolved Hide resolved
Comment on lines 420 to 423
# To help trigger timeouts if we get stuck, save the last_updated_timestamp
if not instance.status.last_updated_timestamp:
await save_cluster_status(instance)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always reset to the current timestamp, because an update has just been applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That stops us detecting an error loop where we are stuck retrying the update over and over again.

Comment on lines 489 to 492
# Reset the timeout counter, now we have completed the update
# and we need to wait for all the components to do their updates
instance.status.last_updated_timestamp = dt.datetime.now(dt.timezone.utc)
await save_cluster_status(instance)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to do this? I thought the idea was to have a timeout from the point that an update was made? The statement above accomplishes this just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is it can take quite a while to get to this statement, if you need to retry the above stuff waiting for the lease to go active, so I reset it here. Maybe that is overkill, it was useful in my odd testing edge cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember what is was, it was for the case where we timed out in an error loop, then the operator fixes up the problem, and update finally completes, and at that point I reset the counter so you see progress in the UI again. Its an edge case for sure...

azimuth_capi/config.py Outdated Show resolved Hide resolved
@JohnGarbutt JohnGarbutt requested a review from mkjpryor November 12, 2024 11:07
Copy link
Collaborator

@mkjpryor mkjpryor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this patch has more fundamental issues that may require a change of tack. For example, it won't go into error if the cluster tries to autoscale and the autoscaling fails.

Rather than recording the last time we saw an update to the spec, we should record the last time the cluster phase changed, then transition to error if the phase is reconciling or upgrading and the last phase change was longer ago than the threshold.

# if not a terminal state, ensure timestamp has been set
if cluster.status.last_updated is None:
now = dt.datetime.now(dt.timezone.utc)
cluster.status.last_updated = now
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this case should still work when the state change isn't triggered by a spec change. I think this means when we get an auto healing failure, we should timeout the operation correctly. But I should re-test that case, for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants