-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Go unhealthy if not ready before timeout #306
base: main
Are you sure you want to change the base?
Conversation
Users get confused when k8s clusters are not converging after a long time waiting. There can be many causes, such as the cloud being full. For now, mark the cluster as unhealthy after the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced that we need to unset the updated at time... It might be quite nice to have.
last_updated_timestamp: schema.Optional[dt.datetime] = Field( | ||
default = None, | ||
description = "Used to trigger the timeout of pending states" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but I think the "timestamp" is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, true.
return | ||
|
||
# Trigger a timeout check if we are in a transitional state | ||
if instance.status.phase in {None, api.ClusterPhase.PENDING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
shouldn't be possible here - is it a condition you have seen? You should get UNKNOWN
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I saw this in my testing, because this can trigger before the first update completes, because I have not set the idle timer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, but I should include unknown in here, maybe I did hit unknown.
azimuth_capi/operator.py
Outdated
# To help trigger timeouts if we get stuck, save the last_updated_timestamp | ||
if not instance.status.last_updated_timestamp: | ||
await save_cluster_status(instance) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should always reset to the current timestamp, because an update has just been applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That stops us detecting an error loop where we are stuck retrying the update over and over again.
azimuth_capi/operator.py
Outdated
# Reset the timeout counter, now we have completed the update | ||
# and we need to wait for all the components to do their updates | ||
instance.status.last_updated_timestamp = dt.datetime.now(dt.timezone.utc) | ||
await save_cluster_status(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need to do this? I thought the idea was to have a timeout from the point that an update was made? The statement above accomplishes this just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is it can take quite a while to get to this statement, if you need to retry the above stuff waiting for the lease to go active, so I reset it here. Maybe that is overkill, it was useful in my odd testing edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember what is was, it was for the case where we timed out in an error loop, then the operator fixes up the problem, and update finally completes, and at that point I reset the counter so you see progress in the UI again. Its an edge case for sure...
…o feature/timeout-k8s-changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this patch has more fundamental issues that may require a change of tack. For example, it won't go into error if the cluster tries to autoscale and the autoscaling fails.
Rather than recording the last time we saw an update to the spec, we should record the last time the cluster phase changed, then transition to error if the phase is reconciling or upgrading and the last phase change was longer ago than the threshold.
# if not a terminal state, ensure timestamp has been set | ||
if cluster.status.last_updated is None: | ||
now = dt.datetime.now(dt.timezone.utc) | ||
cluster.status.last_updated = now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this case should still work when the state change isn't triggered by a spec change. I think this means when we get an auto healing failure, we should timeout the operation correctly. But I should re-test that case, for sure.
Users get confused when k8s clusters are not converging after a long time waiting. There can be many causes, such as the cloud being full. For now, mark the cluster as unhealthy after the timeout.