Go unhealthy if not ready before timeout #306

JohnGarbutt · 2024-10-14T16:48:03Z

Users get confused when k8s clusters are not converging after a long time waiting. There can be many causes, such as the cloud being full. For now, mark the cluster as unhealthy after the timeout.

azimuth_capi/status.py

mkjpryor

I'm still not convinced that we need to unset the updated at time... It might be quite nice to have.

mkjpryor · 2024-10-15T12:13:30Z

azimuth_capi/models/v1alpha1/cluster.py

+    last_updated_timestamp: schema.Optional[dt.datetime] = Field(
+        default = None,
+        description = "Used to trigger the timeout of pending states"
+    )


Minor nit, but I think the "timestamp" is not needed.

yeah, true.

mkjpryor · 2024-10-15T12:14:26Z

azimuth_capi/operator.py

+        return
+
+    # Trigger a timeout check if we are in a transitional state
+    if instance.status.phase in {None, api.ClusterPhase.PENDING,


None shouldn't be possible here - is it a condition you have seen? You should get UNKNOWN instead.

I believe I saw this in my testing, because this can trigger before the first update completes, because I have not set the idle timer here.

ah, but I should include unknown in here, maybe I did hit unknown.

azimuth_capi/status.py

mkjpryor · 2024-10-15T12:18:51Z

azimuth_capi/operator.py

+    # To help trigger timeouts if we get stuck, save the last_updated_timestamp
+    if not instance.status.last_updated_timestamp:
+        await save_cluster_status(instance)
+


This should always reset to the current timestamp, because an update has just been applied.

That stops us detecting an error loop where we are stuck retrying the update over and over again.

mkjpryor · 2024-10-15T12:19:24Z

azimuth_capi/operator.py

+    # Reset the timeout counter, now we have completed the update
+    # and we need to wait for all the components to do their updates
+    instance.status.last_updated_timestamp = dt.datetime.now(dt.timezone.utc)
+    await save_cluster_status(instance)


Not sure we need to do this? I thought the idea was to have a timeout from the point that an update was made? The statement above accomplishes this just fine.

So the idea is it can take quite a while to get to this statement, if you need to retry the above stuff waiting for the lease to go active, so I reset it here. Maybe that is overkill, it was useful in my odd testing edge cases.

I remember what is was, it was for the case where we timed out in an error loop, then the operator fixes up the problem, and update finally completes, and at that point I reset the counter so you see progress in the UI again. Its an edge case for sure...

azimuth_capi/config.py

…hanges

…o feature/timeout-k8s-changes

mkjpryor

I think this patch has more fundamental issues that may require a change of tack. For example, it won't go into error if the cluster tries to autoscale and the autoscaling fails.

Rather than recording the last time we saw an update to the spec, we should record the last time the cluster phase changed, then transition to error if the phase is reconciling or upgrading and the last phase change was longer ago than the threshold.

JohnGarbutt · 2024-11-14T12:20:41Z

azimuth_capi/status.py

+        # if not a terminal state, ensure timestamp has been set
+        if cluster.status.last_updated is None:
+            now = dt.datetime.now(dt.timezone.utc)
+            cluster.status.last_updated = now


So this case should still work when the state change isn't triggered by a spec change. I think this means when we get an auto healing failure, we should timeout the operation correctly. But I should re-test that case, for sure.

Go unhealthy if not ready before timeout

e6ecca0

Users get confused when k8s clusters are not converging after a long time waiting. There can be many causes, such as the cloud being full. For now, mark the cluster as unhealthy after the timeout.

JohnGarbutt requested a review from mkjpryor October 14, 2024 16:48

JohnGarbutt added 2 commits October 14, 2024 19:34

Various fixes after initial testing

1d0f590

Allow unhealthy to be preserved when lease pending

2c739f9

JohnGarbutt commented Oct 15, 2024

View reviewed changes

azimuth_capi/status.py Outdated Show resolved Hide resolved

JohnGarbutt added 2 commits October 15, 2024 11:40

Rework to add status.last_updated_timestamp

f664967

Fix up bad timeout logic

a789486

JohnGarbutt marked this pull request as ready for review October 15, 2024 11:16

Merge branch 'main' into feature/timeout-k8s-changes

f34076a

mkjpryor requested changes Oct 15, 2024

View reviewed changes

JohnGarbutt added 6 commits November 7, 2024 17:56

Merge remote-tracking branch 'origin/main' into feature/timeout-k8s-c…

00cf11b

…hanges

Fix up latest review comments

664d8cf

Merge remote-tracking branch 'origin/feature/timeout-k8s-changes' int…

c614460

…o feature/timeout-k8s-changes

Remove non-required whitespace change

618b21d

Add unknown into pending state list.

e4d7bd5

Fix typo

ceafc49

JohnGarbutt requested a review from mkjpryor November 12, 2024 11:07

mkjpryor requested changes Nov 12, 2024

View reviewed changes

JohnGarbutt commented Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Go unhealthy if not ready before timeout #306

Go unhealthy if not ready before timeout #306

JohnGarbutt commented Oct 14, 2024

mkjpryor left a comment

mkjpryor Oct 15, 2024 •

edited

Loading

JohnGarbutt Nov 7, 2024

mkjpryor Oct 15, 2024

JohnGarbutt Nov 7, 2024

JohnGarbutt Nov 7, 2024

mkjpryor Oct 15, 2024

JohnGarbutt Nov 7, 2024

mkjpryor Oct 15, 2024

JohnGarbutt Nov 7, 2024

JohnGarbutt Nov 7, 2024

mkjpryor left a comment

JohnGarbutt Nov 14, 2024

Go unhealthy if not ready before timeout #306

Are you sure you want to change the base?

Go unhealthy if not ready before timeout #306

Conversation

JohnGarbutt commented Oct 14, 2024

mkjpryor left a comment

Choose a reason for hiding this comment

mkjpryor Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkjpryor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkjpryor Oct 15, 2024 •

edited

Loading