Remove health check and "bad" node concept from ClusterManager #132

dandavison · 2024-07-24T13:56:46Z

A concern was raised that our samples should not be demonstrating any form of polling from within an entity workflow loop, even if the poll frequency is low. Instead, we intend to point to long-running activity patterns.

A concern was raised that our samples should not be demonstrating any form of polling from within an entity workflow loop, even if the poll frequency is low. Instead, we should point to long-running activity patterns.

cretz · 2024-07-24T14:17:37Z

updates_and_signals/safe_message_handlers/workflow.py

@@ -229,9 +196,7 @@ def should_continue_as_new(self) -> bool:
    async def run(self, input: ClusterManagerInput) -> ClusterManagerResult:
        self.init(input)
        await workflow.wait_condition(lambda: self.state.cluster_started)
-        # Perform health checks at intervals.
        while True:


There is no value in this being a loop anymore and no value in the wait condition having a timeout (and therefore no value in a sleep interval setting)

Ah, very good point! (Fixed same bug in Typescript version). I think this actually improves the sample a lot, since it may not be obvious to users that they can implement the main "loop" so simply, and it makes continue-as-new usage clearer and less intimidating.

cretz · 2024-07-24T14:48:24Z

updates_and_signals/safe_message_handlers/workflow.py

        )
+        if self.should_continue_as_new():


I will say this is a bit confusing to a user that may not understand tasks and how nodes lock is done. And I may have been wrong about removing the loop. Basically the interesting question is "can you guarantee if should_continue_as_new() is True in the wait condition lambda, that it is also True here? The only way it could become True in wait condition but False here is if nodes_lock went from unlocked to locked in the interim. The nodes_lock does not surround the entire handlers, it's only after cluster started, and we don't really provide known coroutine ordering (only deterministic ordering).

Take this scenario:

Workflow has long since been running and has a lot of history

Handler is run and it awaits the wait condition for cluster started (which it will have long-since been true)

Event loop run which hits both wait conditions in order, setting both to futures to true (should_continue_as_new wait condition is true because nodes_lock is not locked, and cluster started is true)

Now maybe handler acquires lock and begins to run

Now maybe workflow function here runs and now should_continue_as_new() is False

May need the loop back or split the cluster shutdown and should continue as new wait conditions to separate tasks so you know which completed.

Another scenario, what if steps 4 and 5 above switch? What if this workflow thinks it should continue as new but then as coroutines settle post-completion the update runs. We'll continue as new in the middle of its activities. This needs that new all_handlers_finished thing I think.

Right, I'm trying to move too quickly here. Reinstated loop, and added wait for all_handlers_finished() before CAN. It would be ideal to add some tests exercising CAN.

dandavison · 2024-07-24T15:16:29Z

This is now blocked on an sdk-python release.

drewhoskins-temporal

The purpose is to demonstrate how to use a mutex to protect interleaving between the handlers and the main coroutine and to show something that solves a real-world problem where events come both from the workflow itself and from outside. Is there an alternate way to code up a health check that wouldn't involve polling?

dandavison force-pushed the remove-polling-in-loop-from-cluster-manager branch from a9d1707 to c0e27a6 Compare July 24, 2024 14:00

Remove health check and "bad" node concept

c0e27a6

A concern was raised that our samples should not be demonstrating any form of polling from within an entity workflow loop, even if the poll frequency is low. Instead, we should point to long-running activity patterns.

cretz reviewed Jul 24, 2024

View reviewed changes

Remove unnecessary control logic and otherwise simplify main loop

7ec0e72

cretz reviewed Jul 24, 2024

View reviewed changes

dandavison added 3 commits July 24, 2024 10:56

Reinstate loop

e3eba6e

Wait all handlers finished before CAN

d934ee1

Update SDK version

ab9b2f9

cretz approved these changes Jul 24, 2024

View reviewed changes

cretz mentioned this pull request Jul 24, 2024

Safe message handler sample temporalio/samples-dotnet#73

Merged

drewhoskins-temporal requested changes Aug 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove health check and "bad" node concept from ClusterManager #132

Remove health check and "bad" node concept from ClusterManager #132

dandavison commented Jul 24, 2024

cretz Jul 24, 2024

dandavison Jul 24, 2024 •

edited

Loading

cretz Jul 24, 2024 •

edited

Loading

dandavison Jul 24, 2024

dandavison commented Jul 24, 2024

drewhoskins-temporal left a comment

Remove health check and "bad" node concept from ClusterManager #132

Are you sure you want to change the base?

Remove health check and "bad" node concept from ClusterManager #132

Conversation

dandavison commented Jul 24, 2024

cretz Jul 24, 2024

Choose a reason for hiding this comment

dandavison Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

cretz Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

dandavison Jul 24, 2024

Choose a reason for hiding this comment

dandavison commented Jul 24, 2024

drewhoskins-temporal left a comment

Choose a reason for hiding this comment

dandavison Jul 24, 2024 •

edited

Loading

cretz Jul 24, 2024 •

edited

Loading