Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove health check and "bad" node concept from ClusterManager #132

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

dandavison
Copy link
Contributor

A concern was raised that our samples should not be demonstrating any form of polling from within an entity workflow loop, even if the poll frequency is low. Instead, we intend to point to long-running activity patterns.

@dandavison dandavison force-pushed the remove-polling-in-loop-from-cluster-manager branch from a9d1707 to c0e27a6 Compare July 24, 2024 14:00
A concern was raised that our samples should not be demonstrating any
form of polling from within an entity workflow loop, even if the poll
frequency is low. Instead, we should point to long-running activity patterns.
@@ -229,9 +196,7 @@ def should_continue_as_new(self) -> bool:
async def run(self, input: ClusterManagerInput) -> ClusterManagerResult:
self.init(input)
await workflow.wait_condition(lambda: self.state.cluster_started)
# Perform health checks at intervals.
while True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no value in this being a loop anymore and no value in the wait condition having a timeout (and therefore no value in a sleep interval setting)

Copy link
Contributor Author

@dandavison dandavison Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very good point! (Fixed same bug in Typescript version). I think this actually improves the sample a lot, since it may not be obvious to users that they can implement the main "loop" so simply, and it makes continue-as-new usage clearer and less intimidating.

)
if self.should_continue_as_new():
Copy link
Member

@cretz cretz Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say this is a bit confusing to a user that may not understand tasks and how nodes lock is done. And I may have been wrong about removing the loop. Basically the interesting question is "can you guarantee if should_continue_as_new() is True in the wait condition lambda, that it is also True here? The only way it could become True in wait condition but False here is if nodes_lock went from unlocked to locked in the interim. The nodes_lock does not surround the entire handlers, it's only after cluster started, and we don't really provide known coroutine ordering (only deterministic ordering).

Take this scenario:

  1. Workflow has long since been running and has a lot of history
  2. Handler is run and it awaits the wait condition for cluster started (which it will have long-since been true)
  3. Event loop run which hits both wait conditions in order, setting both to futures to true (should_continue_as_new wait condition is true because nodes_lock is not locked, and cluster started is true)
  4. Now maybe handler acquires lock and begins to run
  5. Now maybe workflow function here runs and now should_continue_as_new() is False

May need the loop back or split the cluster shutdown and should continue as new wait conditions to separate tasks so you know which completed.

Another scenario, what if steps 4 and 5 above switch? What if this workflow thinks it should continue as new but then as coroutines settle post-completion the update runs. We'll continue as new in the middle of its activities. This needs that new all_handlers_finished thing I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm trying to move too quickly here. Reinstated loop, and added wait for all_handlers_finished() before CAN. It would be ideal to add some tests exercising CAN.

@dandavison
Copy link
Contributor Author

This is now blocked on an sdk-python release.

Copy link
Contributor

@drewhoskins-temporal drewhoskins-temporal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to demonstrate how to use a mutex to protect interleaving between the handlers and the main coroutine and to show something that solves a real-world problem where events come both from the workflow itself and from outside. Is there an alternate way to code up a health check that wouldn't involve polling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants