Eager Workflow Start (#606) #622

antlai-temporal · 2023-10-31T18:32:47Z

What was changed

Adding to clients the ability of tracking workers that are attached to them, so that we can implement Eager Workflow Start,
a latency optimization that allows the client to directly dispatch the first workflow task to the worker, eliminating a server round-trip.

Why?

Significantly reduces the latency of short-lived workflows.

Checklist

Closes
[Feature Request] Enable Eager Workflow Start #606

Signed-off-by: Antonio Lain <[email protected]>

client/src/lib.rs

client/src/worker_registry/mod.rs

client/src/raw.rs

client/src/worker_registry/mod.rs

cretz · 2023-10-31T20:24:19Z

client/src/worker_registry/mod.rs

+}
+
+/// This trait enables local workers to made themselves visible to a shared client instance.
+pub trait WorkerRegistry: Send + Sync {


Could also have a single register that returns a WorkerPermit with unregister on it. That way you could actually get rid of this trait and change fn workers(&self) -> Arc<RwLock<dyn WorkerRegistry>> { to something like fn register_slot_provider(&self, Box<dyn SlotProvider>) -> WorkerPermit. Also it will simplify your unregister logic instead of asking the caller to give you their ID again. Also may be worth choosing to use either "worker" or "slot provider" terminology instead of both here.

(none of my suggestions like these are required, just discussion starters)

With the move to a single worker per namespace+queue+client it helps to have a worker id on register to detect multiple workers claiming the same namespace+queue+client, and ignore the registration. The logic on the caller worker gets easier too, they don't have to deal with whether they got a permit or not. So I think I'll stick to register(provider)/unregister(id)...

Can't you just consider a second attempted registration of the same namespace+queue a failure? Why do you have to account for identifier at all?

No, I can't, because registration will happen whether the programmer is using eager start or not in their code, and I want to make sure that I don't break anything in existing code... I also worry about a worker trying to register twice, and the id helps to decide to do a warn! or silently ignore it.

I think this kind of registry/registration should be eager-workflow-start specific and don't perform eager-workflow-start-enabled-worker-register if that's disabled. Future needs to tie clients and workers together can be done via their separate needs. Also makes your registered worker map much smaller (not that it matters much).

I don't think that's the goal though, and at least for now it's always disabled. Eager workflow start capability is enabled/disabled at a worker level (but we may allow additional overriding to disable it at the start level if enabled at the worker level) and it will be disabled by default (and we're by no means settled on whether that will change in the future).

I don't think it is enabled/disabled at worker level in neither java or go. It is also an unnecessary complexity. Once we decide that we want to enable eager in the server, it should just work if clients use the flag in start_workflow....

I see, my mistake! So I guess you will register every one as eager-capable. I'm now not as big of a fan of erroring on duplicates, heh.

So how would a worker ever call register for itself twice necessitating an identifier (instead of once and given a thing for it to unregister itself)?

With the current implementation it won't happen, if the constructor fails you get a new id. But I was thinking that one day the registration may not be just for single process, but for a pool of local processes that can all benefit from early start, and having a unique id for a worker in the trait method will hide that implementation. For debugging it also helps to have a unique id per worker, specially if we are imposing the restriction that only one can register.

And as I just mentioned, the "given a thing to unregister" forces the client to deal with failed registrations, while unregister with id is just transparent.

I personally wouldn't be concerned with a potential future of deduplicating, and having an optional "lease" for a successful registration stored on the worker to unregister is normal. The obvious benefits are 1) you don't need a separate trait compared to a single call, and 2) a worker doesn't have to be responsible for its identifier. But if consensus is that we want workers to have this identifier and we want this separate registry, ok.

client/src/worker_registry/mod.rs

cretz · 2023-10-31T20:49:52Z

core/src/worker/mod.rs

+            wft_semaphore.clone(),
+            external_wft_tx,
+        );
+        client.workers().write().register(Box::new(provider));
        Self {


Hrmm, wonder if it's clearer for the worker to just impl SlotProvider trait directly

At first I started that way, but I had to Arc workers + Weak refs, and the comparisons with the hashmap+vector were getting unnecessary ugly, and it required to do more intrusive changes because Workers::new() don't return Arc workers refs.
It was also more difficult to debug compared to just using simple uuid strings, so I think the indirection helps...

client/src/raw.rs

client/src/worker_registry/mod.rs

core/src/worker/slot_provider.rs

client/src/worker_registry/mod.rs

cretz · 2023-10-31T20:55:54Z

client/src/worker_registry/mod.rs

+#[cfg_attr(test, mockall::automock)]
+pub trait SlotProvider: Send + Sync + Debug {
+    /// A unique identifier for the worker.
+    fn uuid(&self) -> String;


There's a lot of string cloning going on, can you have a scalar identifier? If you switch to the WorkerPermit thing mentioned elsewhere, the worker registry can be responsible for ID creation and therefore can just keep an atomic counter of u64 or something. No need for a worker to provide its ID nor is there a need for a UUID.

Merge branch 'eager-wf-start' of github.com:antlai-temporal/sdk-core into eager-wf-start

core/src/worker/slot_provider.rs

Sushisource

Looking pretty good to me. Just a few more small changes and I think that's it.

client/src/worker_registry/mod.rs

Sushisource · 2023-11-02T22:41:45Z

client/src/worker_registry/mod.rs

    #[cfg(test)]
    /// Returns (num_providers, num_buckets), where a bucket key is namespace+task_queue.
+    /// There is only one provider per bucket so `num_providers` should be equal to `num_buckets`.
    pub fn num_providers(&self) -> (usize, usize) {
        self.manager.read().num_providers()
    }


I think you can just get rid of this entirely now

I was checking the invariant that both maps have the same size in the tests, but yes, not much value...

Sorry, or you meant just get rid of the comment?

I meant the test generally

I use it in all the unit tests in that file, I need to check that things are deallocated/allocated properly, and with the wrapping I can't get to the maps directly...

core/src/worker/slot_provider.rs

client/src/worker_registry/mod.rs

Sushisource

Just a few more minor things, but, LGTM. Thanks!

Sushisource · 2023-11-03T21:58:44Z

client/Cargo.toml

@@ -23,6 +23,8 @@ once_cell = "1.13"
 opentelemetry = { workspace = true, features = ["metrics"] }
 parking_lot = "0.12"
 prost-types = "0.11"
+rand = "0.8.3"


We don't actually need the rand dep any more

Good catch thanks

Sushisource · 2023-11-03T22:00:59Z

client/src/worker_registry/mod.rs

+            p.insert(provider);
+            Some(self.index.insert(key))
+        } else {
+            warn!("Ignoring registration for worker in bucket {key:?}.");


"Bucket" isn't going to mean much for a user seeing the warning.

Suggested change

warn!("Ignoring registration for worker in bucket {key:?}.");

warn!("Ignoring registration for worker: {key:?}.");

Sushisource · 2023-11-03T22:09:22Z

client/src/worker_registry/mod.rs

+#[cfg_attr(test, mockall::automock)]
+pub trait WorkerRegistry {
+    /// Register a local worker that can provide WFT processing slots.
+    fn register(&self, provider: Box<dyn SlotProvider + Send + Sync>) -> Option<WorkerKey>;
+    /// Unregister a provider, typically when its worker starts shutdown.
+    fn unregister(&self, id: WorkerKey);
+}


I'm inclined still to say this trait can just go away and these can be implemented on the SlotManager directly - and in general I try to avoid using mocks except when mocking things that require external resources (ex: the existing mocks are pretty much only for clients / pollers), but this is all internal so typically there's not a big reason to mock anything.

The other two traits make a bit more sense to me since they're implemented outside of the client crate.

I didn't want "core" to see try_reserve_wft_slot, which can only be called by the client, but I'll remove it...

That's what pub(crate) is for

Good point, changing the visibility...

Sushisource · 2023-11-03T22:11:34Z

core/src/worker/mod.rs

@@ -75,7 +81,8 @@ use {
 pub struct Worker {
    config: WorkerConfig,
    wf_client: Arc<dyn WorkerClient>,
-
+    /// Registration key for this worker


Worth saying what the registration key is for

antlai-temporal added 8 commits October 20, 2023 16:33

First complete implementation

643f0e8

Signed-off-by: Antonio Lain <[email protected]>

Merge remote-tracking branch 'upstream/master' into eager-wf-start

fdc0d73

Draft

a491b7a

Signed-off-by: Antonio Lain <[email protected]>

Exclude mock from merging streams

656df5a

Adding unit tests

f4c5ce7

Add integration test

829bb35

Merge remote-tracking branch 'upstream/master' into eager-wf-start

3addfb0

Adding recover integration test

6d2eb19

antlai-temporal requested a review from a team as a code owner October 31, 2023 18:32

antlai-temporal added 2 commits October 31, 2023 12:26

Lint the test code

6d26d35

Enable dynamic config for buildkite

833610a

antlai-temporal changed the title ~~Eager Workflow Start #606~~ Eager Workflow Start (#606) Oct 31, 2023

cretz reviewed Oct 31, 2023

View reviewed changes

Sushisource reviewed Oct 31, 2023

View reviewed changes

cretz reviewed Oct 31, 2023

View reviewed changes

antlai-temporal and others added 5 commits October 31, 2023 21:14

Many review comments addressed

0a5c48e

Fix macro needing dummy string

9ec20a1

Consume slot in shedule_wft

fd0de7c

Adding Spencer's macro trick

415c48a

Merge branch 'eager-wf-start' of github.com:antlai-temporal/sdk-core into eager-wf-start

Hide the RwLock for SlotManager

7f10e09

Sushisource reviewed Nov 2, 2023

View reviewed changes

core/src/worker/slot_provider.rs Outdated Show resolved Hide resolved

Forcing one provider per namespace+task_queue+client

3e03980

Sushisource reviewed Nov 2, 2023

View reviewed changes

antlai-temporal added 3 commits November 3, 2023 11:06

Remove quiet flag

cbb18bb

Replace uuids by SlotMap keys

62db6cf

Merge remote-tracking branch 'upstream/master' into eager-wf-start

ff6015b

Sushisource approved these changes Nov 3, 2023

View reviewed changes

antlai-temporal added 3 commits November 3, 2023 15:55

Remove WorkerRegistry trait

27e5626

Merge remote-tracking branch 'upstream/master' into eager-wf-start

9e8eb40

Make try_reserve_wft_slot only visible in the client crate

f7a6e8d

antlai-temporal merged commit 2e9e3b0 into temporalio:master Nov 3, 2023
5 checks passed

antlai-temporal mentioned this pull request Nov 14, 2023

[Bug] Shutdown worker after replay test #630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eager Workflow Start (#606) #622

Eager Workflow Start (#606) #622

antlai-temporal commented Oct 31, 2023

cretz Oct 31, 2023

antlai-temporal Nov 1, 2023

cretz Nov 1, 2023

antlai-temporal Nov 1, 2023

cretz Nov 1, 2023 •

edited

Loading

cretz Nov 1, 2023

antlai-temporal Nov 1, 2023

cretz Nov 1, 2023 •

edited

Loading

antlai-temporal Nov 1, 2023

cretz Nov 1, 2023 •

edited

Loading

cretz Oct 31, 2023

antlai-temporal Oct 31, 2023

cretz Oct 31, 2023 •

edited

Loading

Sushisource left a comment

Sushisource Nov 2, 2023

antlai-temporal Nov 2, 2023

antlai-temporal Nov 2, 2023

Sushisource Nov 3, 2023

antlai-temporal Nov 3, 2023

Sushisource left a comment

Sushisource Nov 3, 2023

antlai-temporal Nov 3, 2023

Sushisource Nov 3, 2023

antlai-temporal Nov 3, 2023

Sushisource Nov 3, 2023

antlai-temporal Nov 3, 2023

Sushisource Nov 3, 2023

antlai-temporal Nov 3, 2023

Sushisource Nov 3, 2023

	warn!("Ignoring registration for worker in bucket {key:?}.");
	warn!("Ignoring registration for worker: {key:?}.");

Eager Workflow Start (#606) #622

Eager Workflow Start (#606) #622

Conversation

antlai-temporal commented Oct 31, 2023

What was changed

Why?

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Sushisource left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sushisource left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Nov 1, 2023 •

edited

Loading

cretz Nov 1, 2023 •

edited

Loading

cretz Nov 1, 2023 •

edited

Loading

cretz Oct 31, 2023 •

edited

Loading