Consistency question: what statements can we make about the execution interleavings of an Activity invocation? #2821

NCarter3 · 2024-05-17T22:32:45Z

NCarter3
May 17, 2024

Hi Functions Team!

I've been reading the documentation and "Durable Functions: Semantics for Stateful Serverless", and have come away with a couple fairly fundamental consistency model questions.

My current understanding of framework guarantees are as follows:
Activities: at least once execution
Entities: effectively once execution (as defined in this answer: the function may be executed multiple times for a message, but the changes to the durable state will be committed exactly once)

My questions mainly concerns side effects (external service calls). My application creates, updates, and deletes external state (Azure resources and database state). In the durable framework, can we make the following statements? And if we cannot, can we make similar, weakened statements along these lines?

After an invocation of an Activity produces a result that is committed durably, that Activity is no longer running.
After an invocation of an Entity produces a result that is committed durably, that Entity is no longer running.
For a given Activity invocation, at most one instance of that code is running in parallel.
For a given Entity invocation, at most one instance of that code is running in parallel.

The exhortation to write idempotent Activities is well-heard, and I want to precisely understand what we mean by idempotency. There is a large difference between Activities that are idempotent to serial re-invocation, and Activities that are idempotent to all possible interleavings with itself, and that have well-defined behavior if the orchestration continues while an Activity is still running! I haven't observed problems here that I can point to, but I'm intending to start using durable functions for some more ordering-sensitive tasks. I want to make sure I'm on sound theoretical footing before proceeding.

Using @sebastianburckhardt, @cgillum, et. all's wonderful formalization of durable function semantics, I don't think we can make any of these claims. In fact, these are assumed to be violable there:

Workers continuously look for work items, i.e. entries 𝜅 ⟨𝑔in, 𝑥pre⟩ for which 𝑔in is not empty. After a worker processes such an item (as defined in ğ5.2), it tries to atomically commit the new execution state 𝑥post, as well as all task messages 𝑔out that were produced by the execution. [...] The commit happens only if the original 𝑔in and 𝑥pre match. This ensures that even if multiple workers attempt to execute the same work item, only one worker can commit it. (5.1.3)

However, it also leaves this tantalizing future direction that may indicate that the paper may be overly conservative with its claims:

The compute-storage model currently does not include external calls or critical sections. Adding external calls to the formalism is technically easy, but requires an alternate formulation of the correctness guarantees, because duplication of external calls (unlike internal calls) is observable, and can happen when workers make repeated attempts at processing a work item. (5)

To be clear on my ask -- it's not the duplication of calls that I'm concerned about; it's the potential interleaving of a given Activity with itself or occurring after the Orchestration believes that Activity is complete. Can the framework forget or orphan an execution, and create a second execution while the first is still running? Here is an example:

let $O_1$ be a specific orchestration invocation of Orchestrator $O$.
let $A_1$ be the first invocation of Activity $A$ by Orchestrator $O$.
let $A_1a$ be the first processing (running of code) of the invocation $A_1$.
let $A_1b$ be the second processing of the invocation $A_1$.

$O_1$ starts.
$O_1$ calls $A_1$; $O_1$ yields.
$A_1$ is picked up by a worker, execution $A_{1a}$ is started.
$A_1$ is picked up by a (different) worker, execution $A_{1b}$ is started.
$A_{1a}$ calls external service; returns successfully.
$A_{1a}$ returns a result; result is durably committed.
$O_1$ replays and resumes with $A_{1a}$ result.
$A_{1b}$ calls external service; returns successfully.
$A_{1b}$ returns a result; result cannot be committed as $A_{1a}$ already committed; result is dropped.

Is it possible in this framework for (3) and (4) to both occur, and both begin running user code interleaved?
Is it possible for $A_{1b}$ to still be executing at (8) even after a result for $A_1$ is committed and visible at (6-7)?

Thank you for any insights that can be added here!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency question: what statements can we make about the execution interleavings of an Activity invocation? #2821

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Consistency question: what statements can we make about the execution interleavings of an Activity invocation? #2821

NCarter3 May 17, 2024

Replies: 0 comments

NCarter3
May 17, 2024