Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async API specifications #2

Open
stevana opened this issue Dec 2, 2024 · 0 comments
Open

Async API specifications #2

stevana opened this issue Dec 2, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@stevana
Copy link
Collaborator

stevana commented Dec 2, 2024

Currently only synchronous API specifications are supported, it would be nice to be able to specify asynchronous (message passing) APIs as well though.

First of all, in the same way as OpenAPI specifications serve as basis for the current sync HTTP API specifications, AsyncAPI or CloudEvents can serve as a basis for the async message passing specifications. We don't want to use their JSON syntax, but it's good to have something to fall back on with regards to completeness (i.e. what should an async specification be able to cover).

Second, async APIs need to interact well with protocols (see #1 for details), because protocols are important in the async setting as well.

Regarding syntax, here an initial example specification of the bully algorithm for leader election:

node N_{1,2,3} where

-- If we ping the leader and we get a timeout we need to elect a new leader.
follower, leader = i & N_i!Ping, N_i?Pong & follower | Timeout & candidate

-- Broadcast election.
candidate & Ns!Election & election

follower & N_i?Election N_i!Ok & candidate

election & Ns?Ok & follower
             | 
	     ?Timeout & won

won & Ns!Coordinator & leader

follower & N_i?Coordinator & follower, leader=i

-- or, we can combine the last two rules as such:

election & ?Ok+ & follower
           | 
	   ?Timeout !Cordinator & leader

-- candidate is a useful initial state, so that's why I didn't inline that
-- into the ping operation.

We use ! for send and ? for receive, like in the session type literature. The + means "more than one".

Here's another example of the two-phase commit (2PC) protocol:

-- Transaction manager
node TM where

init & Client?Write, RMs!prepare & working

working & RM_i?aborted, RMs!abort & aborted
working & RMs?prepared, RMs!commit & committed

-- Resource manager
node RM_{1,2} where

working & TM?prepare, TM!prepared & prepared
                    | TM!aborted & aborted

prepared & TM?commit & committed

prepared | aborted & TM?abort & aborted

(For comparison, here's a TLA+ version of the 2PC protocol.)

Contract checkers would work largely the same as described in #1, one difference might be that sometimes we can receive "more than one" of a message (indicated by the + above), in that case the checker might have to be more lenient after transitioning to the next state with what messages it can receive.

When testing in the sync setting, the fuzzer pretends to be a client of the server (which should respect some sync API specification), in the async case there's often not a client/server separation. The the bully algorithm leader election example from above, each node is a client and a server at the same time.

One way to deal with this situation is to configure the nodes to send all the messages to the test driver/fuzzer. The test driver can keep a heap of all messages and each time a new message arrives, it can generate a new arrival time and push it into the heap, before popping the top message of the heap and delivering it to the appropriate node. One could also imagine only testing one node at the time and using the protocol to generate valid messages, i.e. simulating the other nodes.

The contract checkers will ensure that the blackbox behaviour (which is determined by in- and output messages) for each node is correct. In addition global properties can be ensured by having the contract checkers record all (inputState, session type, outputState) triples and at the end of a test run (once faults have been removed and the system recovered) we can run LTL formulas over all traces, e.g. for the bully algorithm:

property leaderGetsElected = 
  always (N_i.state' == candidate ==> eventually (N_i.state' == follower || N_i.state' == leader))

For 2PC the property might look like:

property 
  always eventually RM_1.state == RM_2.state && 
    RM_1.state == committed || RM_1.state == aborted

To make the above outlined testing easier, it's convenient to require certain functionality/structure of the nodes under test:

  1. If the nodes are implemented as state machines which take an input and generate zero or more outputs, then this simplifies the scheduling of messages in the test driver;
  2. In order to avoid waiting for timeouts, one should make nodes be able to fire any pending timeouts that would have happened by some time T. That way we can fire the timeouts at the arrival time of the next message we feed to a node.

One last thing regarding functionality/structure of the nodes is the ability to dump it's current state, which is usually easy to do if the node is implemented as a state machine. This is useful when debugging, as it allows us to step through and see how the state evolved over time when it received messages.

@stevana stevana added the enhancement New feature or request label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant