-
Notifications
You must be signed in to change notification settings - Fork 14
Synchronization Barrier Protocol
This describes the protocol used by a distributed PlusCal algorithm compiled by PGo. Before processes can start executing their algorithms, they need to agree whether global state has been properly initialized and to establish connections with one another. In this protocol, a process is arbitrarily chosen to play the role of coordinator and the other processes get information about node locations from it.
This protocol is currently used at least twice in an application compiled by PGo:
- before each process starts their algorithm defined in the PlusCal spec
- upon termination, to make sure processes holding on to global state will not terminate before other processes (that might depend on it) also terminate.
When processes are created, they attempt to send a DIAL
message to the coordinator (determined at compile-time). In case that fails, processes keep trying to contact the process coordinator until it finally succeeds (in the example above, P1 is started before P2). The coordinator acknowledges DIAL
messages, and processes then wait for a message from the coordinator.
As soon as the coordinator receives DIAL
messages from every process, it sends each process a list of IP:port addresses where each other process in listening on. These addresses can be used to communicate across processes.
When a process receives the list of other processes in the system, it establishes a connection with them. When that is completed, a CONNECTED
message is sent to the coordinator to indicate that that process is ready to start any time.
When the coordinator received CONNECTED
messages from every other process, it is then responsible for starting the actual execution of the algorithms specified in PlusCal. The coordinator sends a START
message to every process, indicating that it can now start algorithm execution.
- By the end of the protocol outlined above, all processes will be connected to each other forming a fully connected mesh topology.
- Failures are not handled
- Processes do not disconnect
- Communication happens over Go RPC, which runs on top of TCP.
- The choice of the coordinator process is currently arbitrary and set to the first element in the list of
peers
declared in the configuration file.