You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 31, 2018. It is now read-only.
Distributed and Parallel Computing - Distributed Computing
Introduction
Distributed computing means computing on multiple nodes/processes
Each node as a unique ID
Nodes are connected via channels (i.e. edges)
Nodes do not share memory or a global clock
Nodes are either connected or not connected
Networks are strongly connected (i.e. there is a path between any two nodes)
Networks don't necessarily have to be complete (i.e. edge between every
pair of nodes)
Channels may be directed or undirected
TODO: How does this relate to "nodes are either connected/not connected"?
Communication is passing messages over channels
Channels are not necessarily FIFO
Communication is asynchronous
Parameters of the Network
N number of nodes
E number of edges
D diameter of the network, i.e. the longest "shortest path" between any
two nodes
Failures
Failure rate (FR): Number of failures per unit time
Mean time before failure (MTBF): 1/FR
e.g.
1000 nodes, all critical
MTBF of each node is 10,000
FRₙ = 1/MTBF
FRₙ = 1/10,000
FRₛ = sum(FRₙ for n in nodes)
FRₛ = 1,000 * (1/10,000)
FRₛ = 1/10
MTBFₛ = 1/(1/10)
MTBFₛ = 10
Spanning Trees
Spanning tree:
Contains all the nodes of the network
Edges are a subset of the network's edges
No cycles
Undirected
Tree edges: edges in the spanning tree
Frond edges: edges in the network but not in the spanning tree
Sink tree: tree made by taking all edges of a spanning tree, and making them
directed so that all paths end up at some chosen root node
Usage
We want to send a message to the whole network
If we send to neighbours recursively, it doesn't terminate
If we send to the spanning tree root, and then follow the reverse of the sink
tree (as in the edges are reversed), then every node only sees the message
once
Transition Systems
The behaviour of a distributed algorithm is given by a transition system:
Set of configurations: (λ, δ) ∈ C
Global state of the algorithm
C is the set of all possible configurations of an algorithm
Binary transition relation on C: λ → δ
Changes the global state from one configuration to another
A set of initial configurations: I
A configuration is terminal if there are no transitions out of the
configuration
An execution is a sequence of states in C, beginning with an element in I,
and can either be infinite or ending in a terminal configuration
A configuration is reachable if there is an execution that contains it
States and Events
Configuration consists of
The set of local states of each node
The messages in transit between the nodes
Transitions are connected with events
Internal: an action that modifies the internal state of a process
Send: A message is sent from a process
Receive: A message is received by a process
Initiator: the process where the first event is an internal, or a send event
Centralised: only one initiator
Decentralised: multiple initiators
Assertions
Safety property
Must be true on every reachable configuration
Liveness property
Must be a reachable configuration with this property, from every
configuration
Orders
Total order
a ≤ a reflexivity
a ≤ b && b ≤ a → a = b antisymmetry
a ≤ b && b ≤ c → a ≤ c transitivity
a ≤ b || b ≤ a totality
Partial order
a ≤ a reflexivity
a ≤ b && b ≤ a → a = b antisymmetry
a ≤ b && b ≤ c → a ≤ c transitivity
No totality
Causal order
a ≺ b iff a must occur before b in any execution
If a is send, and b is receive, then a ≺ b
a ≺ b && b ≺ c → a ≺ c
Computations
Executions are too specific, different orders means different executions but
same work is done
Instead, we talk about "computations", a set of executions equivalent up to
permutations of concurrent events
Local Clocks
Maps to a partially ordered set (i.e. integers) such that
a ≺ b ⇒ C(a) ≺ C(b)
Requires each process maintains a record of:
Its local logical clock, measuring the process's own progress
Its global logical clock, measuring its perception of global time
Lamport's Clock
Notated as LC(event)
Tracks local and global logical clocks in one variable
Let a be an event, and k be the clock value of the previous event
If a is internal or send, LC(a) = k + 1
If a is receive, and b is the send event of a, LC(a) = max(k, b + 1)
Consistent with causality, but not strongly consistent
i.e. a ≺ b ⇒ LC(a) ≺ LC(b) but notLC(a) ≺ LC(b) ⇒ a ≺ b
Vector Clock
Each process keeps a vector, the same length as the number of process
vᵢ[i] is the local logical clock
vᵢ[j] where i ≠ j, is process i's most recent knowledge of process j's
logical clock
Initialise all of vᵢ to 0
If a is internal or send
vᵢ[i] += 1
If a is receive, and m is the vector clock sent with a
vᵢ = max(vᵢ, m); vᵢ[i] += 1
Ordering
u = v ↔ ∀i. u[i] = v[i]
u ≤ v ↔ ∀i. u[i] ≤ v[i]
u < v ↔ u ≤ v && ∃i. u[i] < v[i]
u || v ↔ u !≤ v && v !≤ u
Strongly consistent!
Mutual Exclusion
Centralised
Works
Easy to implement
Fair (in order of request)
No starvation (no node waits forever)
Only 3 messages per use of resource (ask, grant, release)
However, single point of failure, and a big bottleneck
Decentralised (Ricart-Agrawala)
To request a resource, send message to all process requesting resource
When process receives message:
If not using and doesn't want
Send OK back to sender
If has access
No reply, but queues the request
Wants the resource but doesn't have it
Compares logical clock value of requester
If requester has lower, send OK message
If requester has higher, queue message and send nothing
When sending a message, wait for an OK from everyone else, once this happens
it can access the resource
When done with the resource, check queue and send OK to all of them
Advantages: works, fair, no starvation
Disadvantages: 2(n - 1) messages, N points of failure, every node needs to
keep track of every other node in the system
Snapshots
Allows saving program state so we can resume later
Allows returning to previous state if things break
With distributed systems, can't do this because there's no global clock
Messages could be on the fly, difficult to record this state
Recording local snapshots must be coordinated correctly to ensure a consistent
global snapshot
If each process takes a local snapshot:
An event is pre-snapshot if it occurs in a process before the local snapshot
Otherwise it is post-snapshot
A snapshot is consistent if
When a is pre-snapshot, x ≺ a implies x is pre-snapshot
A message is included in channel state if its sending is pre-snapshot and
its receiving is post-snapshot
Chandy-Lamport Algorithm
Applies to FIFO channel systems only
Send control messages called markers along channels to separate pre- and
post-snapshot events and trigger local snapshots
Initiator takes local snapshot and sends marker through all outgoing channels
When process pₘ receives marker along channel cₙₘ
If pₘ has not yet saved state
pₘ saves local state
pₘ sets cₙₘ state to {}
pₘ sends marker through to all outgoing channels
Else
pₘ records state of cₘₙ as set of all basic messages received after it
has saved its local state, and before it received the marker message from
pₙ
Correctness
If a ≺ b and b is pre-snapshot then a is pre-snapshot
If a is send and b is receive of the same message in processes p and q
b is pre-snapshot → q has not received a marker when b occurs
Since channels are FIFO, p has not sent a marker when a occurs
Hence, a is pre-snapshot
This chain of causality travels through to all events
Message m between p, q is in channel state Cpq iff send of m at p is
pre-snapshot, and receive at q is post-snapshot
Forward direction
Assume m ∈ Cpq
Since q has saved m, it must occur after we have the control message
from another channel (or q is the initiator), so receive of m is
post-snapshot
Since q has saved m, it must occur before q has the control message
from p, so the send of m is before p has the control message, so
send of m is pre-snapshot
Backwards direction
Assume send of m is pre-snapshot and receive of m is post-snapshot
Send of m pre-snapshot implies p saves state after sending m, hence
control message sent down Cpq after m
And received in same order due to FIFO
Receive of m post-snapshot implies that q has received the control
message already from a different node (or q is initiator)
Since q has saved local state but hasn't received control message from
p, it saves m in Cpq
Lai-Yang-Mattern Algorithm
Works on non-FIFO channels
Rather than having separate marker messages, attach boolean flag to basic
messages
Typically described as white/red
Lai-Yang algorithm didn't need control messages, but required keeping all
message history
Lai-Yang-Mattern algorithm uses control messages with logical clocks
Algorithm
Every process initialised to white
When a process saves its local state:
Turn red
Send control message on all outgoing channels to say how many white messages
it has sent down that channel
Every basic message is the same colour as the process that sends it
White process can save local state at any time
But must save it no later than on receiving a red message, and before
processing that message
When receiving the control message:
Save local state if it hasn't already
Process knows how many white messages it has received currently on each
input channel
Process knows how many white messages it needs to receive from the control
message
Waits for white messages
Each process channel computes channel state as the set of white messages it
receives after saving its local state
Correctness
a ≺ b ^ pre(b) → pre(a)
a, b ∈ q then trivially true
a := send(m), b := recv(m)
pre(b) → white(b)
white(b) → white(a)
white(a) → pre(a)
m ∈ Cpq → pre(send(m)) ^ post(send(m))
m ∈ Cpq → white(send(m)) → pre(send(m)) because of rules
If two nodes start snapshot concurrently, they will both increment and the
snapshot will be the same
Wave Algorithms
Sends request through network to gather information
Can be used for: termination detection, routing, leader election, transaction
commit voting
A wave algorithm needs three conditions:
Finite
One or more decide events
∀ a ∈ D, p ∈ P. ∃ b ∈ p.E. b ≺ a where
D is the decide events
P is the processes
p.E is the events in a process p
Means every process must participate in each decide event
Traversal Algorithm
An initiator sends a token to visit each process
The token may collect/distribute information on the way
The token returns to the initiator
The initiator makes the decision
Tarry's Algorithm
Traversal algorithm for undirected networks
Two main rules:
A process never forwards the token through the same channel twice
A process only forwards the token to its parent when there is no other
option
The parent is the first person to send the token to it
Performance
Number of messages: 2E
Time to complete: 2E
Correctness
Token ends up at initiator
Token never sent through the same channel in same direction twice
When non-initiator holds token, received it one more time than it has sent
it
Hence there is always a channel it has not sent it down yet, so can always
send it on
Unless it is the initiator, in which case the algorithm ends
The token travels through each channel in each direction once
Do this through reductio ad absurdum
Assume that at termination, a channel has not been traversed by token in
both directions
Let p be the (a) earliest visited (b) non-initiator process (c) that has a
channel not traversed in the outgoing direction
Let s be the parent of p that has all channels traversed
As s is p's parent, p must have sent the message back to s
But in order to send to parent, p must have sent down all channels
Contradiction!
Depth first search
Token is forward to a process that has not yet held the token, in preference
to one that has
Means that frond edges will only connect ancestors/descendants
We can make Tarry's algorithm DFS by adding a rule:
If rules 1&2 allow it, send the token down the same channel as soon as you
receive it
We can use this to speed up search time:
Let the token carry information of all processes that carried it
Avoid sending this information down frond edges (meaning that extra memory
would be required)
Messages only travel down spanning tree edges, so 2E → 2N - 2
Echo Algorithm
Wave, but not traversal algorithm
Centralised
Undirected networks
Outline:
Initiator sends message to all neighbours
When non-initiator receives message:
Makes the sender its parent
Sends message to all neighbours except its parent
When non-initiator received messages from all neighbours:
Send message to parent
When initiator received all messages from neighbours, the algorithm
terminates
This builds a spanning tree
Number of messages: 2E
Worst case time to complete: 2N - 2
Deadlocks
Process stuck in infinite wait
Communication deadlock
Cycle of processes, each waiting for the next to send a message
Resource deadlock
Cycle of processes waiting for a resource held by the next process
Different resources
Dealing with Deadlocks
Make deadlocks impossible by designing protocols with this in mind
Only obtain resource if global state ensures it is safe
Detect deadlocks, and break the chain when they occur
Waits-For Graph (WFG)
Directed graph
Nodes are processes
Edge from p to q means that p is waiting for q to respond
If there's an cycle in the WFG, then a deadlock has happened (in simple
models)
Single-resource model
Process can only have one outstanding request for a resource
Cycle in WFG means deadlock
Simplest model
AND model
Process can request multiple simultaneously, and all resources needed to
unblock
Cycle in WFG means deadlock
OR model
Process can request multiple simultaneously, and only one resource needed to
unblock
Cycle in WFG does not mean deadlock
Knot in WFG means deadlock
A knot is a set of vertices such that every vertex u reachable from a
not vertex v can also reach v
AND-OR model
Generalises AND model and OR model
No simple graph structure for detecting deadlocks (TODO: why do we use it
then?!)
p-out-of-q model
Equivalent to AND-OR model
Process requests p resources and q are needed to unblock
Unrestricted model
Problems
We need to maintain the WFG
We need to find cycles/knots in WFG
A deadlock detection algorithm must guarantee:
Progress
All existing deadlocks must be found in finite time
Safety
Must not report deadlocks that do not exist
WFGs Continued
There is a node v for each process in the network
Nodes can be active or blocked
Active node can make n-out-of-m requests of other nodes and then becomes
blocked, or grant requests to other nodes
A blocked node can not make or grant requests, but can become active if a
number of its requests are granted
When a blocked node gets n-out-of-m requests granted, it purges the remaining
m - n requests
When node a gets a request from node b, there is a dependency for b→a
When a grants b's request, the dependency moves to a→b until b releases
the resource
Distributed WFG
Do not wish to centralise
Each node retains information about local part of WFG
Distributed deadlock detection algorithm invoked by initiator
Each node u has a set of variables:
OUTᵤ: Set of nodes that u has sent ungranted requests to
INᵤ: Set of nodes that u has received requests from
nᵤ: Number of nodes that u needs to receive until it becomes unblocked
0 ≤ nᵤ ≤ |OUTᵤ|
nᵤ = 0 ⇒ OUTᵤ = {}
Bracha-Toueg Deadlock Detection Algorithm
Idea: simulate granting of grantable requests and check if initiator node is
unblocked
Variations:
Network with instant messages, base algorithm is static during deadlock
detection
Needs INᵤ, OUTᵤ, nᵤ to be precalculated from the local state and
channel states of a globally consistent snapshot
Network with time delays in message delivery, base algorithm is static
Relaxes the need for the channel states to be used
Network with time delays, base algorithm is dynamic
Relaxes the need for the global snapshot to be precalculated, i.e.
integrates taking snapshot with deadlock detection
We will only look at variation 1
Get a spanning tree by virtually calling two nested echo algorithms
The first spanning tree is rooted at the initiator (using notify/done
messages)
Nested spanning trees are rooted at each active node (using grant/ack
messages)
defnotify(u):
""" Traverse the tree, building a spanning tree. If we find a node that is not waiting on any resources (i.e. `u.n == 0`) call `grant()` on it. """u.notified=Trueforwinu.out:
w ! NOTIFYifu.n==0:
grant(u)
forwinu.out:
w ? DONEdefgrant(u):
""" Grant the resource held by this node. """u.free=Trueforwinu.in:
w ! GRANTforwinu.in:
w ? ACKdefreceive(u, message):
ifmessage==NOTIFY:
ifnotu.notified:
notify(u)
returnDONEelifmessage==GRANT:
ifu.n>0:
u.n-=1ifu.n==0:
grant(u)
returnACK
Will say whether WFG is currently deadlocked, not if one will happen