-
Notifications
You must be signed in to change notification settings - Fork 2
replication algo
There is a mistake: FastAccept has to check final_deps, because execution-linearizability requires new instance have to include all preceding instance.
- Remove infinite strongly-connected-components and livelock.
- Remove
seq
. - Remove
defer
during recovery.
Major changes from epaxos:
-
When updating
deps
, setdeps
to accumulated deps, which includes all reachable instances by walking through the dep-graph. -
When updating
deps
for FastAccept ofa
, addx
intoa.deps
only whenx
does not knowa
:x < a
-
initial_deps
is useless and removed: recovery does not needinitial_deps
. onlydeps
. -
Instances by a same leader has a strong depends-on relation. A later instance always depends on a former one. This is guaranteed by handling FastAccept request sequentially.
-
Use the all-committed constrain. all-initial-value and only-to-quorum is not used.
-
R0
,R1
... orR[0]
,R[1]
... : replica. -
a
,b
...x
,y
... : instance. -
La
,Lb
: is the leader replica of an instancea
orb
. -
F
: number of max allowed failed replica that. -
n
: number of replicas,n = 2F+1
. -
Qc
: classic quorum:Qc = F+1
. -
Qf
: fast quorum:Qf = F+⌊Qc/2⌋ = F + ⌊(F+1)/2⌋
. -
a₀
: initial value of instancea
. -
a₁ⁱ
: updated instancea
byR[i]
when it is forwarded to replicaR[i]
. -
a₂
: value of instancea
some relica believes to be safe. -
→
: depends-on:a → b
meansa
knows ofb
. -
<
: do-not-know:a < b
meansa
does not know ofb
.
An instance is an internal representation of a client request.
Two essential fields are:
-
cmds
the commands a client wants to execute. -
deps
what other instance it sees before being committed. This field is to determine the instance execution order.
type InstanceID(ReplicaID, i64)
type Instance {
deps: Vec<InstanceID>;
committed: bool;
executed: bool;
cmds: Vec<Commands>;
ballot: BallotNum;
}
a.deps
: is instance id set of all instances that directly or indirectly interfering with a
.
E.g., a ~ b ~ c
but a ≁ c
, a.deps
has c
in it because c
indirectly
interfering with a
through b
.
Describe deps
:
a.direct_deps
are the highest instances a
directcly interferes.
TODO proof that we only need to update from directly interfering instances.
On implementation, a.deps
is split into N
subset, where N
is number of replicas.
Every subset contains only instances from leader Ri
:
a.deps[Ri] = {x | x.replicaID == Ri and a → x}
.
And a.deps[Ri]
records only the max instance id.
When updating deps
for FastAccept of a
,
add x
into a.deps
only when x
does not know a
: x < a
.
i.e., if x → a
, then a < x
.
On all replicas that there are both a
and x
,
FastAccept status a, x
can not have a < x
and x < a
.
If a < x
because of an Accepted x
, which means there is another replica on
which x → a
, that conflict with our assumption that intersection.
∴ a < b
always hold if a, b
is initiated by a same leader and a ~ b
.
The entire instance space is a 3d array:
R[i][j][idx]
Fields:
- i: replicaID.
- j: replicaID.
- idx: index of a instance.
Explain:
-
R[i]
: all data on replicai
-
R[i][j]
: instances initiated by replicaj
those are stored on replicai
. -
R[i][j][idx]
:idx
-th instance initiated by replicaj
.
| |
| |
| |
| c f c f c f |
| a b e a b e a b e |
| ---------------- ---------------- ---------------- |
| leader: [0] [1] [2] [0] [1] [2] [0] [1] [2] |
| ================ ================ ================ |
| replica: R[0] R[1] R[2] |
We may write R[0]
as R0
for short.
-
Initially, there are 3 instances
x, y, z
.When
a
is initiated onR0
,R0
believes it knows of all others:a₀ → {x, y, z}
.When
b
is initiated onR1
,R1
believes it knows of all others:b₀ → {x, y, z}
.When
c
is initiated onR2
,R2
believes it knows of all others:c₀ → {x, y, z}
.When
d
is initiated onR0
,R0
believes it knows of all others:d₀ → {a, x, y, z}
.d ↓ a b c x y z x y z x y z ----- ----- ----- R0 R1 R2
When d
is replicated to R1
,
R1
believes that d₁¹ → {a, b, x, y, z}
.
d₁¹
got a new relation d₁¹ → b
:
d d
↓ ↘
a b c
x y z x y z x y z
----- ----- -----
R0 R1 R2
Then c
is replicated to R1
,
R1
believes that c₁¹ → {d, a, b, x, y, z}
.
c₁¹
got three new relations c₁¹ → {b, d, a}
(
because R1
believes d → a
thus c₁¹ → a
):
.c
↙ |
d d |
↓ ↘ ↙
a b c
x y z x y z x y z
----- ----- -----
R0 R1 R2
Then a
is replicated to R1
,
R1
believes that a₁¹ → {b, x, y, z}
.
a₁¹
got only one new relation a₁¹ → b
:
R1
already believes d₀ → a
because it had received d₀
from R0
.
c₁¹ → d
thus c₁¹ → a
.
.c
↙ |
d d |
↓ ↓↘ ↙
a a→b c
x y z x y z x y z
----- ----- -----
R0 R1 R2
We see that different replicas have their own view of instance relations.
The action commit is to broadcast to all replica about what value is safe.
Some value(e.g. an instance or a relation or something else)
is safe if:
it has been forwarded to enough replicas and constituted a quorum(Qf
or Qc
)
so that no process(command leader or recovery process) would never choose other
value for it to commit.
a
is safe if every a.deps[Ri]
is safe.
a₁¹ a₁² a₁³
| | ↓
| | c₁³
↓ ↓ ↓
a₀ b₀ c₀ b₀
--- --- --- --- ---
R0 R1 R2 R3 R4
a₁¹.deps = {b}
a₁².deps = {c}
a₁³.deps = {b, c}
Thus a.deps = {b, c}
can be committed.
In this algorithm we need to ensure two things to be safe, before committing it:
-
What to execute:
a.cmds
.To commit
a.cmds
, forwards it toQc=F+1
replicas, becausea.cmds
never changes. -
and when to execute:
a.deps
.a.deps
have different values on different replicas. Thus it requiresQf
replicas to have the identical value to be safe.
Since a.deps
has n
indepedent fields:
a.deps = {
0: x,
1: y,
...
}
-
If all
a.deps[Ri]
is safe,a
is safe. Then leader commit it on fast-path. -
Otherwise if any of
a.deps[Ri]
is not safe, run another round of Accept to make it safe(slow path).
Conditions must be sastisified to commit on fast-path:
-
For every updated
a.deps[i] == x
, the leader received at least one reply with committedx
TODO remove this: and
x < a
. -
a.deps[i] == x
constitutes a fast-quorum.
These two condition guarantees that x
will never depends on a
.
This is necessary to recover a fast-committed instance.
TODO obviously
There is only one value could be chosen to be safe.
∴ finally an instance is committed the same value on all replias.
∴ All replicas have the same set of instances.
Leader:
-
Initiate instance
a
build
a.deps
:a.deps = {a} for x in local_instances: if a ~ x: a.deps ∪= x.deps
-
FastAccept: forward
a
to other replicas. -
Handle-FastAcceptReply
Update
a.deps
:a.commitDeps = []; for i in 0..n: values = {a.deps[i] for a in all_replies} for v in values: if (count(v, all_replies) >= fast_quorum and v is committed): a.commitDeps[i] = v break if a.commitDeps[i] is NULL: quit fast-path. commit(a)
Non-leader replicas:
-
Handle-FastAccept
TODO need proof of linearizability with this. TODO explain why this is efficient reducing conflict.
update
a.deps'
committed flag are ignored in this pseudo code for clarity
La = leaderOf(a) // on other replica there may be more deps for x in a.deps: if x < a: a.deps ∪= x.deps // check instances not included in a.deps for x in local_instances - a.deps: if a ~ x: if x < a: a.deps ∪= x.deps // update committed flag for x in a.deps: committed[leaderOf(x)] = x.isCommitted reply(a, committed)
Leader:
-
Choose
a.deps
-
Send Accept to replicas
-
Handle AcceptReply
Non-leader replicas:
- Handle Accept
Just commit.
-
All request messages have 3 common fields:
-
req_type
identify type: FastAccept, Accept, Commit or Prepare. -
ballot
is the ballot number,- For FastAccept it is always
0
. - Fast path Accept ballot is
1
. - Slow path Accept ballot is
2
or greater. -
ballot
in Commit message is useless. -
ballot
in a Prepare is chosen by recovery process and should be>2
.
- For FastAccept it is always
-
instance_id
is the instance id this request for.
-
-
All reply messages have 3 common fields:
-
req_type
. -
last_ballot
is the ballot number before processing the request. -
instance_id
.
-
TODO Changes:
deps_committed
is useless in fast-accept:
Without deps_committed
no fast-commit will be delayed.
To fast-commit a > x
:
If x
is slow-committed, an Accept status x
will be seen.
Thus a
can be fast-committed.
If x
is fast-committed:
-
If
a
reachedLx
, thena
know ifx
is committed, becauseLx
is the first to commit. Although there is chancex
is committed aftera
reachesLx
,Lx
broadcastsx is committed
very likely earlier than another instance bringsx is committed
through its fast-accept request. -
If
a
did not reachLx
, thena
must have reachedg - {La, Lx}
, this prevent other value ofa > y
to commit. ∴a > x
is safe to fast commit.
-
cmds
: the commands to run. -
deps
: the deps when leader initiate the instance.
-
deps
: udpated deps by a replica. -
deps_committed
: a vector of committed flag of every instance indeps
TODO: to fast-commita > x
,x is accepted
is also an acceptable condition. use another field to describe status ofx
. maybex.ballot
.
-
final_deps
: the deps chosen by leader or recovery process.
Nothing except the common fileds.
-
cmds
: the commands to run. -
final_deps
: the deps chosen by leader or recovery process.
Nothing except the common fileds.
Nothing except the common fileds.
-
committed
is the committed flag of the instance on a replica.
On a replica, a.deps
represents local relation between a
and other instances
on this replia. There is not circle, e.g.: a > b > c > a
on a replica.
But a.final_deps
could have.
E.g. initially:
a₀ b₀
----- -----
R0 R1
R0
and R1
forward a
and b
to the other:
b₁⁰ a₁¹
↙ ↘
a₀ b₀
----- -----
R0 R1
Finaly a.final_deps = {b}
,
b.final_deps = {a}
.
For this reason execution order is determined not by relation >
,
but by comparing the vecotr a.final_deps
.
TODO
Execution consistency:
If two interfering commands a
and b
are successfully committed,
they will be executed in the same order by every replica.
Every instance must be executed in finite time, e.g. no livelock with a SCC.
Execution linearizability:
If two instances are serialized by client(a
is proposed only after b
is
committed by any replica), then every replica will execute b
before a
.
TODO Def-after
Order is defined as:
-
a.deps ⊃ b.deps
: execa
afterb
. From Def-after, ifa.deps ⊃ b.deps
, executea
afterb
guarantees linearizability. -
Otherwise: exec
a
andb
in instance id order.
If a
is initiated after b
became safe,
b
will never see a
, and a
will definitely see b
.
Then b
will be added into a.deps
.
Thus a.deps ⊃ b.deps
TODO what if see a accepted
∴ execute a
after b
will never break guarantees.
Guarantees:
In the following digram, a.deps ⊃ d.deps
thus a
should be executed after
d
.
b
and e
interferes but b.deps ⊅ e.deps
or e.deps ⊅ b.deps
.
They could be executed in any order that is identical on every replica.
a b
↘ ↙ ~
c ~ d ~ e
↘ ↙ ↘
f g
One general exec algo is by walking the depends-on graph, remove some edges to reduce the graph to a DAG and instances have determined order to execute.
See other doc TODO
Another exec algo is much simpler to proof correctness but requires additional constrains: for instances on a leader, a newer instance must be executed after an older instance.
Our replication algo One of the constrain must be applied to replication:
- A replica must handle older instance before handling a newer one.
- Or the
deps
of an older instance must not includedeps
of the newer instance, when handling FastAccept.
Either one of the above guarantees newer.deps ⊃ older.deps
.
See other doc TODO
Assumes:
- The instance to recover is
a
. - The leader of
a
La
isR0
- The recovery process is
P
(P != R0
).
After Preparing on a quorum(F+1
):
-
If
P
sawR0
, exit and wait forR0
to commita
. -
If
P
saw a committeda
: broadcast and quit. -
If
P
sawa
withballot>0
: run classic paxos with this value and quit.TODO explain ballot
∴ P
only need to recover if all of a
it saw are in FastAccept phase.
Recovery is to choose a value of a.deps
that could have been committed on
fast-path.
P
tries to choose a value for a.deps[0]
, a.deps[1]
... one by one.
First we start to recover a.deps[1]
.
a.deps[La]
is will never change thus do not need to recover it.
After Prepare on a quorum,
P
could see different values of a.deps[1]
(x
, y
...) from different replicas.
Assumes that x > y
and leader of x
, y
is R1
.
- Define
Nx
to be the the number of PrepareReply witha.deps[1] == x
. - Define
Ny
to be the the number of PrepareReply witha.deps[1] == y
. - ...
As the following diagram shows:
x ... a.deps[1]=x a.deps[1]=y
y
a z
--- --- ... --- ---
R0 R1 ... R2
E.g.:
x | a→x a a
↓ | ↓ | |
a y | y `→y `→y
--- --- | --- --- ---
R0 R1 | R2 R3 R4
La Lb
down down
x
is not SlowCommit-ed.
If x
is FastCommit-ed, x → a
.
∴ From FP-condition, a → x
is not FastCommit-ed.
In this situation, both a.deps[1] == x
and a.deps[1] == y
could have been committed
on fast-path.
∵ a.deps[1] = x
has been seen.
∴ Initially, a < x
is not on R1
and it will never be.
But
R1
could havea → x
on it.
∵ a.deps[1] = y
has been seen.
∴ Initially, a → x
is not on R0
and it will never be.
But
R0
could havea → y
on it.
There are F + Nx
replicas
accepted or could have accepted a.deps[1] == x
.
F
unreached replicas plus Nx
.
There are F -1 + Ny
replicas
accepted or could have accepted a.deps[1] == y
.
F
unreached replicas except R1
plus Ny
.
If both of these two constituted fast-quorum(F + ⌊(F+1)/2⌋)
), it requires:
Nx >= ⌊(F+1)/2⌋
Ny >= ⌊(F+1)/2⌋ + 1
Nx + Ny <= F + 1
Thus if F=2k
there could be two value possibly committed on fast-path:
Nx = ⌊(F+1)/2⌋ = k
Ny = ⌊(F+1/2)⌋ + 1 = k + 1
E.g.:
-
R0, R1, R2
could have constituted a fast-quorum fora.deps[1] = x
. -
R0, R3, R4
could have constituted a fast-quorum fora.deps[1] = y
.
P
needs to eliminate one of them
-
a
did not see a fast-committedx
If a.deps[1] == x
is committed,
by fast-commit requirements,
the committed x.deps
must not contain a
(x < a
)
x
could only have been committed on fast path with x > a
:
∵ Nx = ⌊(F+1)/2⌋
.
∴ There are at most F - 1 + Nx = F + ⌊(F+1)/2⌋ - 1 < F + ⌊(F+1)/2⌋
replicas accepts x < a
in FastAccept phase:
F
unreached replicas exceptR0
, plusNx
.
∴ x < a
can not constitute a fast-quorum.
∴ If x
is fast-committed, it must be committed with x > a
.
After Prepare on x
.
Get the value of x
that is accepted with the latest ballot,
or the value of committed x
:
Choose a.deps[1] == x
if: the value of x is NOT nil and x < a
.
Otherwise continue try checking a.deps[1] == y
.
-
If the value is NOT nil,
x
could have been committed on slow-path.-
If
x > a
, From FP-condition,a.deps[1] == x
could not have been committed on fast-path.Discard
a.deps[1] == x
, try other value ofa.deps[1]
. -
If
x < a
, there is a at least classic-quorum on whichx
came beforea
.∴
x
must be ina.deps
whena
is committed.∴
a.deps[1] == x
is the only possible value to commit.
-
-
If the value is nil,
x
is not committed on slow-path.Assumes
x
is committed on fast-path:From Lemma-x-fast-gt-a,
x > a
must be committed.∴ From FP-condition,
a.deps[1] == x
can NOT be committed on fast-path.∴ Discard
a.deps[1] == x
, try other value ofa.deps[1]
.
Continue checking if a.deps[1] == y
can be committed on fast-path, and so on.
If no value of a.deps[1]
could have been committed, use the highest instnace
id.
Choose the only value to commit a.deps[1]
.
E.g.:
| x a→x a |
| ↓ ↓ | |
a | y y '→y |
--- | --- --- --- | ---
R0 | R1 R2 R3 | R4
La Lb
down down
If x
is committed,
-
If
x < a
, then onlya.deps[1] == x
can be committed. Because ifa.deps[1] == y
is committed,a
andx
has no relation, which is impossible. -
If
x > a
, thena.deps[1] == x
can not be committed. Because onR2
x
does not havex > a
, this is not a committed value ofx
.
If x
is not committed, wait until it is committed.
Continue repeat these step on a.deps[1] == y
to choose a value.
Prepare for a
retrieve instance a
on every replicas in a classic-quorum,
along with the dependent instance.
:
from R0: a.deps = [x, y, z, ...]; x = (accepted, x.deps=...), y = ...
from R1: a.deps = [u, v, w, ...]; w = (fast, u.deps=...), v = ...
...
collect all values of a.dpes[1]
: [x, y, z]
sort them in top-down order,
check if there is one that could have been fast-committed by La
:
for x
,
- if
Lx
is in quorum, wait for Lx to commit x. TODO does not need to wait. - otherwise, choose
a.deps[1] == x
if: the value of x is NOT nil andx < a
.
If there is such an x
, choose this a.deps[1] = x
.
Otherwise choose the first.
Send Accept requests and commit.