-
Notifications
You must be signed in to change notification settings - Fork 2
replication dep view
2020 Aug 11
TODO change: relation between a and b: the order between a and b
Major simplifications with comparison to epaxos:
-
Removed livelock issue with infinite SCC from execution algo.
-
Removed
seq
thus recovery is significantly simplified. Addknows
attribute to determine execution order.knows
is a more generalized concept thanseq
and can be specialized toseq
. -
Accept-ed, Commit-ed status now are per-dep field.
-
deps
now contains instance id and all instances it knows. -
Get rid of
defered recovery
from recovery. Instead, one Prepare+Accept is quite enough. -
Simplify the proof of correctness.
-
Recovery is the as simple as a Prepare and standard replication protocol.
-
When updating
deps
for FastAccept ofa
, addx
intoa.deps
only whenx
does not depends-ona
, i.e.,x < a
. This reduces some of the unnecessary cycle of depends-on. -
Instances by a same leader has a depends-on relation. A later instance always depends on a former one. This is guaranteed by handling FastAccept request sequentially.
-
Ballot
is a replica attribute, instead of an instance attribute.
-
R0
,R1
... orR[0]
,R[1]
... : replica. -
a
,b
...x
,y
... : instance. -
a₀
: the value ofa
whena
is initiated on its leader. -
L(a)
,La
,L(b)
,Lb
: is the leader replica of an instancea
orb
. -
f
: number of max allowed failed replica that. -
n
: number of replicas,n = 2f+1
. -
Qc
: classic quorum:|Qc| = f+1
. -
Qf
: quorum of fast round:|Qf| = f+⌊|Qc|/2⌋ = f+⌊(f+1)/2⌋ = ⌊(3f+1)/2⌋ = ⌊(3n-1)/4⌋
-
a ~ b
: interfere:a
andb
can not exchange execution order in any instance(command) sequence. -
a → b
: depends-on:a
interferes withb
and has seenb
. -
a < b
: do-not-know:a
did not seeb
. -
a ↦ b
: exec-depends-on:a
execute afterb
.
-
G-consistency: two interfering instance
a, b
must have at least one relation established. -
G-exec-linearizability
-
Safety
The entire instance space is a 3d array:
R[i][j][idx]
Fields:
- i: replicaID.
- j: replicaID.
- idx: index of a instance.
Explain:
-
R[i]
: all data on replicai
-
R[i][j]
: instances initiated by replicaj
those are stored on replicai
. -
R[i][j][idx]
:idx
-th instance initiated by replicaj
.
| |
| |
| |
| c f c f c f |
| a b e a b e a b e |
| ---------------- ---------------- ---------------- |
| leader: [0] [1] [2] [0] [1] [2] [0] [1] [2] |
| ================ ================ ================ |
| replica: R[0] R[1] R[2] |
Ballot is same as the ballot in paxos, except it is a vector and is used to protect a relation between two instance.
A replica has a attribute term
, and a ballot is vector of the two leaders of a
relation to decide:
blt = {Ra: Ra.term, Rb:Rb.term}
.
In this way, operations on relation of a, b
are serialized.
A replica remembers term
s of all replicas, playing the same role of last_blt
in paxos: Ri.last_blt = {R0: R0.term, R1: R1.term, ...}
.
If the request blt
is not greater-or-equal the replica last_blt
, the reqeust
is declined, i.e., a request will be served if blt[a] >= last_blt[a] and blt[b] >= last_blt[b]
.
-
blt
does not decline a FastAccept request, because FastAccept does not change anything, it just reads an unknown value. TODO.
Ballot is used to identify a leader, just in the way paxos does.
A recovery process Prepare with a higher blt
to become the new leader of a
replica.
A replica seizes the leadership the same way.
E.g.:
Initially La
Lb
have ballot 1, 2
, FastAccept and Accept leader use this
ballot to run.
If a recovery process starts to recover b
, it increment ballot of Lb to 3.
Then run recovery with ballot 1, 3
.
(1, 3)
type Ballot HashMap<ReplicaID, i64>;
An instance is the internal representation of a client request.
type InstanceID(ReplicaID, i64)
type Knows HashMap<InstanceID>
type Dep {
id: InstanceID;
status: FastAccepted | Accepted | Committed,
knows: Knows;
}
type Instance {
id: InstanceID;
cmds: Vec<Commands>;
knows: Knows;
deps: HashMap<InstanceID, Dep>;
}
-
id
: is the instance id. -
cmds
is the commands a client wants to execute. -
knows
is all instances an instance knows and is used to determines order between instances when there is cycle of depends-on.a.knows = {a} ∪ a.deps[0].knows ∪ a.deps[1].knows ∪ ...
-
deps
is the set of intefering instances thata
depends-on:a.deps = {x : a → x}
with additional paxos related info.Two dependencies are different if:
x.id != y.id or x.knows != y.nows
Instances proposed by a same leader have a natural depends-on relationship: A newer instance depends-on all preceding ones.
This serializes execution of instances of a same leader but doesnot affect the out-of-order commit feature.
TODO move down
There is only one value could be chosen to be safe.
∴ finally an instance is committed the same value on all replias.
∴ All replicas have the same set of instances and relations.
A value of something is safe if:
it has been forwarded to enough replicas and constituted a quorum(Qf
or Qc
)
so that no other process(command leader or recovery process) would choose other
value for it to commit.
The action commit is to broadcast to all replica about what value is safe.
With this protocol we need to ensure two info to be safe,
before committing an instance a
:
-
What to execute:
a.cmds
.To commit
a.cmds
, forwards it to aQc
, becausea.cmds
never changes. -
and the orders between
a
and all other interfering instances.
For the order of two interfering instances a ~ b
,
there are 3 legal relation pairs:
a → b and b < a: a depends-on b and b doesnot knows of a;
a → b and b → a: a b depends-on each other;
a < b and b → a: a doesnot knows of b and b depends-on a;
a < b and b < a
is not allowed:
From G-exec-linearizability,
two interfering instance must have at
least one relation established.
Otherwise a
and b
can be executed in arbitrary order.
A leader is the process that initiates an instance.
La
initiates a
then tries to commit one of a → b
or a < b
.
Lb
initiates b
then tries to commit one of b → a
or b < a
.
Round-1, aka the FastPath, La
reads the relation stored on every replica.
If a value that is safe is observed, FastCommit it.
Otherwise, run into Round-2, aka, the SlowPath, follow the paxos protocol,
Accept then Commit.
Initially a replica stores relationship for every pairs of instance, but in a unknown status.
The a
read it first, a replica decides the order is a < b
.
FastAccept only read the value on a replica,
If a replica doesnot have a
or b
on it, the first arived decide the order:
e.g., if a
arives first, the replica decides a < b
.
Otherwise the replica returns the Commit-ed order.
If a value is FastCommit-ed, i.e., committed with one round, The protocol must guarantees recovery always choose this value.
∴ we have:
The chosen value for FastCommit must be seen by any other quorum, i.e.,
A FastQuorum has intersection with every Qc
.
The value in a FastQuorum must be all the same to FastCommit:
Proof:
If a Qc
can choose a value v
by seeing some v
and some other different value, then the other values are not needed.
∴ all values in FastQuorum are all the same.
If two interfering instances a ~ b
are both FastCommit-ed,
the value of b
can be determined from the value of a
: a < b ⇔ b → a
.
Proof:
a ~ b
are both FastCommit-ed,
from Def-basic-idea, TODO
the FastAccpet request respect the existent relation.
Thus there is at least one replica on which the order is a → b
and b < a
.
∴ From Lemma-FastCommit-all-equal,
that a
is FastCommit-ed with a → b
implies that b
is FastCommit-ed with b < a
and vice versa.
TODO define recovery goals.
TODO define recovery-safety.
TODO define recovery need only a Qc
.
If a recovery process reached either of leader of a
(La
) or leader of b
(Lb
),
it is able to find the relation between a, b
from the leader.
From Lemma-FastCommit-entanglement,
We only need to recover an order when it does not reach either of La
and Lb
.
This places demands on the definition of FastQuorum:
A FastQuorum is a set of replicas to which La
forwarded that guarantees
recovery-safety.
A FastQuorum requires: |Qf'(a, b)| > |Qc|/2 + |Qc| - 3
,
where Qf'(a, b) = Qf(a) \ {La, Lb}
, i.e., Qf(a)
without leader of a
or
b
.
Proof:
Assumes
La
and Lb
have chosen their FastQuorum as Qf(a)
and Qf(b)
.
Recovery process has chosen a classic quorum Qc
for recovery.
From Recovery-without-leaders, {La, Lb} ∩ Qc = ø
.
To satisfy FastCommit-guarantee, in a Qc
, only
one of a < b
and b < a
is allowed to constitute a FastQuorum.
Qf(a), Qf(b)
must satisfies:
Qf'(a, b) ∩ Qc
and Qf'(b, a) ∩ Qc
must have intersection.
I.e., |Qf'(a, b) ∩ Qc| > |Qc|/2
for any Qc
.
∴ |Qf'(a, b) ∩ Qc| > |Qc|/2
∴ |Qf'(a, b)| - (n-2-|Qc|) > |Qc|/2
∴ |Qf'(a, b)| > |Qc|/2 + |Qc| - 3
Examples:
- For 5 replicas,
Qc=3
,|Qf'(a, b)|=2
. - For 7 replicas,
Qc=4
,|Qf'(a, b)|=4
. - For 9 replicas,
Qc=5
,|Qf'(a, b)|=5
.
If b
is committed with b < a
, then a can only be FastCommit-ed with a → b
,
because a
must see a a → b
.
If b
is committed with b → a
, then only a < b
will be FastCommit-ed.
because a → b
can be committed only when b
is committed.
From TODO.
a a→b a→b a<b b
--------------
------------
La Lb
If a is FastCommit-ed with a→b, Lb still can SlowCommit b→a
La != Lb
, otherwise no distributed algo is required.
-
La
sends FastAccept request to aQf(a)
. So doesLb
. -
If
La
received identical replies fromQf(a)
, e.g.,a → b
ora < b
, FastCommit it. -
If incompatible order is seen, e.g.,
a → b
anda < b
, run into Accept.
From Lemma-entanglement-with-committed,
If Accept-ed b → a
is seen, b < a
is not FastCommit-ed.
A SlowCommit-ed b < a
must have constituted a Qc
thus only a → b
can be committed.
TODO:
b → a
can be SlowCommit-ed.
From Lemma-entanglement-with-committed,
If Commit-ed b < a
is seen, La
can just FastCommit a → b
.
Because two FastCommit-ed value is entangled from FastCommit-entanglement.
and SlowCommit-ed b < a
always has intersection with another Qc
thus it can
be recovered.
And this directly infers epaxos fast commit condition:
Inference: the value of relationship between a, x
,
If x
is committed with x < a
, then only a → x
can be FastCommit-ed.
∴ A more strict conclusion is:
From Lemma-entanglement-with-committed, another FastCommit condition is:
seeing a committed b
, and forwarded a
to a Qc
.
Conditions must be sastisified to commit on FastPath:
-
For every updated
a → x
, the leader received at least one reply with committedx
. -
a → x
constitutes a fast-quorum.
All following statements requires FastAccept are sent to a Qc
(including the leader).
- condition-1
a < b
constituted aQf'(a, b)
: commita < b
. The same fora > b
. - condition-2 Accept-ed
a < b
constituted aQc
(including leader): commita < b
. The same fora > b
. - condition-3
a > b
are Commit-ed: commita > b
.
Otherwise, SlowPath:
- send Accept
a < b
ifa > b
is not seen. - send Accept
a → b
ifa > b
is seen.
Excluding a, b
, FastPath requires at least |Qc|/2 + |Qc| - 2
replicas,
i.e., ⌊f+1⌋/2 + F-1
.
Including the leader itself, it is ⌊f+1⌋/2 + f
.
as epaxos specified.
If FastAccept of a
reached Lb
,
Lb
does not provide any info to recovery.
Thus it requires some other constrain, to commit with no more replicas.
E.g. using the condition-3, a committed relation.
∴ The FastPath requirements are:
- If
Lb ∈ Qf(a)
:|Qf(a)| >= f+1
andb
is committed. - If
Lb ∉ Qf(a)
:|Qf(a)| >= ⌊f+1⌋/2 + f
.
The coresponding recovery is: after prepare:
-
a:A/C
: commit a. -
b:A/C
: commit b.
If only one is committed, from Lemma-entanglement-with-committed to choose the other value and commit.
-
a:no-A and b:no-A
:a
orb
can only be FastCommit-ed with condition-1. If some value has more than half of|Qc|
, choose it and commit. -
Otherwise, no value is FastCommit-ed. run SlowPath.
And epaxos is a special case of this protocol,
epaxos strengthened the condition to be:
Always requires b
to be committed
Always use |Qf| = ⌊f+1⌋/2 + f
.
∴ From Def-depends-on-previous,
If L(a) == L(b)
and a < b
,
only b → a
can be committed, a → b
or a ↔ b
can never be committed.
The previous section discussed how to determine the order of two instances.
But execution requires more to determine the order.
E.g., a cycle a ↔ b
or a → b → c → a
.
From TODO,
we use knows
to determine execution order.
With knows
, if b₁.knows != b₂.knows
, b₁, b₂
are different instances.
thus the two instance protocol applies to a ~ b₁
and a ~ b₂
.
fn interferings(a):
rst = {}
for x in local_instances:
if x ~ a:
rst.insert(x)
return rst
fn new_dep(x):
return Dep{
id: x.id,
accepted: false,
committed: false,
knows: x.knows,
ballot: 0,
}
fn update_dep(a, b):
a.deps[b.id].knows ∪= b.knows
if b.committed:
a.deps[b.id].committed = true
if b.accepted:
a.deps[b.id].accepted = true
fn union(a, b):
c = {}
c.deps.keys = a.deps.keys ∪ b.deps.keys
for id in c.deps:
update_dep(c.deps[id], a.deps[id])
update_dep(c.deps[id], b.deps[id])
return c
-
Leader: Initiate instance
a₀
: filla₀.deps
with all local instances that interferes witha₀
. Then forwarda₀
to all other replicas. -
NonLeader: handle FastAccept request: initiated
a'
the same way the leader initiatesa₀
. Then uniona'
anda₀
:a₁ = union(a', a₀)
.Reply
a₁
. -
Leader: Handle FastAccept replies: For every instance
x
in replieda.deps
, i.e.,x: x ∈ a.deps
, if it satisfies FP-condition: commitx
. Otherwise enter SlowPath.
-
Leader:
union()
at least|Qc|
replies to builda₂
. Send Accept request witha₂
to replicas. -
NonLeader: handle Accept request: If the ballot in request is not smaller than the ballot in instance
a
locally,req.blt >= last_blt
, accept it. Then reply thelast_blt
-
Leader: Handle Accept reply: if at least
|Qc|
OK replies are received, commit it.
| |
| Leader | Non Leader
| --- | ---
| fn init(a₀): | ---
| a₀.deps = {} |
| |
| for x in interferings(a₀): |
| a₀.deps.insert(x.id, new_dep(x)) |
| |
| init(a₀) |
| forward(a₀) |
| --- | ---
| | fn handle_fast_accept_request(req):
| | a₀ = req.a₀
| | a' = a₀; a'.deps = {}
| | init(a')
| | a₁ = union(a', a₀)
| | reply(a₁)
| --- | ---
| fn handle_fast_accept_replies(replies): |
| |
| dep_ids = {x.id: x ∈ replies[i]} |
| for dep_id in dep_ids: |
| same = true |
| committed = false |
| for repl in replies: |
| r0 = replies[0].deps[dep_id] |
| r = repl.deps[dep_id] |
| same = same and r0 == r |
| if r.committed: |
| committed = true |
| |
| if same: |
| if (r0 == a₀.deps[dep_id]): |
| commit(a, replies, dep_id) |
| else if committed: |
| commit(a, replies, dep_id) |
| else: |
| slowpath(a, replies, dep_id) |
| else: |
| slowpath(a, replies, dep_id) |
| |
| --- | ---
| fn slowpath(a, replies, dep_id): | ---
| |
| for repl in replies: |
| update_dep(a, repl.deps[dep_id]) |
| a.deps[dep_id].accepted = true |
| |
| send_accept(a, dep_id) |
| --- | ---
| | fn handle_accept_request(req):
| |
| | a = req.q; dep_id = req.dep_id
| | if req.blt >= last_blt:
| | save(a.deps[dep_id])
| | last_blt = req.blt
| | reply(OK)
| | else:
| | reply(last_blt)
| |
| --- | ---
| fn handle_accept_replies(replies): |
| ok = 0 |
| for repl in replies: |
| if repl.ok: |
| ok++ |
| if ok >= Qc: |
| commit(a, dep_id) |
| |
| --- | ---
| commit(a, dep_id) |
| a.deps[dep_id].committed = true |
| send_commit(a, dep_id) |
Order is defined as:
-
a.knows ⊃ b.knows
: execa
afterb
. From Def-after, ifa.knows ⊃ b.knows
, executea
afterb
guarantees linearizability. -
Otherwise: exec
a
andb
in instance id order.
See exec-update-accumulated.md
Assumes:
- The instance to recover is
a
. - The leader of
a
La
isR0
- The recovery process is
P
(P != R0
).
Recovery-1: increment ballot: blt₁ = blt₀; blt₁[La]++
Recovery-2: Send FastAccept of a
with blt₁
.
Recovery-3: If conflicting ballot is returned in last_blt
, update local ballot and retry from Recovery-1.
After Preparing on a quorum(Qc
):
-
If
P
sawLa
, exit and wait forLa
to commita
. -
If
P
saw a Commit-eda
: broadcasta
and quit. -
If
P
saw a Accept-eda
: run Accept with this value and quit.
∴ P
need to continue recovery only if all of a
it saw are in FastAccept-ed.
P
tries to choose an order for every instance b: b ∈ a.deps
one by one.
P
could see different values of b: b ∈ a.deps
.
For two value x, y
(e.g.
a → b
and a < b
are different,
a → b₁
and a → b₂
are different, where b₁, b₂
have different knows
):
If P
reached Lb
, wait for Lb
to commit b
:
- If
b
is committed withbᵢ < a
, choosea → bᵢ
. From TODO, onlya → bᵢ
can be FastCommit-ed. - If
b
is committed withbᵢ → a
, From TODO, onlya < b
can be FastCommit-ed. to determine which is committed, just prepare and accept witha < b
to see if there is a conflict value.
If P
didnot reach Lb
:
From Def-FastQuorum:
FastCommit requires ⌊|Qc|/2⌋+1
identical FastAccept-ed value in a Qc: {La, Ly} ∩ Qc = ø
.
∴ choose the value that has at least ⌊|Qc|/2⌋+1
, run Accept, and finally commit.
Use the value of a
in Recovery-5, run Accept and Commit.
From Ballot-helps-relation,
No other z < a
can be Accept-ed, because La
increments its ballot.
∴ If Accept-ed, no other a → z
can be FastCommit-ed.
Vector ballot ensures interfering instances can not Accept.
On implementation, a.deps
is split into N
subset, where N
is number of replicas.
Every subset contains only instances from leader Ri
:
a.deps[Ri] = {x | x.replicaID == Ri and a → x}
.
And a.deps[Ri]
records only the max instance id.
This way,
Dep.id
Dep.accepted
Dep.committed
are all recorded with a int.
Because only Accept-ed a implies all instances on La before a are also
Accept-ed.
So is Commit-ed.
Prepare with a new ballot and FastAccept can be combined.
optimization maybe:
a a a→x a←x a←x x x
a
always knows x knows a, so a does not need to know x.
When handling FastAccept, need to exclude instance known but not on local replica. This guarantees only need to check the highest interfering instance when recovery.
No way. a after b
requires a
knows of all b
knows of.
Prepare and FastAccept can be done together Because Accept only check ballot. FastAccept does not change existent FastAccept.
inexistent FastAccept-ed value can be treat as existent value but dont know what it is.
∵ FastAccept never change a value. ∴ reading an old value does not need to be protected with Ballot.
seeing a Accept-ed instance: A recovery always choose a possibly FastCommit-ed value. ∴ accept this FastAccept request wont let any other one to choose a different value.
As an optimization, respect the accepted value is more efficient. re-populating an Accept-ed value does not break consistency.
Instance.knows
is a virtual attribute and is not need to be persisted on disk.
Because it can be calculated dynamically from deps
:
a.knows = {a} ∪ a.deps[0].knows ∪ a.deps[1].knows ∪ ...
From Def-depends-on-previous,
Because newer instance on La always depends-on older instance,
∴ a.deps[La]
will never change thus do not need to recover it.
If x
is FastCommit-ed:
-
If
a
reachedLx
, thena
know ifx
is committed, becauseLx
is the first to commit. Although there is chancex
is committed aftera
reachesLx
,Lx
broadcastsx is committed
very likely earlier than another instance bringsx is committed
through its FastAccept request. -
If
a
did not reachLx
, thena
must have reachedg - {La, Lx}
, this prevent other value ofa > y
to commit. ∴a > x
is safe to FastCommit.