Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
4945: Fix timeouts growing in Zug when network is stalled r=EdHastingsCasperAssociation a=fizyk20

If more than 1/3 of the validators by weight go offline, it causes the network to stall - which is expected, since the network requires at least 2/3 of the validators to be correct (online and adhering to protocol) in order to make progress.

Right now, if the network gets stalled while using Zug, the remaining validators start to time out waiting for a proposal to get accepted and start increasing their timeouts in order to adjust to what they perceive as network delays. However, after timing out, they set another timer, and the cycle repeats, causing the timeouts to grow without bound.

This PR changes it so that validators only time out at most once per round. This way if the network gets stalled, they increase their timeout _once_ and wait for the round to end (by either becoming skippable, or having an accepted proposal). This will happen once enough validators are back online, but while the network is stalled, they no longer increase their timeout further - which fixes casper-network#4927 while preserving the algorithm's assumptions.


Co-authored-by: Bartłomiej Kamiński <[email protected]>
  • Loading branch information
casperlabs-bors-ng[bot] and fizyk20 authored Oct 31, 2024
2 parents 7606e94 + f8ffc02 commit 1848e2e
Showing 1 changed file with 11 additions and 8 deletions.
19 changes: 11 additions & 8 deletions node/src/components/consensus/protocols/zug.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1664,8 +1664,13 @@ impl<C: Context + 'static> Zug<C> {
.current_round_start
.saturating_add(self.proposal_timeout());
if now >= current_timeout {
outcomes.extend(self.create_and_gossip_message(round_id, Content::Vote(false)));
self.update_proposal_timeout(now);
let msg_outcomes = self.create_and_gossip_message(round_id, Content::Vote(false));
// Only update the proposal timeout if this is the first time we timed out in this
// round
if !msg_outcomes.is_empty() {
self.update_proposal_timeout(now);
}
outcomes.extend(msg_outcomes);
} else if self.faults.contains_key(&self.leader(round_id)) {
outcomes.extend(self.create_and_gossip_message(round_id, Content::Vote(false)));
}
Expand All @@ -1685,12 +1690,10 @@ impl<C: Context + 'static> Zug<C> {
// that time.
debug!(our_idx, %now, %timestamp, "update_round - schedule update 1");
outcomes.extend(self.schedule_update(timestamp));
} else {
if self.current_round_start > now {
// A proposal could be made now. Start the timer and propose if leader.
self.current_round_start = now;
outcomes.extend(self.propose_if_leader(maybe_parent_round_id, now));
}
} else if self.current_round_start > now {
// A proposal could be made now. Start the timer and propose if leader.
self.current_round_start = now;
outcomes.extend(self.propose_if_leader(maybe_parent_round_id, now));
let current_timeout = self
.current_round_start
.saturating_add(self.proposal_timeout());
Expand Down

0 comments on commit 1848e2e

Please sign in to comment.