Populate topology after restart if finality is lagging behind current session #6913

alexggh · 2024-12-16T16:20:26Z

There is a small issue on restart, where if finality is lagging across session boundary and the validator restarts, then the validator won't be able to contribute anymore with assginments/approvals and gossiping for the blocks from the previous session, because after restart it builds the Topology only for the new session, so without a topology it won't be able to distribute assignments and approvals because everything in approval-distribution is gated on having a topology for the block.

The fix is to also keep track of the last finalized block and its session and if it is different from the list of encountered sessions, build its topology and send it to the rest of subsystems.

Signed-off-by: Alexandru Gheorghe <[email protected]>

AndreiEres · 2024-12-18T14:40:40Z

polkadot/node/network/gossip-support/src/lib.rs

@@ -28,6 +28,7 @@ use std::{
 	collections::{HashMap, HashSet},
 	fmt,
 	time::{Duration, Instant},
+	u32,


Do we need to import it?

alindima · 2024-12-23T14:43:14Z

How will this work for subsystems like statement-distribution and bitfield-ditribution which need to keep backing new candidates on the latest session? AFAIU, with this PR, they would receive the topologies of the old session that has some unfinalized blocks, which I expect will break them

alexggh · 2025-01-06T10:14:13Z

How will this work for subsystems like statement-distribution and bitfield-ditribution which need to keep backing new candidates on the latest session? AFAIU, with this PR, they would receive the topologies of the old session that has some unfinalized blocks, which I expect will break them

Good catch, Thanks! In statement-distribution we keep the topology in a per-session structure, so there is no problem there, however the bitfield-distribution does keep only the last received topology and that is a problem, looking into how to fix that.

alindima · 2025-01-06T10:52:29Z

Maybe a better approach would be to have approval-distribution request an old topology from the gossip-support subsystem. With the current form it's also a bit confusing that we're sending NewGossipTopology for an old session

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh · 2025-01-07T12:08:47Z

Maybe a better approach would be to have approval-distribution request an old topology from the gossip-support subsystem.

I've looked closer and it is not that simple because the flow right now is like:

Gossip-support sends NetworkBridgeRxMessage::NewGossipTopology
Network bridge rx receives this message adds some useful information and converts it to NetworkBridgeEvent::NewGossipTopology
approval-distribution, statement-distribution and bitfield-distribution receive NetworkBridgeEvent::NewGossipTopology

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

With the current form it's also a bit confusing that we're sending NewGossipTopology for an old session

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

alindima · 2025-01-08T09:07:50Z

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

To avoid this multi-hop, we could store them with the already enriched data in network-bridge-rx.

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

You do have the session index there but the naming is confusing. Subsystems that use this code could easily overlook this (like a new leaf update containing an old leaf).

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

There is still code that is being run in bitfield-distribution even if the update_topologies did nothing.
Maybe a good compromise would be adding a new message type? So that this change is explicit and is only handled by subsystems that need it

paritytech-workflow-stopper · 2025-01-08T09:21:53Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12667016328
Failed job name: test-linux-stable-no-try-runtime

Populate topology when finality is lagging behind current session

2e7a7fd

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh requested review from ordian, sandreim and AndreiEres and removed request for ordian December 16, 2024 16:20

AndreiEres approved these changes Dec 18, 2024

View reviewed changes

Do not override newer topology with older ones

10a9d87

Signed-off-by: Alexandru Gheorghe <[email protected]>

Merge branch 'master' into alexggh/update_topology_at_restart

5c6f111

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate topology after restart if finality is lagging behind current session #6913

Populate topology after restart if finality is lagging behind current session #6913

alexggh commented Dec 16, 2024

AndreiEres Dec 18, 2024

alindima commented Dec 23, 2024

alexggh commented Jan 6, 2025

alindima commented Jan 6, 2025

alexggh commented Jan 7, 2025

alindima commented Jan 8, 2025

paritytech-workflow-stopper bot commented Jan 8, 2025

Populate topology after restart if finality is lagging behind current session #6913

Are you sure you want to change the base?

Populate topology after restart if finality is lagging behind current session #6913

Conversation

alexggh commented Dec 16, 2024

AndreiEres Dec 18, 2024

Choose a reason for hiding this comment

alindima commented Dec 23, 2024

alexggh commented Jan 6, 2025

alindima commented Jan 6, 2025

alexggh commented Jan 7, 2025

alindima commented Jan 8, 2025

paritytech-workflow-stopper bot commented Jan 8, 2025