-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Populate topology after restart if finality is lagging behind current session #6913
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alexandru Gheorghe <[email protected]>
@@ -28,6 +28,7 @@ use std::{ | |||
collections::{HashMap, HashSet}, | |||
fmt, | |||
time::{Duration, Instant}, | |||
u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to import it?
How will this work for subsystems like statement-distribution and bitfield-ditribution which need to keep backing new candidates on the latest session? AFAIU, with this PR, they would receive the topologies of the old session that has some unfinalized blocks, which I expect will break them |
Good catch, Thanks! In statement-distribution we keep the topology in a per-session structure, so there is no problem there, however the bitfield-distribution does keep only the last received topology and that is a problem, looking into how to fix that. |
Maybe a better approach would be to have approval-distribution request an old topology from the gossip-support subsystem. With the current form it's also a bit confusing that we're sending |
Signed-off-by: Alexandru Gheorghe <[email protected]>
I've looked closer and it is not that simple because the flow right now is like:
Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..
The message has In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route. |
To avoid this multi-hop, we could store them with the already enriched data in network-bridge-rx.
You do have the session index there but the naming is confusing. Subsystems that use this code could easily overlook this (like a new leaf update containing an old leaf).
There is still code that is being run in bitfield-distribution even if the |
All GitHub workflows were cancelled due to failure one of the required jobs. |
There is a small issue on restart, where if finality is lagging across session boundary and the validator restarts, then the validator won't be able to contribute anymore with assginments/approvals and gossiping for the blocks from the previous session, because after restart it builds the Topology only for the new session, so without a topology it won't be able to distribute assignments and approvals because everything in
approval-distribution
is gated on having a topology for the block.The fix is to also keep track of the last finalized block and its session and if it is different from the list of encountered sessions, build its topology and send it to the rest of subsystems.