Fix order of resending messages after restart #6729

alexggh · 2024-12-02T07:57:35Z

The way we build the messages we need to send to approval-distribution can result in a situation where is we have multiple assignments covered by a coalesced approval, the messages are sent in this order:

ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each candidate and add to the queue of messages both the assignment and the approval for that candidate, and when the approval reaches the approval-distribution subsystem it won't be imported and gossiped because one of the assignment for it is not known.

So in a network where a lot of nodes are restarting at the same time we could end up in a situation where a set of the nodes correctly received the assignments and approvals before the restart and approve their blocks and don't trigger their assignments. The other set of nodes should receive the assignments and approvals after the restart, but because the approvals never get broacasted anymore because of this bug, the only way they could approve is if other nodes start broadcasting their assignments.

I think this bug contribute to the reason the network did not recovered on 25-11-25 15:55:40 after the restarts.

Tested this scenario with a zombienet where nodes are finalising blocks because of aggression and all nodes are restarted at once and confirmed the network lags and doesn't recover before and it does after the fix

The way we build the messages we need to send to approval-distribution can result in a situation where is we have multiple assignments covered by a coalesced approval, the messages are sent in this order: ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each candidate and add to the queue of messages both the assignment and the approval for that candidate, and when the approval reaches the approval-distribution subsystem it won't be imported and gossiped because one of the assignment for it is not known. So in a network where a lot of nodes are restarting at the same time we could end up in a situation where a set of the nodes correctly received the assignments and approvals before the restart and approve their blocks and don't trigger their assignments. The other set of nodes should receive the assignments and approvals after the restart, but because the approvals never get broacasted anymore because of this bug, the only way they could approve is if other nodes start broadcasting their assignments. I think this bug contribute to the reason the network did not recovered on `25-11-25 15:55:40` after the restarts. Signed-off-by: Alexandru Gheorghe <[email protected]>

Signed-off-by: Alexandru Gheorghe <[email protected]>

polkadot/node/core/approval-voting/src/lib.rs

paritytech-workflow-stopper · 2024-12-09T12:44:08Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12235807472
Failed job name: fmt

sandreim

🚀

Signed-off-by: Alexandru Gheorghe <[email protected]>

…ng_after_restart

Signed-off-by: Alexandru Gheorghe <[email protected]>

AndreiEres

Nice!

paritytech-cmd-bot-polkadot-sdk · 2024-12-10T13:29:25Z

Created backport PR for stable2407:

[stable2407] Backport #6729 #6828 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-6729-to-stable2407
git worktree add --checkout .worktree/backport-6729-to-stable2407 backport-6729-to-stable2407
cd .worktree/backport-6729-to-stable2407
git reset --hard HEAD^
git cherry-pick -x 65a4e5ee06b11844d536730379d4e1cab337beb4
git push --force-with-lease

paritytech-cmd-bot-polkadot-sdk · 2024-12-10T13:29:30Z

Created backport PR for stable2409:

[stable2409] Backport #6729 #6829 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-6729-to-stable2409
git worktree add --checkout .worktree/backport-6729-to-stable2409 backport-6729-to-stable2409
cd .worktree/backport-6729-to-stable2409
git reset --hard HEAD^
git cherry-pick -x 65a4e5ee06b11844d536730379d4e1cab337beb4
git push --force-with-lease

The way we build the messages we need to send to approval-distribution can result in a situation where is we have multiple assignments covered by a coalesced approval, the messages are sent in this order: ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each candidate and add to the queue of messages both the assignment and the approval for that candidate, and when the approval reaches the approval-distribution subsystem it won't be imported and gossiped because one of the assignment for it is not known. So in a network where a lot of nodes are restarting at the same time we could end up in a situation where a set of the nodes correctly received the assignments and approvals before the restart and approve their blocks and don't trigger their assignments. The other set of nodes should receive the assignments and approvals after the restart, but because the approvals never get broacasted anymore because of this bug, the only way they could approve is if other nodes start broadcasting their assignments. I think this bug contribute to the reason the network did not recovered on `25-11-25 15:55:40` after the restarts. Tested this scenario with a `zombienet` where `nodes` are finalising blocks because of aggression and all nodes are restarted at once and confirmed the network lags and doesn't recover before and it does after the fix --------- Signed-off-by: Alexandru Gheorghe <[email protected]> (cherry picked from commit 65a4e5e)

paritytech-cmd-bot-polkadot-sdk · 2024-12-10T13:29:35Z

Successfully created backport PR for stable2412:

[stable2412] Backport #6729 #6830

Polkadot-Forum · 2024-12-10T14:42:11Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2025-11-25-kusama-parachains-spammening-aftermath/11108/1

Backport #6729 into `stable2407` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>

Backport #6729 into `stable2409` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>

Backport #6729 into `stable2412` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  Co-authored-by: Alexandru Gheorghe <[email protected]>

The way we build the messages we need to send to approval-distribution can result in a situation where is we have multiple assignments covered by a coalesced approval, the messages are sent in this order: ASSIGNMENT1, APPROVAL, ASSIGNMENT2, because we iterate over each candidate and add to the queue of messages both the assignment and the approval for that candidate, and when the approval reaches the approval-distribution subsystem it won't be imported and gossiped because one of the assignment for it is not known. So in a network where a lot of nodes are restarting at the same time we could end up in a situation where a set of the nodes correctly received the assignments and approvals before the restart and approve their blocks and don't trigger their assignments. The other set of nodes should receive the assignments and approvals after the restart, but because the approvals never get broacasted anymore because of this bug, the only way they could approve is if other nodes start broadcasting their assignments. I think this bug contribute to the reason the network did not recovered on `25-11-25 15:55:40` after the restarts. Tested this scenario with a `zombienet` where `nodes` are finalising blocks because of aggression and all nodes are restarted at once and confirmed the network lags and doesn't recover before and it does after the fix --------- Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh added 2 commits December 2, 2024 09:54

Add unittests

62904f1

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh marked this pull request as ready for review December 9, 2024 08:31

sandreim reviewed Dec 9, 2024

View reviewed changes

polkadot/node/core/approval-voting/src/lib.rs Show resolved Hide resolved

polkadot/node/core/approval-voting/src/lib.rs Show resolved Hide resolved

Update lib.rs

39d0acc

sandreim approved these changes Dec 9, 2024

View reviewed changes

alexggh added 3 commits December 10, 2024 10:51

Fix formatting

eb880a4

Signed-off-by: Alexandru Gheorghe <[email protected]>

Merge remote-tracking branch 'origin/master' into alexggh/fix_resendi…

dca4f28

…ng_after_restart

Add prdoc

1a24bd2

Signed-off-by: Alexandru Gheorghe <[email protected]>

AndreiEres approved these changes Dec 10, 2024

View reviewed changes

Update pr_6729.prdoc

c95be67

alexggh enabled auto-merge December 10, 2024 10:03

alexggh added this pull request to the merge queue Dec 10, 2024

Merged via the queue into master with commit 65a4e5e Dec 10, 2024
197 of 199 checks passed

alexggh deleted the alexggh/fix_resending_after_restart branch December 10, 2024 13:28

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Dec 10, 2024

[stable2407] Backport #6729 #6828

Merged

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Dec 10, 2024

[stable2409] Backport #6729 #6829

Merged

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Dec 10, 2024

[stable2412] Backport #6729 #6830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix order of resending messages after restart #6729

Fix order of resending messages after restart #6729

alexggh commented Dec 2, 2024 •

edited

Loading

paritytech-workflow-stopper bot commented Dec 9, 2024

sandreim left a comment

AndreiEres left a comment

paritytech-cmd-bot-polkadot-sdk bot commented Dec 10, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 10, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 10, 2024

Polkadot-Forum commented Dec 10, 2024

Fix order of resending messages after restart #6729

Fix order of resending messages after restart #6729

Conversation

alexggh commented Dec 2, 2024 • edited Loading

paritytech-workflow-stopper bot commented Dec 9, 2024

sandreim left a comment

Choose a reason for hiding this comment

AndreiEres left a comment

Choose a reason for hiding this comment

paritytech-cmd-bot-polkadot-sdk bot commented Dec 10, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 10, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 10, 2024

Polkadot-Forum commented Dec 10, 2024

alexggh commented Dec 2, 2024 •

edited

Loading