Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move more logic in blockstore_processor behind allow_dead_slots flag #2341

Merged
merged 2 commits into from
Jul 31, 2024

Conversation

steviez
Copy link

@steviez steviez commented Jul 29, 2024

Problem

A flag in ProcessOptions can be set to allow dead slots. If the flag is
not set and a block is marked dead, fetching the entries will fail as
the blockstore method to retrive entries checks if the slot is dead.

Within process_next_slots(), new Banks are created as children of
already replayed banks are discovered. The children Banks are created
prior to fetching entries, and thus, a Bank could be created for a
dead slot that will eventually be discarded.

Summary of Changes

Instead of allowing the extra work of creating a Bank to proceed, check
if a slot is dead (and allow_dead_slots=false) BEFORE creating a Bank
for the slot.

For the case of a validator, allow_dead_slots=false for the local ledger replay at startup. So, we will avoid processing any known dead slots altogether. On the other hand, ReplayStage will create a Bank for dead slots, see it is dead and then proceed on other forks. By avoiding the creation of Bank in local ledger replay, we'll avoid this non-fatal error:

[... ERROR solana_accounts_db::accounts_db] set_hash: already exists; multiple forks with shared slot X as child (parent: Y)!?

Fixes solana-labs#28343

steviez added 2 commits July 29, 2024 17:04
A flag in ProcessOptions can be set to allow dead slots. If the flag is
not set and a block is marked dead, fetching the entries will fail as
the entry fetch method internally checks if the slot is dead.

Within process_next_slots(), new Banks are created as replay progresses
through slots. If allow_dead_slots=false and a dead slot is loaded,
replay of the slot will error. That error is handled and results in the
Bank being removed from BankForks.

Instead of allowing the extra work to proceed, check if new slots to
replay are dead (and allow_dead_slots=false) BEFORE the creation of a
new Bank.
@steviez steviez force-pushed the bstore_proc_dead_slots branch from 1e76423 to 2dc1b2d Compare July 29, 2024 22:04
@steviez steviez requested review from AshwinSekar and bw-solana July 30, 2024 03:44
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1725,18 +1725,21 @@ fn process_next_slots(
blockstore: &Blockstore,
leader_schedule_cache: &LeaderScheduleCache,
pending_slots: &mut Vec<(SlotMeta, Bank, Hash)>,
halt_at_slot: Option<Slot>,
opts: &ProcessOptions,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't love the name, but I know you're just maintaining the existing convention, so I'll hold my nose

Copy link

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
With this change, do we feel confident to upgrade the set_hash to a panic? i.e. running solana-labs#33186 on master/inv net for a while and see what happens?

@steviez steviez merged commit d7b22e2 into anza-xyz:master Jul 31, 2024
41 checks passed
@steviez steviez deleted the bstore_proc_dead_slots branch July 31, 2024 16:05
@steviez
Copy link
Author

steviez commented Jul 31, 2024

LGTM. With this change, do we feel confident to upgrade the set_hash to a panic? i.e. running solana-labs#33186 on master/inv net for a while and see what happens?

Potentially, let me think about it a little and will followup with you. Per this Discord message, Brooks had a similar idea at one point but mentioned seeing a CI failure. Obviously, if we have a test that hits this, that would be great to fix, but for a live cluster, there is still more unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic from attempting to replay a dead slot
3 participants