Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic from attempting to replay a dead slot #28343

Closed
steviez opened this issue Oct 11, 2022 · 1 comment · Fixed by anza-xyz/agave#2341
Closed

Panic from attempting to replay a dead slot #28343

steviez opened this issue Oct 11, 2022 · 1 comment · Fixed by anza-xyz/agave#2341
Assignees
Labels
validator Issues that relate to the validator

Comments

@steviez
Copy link
Contributor

steviez commented Oct 11, 2022

Problem

On a node running v1.14 (issue present in master too), we observed a panic shortly after the node fired up ReplayStage. The following log lines with annotations tell the story of what is happening and why we end with a panic:

// Validator process starts
[2022-10-07T20:05:04.522762389Z INFO  solana_validator] Starting validator with: ArgsOs {

// Ledger processing begins for what is already present
[2022-10-07T20:10:04.332963953Z INFO  solana_ledger::blockstore_processor] Processing ledger from slot 154185240...
[2022-10-07T20:10:04.333307300Z INFO  solana_ledger::blockstore_processor] ledger holds data through slot 154185596

// 154185568 bank created
[2022-10-07T20:12:35.339310609Z INFO  solana_metrics::metrics] datapoint: bank-new_from_parent-heights slot=154185568i block_height=139237307i parent_slot=154185563i ...

// 154185568 fails to verify and is marked dead
[2022-10-07T20:12:36.251737344Z WARN  solana_ledger::blockstore_processor] slot 154185568 failed to verify: failed to load entries, error: blockstore error

// Ledger processing finished, 154185563 still an active bank
[2022-10-07T20:12:42.324282843Z INFO  solana_ledger::blockstore_processor] ledger processed in 2 minutes, 37 seconds, 990 ms, 457 µs and 693 ns. root slot is 154185555, 32 banks: 154185555, 154185556, 154185557, 154185558, 154185559, 154185560, 154185561, 154185562, 154185563, 154185564, 154185565, 154185566, 154185567, 154185569, 154185570, 154185571, 154185572, 154185573, 154185574, 154185575, 154185576, 154185577, 154185578, 154185579, 154185580, 154185581, 154185582, 154185583, 154185584, 154185585, 154185586, 154185587

// set_hash error (this is a subroutine of bank creation for 154185568)
[2022-10-07T20:12:42.373863730Z ERROR solana_runtime::accounts_db] set_hash: already exists; multiple forks with shared slot 154185568 as child (parent: 154185563)!?

// 154185568 bank created second time since its' parent (154185563) not yet rooted so found by generate_new_bank_forks()
[2022-10-07T20:12:42.374852562Z INFO  solana_metrics::metrics] datapoint: bank-new_from_parent-heights slot=154185568i block_height=139237307i parent_slot=154185563i ...

// ReplayStage rejects 154185568 since we had previously marked it dead
[2022-10-07T20:12:42.610532191Z ERROR solana_metrics::metrics] datapoint: replay-stage-mark_dead_slot error="error: FailedToLoadEntries(DeadSlot)" slot=154185568i

// Debug panic that was put in to provide more information than the eventual panic that we would hit
[2022-10-07T20:12:56.778393865Z INFO  solana_runtime::accounts_db] remove_dead_slots_metadata: 1 dead slots, dead_slots: [154185568]
thread 'solBgAccounts' panicked at 'cleaning slot that was duplicate: 154185568', runtime/src/accounts_db.rs:7590:21
stack backtrace:
[2022-10-07T20:12:56.798757132Z INFO  solana_metrics::metrics] datapoint: Gossip streamer-send-sample_duration_ms=393i streamer-send-host_count=1164i streamer-send-bytes_total=2180067i streamer-send-pkt_count_total=2060i streamer-send-host_bytes_min=1050i streamer-send-host_bytes_max=9536i streamer-send-host_bytes_mean=1875i streamer-send-host_bytes_90pct=3179i streamer-send-host_bytes_50pct=1060i streamer-send-host_bytes_10pct=1060i
[2022-10-07T20:12:56.798764606Z INFO  solana_streamer::streamer] streamer send Gossip hosts: count:1164 [(69.197.13.5, SendStats { bytes: 9531, count: 9 }), (65.109.52.186, SendStats { bytes: 9531, count: 9 }), (65.108.127.108, SendStats { bytes: 8472, count: 8 }), (66.23.234.14, SendStats { bytes: 7413, count: 7 }), (65.108.226.27, SendStats { bytes: 7413, count: 7 })]
   0: rust_begin_unwind
             at ./rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at ./rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14
   2: solana_runtime::accounts_db::AccountsDb::purge_slot_cache_pubkeys
   3: solana_runtime::accounts_db::AccountsDb::purge_slot_cache
   4: solana_runtime::accounts_db::AccountsDb::purge_slot
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2022-10-07T20:12:56.820420647Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solBgAccounts" one=1i message="panicked at 'cleaning slot that was duplicate: 154185568', runtime/src/accounts_db.rs:7590:21" location="runtime/src/accounts_db.rs:7590:21" version="\"1.14.6 (src:devbuild; feat:940802714)\""

So, basically, the issue is that the slot is getting marked dead in startup processing, but then also being processed by ReplayStage a second time. ReplayStage can handle playing dead slots; however, since this is the second time the slot was created, we encountered that set_hash, things were already in a bad state.

Proposed Solution

A couple options I see:

  • Make the blockstore function that ReplayStage::generate_new_bank_forks() calls filter out dead slots
  • Do more cleanup on the dead slot in initial processing such that creating & dropping the bank the second time doesn't error

Full log

bug.log

@steviez
Copy link
Contributor Author

steviez commented Oct 11, 2022

This is a distinct failure form #28236, but similar in the sense that we were creating a bank for the same slot multiple times.

@sakridge sakridge added the validator Issues that relate to the validator label Oct 21, 2022
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Oct 23, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2023
@steviez steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Oct 31, 2023
@steviez steviez reopened this Feb 21, 2024
@steviez steviez self-assigned this Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment