You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a node running v1.14 (issue present in master too), we observed a panic shortly after the node fired up ReplayStage. The following log lines with annotations tell the story of what is happening and why we end with a panic:
// Validator process starts
[2022-10-07T20:05:04.522762389Z INFO solana_validator] Starting validator with: ArgsOs {
// Ledger processing begins for what is already present
[2022-10-07T20:10:04.332963953Z INFO solana_ledger::blockstore_processor] Processing ledger from slot 154185240...
[2022-10-07T20:10:04.333307300Z INFO solana_ledger::blockstore_processor] ledger holds data through slot 154185596
// 154185568 bank created
[2022-10-07T20:12:35.339310609Z INFO solana_metrics::metrics] datapoint: bank-new_from_parent-heights slot=154185568i block_height=139237307i parent_slot=154185563i ...
// 154185568 fails to verify and is marked dead
[2022-10-07T20:12:36.251737344Z WARN solana_ledger::blockstore_processor] slot 154185568 failed to verify: failed to load entries, error: blockstore error
// Ledger processing finished, 154185563 still an active bank
[2022-10-07T20:12:42.324282843Z INFO solana_ledger::blockstore_processor] ledger processed in 2 minutes, 37 seconds, 990 ms, 457 µs and 693 ns. root slot is 154185555, 32 banks: 154185555, 154185556, 154185557, 154185558, 154185559, 154185560, 154185561, 154185562, 154185563, 154185564, 154185565, 154185566, 154185567, 154185569, 154185570, 154185571, 154185572, 154185573, 154185574, 154185575, 154185576, 154185577, 154185578, 154185579, 154185580, 154185581, 154185582, 154185583, 154185584, 154185585, 154185586, 154185587
// set_hash error (this is a subroutine of bank creation for 154185568)
[2022-10-07T20:12:42.373863730Z ERROR solana_runtime::accounts_db] set_hash: already exists; multiple forks with shared slot 154185568 as child (parent: 154185563)!?
// 154185568 bank created second time since its' parent (154185563) not yet rooted so found by generate_new_bank_forks()
[2022-10-07T20:12:42.374852562Z INFO solana_metrics::metrics] datapoint: bank-new_from_parent-heights slot=154185568i block_height=139237307i parent_slot=154185563i ...
// ReplayStage rejects 154185568 since we had previously marked it dead
[2022-10-07T20:12:42.610532191Z ERROR solana_metrics::metrics] datapoint: replay-stage-mark_dead_slot error="error: FailedToLoadEntries(DeadSlot)" slot=154185568i
// Debug panic that was put in to provide more information than the eventual panic that we would hit
[2022-10-07T20:12:56.778393865Z INFO solana_runtime::accounts_db] remove_dead_slots_metadata: 1 dead slots, dead_slots: [154185568]
thread 'solBgAccounts' panicked at 'cleaning slot that was duplicate: 154185568', runtime/src/accounts_db.rs:7590:21
stack backtrace:
[2022-10-07T20:12:56.798757132Z INFO solana_metrics::metrics] datapoint: Gossip streamer-send-sample_duration_ms=393i streamer-send-host_count=1164i streamer-send-bytes_total=2180067i streamer-send-pkt_count_total=2060i streamer-send-host_bytes_min=1050i streamer-send-host_bytes_max=9536i streamer-send-host_bytes_mean=1875i streamer-send-host_bytes_90pct=3179i streamer-send-host_bytes_50pct=1060i streamer-send-host_bytes_10pct=1060i
[2022-10-07T20:12:56.798764606Z INFO solana_streamer::streamer] streamer send Gossip hosts: count:1164 [(69.197.13.5, SendStats { bytes: 9531, count: 9 }), (65.109.52.186, SendStats { bytes: 9531, count: 9 }), (65.108.127.108, SendStats { bytes: 8472, count: 8 }), (66.23.234.14, SendStats { bytes: 7413, count: 7 }), (65.108.226.27, SendStats { bytes: 7413, count: 7 })]
0: rust_begin_unwind
at ./rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at ./rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14
2: solana_runtime::accounts_db::AccountsDb::purge_slot_cache_pubkeys
3: solana_runtime::accounts_db::AccountsDb::purge_slot_cache
4: solana_runtime::accounts_db::AccountsDb::purge_slot
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2022-10-07T20:12:56.820420647Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solBgAccounts" one=1i message="panicked at 'cleaning slot that was duplicate: 154185568', runtime/src/accounts_db.rs:7590:21" location="runtime/src/accounts_db.rs:7590:21" version="\"1.14.6 (src:devbuild; feat:940802714)\""
So, basically, the issue is that the slot is getting marked dead in startup processing, but then also being processed by ReplayStage a second time. ReplayStage can handle playing dead slots; however, since this is the second time the slot was created, we encountered that set_hash, things were already in a bad state.
Proposed Solution
A couple options I see:
Make the blockstore function that ReplayStage::generate_new_bank_forks() calls filter out dead slots
Do more cleanup on the dead slot in initial processing such that creating & dropping the bank the second time doesn't error
Problem
On a node running v1.14 (issue present in master too), we observed a panic shortly after the node fired up ReplayStage. The following log lines with annotations tell the story of what is happening and why we end with a panic:
So, basically, the issue is that the slot is getting marked dead in startup processing, but then also being processed by ReplayStage a second time. ReplayStage can handle playing dead slots; however, since this is the second time the slot was created, we encountered that
set_hash
, things were already in a bad state.Proposed Solution
A couple options I see:
ReplayStage::generate_new_bank_forks()
calls filter out dead slotsFull log
bug.log
The text was updated successfully, but these errors were encountered: