Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.16: Add ability to output Bank hash details (backport of #32632) #34257

Merged
merged 3 commits into from
Dec 1, 2023

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Nov 28, 2023

This is an automatic backport of pull request #32632 done by Mergify.
Cherry-pick of 6bbf514 has failed:

On branch mergify/bp/v1.16/pr-32632
Your branch is up to date with 'origin/v1.16'.

You are currently cherry-picking commit 6bbf514e78.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   core/src/replay_stage.rs
	modified:   runtime/src/bank.rs
	new file:   runtime/src/bank/bank_hash_details.rs

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   Cargo.lock
	both modified:   ledger-tool/src/args.rs
	both modified:   ledger-tool/src/main.rs
	both modified:   programs/sbf/Cargo.lock
	both modified:   runtime/Cargo.toml
	both modified:   runtime/src/accounts_db.rs
	both modified:   validator/src/main.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally


Mergify commands and options

More conditions and actions can be found in the documentation.

You can also trigger Mergify actions by commenting on this pull request:

  • @Mergifyio refresh will re-evaluate the rules
  • @Mergifyio rebase will rebase this PR on its base branch
  • @Mergifyio update will merge the base branch into this PR
  • @Mergifyio backport <destination> will backport this PR on <destination> branch

Additionally, on Mergify dashboard you can:

  • look at your merge queues
  • generate the Mergify configuration with the config editor.

Finally, you can contact us on https://mergify.com

When a consensus divergance occurs, the current workflow involves a
handful of manual steps to hone in on the offending slot and
transaction. This process isn't overly difficult to execute; however, it
is tedious and currently involves creating and parsing logs.

This change introduces functionality to output a debug file that
contains the components go into the bank hash. The file can be generated
in two ways:
- Via solana-validator when the node realizes it has diverged
- Via solana-ledger-tool verify by passing a flag

When a divergance occurs now, the steps to debug would be:
- Grab the file from the node that diverged
- Generate a file for the same slot with ledger-tool with a known good
  version
- Diff the files, they are pretty-printed json

(cherry picked from commit 6bbf514)

# Conflicts:
#	Cargo.lock
#	ledger-tool/src/args.rs
#	ledger-tool/src/main.rs
#	programs/sbf/Cargo.lock
#	runtime/Cargo.toml
#	runtime/src/accounts_db.rs
#	validator/src/main.rs
@mergify mergify bot added the conflicts label Nov 28, 2023
Copy link

codecov bot commented Nov 29, 2023

Codecov Report

Merging #34257 (3c6b136) into v1.16 (302238b) will increase coverage by 0.0%.
Report is 1 commits behind head on v1.16.
The diff coverage is 89.7%.

❗ Current head 3c6b136 differs from pull request most recent head d0f7ff2. Consider uploading reports for the commit d0f7ff2 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##            v1.16   #34257    +/-   ##
========================================
  Coverage    81.8%    81.8%            
========================================
  Files         766      767     +1     
  Lines      209077   209251   +174     
========================================
+ Hits       171128   171312   +184     
+ Misses      37949    37939    -10     

@t-nelson
Copy link
Contributor

we really shouldn't be debugging many legit hash mismatches on 1.16 in the next ~6wks. do we really need this?

@steviez
Copy link
Contributor

steviez commented Nov 29, 2023

we really shouldn't be debugging many legit hash mismatches on 1.16 in the next ~6wks. do we really need this?

Well, we've had a handful of panic reports on v1.16:

The couple of these failures that we've looked into haven't been reproducible with solana-ledger-tool. At this point, we really have no idea what the cause is (hardware issue, non-deterministic bug, gremlins, etc). This change isn't a fix for these issues, but it could allow us to gain additional insight into the failures and potentially lead to a solution.

To me, not backporting this change is saying that we're ok with intermittent panics until we get to v1.17 on mnb, either because v1.17 somehow resolves the issue or because v1.17 has this change present and will allow us to debug then.

I realize the diff is large for backport this late in the release cycle, but if you trace it down, the only place where it gets calls on a production validator is when the node is essentially already doomed in the dump/repair/retry loop

@t-nelson
Copy link
Contributor

right, but the stable branch should be sufficiently solidified at this point that carrying the patch set externally won't be much effort. we can just build a new patched bin should the need arise. no reason to impose any risk on production

@steviez
Copy link
Contributor

steviez commented Nov 29, 2023

right, but the stable branch should be sufficiently solidified at this point that carrying the patch set externally won't be much effort. we can just build a new patched bin should the need arise. no reason to impose any risk on production

I think the most important part of this PR is that solana-validator will create the file when it is about to panic after it has diverged. If we don't merge this PR, the order of operations is:

  • Validator hits panic
  • If we're lucky, we get validator to apply patch
  • We wait and see if we observe the panic on the node that applied the patch
  • Repeat the above process for each new validator that encounters the panic

In #33740 specifically, I got Zan to apply the patch eventually after he ran into the panic several times. However, he has not yet seen the panic with the patch present in his binary. Additionally, he rightfully so wants some stability with his node

@steviez
Copy link
Contributor

steviez commented Nov 29, 2023

Quick datapoint, looking at mnb metrics from the last 7 days as of time of me writing this:

  • There are 35 instances of the We are attempting to dump a block that we produced ... (node is leader)
  • There are 28 instances of the "We have tried to repair duplicate blocks ...` logs (node is NOT leader)

Skimming through, a couple of these are operator error (ie running v1.14.X), but plenty are current versions. This also doesn't account for number of people who may have hit this who aren't reporting metrics. Granted, someone has to chime up in Discord that they hit this and send us the file, but if we bake the change into the binary, I feel like we'll have a higher percentage of getting some hits

@steviez steviez requested a review from t-nelson November 30, 2023 21:54
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ @brooksprumo's approval

@steviez
Copy link
Contributor

steviez commented Dec 1, 2023

We should also backport #34256 to v1.16 once this backport lands; this PR covers the non-leader path, that PR will create the same debug file when the node in question is the leader

brooksprumo
brooksprumo previously approved these changes Dec 1, 2023
Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accounts-db directories look good to me.

There's some nit comments that are basically questions on consistency with v1.17/master, but they are not deal breakers IMO.

validator/src/main.rs Show resolved Hide resolved
ledger-tool/src/args.rs Outdated Show resolved Hide resolved
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@steviez steviez merged commit 08eb469 into v1.16 Dec 1, 2023
31 checks passed
@steviez steviez deleted the mergify/bp/v1.16/pr-32632 branch December 1, 2023 17:12
@bji
Copy link
Contributor

bji commented Dec 6, 2023

right, but the stable branch should be sufficiently solidified at this point that carrying the patch set externally won't be much effort. we can just build a new patched bin should the need arise. no reason to impose any risk on production

I think the most important part of this PR is that solana-validator will create the file when it is about to panic after it has diverged. If we don't merge this PR, the order of operations is:

  • Validator hits panic
  • If we're lucky, we get validator to apply patch
  • We wait and see if we observe the panic on the node that applied the patch
  • Repeat the above process for each new validator that encounters the panic

In #33740 specifically, I got Zan to apply the patch eventually after he ran into the panic several times. However, he has not yet seen the panic with the patch present in his binary. Additionally, he rightfully so wants some stability with his node

I'm ashamed to admit that I did not return to 1.16 with the patch that enables the extra debugging. I am still on 1.17.5. I promise that tomorrow I will downgrade my secondary to 1.16.21. 1.17.5 has been completely solid for me, so I feel like "that's the solution anyway". But still I'll downgrade to help out with the debugging of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants