-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.16: Add ability to output Bank hash details (backport of #32632) #34257
Conversation
When a consensus divergance occurs, the current workflow involves a handful of manual steps to hone in on the offending slot and transaction. This process isn't overly difficult to execute; however, it is tedious and currently involves creating and parsing logs. This change introduces functionality to output a debug file that contains the components go into the bank hash. The file can be generated in two ways: - Via solana-validator when the node realizes it has diverged - Via solana-ledger-tool verify by passing a flag When a divergance occurs now, the steps to debug would be: - Grab the file from the node that diverged - Generate a file for the same slot with ledger-tool with a known good version - Diff the files, they are pretty-printed json (cherry picked from commit 6bbf514) # Conflicts: # Cargo.lock # ledger-tool/src/args.rs # ledger-tool/src/main.rs # programs/sbf/Cargo.lock # runtime/Cargo.toml # runtime/src/accounts_db.rs # validator/src/main.rs
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## v1.16 #34257 +/- ##
========================================
Coverage 81.8% 81.8%
========================================
Files 766 767 +1
Lines 209077 209251 +174
========================================
+ Hits 171128 171312 +184
+ Misses 37949 37939 -10 |
we really shouldn't be debugging many legit hash mismatches on 1.16 in the next ~6wks. do we really need this? |
Well, we've had a handful of panic reports on v1.16:
The couple of these failures that we've looked into haven't been reproducible with To me, not backporting this change is saying that we're ok with intermittent panics until we get to v1.17 on mnb, either because v1.17 somehow resolves the issue or because v1.17 has this change present and will allow us to debug then. I realize the diff is large for backport this late in the release cycle, but if you trace it down, the only place where it gets calls on a production validator is when the node is essentially already doomed in the dump/repair/retry loop |
right, but the stable branch should be sufficiently solidified at this point that carrying the patch set externally won't be much effort. we can just build a new patched bin should the need arise. no reason to impose any risk on production |
I think the most important part of this PR is that
In #33740 specifically, I got Zan to apply the patch eventually after he ran into the panic several times. However, he has not yet seen the panic with the patch present in his binary. Additionally, he rightfully so wants some stability with his node |
Quick datapoint, looking at mnb metrics from the last 7 days as of time of me writing this:
Skimming through, a couple of these are operator error (ie running v1.14.X), but plenty are current versions. This also doesn't account for number of people who may have hit this who aren't reporting metrics. Granted, someone has to chime up in Discord that they hit this and send us the file, but if we bake the change into the binary, I feel like we'll have a higher percentage of getting some hits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+ @brooksprumo's approval
We should also backport #34256 to v1.16 once this backport lands; this PR covers the non-leader path, that PR will create the same debug file when the node in question is the leader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accounts-db directories look good to me.
There's some nit comments that are basically questions on consistency with v1.17/master, but they are not deal breakers IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I'm ashamed to admit that I did not return to 1.16 with the patch that enables the extra debugging. I am still on 1.17.5. I promise that tomorrow I will downgrade my secondary to 1.16.21. 1.17.5 has been completely solid for me, so I feel like "that's the solution anyway". But still I'll downgrade to help out with the debugging of this issue. |
This is an automatic backport of pull request #32632 done by Mergify.
Cherry-pick of 6bbf514 has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
Mergify commands and options
More conditions and actions can be found in the documentation.
You can also trigger Mergify actions by commenting on this pull request:
@Mergifyio refresh
will re-evaluate the rules@Mergifyio rebase
will rebase this PR on its base branch@Mergifyio update
will merge the base branch into this PR@Mergifyio backport <destination>
will backport this PR on<destination>
branchAdditionally, on Mergify dashboard you can:
Finally, you can contact us on https://mergify.com