-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn if vaildator appears to be repairing excessively #33320
Warn if vaildator appears to be repairing excessively #33320
Conversation
18e6e57
to
b396244
Compare
Codecov Report
@@ Coverage Diff @@
## master #33320 +/- ##
=========================================
- Coverage 81.8% 81.8% -0.1%
=========================================
Files 806 806
Lines 217499 217660 +161
=========================================
+ Hits 178019 178138 +119
- Misses 39480 39522 +42 |
466a2e7
to
ed49337
Compare
3e61009
to
bc04854
Compare
mind breaking the plumbing out to a separate commit to ease review? same pr is fine, just wanna see the functional changes in isolation to i don't miss anything |
2549a13
to
062e8ec
Compare
The change seems overly intrusive (34 files, 347 loc) just to emit a warning, and most operators do not monitor logs anyways. |
Granted it touches lots of files, but it's pretty basic refactoring. Yes, we have the tools to investigate as in #33166, but it still requires an investigation. Inspecting logs is inconvenient for us, but more convenient for operators. If this reduces instances where we have to investigate I think it would be a win. |
062e8ec
to
b3937c8
Compare
Do operators look at warnings in the log? @t-nelson ping @behzadnouri |
b3937c8
to
df3e5c2
Compare
does posting screenshots on discord constitute "looking"? |
I guess I'm asking if it's worth trying to help operators with log messages? 🤷 |
If the code change was small then adding some more logs was probably fine. |
considering only the second commit, it's not so bad. the first is big 'cause |
The health status calculation is being changed by: #33651. I'll update this PR following that change. |
how many nodes are typically in a state that would be logging with this? how frequently? we can already observe these nodes from metrics, right? |
Not a huge amount, maybe ~1 per week in the top 100. I'll look across the full set of validators.
Yeah, we see a bump in repair, then find the outliers. I was hoping this would help eliminate some noise. |
Problem
It is possible for validators to stay in sync with a cluster solely using repair. This is substantially less efficient than receiving data using turbine and most likely indicates network related misconfiguration. Misconfigured nodes generating substantial repair traffic skews metric data.
Summary of Changes
Maintain slot repair history and generate warnings if a node appears to be excessively repairing slots while indicating that it is "caught up" with the cluster.
Fixes #