-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: surface validator status, e.g. jailed #3746
Comments
From current testnet:
There are a few problems in there, such as active=0 and "jailed" being a negative value. Let's inspect the transition logic in the staking component and make sure gauges are incremented and decremented correctly. |
Similarly, the missed blocks metric is not accurate:
Those values are far too high: the total block height is only ~60k at time of writing. At a guess, we're incrementing the missed_blocks gauge by the entirety of historical missed blocks for that validator: penumbra/crates/core/component/stake/src/component/validator_handler/validator_manager.rs Lines 165 to 166 in 6ab2175
That |
I'm surprised edit: ah nvm. got it. |
Ok I get why they're different, but not why those values are so high |
We should be computing |
Okay, wow - I would think that |
Took a closer look at this today. The As for the validator state metrics, the current setup in can't possibly work: we're incrementing and decrementing gauges based on validator state transitions, but gauges are by default zero, so we're actually emitting metrics on the validator state transitions observed by a specific instance of pd over the course of its uptime. Bouncing the pd service will wipe those metrics and cause them to tick up again over time. That's why we're seeing negative numbers, because a long-lived pd instance will start with its definition of jailed validators at 0, and as validators on the network enter jailed state, that number will veer into negatives. To resolve, we must ground the metrics in actual historical state. Two options:
Here's a lightly edited terminal session illustrating the problem:
Once fixed, we should tack on an integration test to make sure these metrics exist at network start: active validators should never be zero at time of join. |
Should we be using metrics for this at all? I think the answer is probably no. I don't think we should be using metrics for anything that requires reconciling ongoing events with historical data. We already have a system for that, the event mechanism, and we should use it. And there is a standard pattern for rendering events on blockchains, namely a block explorer. If someone wants to check if their validator is online or its status, they should open the validator's page in a block explorer. But we shouldn't try to build a half-featured block explorer inside Grafana just because our existing block explorer isn't functional. |
We were inappropriately `increment`ing a gauge to track missed blocks, when instead we should have been clobbering the value via `set`, as we know the precise number from the state machine. This fixes a broken metric for missed blocks that was growing too fast, emitting wrong values. Refs #3746.
We were inappropriately `increment`ing a gauge to track missed blocks, when instead we should have been clobbering the value via `set`, as we know the precise number from the state machine. This fixes a broken metric for missed blocks that was growing too fast, emitting wrong values. Refs #3746.
This is a great take, and we have progress toward the event pattern in #4277. |
We were inappropriately `increment`ing a gauge to track missed blocks, when instead we should have been clobbering the value via `set`, as we know the precise number from the state machine. This fixes a broken metric for missed blocks that was growing too fast, emitting wrong values. Refs #3746.
Closing in favor of #4336, which reframes the ask as event emission. |
During Testnet 64 folks in Discord have reported an uptick in jailing rates. It's difficult to reason concretely about that observation because we don't have good visibility into rates of banning. We do emit metrics such as
penumbra_stake_validators_jailed
in pd, but aren't visualizing those anywhere out of the box. We also try to log the status change via a span: https://github.com/penumbra-zone/penumbra/blob/v0.64.2/crates/core/component/stake/src/component.rs#L125-L135 but oddly I'm not seeing that span label return results in our 64 logs, even though we have debug logs on for the validators we run.The text was updated successfully, but these errors were encountered: