Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: surface validator status, e.g. jailed #3746

Closed
conorsch opened this issue Feb 6, 2024 · 11 comments
Closed

metrics: surface validator status, e.g. jailed #3746

conorsch opened this issue Feb 6, 2024 · 11 comments
Assignees
Labels
A-telemetry Area: Metrics, logging, and other observability-related features _P-V1 Priority: slated for V1 release
Milestone

Comments

@conorsch
Copy link
Contributor

conorsch commented Feb 6, 2024

During Testnet 64 folks in Discord have reported an uptick in jailing rates. It's difficult to reason concretely about that observation because we don't have good visibility into rates of banning. We do emit metrics such as penumbra_stake_validators_jailed in pd, but aren't visualizing those anywhere out of the box. We also try to log the status change via a span: https://github.com/penumbra-zone/penumbra/blob/v0.64.2/crates/core/component/stake/src/component.rs#L125-L135 but oddly I'm not seeing that span label return results in our 64 logs, even though we have debug logs on for the validators we run.

@conorsch conorsch added the A-telemetry Area: Metrics, logging, and other observability-related features label Feb 6, 2024
@conorsch conorsch self-assigned this Feb 6, 2024
@github-project-automation github-project-automation bot moved this to 🗄️ Backlog in Penumbra Feb 6, 2024
@github-project-automation github-project-automation bot moved this to Future in Testnets Feb 6, 2024
@conorsch
Copy link
Contributor Author

From current testnet:

❯ curl -s http://localhost:9000/metrics | rg ^penumbra_stake | rg -v missed_blocks
penumbra_stake_validators_inactive 363
penumbra_stake_validators_jailed -23
penumbra_stake_validators_disabled 29
penumbra_stake_validators_active 0
penumbra_stake_validators_defined 64

There are a few problems in there, such as active=0 and "jailed" being a negative value. Let's inspect the transition logic in the staking component and make sure gauges are incremented and decremented correctly.

@conorsch
Copy link
Contributor Author

Similarly, the missed blocks metric is not accurate:

❯ curl -s http://localhost:9000/metrics | rg ^penumbra_stake_missed_blocks | sort -k 2 -n -r | head -n 20
penumbra_stake_missed_blocks{identity_key="penumbravalid14e54ukuppxs3f6dsg6flx9tsmqcv363tvvzrdku8fgmagjd435fqexu7zs"} 148359350
penumbra_stake_missed_blocks{identity_key="penumbravalid1u60u8405jg8mlz0epmhqfslqqkjurx6l4pm8n3j0ves66sztwqrqctw4f2"} 137059677
penumbra_stake_missed_blocks{identity_key="penumbravalid1qrxgv0nqm8a5edffr5t5ykhjuchkmwkejsgfhtc38t2y0kf33qrsql244n"} 133699997
penumbra_stake_missed_blocks{identity_key="penumbravalid1g3wfpjkplhktcq5d3t3l5ud4u34qwh6kxacwrkd8uacvh5jlcqqqzqnhxt"} 130544894
penumbra_stake_missed_blocks{identity_key="penumbravalid1ms5kyy9qz0dggseq9ud9gl2wmlyl8c767hrw66ku35tksczsmv8qxlxxsd"} 127920260
penumbra_stake_missed_blocks{identity_key="penumbravalid1qqqvkxl2eewjl0f3kakcvryv4g6kq6ua8w8xk4g5j7n9pn50ycqstgekuv"} 119628111
penumbra_stake_missed_blocks{identity_key="penumbravalid1sqmemw9qeswm8fmvk8p8g2z49rhrxkyu09h9keqz5s6xkxh3tc8q9j32w0"} 99881573
penumbra_stake_missed_blocks{identity_key="penumbravalid15tu5hycly4yltuk6c4vsx45kn003amet69eqp2sxcjnw46udaczqf3n354"} 92749412
penumbra_stake_missed_blocks{identity_key="penumbravalid1ajryejwx349xl9lvh8sym09jle9chaum4628k7c0n5rpaw9v2cgqrtn9z4"} 85621965
penumbra_stake_missed_blocks{identity_key="penumbravalid1uzn79d9s736t6nga7060ts9ghncuzhahr92nldamwhjhc726458qdf509n"} 57060921
penumbra_stake_missed_blocks{identity_key="penumbravalid1lndq7e036xvuu7s7kavf9ncnqq70czyyj8gdenyzn8k9cyuwvg9qqm33vm"} 48428299
penumbra_stake_missed_blocks{identity_key="penumbravalid1wpkhqt7p3t84ux3vxdtscp8t8rz89xrarfc0ey0z79tux0na0y8qz72s8r"} 47233969
penumbra_stake_missed_blocks{identity_key="penumbravalid1zjdq22kr0n0e7u3z363h9ussktu768nqpryr47y6w7rxylx75gpqyfkqw6"} 45140250
penumbra_stake_missed_blocks{identity_key="penumbravalid1nnl5c8ad9gm0kpc42ttmyw29vrrsjaurk8euuj73kj9hvape4vqsm8c2f6"} 44854894
penumbra_stake_missed_blocks{identity_key="penumbravalid182uscvf6qaqq3r7tdy3uzf3thrrym08gnzg3furvdgvfguvgn5qs97p3tm"} 44854894
penumbra_stake_missed_blocks{identity_key="penumbravalid1227gvj8qmfp9pwnxwue3smduhjjk6junn9pgu9p5lru52x6a4spq3m9t4d"} 42277330
penumbra_stake_missed_blocks{identity_key="penumbravalid1mzfd4lk4kpx5gtvwas6e26mhkzh986g7clqwmucxe64md8kuty9qs0pf9c"} 25093054
penumbra_stake_missed_blocks{identity_key="penumbravalid16jcq73cy93hx6dc857drpednfmpe032uxrvzjg00xc0ser54xq8s6j6jjp"} 10619136
penumbra_stake_missed_blocks{identity_key="penumbravalid18sgu2nrded90ntn747223ttv4ggna37t8d8wlfwj42qu487lw58q3adpln"} 10504890
penumbra_stake_missed_blocks{identity_key="penumbravalid15tjp4mww35qmq8fvpkhjsd87ndnj9pthwzf33sn0n6p6vncrvyrqlf7275"} 9501019

Those values are far too high: the total block height is only ~60k at time of writing. At a guess, we're incrementing the missed_blocks gauge by the entirety of historical missed blocks for that validator:

metrics::gauge!(metrics::MISSED_BLOCKS, "identity_key" => identity_key.to_string())
.increment(0.0);

That .increment() call should instead be .set() on the gauge.

@erwanor
Copy link
Member

erwanor commented Feb 16, 2024

I'm surprised increment(0) and set(0) are any different

edit: ah nvm. got it.

@erwanor
Copy link
Member

erwanor commented Feb 16, 2024

Ok I get why they're different, but not why those values are so high

@conorsch
Copy link
Contributor Author

Ok I get why they're different, but not why those values are so high

We should be computing missed_blocks += n but instead we're computing missed_blocks += (missed_blocks + n), is my understanding.

@erwanor
Copy link
Member

erwanor commented Feb 20, 2024

Okay, wow - I would think that increment(N) does counter += N and set(N) does counter = N but I guess not, good catch

@cratelyn
Copy link
Contributor

x-ref #3995, i stumbled across this today. i'm going to self-assign this, because i believe i can fix it as part of #4070 / #4040.

@cratelyn cratelyn assigned cratelyn and unassigned conorsch Mar 22, 2024
@cratelyn cratelyn moved this from 🗄️ Backlog to In progress in Penumbra Mar 25, 2024
@cratelyn cratelyn added this to the Sprint 3 milestone Mar 25, 2024
@cratelyn cratelyn moved this from In progress to 🗄️ Backlog in Penumbra Mar 25, 2024
@cratelyn cratelyn removed this from the Sprint 3 milestone Mar 25, 2024
@cratelyn cratelyn added this to the Sprint 4 milestone Apr 8, 2024
@cratelyn cratelyn moved this from Backlog to In progress in Penumbra Apr 8, 2024
@cratelyn cratelyn moved this from In progress to Backlog in Penumbra Apr 9, 2024
@aubrika aubrika assigned conorsch and unassigned cratelyn Apr 10, 2024
@aubrika aubrika added the _P-V1 Priority: slated for V1 release label Apr 10, 2024
@conorsch
Copy link
Contributor Author

Took a closer look at this today. The missed_blocks issue is appears quite simple: the increment -> set fix discussed above should resolve.

As for the validator state metrics, the current setup in can't possibly work: we're incrementing and decrementing gauges based on validator state transitions, but gauges are by default zero, so we're actually emitting metrics on the validator state transitions observed by a specific instance of pd over the course of its uptime. Bouncing the pd service will wipe those metrics and cause them to tick up again over time. That's why we're seeing negative numbers, because a long-lived pd instance will start with its definition of jailed validators at 0, and as validators on the network enter jailed state, that number will veer into negatives.

To resolve, we must ground the metrics in actual historical state. Two options:

  1. on pd start, query the stake service for current values of active, defined, jailed, etc, and initialize the gauges to those values; then leave the existing increment/decrement logic in place, and trust those values to be updated properly over time; or
  2. modify the existing stake metrics function to perform the stake query, and set the gauges from those values, so it's always correct.

Here's a lightly edited terminal session illustrating the problem:

❯ # shortly after joining, the metrics are nil:
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort
❯ # a bit later, as an epoch was crossed during syncing up to current height
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort
penumbra_stake_validators_active -1
penumbra_stake_validators_jailed 1
❯ # evidence that we're still catching up:
❯ curl -s http://localhost:26657/status | jq -r .result.sync_info.catching_up
true
❯ # bounce the service, see relevant metrics reset:
❯ sudo systemctl restart penumbra cometbft
[sudo] password for conor:
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort

Once fixed, we should tack on an integration test to make sure these metrics exist at network start: active validators should never be zero at time of join.

@hdevalence
Copy link
Member

Should we be using metrics for this at all? I think the answer is probably no. I don't think we should be using metrics for anything that requires reconciling ongoing events with historical data. We already have a system for that, the event mechanism, and we should use it. And there is a standard pattern for rendering events on blockchains, namely a block explorer.

If someone wants to check if their validator is online or its status, they should open the validator's page in a block explorer. But we shouldn't try to build a half-featured block explorer inside Grafana just because our existing block explorer isn't functional.

@cratelyn cratelyn modified the milestones: Sprint 4, Sprint 5 Apr 22, 2024
@aubrika aubrika moved this from Backlog to Todo in Penumbra Apr 22, 2024
@aubrika aubrika modified the milestones: Sprint 5, Sprint 6 May 6, 2024
@aubrika aubrika added the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024
conorsch added a commit that referenced this issue May 6, 2024
We were inappropriately `increment`ing a gauge to track missed blocks,
when instead we should have been clobbering the value via `set`,
as we know the precise number from the state machine. This fixes
a broken metric for missed blocks that was growing too fast,
emitting wrong values.

Refs #3746.
conorsch added a commit that referenced this issue May 6, 2024
We were inappropriately `increment`ing a gauge to track missed blocks,
when instead we should have been clobbering the value via `set`,
as we know the precise number from the state machine. This fixes
a broken metric for missed blocks that was growing too fast,
emitting wrong values.

Refs #3746.
@conorsch
Copy link
Contributor Author

conorsch commented May 6, 2024

Should we be using metrics for this at all? [...] We already have a system for that, the event mechanism, and we should use it.

This is a great take, and we have progress toward the event pattern in #4277.

@cratelyn cratelyn removed the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024
conorsch added a commit that referenced this issue May 6, 2024
We were inappropriately `increment`ing a gauge to track missed blocks,
when instead we should have been clobbering the value via `set`,
as we know the precise number from the state machine. This fixes
a broken metric for missed blocks that was growing too fast,
emitting wrong values.

Refs #3746.
@conorsch
Copy link
Contributor Author

conorsch commented May 6, 2024

Closing in favor of #4336, which reframes the ask as event emission.

@conorsch conorsch closed this as completed May 6, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Penumbra May 6, 2024
conorsch added a commit that referenced this issue May 16, 2024
The validator state metrics are not accurate, and should be refactored
to use ABCI event emission so that developers can leverage the state
transitions in a sane way. In the meantime, let's remove the incorrect
metrics entirely.

Refs #3746, #4336.
conorsch added a commit that referenced this issue May 20, 2024
The validator state metrics are not accurate, and should be refactored
to use ABCI event emission so that developers can leverage the state
transitions in a sane way. In the meantime, let's remove the incorrect
metrics entirely.

Refs #3746, #4336.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-telemetry Area: Metrics, logging, and other observability-related features _P-V1 Priority: slated for V1 release
Projects
Archived in project
Status: Future
Development

No branches or pull requests

5 participants