metrics: surface validator status, e.g. jailed #3746

conorsch · 2024-02-06T18:40:18Z

During Testnet 64 folks in Discord have reported an uptick in jailing rates. It's difficult to reason concretely about that observation because we don't have good visibility into rates of banning. We do emit metrics such as penumbra_stake_validators_jailed in pd, but aren't visualizing those anywhere out of the box. We also try to log the status change via a span: https://github.com/penumbra-zone/penumbra/blob/v0.64.2/crates/core/component/stake/src/component.rs#L125-L135 but oddly I'm not seeing that span label return results in our 64 logs, even though we have debug logs on for the validators we run.

The text was updated successfully, but these errors were encountered:

conorsch · 2024-02-16T18:29:38Z

From current testnet:

❯ curl -s http://localhost:9000/metrics | rg ^penumbra_stake | rg -v missed_blocks
penumbra_stake_validators_inactive 363
penumbra_stake_validators_jailed -23
penumbra_stake_validators_disabled 29
penumbra_stake_validators_active 0
penumbra_stake_validators_defined 64

There are a few problems in there, such as active=0 and "jailed" being a negative value. Let's inspect the transition logic in the staking component and make sure gauges are incremented and decremented correctly.

conorsch · 2024-02-16T18:57:34Z

Similarly, the missed blocks metric is not accurate:

❯ curl -s http://localhost:9000/metrics | rg ^penumbra_stake_missed_blocks | sort -k 2 -n -r | head -n 20
penumbra_stake_missed_blocks{identity_key="penumbravalid14e54ukuppxs3f6dsg6flx9tsmqcv363tvvzrdku8fgmagjd435fqexu7zs"} 148359350
penumbra_stake_missed_blocks{identity_key="penumbravalid1u60u8405jg8mlz0epmhqfslqqkjurx6l4pm8n3j0ves66sztwqrqctw4f2"} 137059677
penumbra_stake_missed_blocks{identity_key="penumbravalid1qrxgv0nqm8a5edffr5t5ykhjuchkmwkejsgfhtc38t2y0kf33qrsql244n"} 133699997
penumbra_stake_missed_blocks{identity_key="penumbravalid1g3wfpjkplhktcq5d3t3l5ud4u34qwh6kxacwrkd8uacvh5jlcqqqzqnhxt"} 130544894
penumbra_stake_missed_blocks{identity_key="penumbravalid1ms5kyy9qz0dggseq9ud9gl2wmlyl8c767hrw66ku35tksczsmv8qxlxxsd"} 127920260
penumbra_stake_missed_blocks{identity_key="penumbravalid1qqqvkxl2eewjl0f3kakcvryv4g6kq6ua8w8xk4g5j7n9pn50ycqstgekuv"} 119628111
penumbra_stake_missed_blocks{identity_key="penumbravalid1sqmemw9qeswm8fmvk8p8g2z49rhrxkyu09h9keqz5s6xkxh3tc8q9j32w0"} 99881573
penumbra_stake_missed_blocks{identity_key="penumbravalid15tu5hycly4yltuk6c4vsx45kn003amet69eqp2sxcjnw46udaczqf3n354"} 92749412
penumbra_stake_missed_blocks{identity_key="penumbravalid1ajryejwx349xl9lvh8sym09jle9chaum4628k7c0n5rpaw9v2cgqrtn9z4"} 85621965
penumbra_stake_missed_blocks{identity_key="penumbravalid1uzn79d9s736t6nga7060ts9ghncuzhahr92nldamwhjhc726458qdf509n"} 57060921
penumbra_stake_missed_blocks{identity_key="penumbravalid1lndq7e036xvuu7s7kavf9ncnqq70czyyj8gdenyzn8k9cyuwvg9qqm33vm"} 48428299
penumbra_stake_missed_blocks{identity_key="penumbravalid1wpkhqt7p3t84ux3vxdtscp8t8rz89xrarfc0ey0z79tux0na0y8qz72s8r"} 47233969
penumbra_stake_missed_blocks{identity_key="penumbravalid1zjdq22kr0n0e7u3z363h9ussktu768nqpryr47y6w7rxylx75gpqyfkqw6"} 45140250
penumbra_stake_missed_blocks{identity_key="penumbravalid1nnl5c8ad9gm0kpc42ttmyw29vrrsjaurk8euuj73kj9hvape4vqsm8c2f6"} 44854894
penumbra_stake_missed_blocks{identity_key="penumbravalid182uscvf6qaqq3r7tdy3uzf3thrrym08gnzg3furvdgvfguvgn5qs97p3tm"} 44854894
penumbra_stake_missed_blocks{identity_key="penumbravalid1227gvj8qmfp9pwnxwue3smduhjjk6junn9pgu9p5lru52x6a4spq3m9t4d"} 42277330
penumbra_stake_missed_blocks{identity_key="penumbravalid1mzfd4lk4kpx5gtvwas6e26mhkzh986g7clqwmucxe64md8kuty9qs0pf9c"} 25093054
penumbra_stake_missed_blocks{identity_key="penumbravalid16jcq73cy93hx6dc857drpednfmpe032uxrvzjg00xc0ser54xq8s6j6jjp"} 10619136
penumbra_stake_missed_blocks{identity_key="penumbravalid18sgu2nrded90ntn747223ttv4ggna37t8d8wlfwj42qu487lw58q3adpln"} 10504890
penumbra_stake_missed_blocks{identity_key="penumbravalid15tjp4mww35qmq8fvpkhjsd87ndnj9pthwzf33sn0n6p6vncrvyrqlf7275"} 9501019

Those values are far too high: the total block height is only ~60k at time of writing. At a guess, we're incrementing the missed_blocks gauge by the entirety of historical missed blocks for that validator:

penumbra/crates/core/component/stake/src/component/validator_handler/validator_manager.rs

Lines 165 to 166 in 6ab2175

    
           metrics::gauge!(metrics::MISSED_BLOCKS, "identity_key" => identity_key.to_string()) 
        
               .increment(0.0);

That .increment() call should instead be .set() on the gauge.

erwanor · 2024-02-16T20:08:46Z

I'm surprised increment(0) and set(0) are any different

edit: ah nvm. got it.

erwanor · 2024-02-16T20:09:48Z

Ok I get why they're different, but not why those values are so high

conorsch · 2024-02-20T21:02:42Z

Ok I get why they're different, but not why those values are so high

We should be computing missed_blocks += n but instead we're computing missed_blocks += (missed_blocks + n), is my understanding.

erwanor · 2024-02-20T21:07:30Z

Okay, wow - I would think that increment(N) does counter += N and set(N) does counter = N but I guess not, good catch

cratelyn · 2024-03-22T15:16:57Z

x-ref #3995, i stumbled across this today. i'm going to self-assign this, because i believe i can fix it as part of #4070 / #4040.

conorsch · 2024-04-10T23:50:13Z

Took a closer look at this today. The missed_blocks issue is appears quite simple: the increment -> set fix discussed above should resolve.

As for the validator state metrics, the current setup in can't possibly work: we're incrementing and decrementing gauges based on validator state transitions, but gauges are by default zero, so we're actually emitting metrics on the validator state transitions observed by a specific instance of pd over the course of its uptime. Bouncing the pd service will wipe those metrics and cause them to tick up again over time. That's why we're seeing negative numbers, because a long-lived pd instance will start with its definition of jailed validators at 0, and as validators on the network enter jailed state, that number will veer into negatives.

To resolve, we must ground the metrics in actual historical state. Two options:

on pd start, query the stake service for current values of active, defined, jailed, etc, and initialize the gauges to those values; then leave the existing increment/decrement logic in place, and trust those values to be updated properly over time; or
modify the existing stake metrics function to perform the stake query, and set the gauges from those values, so it's always correct.

Here's a lightly edited terminal session illustrating the problem:

❯ # shortly after joining, the metrics are nil:
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort
❯ # a bit later, as an epoch was crossed during syncing up to current height
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort
penumbra_stake_validators_active -1
penumbra_stake_validators_jailed 1
❯ # evidence that we're still catching up:
❯ curl -s http://localhost:26657/status | jq -r .result.sync_info.catching_up
true
❯ # bounce the service, see relevant metrics reset:
❯ sudo systemctl restart penumbra cometbft
[sudo] password for conor:
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort

Once fixed, we should tack on an integration test to make sure these metrics exist at network start: active validators should never be zero at time of join.

hdevalence · 2024-04-11T15:33:54Z

Should we be using metrics for this at all? I think the answer is probably no. I don't think we should be using metrics for anything that requires reconciling ongoing events with historical data. We already have a system for that, the event mechanism, and we should use it. And there is a standard pattern for rendering events on blockchains, namely a block explorer.

If someone wants to check if their validator is online or its status, they should open the validator's page in a block explorer. But we shouldn't try to build a half-featured block explorer inside Grafana just because our existing block explorer isn't functional.

We were inappropriately `increment`ing a gauge to track missed blocks, when instead we should have been clobbering the value via `set`, as we know the precise number from the state machine. This fixes a broken metric for missed blocks that was growing too fast, emitting wrong values. Refs #3746.

conorsch · 2024-05-06T19:49:32Z

Should we be using metrics for this at all? [...] We already have a system for that, the event mechanism, and we should use it.

This is a great take, and we have progress toward the event pattern in #4277.

We were inappropriately `increment`ing a gauge to track missed blocks, when instead we should have been clobbering the value via `set`, as we know the precise number from the state machine. This fixes a broken metric for missed blocks that was growing too fast, emitting wrong values. Refs #3746.

conorsch · 2024-05-06T23:42:18Z

Closing in favor of #4336, which reframes the ask as event emission.

The validator state metrics are not accurate, and should be refactored to use ABCI event emission so that developers can leverage the state transitions in a sane way. In the meantime, let's remove the incorrect metrics entirely. Refs #3746, #4336.

conorsch added the A-telemetry Area: Metrics, logging, and other observability-related features label Feb 6, 2024

github-project-automation bot added this to Testnets and Penumbra Feb 6, 2024

conorsch self-assigned this Feb 6, 2024

github-project-automation bot moved this to 🗄️ Backlog in Penumbra Feb 6, 2024

github-project-automation bot moved this to Future in Testnets Feb 6, 2024

cratelyn assigned cratelyn and unassigned conorsch Mar 22, 2024

cratelyn moved this from 🗄️ Backlog to In progress in Penumbra Mar 25, 2024

cratelyn added this to the Sprint 3 milestone Mar 25, 2024

cratelyn moved this from In progress to 🗄️ Backlog in Penumbra Mar 25, 2024

cratelyn removed this from the Sprint 3 milestone Mar 25, 2024

cratelyn added this to the Sprint 4 milestone Apr 8, 2024

cratelyn moved this from Backlog to In progress in Penumbra Apr 8, 2024

cratelyn moved this from In progress to Backlog in Penumbra Apr 9, 2024

aubrika assigned conorsch and unassigned cratelyn Apr 10, 2024

aubrika added the _P-V1 Priority: slated for V1 release label Apr 10, 2024

cratelyn modified the milestones: Sprint 4, Sprint 5 Apr 22, 2024

aubrika moved this from Backlog to Todo in Penumbra Apr 22, 2024

conorsch mentioned this issue Apr 26, 2024

staking: emit event when tombstoning active/jailed/inactive vals #4277

Merged

1 task

aubrika modified the milestones: Sprint 5, Sprint 6 May 6, 2024

aubrika added the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024

conorsch mentioned this issue May 6, 2024

fix(metrics): validator missed blocks, inc -> set #4327

Merged

1 task

cratelyn removed the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024

conorsch mentioned this issue May 6, 2024

staking: emit ABCI events for validator state transitions #4336

Closed

conorsch closed this as completed May 6, 2024

github-project-automation bot moved this from Todo to Done in Penumbra May 6, 2024

conorsch mentioned this issue May 16, 2024

fix(pd): remove validator state metrics #4401

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: surface validator status, e.g. jailed #3746

metrics: surface validator status, e.g. jailed #3746

conorsch commented Feb 6, 2024

conorsch commented Feb 16, 2024

conorsch commented Feb 16, 2024

erwanor commented Feb 16, 2024 •

edited

Loading

erwanor commented Feb 16, 2024

conorsch commented Feb 20, 2024

erwanor commented Feb 20, 2024

cratelyn commented Mar 22, 2024

conorsch commented Apr 10, 2024

hdevalence commented Apr 11, 2024

conorsch commented May 6, 2024

conorsch commented May 6, 2024

metrics: surface validator status, e.g. jailed #3746

metrics: surface validator status, e.g. jailed #3746

Comments

conorsch commented Feb 6, 2024

conorsch commented Feb 16, 2024

conorsch commented Feb 16, 2024

erwanor commented Feb 16, 2024 • edited Loading

erwanor commented Feb 16, 2024

conorsch commented Feb 20, 2024

erwanor commented Feb 20, 2024

cratelyn commented Mar 22, 2024

conorsch commented Apr 10, 2024

hdevalence commented Apr 11, 2024

conorsch commented May 6, 2024

conorsch commented May 6, 2024

erwanor commented Feb 16, 2024 •

edited

Loading