Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce a simple readiness probe that doesn't check bootstrapped st… #6

Merged
merged 1 commit into from
Feb 14, 2024

Conversation

nicolasochem
Copy link
Contributor

@nicolasochem nicolasochem commented Feb 13, 2024

…atus

Recently we have seen on mondaynet the node's RPC subsystem become unresponsive, but the node does not crash.

We normally have a readiness probe to get alerted when this happens.

But the readiness probe is overkill: it checks whether the chain is bootstrapped by measuring the age of the head block, failing if it's over 10 minutes.

We don't want this on test networks generally, but especially on weeklynet. After activation, we wait for the website to be published, other participants to come online, and quorum to be met.

If we had this probe, the chain would be marked as unbootstrapped and stop responding to RPC and p2p, then we would never get quorum.

But we still want to be alerted when the RPC subsystem is down.

I'm introducing 2 readiness probe settings:

  • bootstrapped_readiness_probe: identical to existing readiness_probe
  • rpc_readiness_probe: checks for RPC only

By default, they are on. So for mondaynet, the following should be set:

nodes:
  nodex:
    bootstrapped_readiness_probe: false

…atus

Recently we have seen on mondaynet the node's RPC subsystem become
unresponsive, but the node does not crash.

We normally have a readiness probe to get alerted when this happens.

But the readiness probe is overkill: it checks whether the chain is
bootstrapped by measuring the age of the head block, failing if it's
over 10 minutes.

We don't want this on test networks generally, but especially on
mondaynet. After activation, we wait for the website to be published,
other participants to come online, and quorum to be met.

If we had this probe, the chain would be marked as unbootstrapped and
stop responding to RPC and p2p, then we would never get quorum.

But we still want to be alerted when the RPC subsystem is down.

I'm introducing 2 readiness probe settings:

* `bootstrapped_readiness_probe`: identical to existing
  `readiness_probe`
* `rpc_readiness_probe`: checks for RPC only

By default, they are on. So for mondaynet, the following should be set:

```
nodes:
  nodex:
    bootstrapped_readiness_probe: false
```
Copy link
Contributor

@craigbuckler craigbuckler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on dailynet. All seems good to go.

@nicolasochem nicolasochem merged commit 9c93681 into main Feb 14, 2024
16 checks passed
@nicolasochem nicolasochem deleted the nicolasochem@always_check_rpc branch February 14, 2024 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants