Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force client disconnects when node is unhealthy #13

Open
wants to merge 14 commits into
base: master-2.2
Choose a base branch
from

Conversation

pmantica11
Copy link

Clients are prevented from connecting to an unhealthy gRPC instance but are not disconnected from a lagging one. In this PR we disconnect clients clients so that they are forced to reconnect to a healthy gRPC instance.

solana-sdk = "~2.1.2"
solana-transaction-status = "~2.1.2"
solana-client = "~2.1.2"
solana-rpc-client-api = "~2.1.2"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the master git versions was giving me package conflict errors. So I will use the latest version of the release instead. I tested the code using the Solana test validator, and it worked. So we should be good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is risky, we should keep using the master versions since geyser usually has breaking changes between minor versions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retrospectively agree, but since this version has been battle tested with 2.2 and it's fixed, then I now prefer this over using the master version.

yellowstone-grpc-geyser/config.json Show resolved Hide resolved
use tokio::time::interval;

pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100;
pub const IS_NODE_UNHEALTHY: Lazy<Arc<AtomicBool>> = Lazy::new(|| Arc::new(AtomicBool::new(false)));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered disconnecting clients only when the node was unhealthy by some amount of time. But I think that was unnecessarily complicated. If a node is behind by 100 slots, then it must have been unhealthy for about 30 seconds, which is enough.

yellowstone-grpc-geyser/src/monitor.rs Outdated Show resolved Hide resolved
yellowstone-grpc-geyser/src/plugin.rs Outdated Show resolved Hide resolved
yellowstone-grpc-geyser/src/monitor.rs Outdated Show resolved Hide resolved
@@ -838,6 +839,13 @@ impl GrpcService {
}
}
message = messages_rx.recv() => {
let num_slots_behind = NUM_SLOTS_BEHIND.load(Ordering::SeqCst);
if num_slots_behind > HEALTH_CHECK_SLOT_DISTANCE {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required in first release, but I would prefer if this is configured as two checks:

  1. Auto-disconnect is >100 slots
  2. Disconnect if >20 slots for last 5 checks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds too much configuration complexity, and it's not worth it. I think that this is only useful in the case where we are stuck precisely between 20 and 100 slots for a long time. I don't believe that this happens frequently. I believe that nodes either die and aren't able to cache up at all or are healthy. I haven't seen any cases where the nodes lags exactly by 20 slots for more than a minute.

I actually decided to simplify this further and I use the rpc health check directly instead of relying on a configured health slot distance.

@vovkman
Copy link

vovkman commented Nov 27, 2024

Is it the case that we will still accept connections and then disconnect when a message comes through and its behind? We should probably prevent connecting in general if behind

@pmantica11
Copy link
Author

Is it the case that we will still accept connections and then disconnect when a message comes through and its behind? We should probably prevent connecting in general if behind

Good point. I'll also update the code to handle that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants