Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: reset the node up/down marker on topology update #607

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

havaker
Copy link
Contributor

@havaker havaker commented Nov 25, 2022

For node up/down markers to work reliably, a source of truth other than CQL events is needed. That's because the CQL events delivery is prone to connection failures. By quering system.cluster_status table from time to time, we are able to get the true information about the status of nodes in the cluster.

This pull request implements this quering and resets each node up/down marker value to the information received from the periodic quering.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I added appropriate Fixes: annotations to PR description.

For node up/down markers to work reliably, a source of truth other than
CQL events is needed. That's because the CQL events delivery is prone to
connection failure. By quering `system.cluster_status` table from time
to time, we are able to get the true information about the status of
nodes in the cluster.

This change will enable us to correct the up/down markers because every
`Node` is created (and also can be updated) based on information from a
corresponding `Peer`.
Done to match marker name with `system.cluster_status` column name - `up`.
Node up markers should be reset according to the information from
`Metadata` - we recognize it as a source of truth about the cluster (in
contrast to the information received via events, whose delivery is
unreliable).
@havaker
Copy link
Contributor Author

havaker commented Nov 25, 2022

Oops, looks like system.cluster_status appeared in ScyllaDB only recently (scylladb/scylladb@7c95bd3), and it does not exists at all in Cassandra.

In the absence of system.cluster_status table, we will be forced to use another way to detect down nodes. I think that periodically sending a keepalive query to each node would be a good substitute - the infrastructure to do so already exists (#395), but the question is how can it be used.

As for now, I'm converting this PR into a draft.

@wprzytula
Copy link
Collaborator

This effort is waiting for more time resources available.

@Lorak-mmk
Copy link
Collaborator

Time resources are not everything. I think someone told me that cluster_status is not the data source we are looking for (don't remember why) - maybe it was @piodul ? We would also need to implement different behavior for Scylla and Cassandra.
Even if there exists some correct data source, should we stop sending queries to node A if we still have open working connections to it, just because node B thinks that A is down? Maybe it's better to just rely on the driver?

@Lorak-mmk Lorak-mmk removed their assignment Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants