Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hang in gocql init #1795

Merged
merged 2 commits into from
Nov 1, 2024
Merged

Fix hang in gocql init #1795

merged 2 commits into from
Nov 1, 2024

Conversation

rukai
Copy link
Member

@rukai rukai commented Oct 31, 2024

closes #1793

This PR reproduces the issue in an integration test and then fixes the issue with a code change.

The problem

Shotover works well with clients that perform token aware routing of prepared execute requests.
However, if the client is not performing token aware routing, shotover may not be able to route the execute request correctly.
And while the gocql driver is initializing it sends execute requests through its control connection (not routed by token) therefore hitting this issue.
As a result the gocql driver ends up in a loop as it keeps retrying but never succeeding to prepared and execute the query.

The exact issue follows this flow:

  1. client sends request to create prepared query over control connection
  2. shotover routes request to all nodes in its rack
  3. all cassandra nodes in the rack respond with success
  4. shotover combines these responses into a single success response to send back to client
  5. client attempts to execute query over control connection
  6. shotover (correctly) routes the request to its true destination which is outside of shotovers rack.
  7. cassandra responds with error because there is no such prepared query on this cassandra node since it is outside of shotovers rack and only the nodes within shotovers rack have the query prepared.
  8. shotover returns error response to client.
  9. client receives error and retries the whole process, returning to step 1

The original fix

The absolute simplest fix is to have shotover send the prepare request to all cassandra nodes in the cluster, not just the cassandra nodes in shotover's rack.

However, this has some performance issues as we now need to send a message to all nodes in the cluster, likely forcing shotover to open new connections to them, and requiring out of rack communication.
As a result I considered alternative fixes.

The final fix

The final fix used in this PR is to route the request to a random node within the rack and have that node route to the correct node.
This:

* simplifies our routing logic.
* avoids the performance concerns of routing prepare requests to all nodes.
* Has the downside that we remove the fallback to out of node requests.

The only case where this fallback would be used is if all cassandra nodes in a shotover instance's rack are down.
This is very rare, but possible.
However if we have lost an entire rack of cassandra nodes then we have lost a large chunk of the cassandra cluster, so if we start sending out of rack requests we might start to overwhelm any still functioning racks.
Better for shotover to just handle this case as an error.

The final fix is implemented as a 2nd commit ontop of the original fix which is the 1st commit.
Final fix is reverted, it is common to run with a rack of 1 cassandra node, so this is actually quite dangerous.

We will explore any possible performance fixes once we have clusters running with shotover with high cassandra node count.

Copy link

codspeed-hq bot commented Oct 31, 2024

CodSpeed Performance Report

Merging #1795 will not alter performance

Comparing rukai:fix_gocql_init_hang (995f90e) with main (8d374bc)

Summary

✅ 38 untouched benchmarks

@rukai rukai force-pushed the fix_gocql_init_hang branch 3 times, most recently from b649a1a to aea08e6 Compare October 31, 2024 23:41
@rukai rukai marked this pull request as ready for review November 1, 2024 00:14
@rukai rukai requested a review from ronycsdu November 1, 2024 01:03
@rukai rukai force-pushed the fix_gocql_init_hang branch from 56c72fd to 980dcb0 Compare November 1, 2024 01:21
@ronycsdu ronycsdu self-requested a review November 1, 2024 01:35
@rukai rukai enabled auto-merge (squash) November 1, 2024 01:41
@rukai rukai merged commit cdbd339 into shotover:main Nov 1, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gocql client hangs when connecting to Cassandra via Shotover. The client hangs if a keyspace is specified
3 participants