Fix hang in gocql init #1795

rukai · 2024-10-31T04:28:49Z

This PR reproduces the issue in an integration test and then fixes the issue with a code change.

The problem

Shotover works well with clients that perform token aware routing of prepared execute requests.
However, if the client is not performing token aware routing, shotover may not be able to route the execute request correctly.
And while the gocql driver is initializing it sends execute requests through its control connection (not routed by token) therefore hitting this issue.
As a result the gocql driver ends up in a loop as it keeps retrying but never succeeding to prepared and execute the query.

The exact issue follows this flow:

client sends request to create prepared query over control connection
shotover routes request to all nodes in its rack
all cassandra nodes in the rack respond with success
shotover combines these responses into a single success response to send back to client
client attempts to execute query over control connection
shotover (correctly) routes the request to its true destination which is outside of shotovers rack.
cassandra responds with error because there is no such prepared query on this cassandra node since it is outside of shotovers rack and only the nodes within shotovers rack have the query prepared.
shotover returns error response to client.
client receives error and retries the whole process, returning to step 1

The original fix

The absolute simplest fix is to have shotover send the prepare request to all cassandra nodes in the cluster, not just the cassandra nodes in shotover's rack.

However, this has some performance issues as we now need to send a message to all nodes in the cluster, likely forcing shotover to open new connections to them, and requiring out of rack communication.
As a result I considered alternative fixes.

The final fix

The final fix used in this PR is to route the request to a random node within the rack and have that node route to the correct node.
This:
* simplifies our routing logic.
* avoids the performance concerns of routing prepare requests to all nodes.
* Has the downside that we remove the fallback to out of node requests.

The only case where this fallback would be used is if all cassandra nodes in a shotover instance's rack are down.
This is very rare, but possible.
However if we have lost an entire rack of cassandra nodes then we have lost a large chunk of the cassandra cluster, so if we start sending out of rack requests we might start to overwhelm any still functioning racks.
Better for shotover to just handle this case as an error.

~~The final fix is implemented as a 2nd commit ontop of the original fix which is the 1st commit.~~
Final fix is reverted, it is common to run with a rack of 1 cassandra node, so this is actually quite dangerous.

We will explore any possible performance fixes once we have clusters running with shotover with high cassandra node count.

codspeed-hq · 2024-10-31T04:40:27Z

CodSpeed Performance Report

Merging #1795 will not alter performance

_{Comparing rukai:fix_gocql_init_hang (995f90e) with main (8d374bc)}

Summary

✅ 38 untouched benchmarks

shotover/src/transforms/cassandra/sink_cluster/rewrite.rs

ronycsdu approved these changes Oct 31, 2024

View reviewed changes

rukai force-pushed the fix_gocql_init_hang branch 3 times, most recently from b649a1a to aea08e6 Compare October 31, 2024 23:41

rukai marked this pull request as ready for review November 1, 2024 00:14

rukai requested a review from ronycsdu November 1, 2024 01:03

Fix hang in gocql init

980dcb0

rukai force-pushed the fix_gocql_init_hang branch from 56c72fd to 980dcb0 Compare November 1, 2024 01:21

justinweng-instaclustr approved these changes Nov 1, 2024

View reviewed changes

ronycsdu reviewed Nov 1, 2024

View reviewed changes

shotover/src/transforms/cassandra/sink_cluster/rewrite.rs Show resolved Hide resolved

ronycsdu self-requested a review November 1, 2024 01:35

ronycsdu approved these changes Nov 1, 2024

View reviewed changes

rukai enabled auto-merge (squash) November 1, 2024 01:41

Merge branch 'main' into fix_gocql_init_hang

995f90e

rukai merged commit cdbd339 into shotover:main Nov 1, 2024
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hang in gocql init #1795

Fix hang in gocql init #1795

rukai commented Oct 31, 2024 •

edited

Loading

codspeed-hq bot commented Oct 31, 2024 •

edited

Loading

Fix hang in gocql init #1795

Fix hang in gocql init #1795

Conversation

rukai commented Oct 31, 2024 • edited Loading

The problem

The original fix

The final fix

codspeed-hq bot commented Oct 31, 2024 • edited Loading

CodSpeed Performance Report

Merging #1795 will not alter performance

Summary

rukai commented Oct 31, 2024 •

edited

Loading

codspeed-hq bot commented Oct 31, 2024 •

edited

Loading