Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
closes #1793
This PR reproduces the issue in an integration test and then fixes the issue with a code change.
The problem
Shotover works well with clients that perform token aware routing of prepared execute requests.
However, if the client is not performing token aware routing, shotover may not be able to route the execute request correctly.
And while the gocql driver is initializing it sends execute requests through its control connection (not routed by token) therefore hitting this issue.
As a result the gocql driver ends up in a loop as it keeps retrying but never succeeding to prepared and execute the query.
The exact issue follows this flow:
The original fix
The absolute simplest fix is to have shotover send the prepare request to all cassandra nodes in the cluster, not just the cassandra nodes in shotover's rack.
However, this has some performance issues as we now need to send a message to all nodes in the cluster, likely forcing shotover to open new connections to them, and requiring out of rack communication.
As a result I considered alternative fixes.
The final fixThe final fix used in this PR is to route the request to a random node within the rack and have that node route to the correct node.This:
* simplifies our routing logic.* avoids the performance concerns of routing prepare requests to all nodes.* Has the downside that we remove the fallback to out of node requests.The only case where this fallback would be used is if all cassandra nodes in a shotover instance's rack are down.This is very rare, but possible.
However if we have lost an entire rack of cassandra nodes then we have lost a large chunk of the cassandra cluster, so if we start sending out of rack requests we might start to overwhelm any still functioning racks.
Better for shotover to just handle this case as an error.
The final fix is implemented as a 2nd commit ontop of the original fix which is the 1st commit.Final fix is reverted, it is common to run with a rack of 1 cassandra node, so this is actually quite dangerous.
We will explore any possible performance fixes once we have clusters running with shotover with high cassandra node count.