Fix low connection_count performance #1459
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
closes #1442
While investigating the low performance of connection_count=1 benchmarks I observed that the size of message batches passed through transforms was very low for all benchmarks. Usually the batches had only a single message but sometimes there were larger.
This seemed very strange to me and indicated that pipelining of cassandra messages wasn't occurring properly.
This left me with 2 possibilities:
I started by explored the first option, I inspected the bench implementation and found that it has 100 concurrent tasks sending messages, so on a 2 core EC2 instance we should be seeing at least some pipelining occuring.
I then checked the stream_id of the cassandra messages passing through shotover, they were all between 0-99 which indicated that the driver was definitely expecting these messages to be pipelined.
To double check this I increased the number of tasks to 1000 and found that the stream_id values were now between 0-999 which validated this theory.
So I mostly ruled out the first possibility and started exploring the second.
The way we implemented codecs was weird and seemed to go a bit against the contract given in the documentation https://docs.rs/tokio-util/latest/tokio_util/codec/trait.Decoder.html#tymethod.decode
Our implementation was returning as many messages as we could while the docs assume that you are returning a single message and dont mention anything about returning multiple messages.
So its not clear from the docs what the performance impacts would be from our deviation.
But it seemed possible that the tokio codec abstraction was tuned to work with the implementation always returning a single item.
So I swapped the batching around to occur at the point we read messages off the decoder task instead of from within the decoder.
This gave us the large win of this PR.
Future work
I would like to investigate replacing
Vec<Message>
withMessage
within the encoder/decoder as I think that at the very least that should let us avoid an extra allocation per message.But I'll leave that to a follow up as it'll be easier to evaluate once this PR has landed.
Benchmarks
Huge wins for cassandra:
Some wins for kafka:
Redis seems unaffected:
Curious about this I investigated and found that the batch sizes were also unaffected. Both before and after this PR we were getting batch sizes around 50.