Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix low connection_count performance #1459

Merged
merged 2 commits into from
Feb 8, 2024

Conversation

rukai
Copy link
Member

@rukai rukai commented Feb 8, 2024

closes #1442

While investigating the low performance of connection_count=1 benchmarks I observed that the size of message batches passed through transforms was very low for all benchmarks. Usually the batches had only a single message but sometimes there were larger.
This seemed very strange to me and indicated that pipelining of cassandra messages wasn't occurring properly.
This left me with 2 possibilities:

  1. the cassandra driver used in the benchmark isnt properly pipelining messages
  2. Something between when we read client requests off the socket and when we first start processing these requests is preventing proper pipelining/batching of messages.

I started by explored the first option, I inspected the bench implementation and found that it has 100 concurrent tasks sending messages, so on a 2 core EC2 instance we should be seeing at least some pipelining occuring.
I then checked the stream_id of the cassandra messages passing through shotover, they were all between 0-99 which indicated that the driver was definitely expecting these messages to be pipelined.
To double check this I increased the number of tasks to 1000 and found that the stream_id values were now between 0-999 which validated this theory.

So I mostly ruled out the first possibility and started exploring the second.
The way we implemented codecs was weird and seemed to go a bit against the contract given in the documentation https://docs.rs/tokio-util/latest/tokio_util/codec/trait.Decoder.html#tymethod.decode
Our implementation was returning as many messages as we could while the docs assume that you are returning a single message and dont mention anything about returning multiple messages.
So its not clear from the docs what the performance impacts would be from our deviation.
But it seemed possible that the tokio codec abstraction was tuned to work with the implementation always returning a single item.
So I swapped the batching around to occur at the point we read messages off the decoder task instead of from within the decoder.
This gave us the large win of this PR.

Future work

I would like to investigate replacing Vec<Message> with Message within the encoder/decoder as I think that at the very least that should let us avoid an extra allocation per message.
But I'll leave that to a follow up as it'll be easier to evaluate once this PR has landed.

Benchmarks

Huge wins for cassandra:
image

Some wins for kafka:
image

Redis seems unaffected:
Curious about this I investigated and found that the batch sizes were also unaffected. Both before and after this PR we were getting batch sizes around 50.
image

@rukai rukai force-pushed the fix_low_connection_count_performance branch from 721ddd9 to f11fea4 Compare February 8, 2024 02:28
Copy link

github-actions bot commented Feb 8, 2024

0 benchmark regressed. 4 benchmark improved. Please check the benchmark workflow logs for full details: https://github.com/shotover/shotover-proxy/actions/runs/7824080116

  2 (2.00%) high mild
  5 (5.00%) high severe

kafka_codec/decode_request_metadata
                        time:   [241.13 ns 241.29 ns 241.52 ns]
                        change: [-28.990% -28.540% -28.122%] (p = 0.00 < 0.05)
                        Performance has improved.
--
  1 (1.00%) low severe
  4 (4.00%) high mild
  1 (1.00%) high severe
kafka_codec/decode_request_list_offsets
                        time:   [327.99 ns 331.24 ns 336.11 ns]
                        change: [-24.077% -23.318% -22.456%] (p = 0.00 < 0.05)
                        Performance has improved.
--
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) low severe
  3 (3.00%) high severe
kafka_codec/decode_request_fetch
                        time:   [365.88 ns 366.73 ns 368.10 ns]
                        change: [-23.502% -22.983% -22.463%] (p = 0.00 < 0.05)
                        Performance has improved.
--
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low severe
  2 (2.00%) high mild
  3 (3.00%) high severe
kafka_codec/decode_all  time:   [242.44 ns 242.62 ns 242.85 ns]
                        change: [-81.725% -81.665% -81.620%] (p = 0.00 < 0.05)
                        Performance has improved.

@rukai rukai marked this pull request as ready for review February 8, 2024 03:49
@rukai rukai requested a review from conorbros February 8, 2024 03:49
@rukai rukai enabled auto-merge (squash) February 8, 2024 11:36
@rukai rukai merged commit 5bea9e9 into shotover:main Feb 8, 2024
40 checks passed
This was referenced Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve shotovers performance on low connection count
3 participants