Fix low connection_count performance #1459

rukai · 2024-02-08T00:12:04Z

While investigating the low performance of connection_count=1 benchmarks I observed that the size of message batches passed through transforms was very low for all benchmarks. Usually the batches had only a single message but sometimes there were larger.
This seemed very strange to me and indicated that pipelining of cassandra messages wasn't occurring properly.
This left me with 2 possibilities:

the cassandra driver used in the benchmark isnt properly pipelining messages
Something between when we read client requests off the socket and when we first start processing these requests is preventing proper pipelining/batching of messages.

I started by explored the first option, I inspected the bench implementation and found that it has 100 concurrent tasks sending messages, so on a 2 core EC2 instance we should be seeing at least some pipelining occuring.
I then checked the stream_id of the cassandra messages passing through shotover, they were all between 0-99 which indicated that the driver was definitely expecting these messages to be pipelined.
To double check this I increased the number of tasks to 1000 and found that the stream_id values were now between 0-999 which validated this theory.

So I mostly ruled out the first possibility and started exploring the second.
The way we implemented codecs was weird and seemed to go a bit against the contract given in the documentation https://docs.rs/tokio-util/latest/tokio_util/codec/trait.Decoder.html#tymethod.decode
Our implementation was returning as many messages as we could while the docs assume that you are returning a single message and dont mention anything about returning multiple messages.
So its not clear from the docs what the performance impacts would be from our deviation.
But it seemed possible that the tokio codec abstraction was tuned to work with the implementation always returning a single item.
So I swapped the batching around to occur at the point we read messages off the decoder task instead of from within the decoder.
This gave us the large win of this PR.

Future work

I would like to investigate replacing Vec<Message> with Message within the encoder/decoder as I think that at the very least that should let us avoid an extra allocation per message.
But I'll leave that to a follow up as it'll be easier to evaluate once this PR has landed.

Benchmarks

Huge wins for cassandra:

Some wins for kafka:

Redis seems unaffected:
Curious about this I investigated and found that the batch sizes were also unaffected. Both before and after this PR we were getting batch sizes around 50.

github-actions · 2024-02-08T02:46:45Z

0 benchmark regressed. 4 benchmark improved. Please check the benchmark workflow logs for full details: https://github.com/shotover/shotover-proxy/actions/runs/7824080116

  2 (2.00%) high mild
  5 (5.00%) high severe

kafka_codec/decode_request_metadata
                        time:   [241.13 ns 241.29 ns 241.52 ns]
                        change: [-28.990% -28.540% -28.122%] (p = 0.00 < 0.05)
                        Performance has improved.
--
  1 (1.00%) low severe
  4 (4.00%) high mild
  1 (1.00%) high severe
kafka_codec/decode_request_list_offsets
                        time:   [327.99 ns 331.24 ns 336.11 ns]
                        change: [-24.077% -23.318% -22.456%] (p = 0.00 < 0.05)
                        Performance has improved.
--
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) low severe
  3 (3.00%) high severe
kafka_codec/decode_request_fetch
                        time:   [365.88 ns 366.73 ns 368.10 ns]
                        change: [-23.502% -22.983% -22.463%] (p = 0.00 < 0.05)
                        Performance has improved.
--
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low severe
  2 (2.00%) high mild
  3 (3.00%) high severe
kafka_codec/decode_all  time:   [242.44 ns 242.62 ns 242.85 ns]
                        change: [-81.725% -81.665% -81.620%] (p = 0.00 < 0.05)
                        Performance has improved.

Fix low connection_count performance

f11fea4

rukai force-pushed the fix_low_connection_count_performance branch from 721ddd9 to f11fea4 Compare February 8, 2024 02:28

rukai marked this pull request as ready for review February 8, 2024 03:49

rukai requested a review from conorbros February 8, 2024 03:49

conorbros approved these changes Feb 8, 2024

View reviewed changes

cjrolo approved these changes Feb 8, 2024

View reviewed changes

Merge branch 'main' into fix_low_connection_count_performance

d8faf5c

rukai enabled auto-merge (squash) February 8, 2024 11:36

rukai merged commit 5bea9e9 into shotover:main Feb 8, 2024
40 checks passed

This was referenced Feb 12, 2024

Release shotover 0.3.1 #1470

Merged

Cassandra 5 functions #1474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix low connection_count performance #1459

Fix low connection_count performance #1459

rukai commented Feb 8, 2024 •

edited

Loading

github-actions bot commented Feb 8, 2024

Fix low connection_count performance #1459

Fix low connection_count performance #1459

Conversation

rukai commented Feb 8, 2024 • edited Loading

Future work

Benchmarks

github-actions bot commented Feb 8, 2024

rukai commented Feb 8, 2024 •

edited

Loading