Psync Zero-Copy Transmission #1335

yairgott · 2024-11-21T17:04:08Z

Problem Statement

In the current design, the primary maintains a replication buffer to record mutation commands for syncing the replicas. This replication buffer is implemented as a linked list of chunked buffers. The primary periodically transmits these recorded commands to each replica by issuing socket writes on the replica connections, which involve copying data from the user-space buffer to the kernel.
The transmission is performed by the writeToReplica function, which uses connWrite to send data over the socket.

This user-space to kernel buffer copy consumes CPU cycles and increases the memory footprint. The overhead becomes more noticeable when a replica lags significantly behind the primary, as pysnc triggers a transmission burst. This burst may temporarily reduce the primary's responsiveness, with excessive copying and potential TCP write buffer exhaustion being major contributing factors.

Proposal

Modern Linux systems support zero-copy transmission, which operates by:

The user-space provides a buffer directly to the kernel for transmission.
The kernel uses this buffer without duplicating it.
Transmission progress is communicated back to the user-space through notifications. epoll can be used as the notification mechanism.

The primary downside of zero-copy is the need for userspace to manage the send buffer. However, this limitation is much less applicable for the psync use case as Valkey already manages the pysnc replication buffers.

It’s important to note that using zero-copy for psync requires careful adjustments of the replica client write buffers management logic. Specifically, the logic to ensure that the total accumulated replication write buffer size, across all the replica connections, is limited to the value of client-output-buffer-limit replica.

Further reading on zero-copy can be found here.
Note that this article states that zero-copy is most effective for large payloads, and experimentation is necessary to determine the minimum payload size. For Memorystore vector search cluster communication, enabling zero-copy in gRPC improved QPS by approximately 8.6%.

Zero-Copy Beyond Psync

Zero-copy can also optimize transmission to clients. In the current implementation, dictionary entries are first copied into the client object's write buffer and then copied again during transmission to the client socket, resulting in two memory copies. Using zero-copy eliminates the client socket copy.
Similarly to the psync use case, implementing zero-copy for client transmission requires careful adjustments to the client’s write buffer management logic. The following considerations, while not exhaustive, outline key aspects to address:

Adjustments to the accumulated client send buffer size upper limit logic may be required, similarly to the psync use case. Note that the upper limit for the accumulated client send buffer is controlled by the client-output-buffer-limit normal parameter.
Currently, for large dictionary entries, transmission to the client involves multiple iterations of copying chunks of data from the dictionary entry into the client send buffer, and subsequently from the send buffer to the socket. If the socket's TCP send buffer becomes full, the copying process is suspended until some send buffer is available, preventing excessive memory usage.
Since zero-copy doesn’t consume TCP buffers, excessive memory usage prevention must be handled differently. One approach is to defer copying the next portion of the dictionary entry until a confirmation is received that a significant part of the pending write buffer has been received by the replica.

murphyjacob4 · 2024-11-23T02:30:42Z

I think zero-copy is pretty promising. As you mentioned zero-copy isn't "free lunch" though, there is some overhead in tracking references, registering for new events in the event loop for the SO_EE_ORIGIN_ZEROCOPY messages, and pinning pages into the kernel space. The documentation recommends against it for small writes, but then again on a single threaded architecture there may be less contention for the page pinning so it may do better than they expect.

I think it makes sense to proceed with a prototype and get some benchmark information on how this changes PSYNC/replication streaming performance. It could be the case it doesn't improve much, or it could be a big improvement.

I have something in the works - will add a comment when I have some data to share

murphyjacob4 · 2024-11-26T01:05:31Z

So I have a prototype where I enable zero copy on outgoing replication links. I can post a draft PR soon.

I did some local testing on my development machine. Test setup is as follows:

Run a test Valkey server on port 6379 (primary) and on port 6380 (replica):
a. Primary: valkey-server --port 6379 --save "" --client-output-buffer-limit "replica 0 0 0"
b. Replica: valkey-server --replicaof localhost 6379 --port 6380 --save "" --repl-diskless-load swapdb
c. I use client-output-buffer-limit "replica 0 0 0" to prevent replica from being disconnected when I send a lot of write traffic

Run Memtier to fill with lots of data quickly:

memtier_benchmark --protocol=redis --server=localhost --port=6379 \
--key-maximum=100000 --data-size=409600 --pipeline=64 --randomize \
--clients=6 --threads=6 --requests=allkeys --ratio=1:0 --key-pattern=P:P
```

Measure the time to fill the primary, and the time for the replica to fully catch up (by checking info replication)

What I found is the following:

Setting	Time to write keys to primary	Time for replica to catch up	Total Time
Zero Copy Off	48.0846 sec	14.1706 sec	62.2552 sec
Zero Copy On	49.0943 sec	1.1838 sec	50.2781 sec
Delta	+2.1%	-91.6%	-19.24%

I want to test this on a network interface that isn't loopback next. I am guessing things may look a bit different if we are actually going over the wire.

yairgott · 2024-11-26T17:44:31Z

Thanks for picking this up and coming up with a working prototype so quickly!

The numbers are showing solid improvement and I'm curious to see the numbers over the network interface.

Can you also capture the memory footprint during the time that the replica catching up?

murphyjacob4 · 2024-11-26T22:27:19Z

Tested in a cloud deployment on GCP with two n2-highmem-2 machines.

This time, I use a key space of 4 GiB (10,000 keys of 400 KiB each) and use a test time in Memtier of 30 seconds. This way, we can measure the write throughput against the primary and estimate the write throughput on the replication stream by how long it takes to catch up:

Setting	Operations	Time to Insert on Primary	Time to Reflect on Replica (including time to insert on primary)	Primary Throughput	Replica Throughput
Zero Copy Off	92717 ops	30 sec	31.4452 sec	3090.566667 ops/s	2948.526325 ops/s
Zero Copy On	105922 ops	30 sec	35.943 sec	3530.733333 ops/s	2946.943772 ops/s
Delta				+14.24%	-0.05%

I think these results make sense. We are able to increase primary throughput by removing the data copy to the TCP kernel buffer, while the replica throughput is not significantly changed.

murphyjacob4 · 2024-11-26T23:19:10Z

Across different data sizes:

Zero Copy On/Off	Key Size	Primary Throughput (ops/s)	Replica Throughput (ops/s)	Delta
Off	1 KiB	298213.93	298213.93	0%
On	1 KiB	291036.33	291036.33	-2.41%
Off	4 KiB	120160.97	120160.97	0%
On	4 KiB	118990.6	118990.6	-0.97%
Off	40 KiB	18100.68	18100.68	0%
On	40 KiB	19097.21	19097.21	5.51%
Off	400 KiB	3090.56	2948.52	0%
On	400 KiB	3530.73	2946.94	14.24%

murphyjacob4 · 2024-11-28T04:33:43Z

Given the data I've collected so far and the advice from the Linux documentation, I think it would make sense to limit zero-copy to only writes over a certain size (e.g. 10KiB). I'll update my prototype with this logic and see what setting makes sense here for a default, but it may vary based on a per-deployment basis so making it a parameter might make sense (e.g. zero_copy_min_write_size)

murphyjacob4 · 2024-12-04T01:08:39Z

I ran some experiments with 10 KiB as the minimum write size to activate TCP zero copy, and it seems to be a net positive:

Zero Copy On	Data size (B)	Key count	Primary Ops/s	Replica Ops/s	Primary Delta	Replica Delta
FALSE	1024	4194304	247954.49	247954.49	0.00%	0.00%
TRUE	1024	4194304	253675.76	253675.76	2.31%	2.31%
FALSE	4096	1048576	115197.56	115197.56	0.00%	0.00%
TRUE	4096	1048576	120152.36	120152.36	4.30%	4.30%
FALSE	10240	419430	62259.01	62259.01	0.00%	0.00%
TRUE	10240	419430	67037.02	67037.02	7.67%	7.67%
FALSE	40960	104857	16429.05	16429.05	0.00%	0.00%
TRUE	40960	104857	19331.41	19331.41	17.67%	17.67%
FALSE	102400	41943	8345.53	8345.53	0.00%	0.00%
TRUE	102400	41943	9682.31	9682.31	16.02%	16.02%
FALSE	409600	10485	2900.53	2772.36	0.00%	0.00%
TRUE	409600	10485	3389.10	2816.00	16.84%	1.57%

Limiting the write size to only writes over the threshold also improves large write performance, likely since they can get split across the borders of replication blocks. Performance boost on the <10KiB data size could be attributed to batching of writes, or some could just be transient differences in hardware (it is running in a cloud environment so there could be some discrepancy there).

I'll put the performance testing code here for future reference (it is a python script that uses Memtier and redis-cli):

zero_copy_perf.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Psync Zero-Copy Transmission #1335

Psync Zero-Copy Transmission #1335

yairgott commented Nov 21, 2024 •

edited

Loading

murphyjacob4 commented Nov 23, 2024

murphyjacob4 commented Nov 26, 2024

yairgott commented Nov 26, 2024

murphyjacob4 commented Nov 26, 2024 •

edited

Loading

murphyjacob4 commented Nov 26, 2024

murphyjacob4 commented Nov 28, 2024

murphyjacob4 commented Dec 4, 2024

Psync Zero-Copy Transmission #1335

Psync Zero-Copy Transmission #1335

Comments

yairgott commented Nov 21, 2024 • edited Loading

Problem Statement

Proposal

Zero-Copy Beyond Psync

murphyjacob4 commented Nov 23, 2024

murphyjacob4 commented Nov 26, 2024

yairgott commented Nov 26, 2024

murphyjacob4 commented Nov 26, 2024 • edited Loading

murphyjacob4 commented Nov 26, 2024

murphyjacob4 commented Nov 28, 2024

murphyjacob4 commented Dec 4, 2024

yairgott commented Nov 21, 2024 •

edited

Loading

murphyjacob4 commented Nov 26, 2024 •

edited

Loading