Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doca tcp frame builder performance improvement #10

Open
masaruito110 opened this issue Apr 17, 2024 · 3 comments
Open

doca tcp frame builder performance improvement #10

masaruito110 opened this issue Apr 17, 2024 · 3 comments
Assignees

Comments

@masaruito110
Copy link
Contributor

masaruito110 commented Apr 17, 2024

Purpose

Based on d0ea36e I measured the performance of docagpunetio.

Current server structure is below

|doca flow| --------(tcp dst_port 1234)--> |frame builder|
                |---(tcp dst_port 1235)--> |frame builder|
                    : (number of server instances I specified)

frame builder structure

|receive_tcp|<--semaphore-->|send_ack| <--semaphore-->|makeframe|<--semaphore-->|notify frame built|

receive_tcp polls doca_gpu_dev_eth_rxq_receive_warp, send_ack send ack to client and calculate the latest seq number for make frame, make frame builds frames using cudaMemcpyAsync.

According to some trial, the throughput is influenced by client side ack checking frequency, and the number of sessions.

Environment

1
|connectx7 on PCIe4|<------>|connectx7 on PCIe3|
|A100 40GB GPU on PCIe4|

2
|connectx7 on PCIe4|<------>|connectx7 on PCIe3|
                    |<----->|connectx6 on PCIe3|
|A100 40GB GPU on PCIe4|

Result

Here is the result.
env is the Environment described in Environment session.
process means the number of processes. session/process means the number of sessions per process, when process is 2 and session/process is 1, total number of sessions is 2. chunk size is that client checks ack from server every time when it sends this number of bytes. Gbps/session is the throughput per session.

Theoretically, when env is 1, the total throughput is 100Gbps, so we expect we can get 100Gbps when 1 session, 50Gbps/session when 2 sessions. When env is 2, we use 2 ports of connectx7, so the total hroughput is 200Gbps

env process session/process chunk size [MByte] Gbps/session
1 1 1 1 38.9
1 1 1 2 39.39
1 1 1 4 41.72
1 1 1 8 43.46
1 1 1 16 44.36
1 1 2 1 6
1 2 1 1 18.06
2 2 1 1 17.52

The result when env is 1, process is 1 and session/process is 1, over 16MByte chunk size doesn't work because the cyclic buffer handled by doca overwritten. The chunk size increased, the throughput improved. This means that the average RTT is long.

The result when env is 1 and 2, process is 2 and session/process is 1, we get half of result when env is 1, process is 1 and session/process is 1. We expected the throughput is the same because NIC bandwidth, PCIe bandwidth and GPU device memory bandwidth is enough. So there is a limitation or limitations in doca library.

The result when env is 1, process is 1 and session/process is 2, we only get 6Gbps. We expected we can get the same result of when process is 2 and session/process is 1. Maybe cuda kernels affect other kernels in the same process.

@masaruito110 masaruito110 self-assigned this Apr 17, 2024
@masaruito110
Copy link
Contributor Author

masaruito110 commented Apr 18, 2024

Additonal report.

Frame builder part skips frame building and only send ack.

env process session/process chunk size [MByte] Gbps/session
1 1 1 32 90.93
1 1 2 32 49.1
2 2 1 32 55.59
2 2 1 128 88.88

(env, process, session/process) are (1, 1, 1) and (1, 1, 2) is close to theoretical performance.
(2,2,1), when chunk size is 128MB, then we got high performance. We can say when there is multiple sessions, the latency get worse, but potentially we can get high throughput.

@masaruito110
Copy link
Contributor Author

masaruito110 commented Apr 19, 2024

Addtional report same as #12 (comment)

I run heavy mempy kernel while only send ack app (Frame builder without frame building).

The result is below. We measured the performance when there are one or two heavy memcpy runing. The chunk size is enough, then we can get high performance, but the performance with chunk size that frame building can work is slow and similar to the result of actual frame building performance. Also the heavy mempcy increase the performance decrease.

env process session/process chunk size [MByte] Gbps/session with one heavy memcpy Gbps/session with two heavy memcpy
1 1 1 1 27.82 17.72
1 1 1 2 34.63 23.08
1 1 1 4 41.54 27.69
1 1 1 8 41.56 27.69
1 1 1 16 55.43 36.93
1 1 1 32 55.44 36.96
1 1 1 40 69.21 46.16
1 1 1 48 83.13 55.46
1 1 1 64 83.13 73.78
1 1 1 128 88.7 88.53

The performances without heavy memcpy

env process session/process chunk size [MByte] Gbps/session
1 1 1 1 63.02
1 1 1 2 72.31
1 1 1 4 81.92
1 1 1 8 91.04
1 1 1 16 89.52
1 1 1 32 91.13

@masaruito110
Copy link
Contributor Author

masaruito110 commented Apr 24, 2024

Performance on VMWare VM.

VM has RTX A6000 Ampere arch, ConnectX6, PCIe gen2, CPU is Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz

process session/process chunk size [MByte] Gbps/session
1 1 1 46.38
1 1 2 55.52
1 1 4 60
1 1 8 60

RTT performance
Sent 16k times. RTT is usec.

chunk size [MByte] rtt ave rtt stddev rtt min rtt max
1 147.4199 11.37497 130.888 242.101
2 265.0977 11.41619 240.277 319.813
4 486.8668 19.04014 454.549 549.641
8 945.3377 30.31007 900.802 1025.787

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant