Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doca udp frame builder performance improvement #12

Open
masaruito110 opened this issue Apr 19, 2024 · 1 comment
Open

doca udp frame builder performance improvement #12

masaruito110 opened this issue Apr 19, 2024 · 1 comment

Comments

@masaruito110
Copy link
Contributor

masaruito110 commented Apr 19, 2024

#Purpose

Based on 7de28d5 I measured the performance of docagpunetio.

Current server structure

|doca flow| ---------> |frame builder|

frame builder structure

|receive_tcp|<--semaphore-->|makeframe|<--semaphore-->|notify frame built|

#Environment
Environment is the same as #10

#Result

The difference with #10 is that we cannot set chunk size because udp doesn't check ack. Hence, we just show maximum performance. The trend looks the same as #10

env process session/process Gbps/session
1 1 1 43
2 2 1 18
@masaruito110
Copy link
Contributor Author

I run the simple app that heavily copy while frame builder runs.

The result is similar to that of multiple sessions.

env process session/process Gbps/session
1 1 1 18

Doca seems to be influenced by other heavily copy kernels.

Simple app is below.

File Edit Options Buffers Tools C++ Help
#include <cuda_runtime.h>
#include <stdint.h>
#include <stdio.h>

__global__ void heavy_memcpy(uint8_t* dst, uint8_t* src, size_t chunk, size_t frame_size)
{

    size_t cnt = 0;
    while (true) {
        cnt++;
        if (cnt % 1000 && threadIdx.x == 0) {
            printf("copying %d\n", cnt);
        }
        for (int i = threadIdx.x; i < frame_size / chunk - 1; i += blockDim.x) {
            cudaMemcpyAsync(dst + i * chunk, src + i * chunk, chunk, cudaMemcpyDeviceToDevice);
        }
    }
}

void heavy_memcpy_cpu()
{
    uint8_t* dst;
    uint8_t* src;

    size_t frame_size = (size_t)4 * 1024 * 1024 * 1024;
    size_t chunk = 8000;

    cudaMalloc((void**)&dst, frame_size);
    cudaMalloc((void**)&src, frame_size);

    heavy_memcpy<<<1, 1024>>>(dst, src, chunk, frame_size);
    cudaDeviceSynchronize();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant