-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Could the cudaStreamSynchronize method be a bottleneck? #105
Comments
Hi! What you're measuring isn't the time that the CPU is active. The nvcomp compression functions just queue up the compression kernels to occur on the GPU asynchronously. |
Hi, I actually used a misleading time column name. What I am measuring is the system time that elapses between loading the data to be compressed from host to device, and retrieving the compressed data from the device. Following the updated table: Here a short pseudo code that shows where i read system time:
I would add that I perform the compression using the approach indicated by the low-level example, as I need to However, to try to understand how to reduce the system time (and thus the bottleneck caused by the cudaStreamSynchronize), I used Nsight to try to understand what was happening between the GPU and CPU (as suggested). The firs suggestion says: It seems that one of the suggestions is to appropriately size the number of threads per block. thank you |
That looks like Nsight Compute, instead of Nsight Systems. Sorry, it's a bit confusing that the two different programs are named so similarly. Also, if you look at the Duration column, you'll see that the selected function it's referring to takes a negligible amount of time, so speeding it up wouldn't gain much overall.
The purpose of calling cudaStreamSynchronize is to wait for the GPU stream to finish the kernels queued up in it, so if you don't need the CUDA kernels launched by nvcomp to finish at that point, you don't need to call cudaStreamSynchronize there. Your program could continue running other computations, and call cudaStreamSynchronize when the results of the kernels are needed. As for things you can do to make the compression kernels run faster, it will depend on the kernel and the data, but trying a different number of chunks to split the input data into may or may not help, and LZ4 has a |
I call cudaStreamSynchronize on specific stream exactly when i need compressed data. I mean:
Last i copy compressed data from device to host. To do that, i use the following snippet code:
|
This issue has been labeled |
Dear all, is there any suggestion? |
This issue has been labeled |
Hi @andreamartini |
This issue has been labeled |
My question is: Is it possible reduce the CPU time required by cudaStreamSynchronize ?
Context:
My goal is to use the nvComp library, in order to occupy (for compression operations), as little cpu time as possible.
For that reason, I detected the cpu time with std:chrono, immediately before the data transfer operations from host to device,
and immediately after retrieving the compressed data from host. To retrieve the compressed data from Host, in order to
perform cpu-side operations (save compressed data on disk, send them over network, ...), I have to run the cudaStreamSynchronize(_cudaStream) command, where _cudaStream is the stream used by all cudaMemcpyAsync operations and by the nvcompBatchedZstdCompressAsync (or LZ4, Deflate and so on) compression method. All operations performed, i.e., host device transfer, compression, and host device transfer, are done using asynchronous methods on _cudaStream. The problem is that the CPU time I detect after invoking the cudaStreamSynchronize command is a time that seems high to me, and for my purposes it seems to be the bottleneck.
Some details:
OS: Windows 11, 32 GB RAM
CPU: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
GPU: NVIDIA GeForce GTX 1650
The table above contains the results of performing the compression of data blocks (consisting of number of gradually increasing images: 1 frame block, 10 frames block, 20 frames block and 90 frames block) and the respective CPU times. The size of each single image is 3.9 MB (all images have the same dimension).
Looking at the "CPU Comp ElapsedTime (ms)" column, you can see that for 3.9 MB image size, compression requires from 20 to 30
ms (apart the first case which has a very high time, but I couldn't find an explanation). This time grows linearly as the file size increases, but the average cpu time per image (column "AvgCPU Compression Time per Image (ms)" obtained by CPU Time / Number of frames) varies between 12 and 17 ms.
Is it possible to reduce the CPU time required by cudaStreamSynchronize? I.e. is it possible get the average cpu time per image
under 5 ms?
I tried creating k std::thread async to compress k images using a different stream for each thread.
Most likely I'm doing something wrong as the total time for compressing the k images turns out to be high (500 ms).
Could multithreaded approach reduce cpu time caused by cudaStreamSynchronize?
Thank you in advance
Andrea
The text was updated successfully, but these errors were encountered: