Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Could the cudaStreamSynchronize method be a bottleneck? #105

Open
andreamartini opened this issue Jul 3, 2024 · 9 comments
Open
Labels
inactive-30d question Further information is requested

Comments

@andreamartini
Copy link

My question is: Is it possible reduce the CPU time required by cudaStreamSynchronize ?
Context:
My goal is to use the nvComp library, in order to occupy (for compression operations), as little cpu time as possible.
For that reason, I detected the cpu time with std:chrono, immediately before the data transfer operations from host to device,
and immediately after retrieving the compressed data from host. To retrieve the compressed data from Host, in order to
perform cpu-side operations (save compressed data on disk, send them over network, ...), I have to run the cudaStreamSynchronize(_cudaStream) command, where _cudaStream is the stream used by all cudaMemcpyAsync operations and by the nvcompBatchedZstdCompressAsync (or LZ4, Deflate and so on) compression method. All operations performed, i.e., host device transfer, compression, and host device transfer, are done using asynchronous methods on _cudaStream. The problem is that the CPU time I detect after invoking the cudaStreamSynchronize command is a time that seems high to me, and for my purposes it seems to be the bottleneck.

Some details:
OS: Windows 11, 32 GB RAM
CPU: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
GPU: NVIDIA GeForce GTX 1650
immagine

The table above contains the results of performing the compression of data blocks (consisting of number of gradually increasing images: 1 frame block, 10 frames block, 20 frames block and 90 frames block) and the respective CPU times. The size of each single image is 3.9 MB (all images have the same dimension).

Looking at the "CPU Comp ElapsedTime (ms)" column, you can see that for 3.9 MB image size, compression requires from 20 to 30
ms (apart the first case which has a very high time, but I couldn't find an explanation). This time grows linearly as the file size increases, but the average cpu time per image (column "AvgCPU Compression Time per Image (ms)" obtained by CPU Time / Number of frames) varies between 12 and 17 ms.

Is it possible to reduce the CPU time required by cudaStreamSynchronize? I.e. is it possible get the average cpu time per image
under 5 ms?

I tried creating k std::thread async to compress k images using a different stream for each thread.
Most likely I'm doing something wrong as the total time for compressing the k images turns out to be high (500 ms).
Could multithreaded approach reduce cpu time caused by cudaStreamSynchronize?

Thank you in advance
Andrea

@andreamartini andreamartini added ? - Needs Triage question Further information is requested labels Jul 3, 2024
@ndickson-nvidia
Copy link
Contributor

Hi! What you're measuring isn't the time that the CPU is active. The nvcomp compression functions just queue up the compression kernels to occur on the GPU asynchronously. cudaStreamSynchronize simply waits for the GPU stream to finish all of those queued GPU computations. If you profile your program using Nsight Systems on the command line and then open the report file using the Nsight Systems user interface, you should be able to get a better view of what's going on between the GPU and CPU code.

@andreamartini
Copy link
Author

Hi, I actually used a misleading time column name. What I am measuring is the system time that elapses between loading the data to be compressed from host to device, and retrieving the compressed data from the device. Following the updated table:

immagine

Here a short pseudo code that shows where i read system time:


// start timer
auto tStartCompression = std::chrono::steady_clock::now();
// copy data from host to device using async copy
....
// call compression algorithm like ... 
 nvcompStatus_t      BatchData::_runZStandardCompression()
    {
        return nvcompBatchedZstdCompressAsync(
            _pDevice_inputData_as_Chunks_ptrs,
            _pDevice_inputData_chunks_size,
            _stChunkSize, // The maximum chunk size
            _stNumberOfChunks,
            _pDevice_temp_space_ptr,
            _stTemp_bytes,
            _pDevice_outputCompressed_Chunks_ptrs,
            _pDevice_compressed_chunks_size,
            _nvcompBatchedZStdOpts,
            _cudaStream);
    } 

// Get back compressed data from device to host, using  async copy

// MUST: synchronize GPU and CPU Code for retrieving compressed data 
cudaStreamSynchronize(_cudaStream);   // It seems to be the reason of bottleneck

auto			tEndCompression = std::chrono::steady_clock::now();     // Stop timer

I would add that I perform the compression using the approach indicated by the low-level example, as I need to
decompress on another PC without using nvcomp, and therefore I need to recover the size of each compressed chunk
(I have not found anything on the high level example that allows me to do this).

However, to try to understand how to reduce the system time (and thus the bottleneck caused by the cudaStreamSynchronize), I used Nsight to try to understand what was happening between the GPU and CPU (as suggested).

immagine

The firs suggestion says:
"Threads are executed in groups of 32 threads called warps. This kernel launch is configured to execute 1 threads per block.
Consequently, some threads in a warp are masked off and those hardware resources are unused.
Try changing the number of threads per block to be a multiple of 32 threads.
Between 128 and 256 threads per block is a good initial range for experimentation.
Use smaller thread blocks rather than one large thread block per multiprocessor if latency affects performance.
This is particularly beneficial to kernels that frequently call __syncthreads().
See the Hardware Model description for more details on launch configurations."

It seems that one of the suggestions is to appropriately size the number of threads per block.
Is it possible to configure nvcomp for using specific threadsNumber per block?
is there any other approach to reduce the waiting time caused by the cudaStreamSynchronize?

thank you

@ndickson-nvidia
Copy link
Contributor

That looks like Nsight Compute, instead of Nsight Systems. Sorry, it's a bit confusing that the two different programs are named so similarly. Also, if you look at the Duration column, you'll see that the selected function it's referring to takes a negligible amount of time, so speeding it up wouldn't gain much overall.

is there any other approach to reduce the waiting time caused by the cudaStreamSynchronize?

The purpose of calling cudaStreamSynchronize is to wait for the GPU stream to finish the kernels queued up in it, so if you don't need the CUDA kernels launched by nvcomp to finish at that point, you don't need to call cudaStreamSynchronize there. Your program could continue running other computations, and call cudaStreamSynchronize when the results of the kernels are needed.

As for things you can do to make the compression kernels run faster, it will depend on the kernel and the data, but trying a different number of chunks to split the input data into may or may not help, and LZ4 has a data_type option in its nvcompBatchedLZ4Opts_t structure that can be NVCOMP_TYPE_CHAR (the default), NVCOMP_TYPE_SHORT, or NVCOMP_TYPE_INT for treating the input data as 1-byte, 2-byte, or 4-byte integers. The equivalent options structure for Deflate has an option for different algorithms to use that give different quality of compression vs. throughput.

@andreamartini
Copy link
Author

The purpose of calling cudaStreamSynchronize is to wait for the GPU stream to finish the kernels queued up in it,
so if you don't need the CUDA kernels launched by nvcomp to finish at that point, you don't need to call
cudaStreamSynchronize there. Your program could continue running other computations, and call cudaStreamSynchronize
when the results of the kernels are needed.

I call cudaStreamSynchronize on specific stream exactly when i need compressed data. I mean:
First i copy data from hsot to device using cudaMemcpyAsync on _cudaStream
Then i compress data using nvcomp api, i.e:

nvcompBatchedZstdCompressAsync(
           _pDevice_inputData_as_Chunks_ptrs,
           _pDevice_inputData_chunks_size,
           _stChunkSize, // The maximum chunk size
           _stNumberOfChunks,
           _pDevice_temp_space_ptr,
           _stTemp_bytes,
           _pDevice_outputCompressed_Chunks_ptrs,
           _pDevice_compressed_chunks_size,
           _nvcompBatchedZStdOpts,
           _cudaStream);

Last i copy compressed data from device to host.

To do that, i use the following snippet code:
```
// Compression in action ...
// Invoke the compression pointed by the function pointer _fpSpecializedCompression
nvcompStatus_t compressionResult = (this._fpSpecializedCompression)();

    if (compressionResult != nvcompSuccess)
    {
        bRet = false;
    }
    else
    {
        _stTotalCompressedSize = 0;

        /* Get compressed data ... */
        /* outputCompressedData is std::vector<std::vector<uint8_t>> */
        outputCompressedData.resize(_stNumberOfChunks);
        /*compressedChunksSize is std::vector<size_t> */
        compressedChunksSize.resize(_stNumberOfChunks);
       /* Get compressed chunks'size */
        gpuErrchk(cudaMallocHost(&_pHost_compressed_chunks_size, sizeof(size_t) * _stNumberOfChunks));
        gpuErrchk(cudaMemcpyAsync(_pHost_compressed_chunks_size, _pDevice_compressed_chunks_size, sizeof(size_t) * _stNumberOfChunks, cudaMemcpyDeviceToHost, _cudaStream));

        //  uint8_t** _pHost_readback_outputCompressed_Chunks_ptrs;
        gpuErrchk(cudaMallocHost(&_pHost_readback_outputCompressed_Chunks_ptrs, sizeof(size_t) * _stNumberOfChunks))

		// Copy each compressed chunk from device to host and then from host to
		// the std::vector used for saving compressed data to file
        for (size_t i = 0; i < _stNumberOfChunks; i++)
        {
            size_t chunkCompressedSize = _pHost_compressed_chunks_size[i];           
            gpuErrchk(cudaMallocHost(&_pHost_readback_outputCompressed_Chunks_ptrs[i], sizeof(uint8_t) * chunkCompressedSize));
            uint8_t* pHost_CompressedData_dst = _pHost_readback_outputCompressed_Chunks_ptrs[i];
            const uint8_t* pDevice_Compressedsrc = static_cast<uint8_t*>(_pHost_outputCompressed_Chunks_ptrs[i]);        
            gpuErrchk(cudaMemcpyAsync(pHost_CompressedData_dst, pDevice_Compressedsrc, sizeof(uint8_t) * chunkCompressedSize, cudaMemcpyDeviceToHost, _cudaStream));
           
            // POI 'A'
            /* I need to synchronize here, after the copy of each compressed chunk data*/
            gpuErrchk(cudaStreamSynchronize(_cudaStream));

            /*Copy the compressed chunk data to a vector which i use externally for saving data on file*/
            compressedChunksSize[i] = chunkCompressedSize;
            outputCompressedData[i].resize(chunkCompressedSize);
            memcpy(outputCompressedData[i].data(), pHost_CompressedData_dst, chunkCompressedSize * sizeof(uint8_t));
            _stTotalCompressedSize += chunkCompressedSize;
            // END POI 'A'
        }

      
     /* 
      // POI 'B'
      gpuErrchk(cudaStreamSynchronize(_cudaStream));
      for (size_t i = 0; i < _stNumberOfChunks; i++)
        {
            size_t chunkCompressedSize = _pHost_compressed_chunks_size[i];            
            uint8_t* pHost_CompressedData = _pHost_readback_outputCompressed_Chunks_ptrs[i];
            compressedChunksSize[i] = chunkCompressedSize;
            outputCompressedData[i].resize(chunkCompressedSize);
            memcpy(outputCompressedData[i].data(), pHost_CompressedData, chunkCompressedSize * sizeof(uint8_t));
            _stTotalCompressedSize += chunkCompressedSize;
        }
      // END POI 'B'
      */
    }

    cudaEventRecord(cudaEndEvent, _cudaStream);  // I record the cudaStartEvent on _cudaStream just before the copy of data from Host to device

    auto			tEndCompression = std::chrono::steady_clock::now();
	// tStartCompression = std::chrono::steady_clock::now() called before copy data from host to device
    _fmillisec_		elapsedTime = tEndCompression - tStartCompression;
    _milliseconds_	msElapsed = std::chrono::duration_cast<_milliseconds_>(elapsedTime);

	compressedReport.stCpuCompressionElapsedTime_ms = msElapsed.count();

    float fGpuCompressedElapsedTime_ms = 0.0f;
    cudaEventElapsedTime(&fGpuCompressedElapsedTime_ms, cudaStartEvent, cudaEndEvent);

In the reported code, i inserted 2 POI (Point of interest) A and B. Currently i'm using POI A.
With POI A, i call cudaStreamSynchronize _stNumberOfChunks times, and i get the right compressed data, but the elapsed time seems too high (As shown in the images in the previous posts, on average, compression of 4MB takes 12 to 17 ms). If i comment POI A block and decomment POI B block, in order to call cudaStreamSynchronize only once, pHost_CompressedData is null: it seems not ready. So i have to use POI A block.

In case of conceptual errors, please could you suggest me the right way?

Moreover, could the time of 12 ms be a correct time for compression of a 4MB file? 
With a geforce gtx 1650 card, what might be a reference time? Or , finally, is the detected time between 
transfer data from host to device, compression, and transfer from device to host, a valid indicator 
for measuring compression performance effectiveness?

Thank you.

Copy link

github-actions bot commented Aug 7, 2024

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@andreamartini
Copy link
Author

Dear all, is there any suggestion?

Copy link

github-actions bot commented Sep 6, 2024

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@naveenaero
Copy link
Collaborator

Hi @andreamartini
Apologies for the delayed response.
In the code snipped you shared, it is functionally correct. As @ndickson-nvidia mentioned that the cudaStreamSynchronize on cudaStream is meant to make sure the GPU kernel is finished compressing the data and also that the compressed data has finished copying from the device back to the host. Only in the first iteration of the loop, the host will wait for the GPU compression to be finished and the first chunk to be copied from the device to the host. From the second iteration onwards, the synchronization only waits for the chunk to be copied from the device to the host.
I do see that you have a memcpy inside the for-loop which could affect the overall time you are measuring.
To measure compression performance you should ideally be only measuring the compression time and not the h2d and d2h transfer times. You can also try running the benchmarks we have here : https://github.com/NVIDIA/CUDALibrarySamples/tree/master/nvCOMP

Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive-30d question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants