Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Is it available BITCOMP decompression for cpu? #104

Open
andreamartini opened this issue Jun 15, 2024 · 7 comments
Open

[QST] Is it available BITCOMP decompression for cpu? #104

andreamartini opened this issue Jun 15, 2024 · 7 comments
Labels
inactive-30d question Further information is requested

Comments

@andreamartini
Copy link

In the same way of deflate, zstandard, lz4 decompression, with which it is possible to decompress on cpu something compressed via GPU and nvcomp (i.e. using low level approach), i am interested in BITCOMP cpu decompression. That is, compress on GPU using nvcomp and BITCOMP, and then decompress on CPU (using BITCOMP or, maybe, other algorithm). If yes, is there some links where i can take a look?
Thank you.

@andreamartini andreamartini added ? - Needs Triage question Further information is requested labels Jun 15, 2024
@gthomascollignon
Copy link
Contributor

Andrea,
yes you can use bitcomp on CPU, but you'll have to use the bitcomp native API as nvcomp doesn't expose this feature (and the lossy FP compression). The native API might also save some overhead and be a little faster.

Bitcomp's native API works with a handle, the idea is to create a plan (data type, size, algorithm) and then this plan can be use over and over to compress or decompress data of the same size and type, on GPU (completely asynchronously, all the info needed to launch kernels is in the handle) or on CPU.
The CPU API is single threaded so it's going to be slow, but decompression can be easily parallelized by using the partial decompression. For example, a 1GB buffer can be parallelized on 10 threads, each thread performing a partial decompression of a 100MB chunk, with you choice of parallelism (OpenMP, pthreads, ...)

Check the API in native/bitcomp.h.

Here is an example of mixed compression and decompression on GPU and CPU using the native API.
(BTCHK and CUCHK are just macros to check the returned values for Bitcomp and CUDA calls)

    // Including the native bitcomp include file instead of nvcomp's
    #include <native/bitcomp.h>
    <...>

     // All buffers allocated as managed memory to be visible on CPU and GPU
    CUCHK(cudaMallocManaged((void **)&data, size));
    CUCHK(cudaMallocManaged((void **)&decomp, size));
    <...> // Input data init

    // Creating a bitcomp handle to compress floating point data
    // For lossless compression, we treat the FP32 data as unsigned int (same 32b size)
    bitcompHandle_t handle;
    BTCHK(bitcompCreatePlan(&handle,
                            size,
                            BITCOMP_UNSIGNED_32BIT,
                            BITCOMP_LOSSLESS,
                            BITCOMP_DEFAULT_ALGO));

    // Query the maximum output size to allocate the compressed buffer
    size_t lcomp = bitcompMaxBuflen(size);
    printf("Max compressed buffer size (%lu) = %lu\n", size, lcomp);
    void *output;
    CUCHK(cudaMallocManaged(&output, lcomp));

    // Launch the compression (asynchronous) on GPU
    // The handle could be associated with another stream if needed.
    BTCHK(bitcompCompressLossless(handle, data, output));

    // Wait for the compression to finish before querying the compressed size
    CUCHK(cudaDeviceSynchronize());

    size_t compsize;
    BTCHK (bitcompGetCompressedSize (output, &compsize));
    float ratio = (float)size / (float)compsize;
    printf ("Compressed (GPU) to %lu bytes, ratio = %.1f\n", compsize, ratio);

    // Decompress the data on CPU
    BTCHK(bitcompHostUncompress(handle, output, decomp));

    // Decompressed data should be the same as the original (lossless)
    for (int i=0; i<n; i++)
        if (decomp[i] != data[i])
        {
            printf ("Error, decompressed data does not match\n");
            return -1;
        }
    printf ("Decompressed on CPU, matches original data\n");

    // Compress the data on CPU
    BTCHK(bitcompHostCompressLossless(handle, data, output));
    BTCHK (bitcompGetCompressedSize (output, &compsize));
    ratio = (float)size / (float)compsize;
    printf ("Compressed (CPU) to %lu bytes, ratio = %.1f\n", compsize, ratio);

    // Decompress the data on GPU (asynchronous)
    BTCHK(bitcompUncompress(handle, output, decomp));
    
    // Wait for the decompression to finish
    CUCHK(cudaDeviceSynchronize());
    

Managed memory is used here to make the buffers visible on CPU and GPU, just to mix CPU and GPU for the example.

Bitcomp can compress over PCIe (or nvlink) as well, for example with the uncompressed data in GPU memory, and the output buffer in host pinned memory. It works well if the compression is high enough for the PCIe bus to not be a bottleneck.
If you want to do that, you should also call this to speed up the compression (it allocates a tiny bit of memory locally on the device to cache some metadata of the output buffer. The handle becomes "optimized" for the current device and can't be used for several compression in parallel in different streams, but that doesn't apply to CPU calls, they won't even use this cache).

   * @brief Turn on compression acceleration when the compressed output is not in the global memory
   * of the device running the compression (e.g. host pinned memory, or another device's memory)
   * This is optional and only affects the performance.
   * NOTE: This makes the handle become device-specific. A plan that has this acceleration turned on
   * should always be used on the same device.
   * 
   * @param handle (input) Bitcomp handle
   * @return Returns BITCOMP_SUCCESS if successful, or an error
   */
  bitcompResult_t bitcompAccelerateRemoteCompression(bitcompHandle_t handle);

@andreamartini
Copy link
Author

Hi gthomascollignon, thank you for your reply. Where i can find native bitcomp repository? I'm having some trouble finding it through google engine.

@andreamartini
Copy link
Author

I found https://github.com/boyuanzhang62/bitcomp_lossless_example/blob/master/bitcomp_example.cu
I suppose it needs cuda and graphic board cuda compliant. Isn't true?

Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@naveenaero
Copy link
Collaborator

Hi @andreamartini
You can find the include/nvcomp/native/bitcomp.h header in the nvCOMP v4.0.1 binaries package located here. Please let me know how it goes.

Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive-30d question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants