Add blog post on CUDA.jl 5.1. (#39)

JuliaGPU · Nov 7, 2023 · cfc6ca3 · cfc6ca3
1 parent 0fe2b7b
commit cfc6ca3
Showing 1 changed file with 210 additions and 0 deletions.
diff --git a/post/2023-11-07-cuda_5.1.md b/post/2023-11-07-cuda_5.1.md
@@ -0,0 +1,210 @@
++++
+title = "CUDA.jl 5.1: Unified memory and cooperative groups"
+author = "Tim Besard"
+external = true
+abstract = """
+  CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit:
+  unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative
+  groups which offer a more modular approach to kernel programming."""
++++
+
+{{abstract}}
+
+{{ redirect "https://info.juliahub.com/cuda-jl-5-1-unified-memory" }}
+
+<!--
+
+# CUDA.jl 5.1: Unified memory and cooperative groups
+
+CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified
+memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which
+offer a more modular approach to kernel programming.
+
+## Unified memory
+
+Unified memory is a feature of CUDA that allows the programmer to **access memory from both
+the CPU and GPU**, relying on the driver to move data between the two. This can be useful
+for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU
+has available, or to be able to incrementally port code to the GPU and still have parts of
+the application run on the CPU.
+
+CUDA.jl did already support unified memory, but only for the most basic use cases. With
+CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that
+memory from the CPU:
+
+```julia-repl
+julia> gpu = cu([1., 2.]; unified=true)
+2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
+ 1.0
+ 2.0
+
+julia> # accessing GPU memory from the CPU
+       gpu[1] = 3;
+
+julia> gpu
+2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
+ 3.0
+ 2.0
+```
+
+Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is **safe and
+efficient to perform scalar iteration on `CuArray`s backed by unified memory**. This greatly
+simplifies porting applications to the GPU, as it no longer is a problem when code uses
+`AbstractArray` fallbacks from Base that process element by element.
+
+In addition, CUDA.jl 5.1 also makes it **easier to convert `CuArray`s to `Array` objects**.
+This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK
+which do not support `CuArray`s:
+
+```julia-repl
+julia> cpu = unsafe_wrap(Array, gpu)
+2-element Vector{Float32}:
+ 3.0
+ 2.0
+
+julia> LinearAlgebra.BLAS.scal!(2f0, cpu);
+
+julia> gpu
+2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
+ 6.0
+ 4.0
+```
+
+The reverse is also possible: CPU-based `Array`s can now trivially be converted to `CuArray`
+objects for use on the GPU, **without the need to explicitly allocate unified memory**. This
+further simplifies memory management, as it makes it possible to use the GPU inside of an
+existing application without having to copy data into a `CuArray`:
+
+```julia-repl
+julia> gpu = unsafe_wrap(CuArray, cpu)
+2-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
+ 1
+ 2
+
+julia> CUDA.@sync gpu .+= 1;
+
+julia> cpu
+2-element Vector{Int64}:
+ 2
+ 3
+```
+
+Note that the above methods are prefixed `unsafe` because of how they require **careful
+management of object lifetimes**: When creating an `Array` from a `CuArray`, the `CuArray`
+must be kept alive for as long as the `Array` is used, and vice-versa when creating a
+`CuArray` from an `Array`. Explicit synchronization (i.e. waiting for the GPU to finish
+computing) is also required, as CUDA.jl cannot synchronize automatically when accessing GPU
+memory through a CPU pointer.
+
+For now, CUDA.jl still defaults to device memory for unspecified allocations. This can be
+changed using the `default_memory`
+[preference](https://github.com/JuliaPackaging/Preferences.jl) of the CUDA.jl module, which
+can be set to either `"device"`, `"unified"` or `"host"`. When these changes have been
+sufficiently tested, and the remaining rough edges have been smoothed out, we may consider
+switching the default allocator.
+
+
+## Cooperative groups
+
+Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA
+cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it
+possible to **write kernels that are more flexible than the traditional approach** of
+differentiating computations based on thread and block indices. Instead, cooperative groups
+allow the programmer to use objects representing groups of threads, pass those around, and
+differentiate computations based on queries on those objects.
+
+For example, let's port the example from the [introductory NVIDIA blogpost
+post](https://developer.nvidia.com/blog/cooperative-groups/), which provides a function to
+compute the sum of an array in parallel:
+
+```julia
+function reduce_sum(group, temp, val)
+    lane = CG.thread_rank(group)
+
+    # Each iteration halves the number of active threads
+    # Each thread adds its partial sum[i] to sum[lane+i]
+    i = CG.num_threads(group) ÷ 2
+    while i > 0
+        temp[lane] = val
+        CG.sync(group)
+        if lane <= i
+            val += temp[lane + i]
+        end
+        CG.sync(group)
+        i ÷= 2
+    end
+
+    return val  # note: only thread 1 will return full sum
+end
+```
+
+When the threads of a group call this function, they cooperatively compute the sum of the
+values passed by each thread in the group. For example, let's write a kernel that calls this
+function using a group representing the current thread block:
+
+```julia
+function sum_kernel_block(sum::AbstractArray{T},
+                          input::AbstractArray{T}) where T
+    # have each thread compute a partial sum
+    my_sum = thread_sum(input)
+
+    # perform a cooperative summation
+    temp = CuStaticSharedArray(T, 256)
+    g = CG.this_thread_block()
+    block_sum = reduce_sum(g, temp, my_sum)
+
+    # combine the block sums
+    if CG.thread_rank(g) == 1
+        CUDA.@atomic sum[] += block_sum
+    end
+
+    return
+end
+
+function thread_sum(input::AbstractArray{T}) where T
+    sum = zero(T)
+
+    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
+    stride = blockDim().x * gridDim().x
+    while i <= length(input)
+        sum += input[i]
+        i += stride
+    end
+
+    return sum
+end
+
+n = 1<<24
+threads = 256
+blocks = cld(n, threads)
+
+data = CUDA.rand(n)
+sum = CUDA.fill(zero(eltype(data)), 1)
+@cuda threads=threads blocks=blocks sum_kernel_block(sum, data)
+```
+
+This style of programming makes it possible to write kernels that are safer and more modular
+than traditional kernels. Some CUDA features also require the use of cooperative groups, for
+example, asynchronous memory copies between global and shared memory are done using the
+`CG.memcpy_async` function.
+
+With CUDA.jl 5.1, it is now possible to use a large part of these APIs from Julia. Support
+has been added for implicit groups (with the exception of cluster groups and the deprecated
+multi-grid groups), all relevant queries on these groups, as well as the many important
+collective functions, such as `shuffle`, `vote`, and `memcpy_async`. Support for explicit
+groups is still missing, as are collectives like `reduce` and `invoke`. For more
+information, refer to [the CUDA.jl
+documentation](https://cuda.juliagpu.org/dev/development/kernel/#Cooperative-groups).
+
+
+## Other updates
+
+Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements:
+
+- Support for CUDA 12.3
+- Performance improvements related to memory copies, which regressed in CUDA 5.0
+- Improvements to the native profiler (`CUDA.@profiler`), now also showing local memory
+  usage, supporting more NVTX metadata, and with better support for Pluto.jl and Jupyter
+- Many CUSOLVER and CUSPARSE improvements by [@amontoison](https://github.com/amontoison)
+
+-->