Skip to content

Commit

Permalink
Add blog post on CUDA.jl 5.1. (#39)
Browse files Browse the repository at this point in the history
  • Loading branch information
maleadt authored Nov 7, 2023
1 parent 0fe2b7b commit cfc6ca3
Showing 1 changed file with 210 additions and 0 deletions.
210 changes: 210 additions & 0 deletions post/2023-11-07-cuda_5.1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
+++
title = "CUDA.jl 5.1: Unified memory and cooperative groups"
author = "Tim Besard"
external = true
abstract = """
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit:
unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative
groups which offer a more modular approach to kernel programming."""
+++

{{abstract}}

{{ redirect "https://info.juliahub.com/cuda-jl-5-1-unified-memory" }}

<!--
# CUDA.jl 5.1: Unified memory and cooperative groups
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified
memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which
offer a more modular approach to kernel programming.
## Unified memory
Unified memory is a feature of CUDA that allows the programmer to **access memory from both
the CPU and GPU**, relying on the driver to move data between the two. This can be useful
for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU
has available, or to be able to incrementally port code to the GPU and still have parts of
the application run on the CPU.
CUDA.jl did already support unified memory, but only for the most basic use cases. With
CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that
memory from the CPU:
```julia-repl
julia> gpu = cu([1., 2.]; unified=true)
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
1.0
2.0
julia> # accessing GPU memory from the CPU
gpu[1] = 3;
julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
3.0
2.0
```
Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is **safe and
efficient to perform scalar iteration on `CuArray`s backed by unified memory**. This greatly
simplifies porting applications to the GPU, as it no longer is a problem when code uses
`AbstractArray` fallbacks from Base that process element by element.
In addition, CUDA.jl 5.1 also makes it **easier to convert `CuArray`s to `Array` objects**.
This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK
which do not support `CuArray`s:
```julia-repl
julia> cpu = unsafe_wrap(Array, gpu)
2-element Vector{Float32}:
3.0
2.0
julia> LinearAlgebra.BLAS.scal!(2f0, cpu);
julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
6.0
4.0
```
The reverse is also possible: CPU-based `Array`s can now trivially be converted to `CuArray`
objects for use on the GPU, **without the need to explicitly allocate unified memory**. This
further simplifies memory management, as it makes it possible to use the GPU inside of an
existing application without having to copy data into a `CuArray`:
```julia-repl
julia> gpu = unsafe_wrap(CuArray, cpu)
2-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
1
2
julia> CUDA.@sync gpu .+= 1;
julia> cpu
2-element Vector{Int64}:
2
3
```
Note that the above methods are prefixed `unsafe` because of how they require **careful
management of object lifetimes**: When creating an `Array` from a `CuArray`, the `CuArray`
must be kept alive for as long as the `Array` is used, and vice-versa when creating a
`CuArray` from an `Array`. Explicit synchronization (i.e. waiting for the GPU to finish
computing) is also required, as CUDA.jl cannot synchronize automatically when accessing GPU
memory through a CPU pointer.
For now, CUDA.jl still defaults to device memory for unspecified allocations. This can be
changed using the `default_memory`
[preference](https://github.com/JuliaPackaging/Preferences.jl) of the CUDA.jl module, which
can be set to either `"device"`, `"unified"` or `"host"`. When these changes have been
sufficiently tested, and the remaining rough edges have been smoothed out, we may consider
switching the default allocator.
## Cooperative groups
Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA
cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it
possible to **write kernels that are more flexible than the traditional approach** of
differentiating computations based on thread and block indices. Instead, cooperative groups
allow the programmer to use objects representing groups of threads, pass those around, and
differentiate computations based on queries on those objects.
For example, let's port the example from the [introductory NVIDIA blogpost
post](https://developer.nvidia.com/blog/cooperative-groups/), which provides a function to
compute the sum of an array in parallel:
```julia
function reduce_sum(group, temp, val)
lane = CG.thread_rank(group)
# Each iteration halves the number of active threads
# Each thread adds its partial sum[i] to sum[lane+i]
i = CG.num_threads(group) ÷ 2
while i > 0
temp[lane] = val
CG.sync(group)
if lane <= i
val += temp[lane + i]
end
CG.sync(group)
i ÷= 2
end
return val # note: only thread 1 will return full sum
end
```
When the threads of a group call this function, they cooperatively compute the sum of the
values passed by each thread in the group. For example, let's write a kernel that calls this
function using a group representing the current thread block:
```julia
function sum_kernel_block(sum::AbstractArray{T},
input::AbstractArray{T}) where T
# have each thread compute a partial sum
my_sum = thread_sum(input)
# perform a cooperative summation
temp = CuStaticSharedArray(T, 256)
g = CG.this_thread_block()
block_sum = reduce_sum(g, temp, my_sum)
# combine the block sums
if CG.thread_rank(g) == 1
CUDA.@atomic sum[] += block_sum
end
return
end
function thread_sum(input::AbstractArray{T}) where T
sum = zero(T)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
while i <= length(input)
sum += input[i]
i += stride
end
return sum
end
n = 1<<24
threads = 256
blocks = cld(n, threads)
data = CUDA.rand(n)
sum = CUDA.fill(zero(eltype(data)), 1)
@cuda threads=threads blocks=blocks sum_kernel_block(sum, data)
```
This style of programming makes it possible to write kernels that are safer and more modular
than traditional kernels. Some CUDA features also require the use of cooperative groups, for
example, asynchronous memory copies between global and shared memory are done using the
`CG.memcpy_async` function.
With CUDA.jl 5.1, it is now possible to use a large part of these APIs from Julia. Support
has been added for implicit groups (with the exception of cluster groups and the deprecated
multi-grid groups), all relevant queries on these groups, as well as the many important
collective functions, such as `shuffle`, `vote`, and `memcpy_async`. Support for explicit
groups is still missing, as are collectives like `reduce` and `invoke`. For more
information, refer to [the CUDA.jl
documentation](https://cuda.juliagpu.org/dev/development/kernel/#Cooperative-groups).
## Other updates
Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements:
- Support for CUDA 12.3
- Performance improvements related to memory copies, which regressed in CUDA 5.0
- Improvements to the native profiler (`CUDA.@profiler`), now also showing local memory
usage, supporting more NVTX metadata, and with better support for Pluto.jl and Jupyter
- Many CUSOLVER and CUSPARSE improvements by [@amontoison](https://github.com/amontoison)
-->

0 comments on commit cfc6ca3

Please sign in to comment.