-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
210 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
+++ | ||
title = "CUDA.jl 5.1: Unified memory and cooperative groups" | ||
author = "Tim Besard" | ||
external = true | ||
abstract = """ | ||
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: | ||
unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative | ||
groups which offer a more modular approach to kernel programming.""" | ||
+++ | ||
|
||
{{abstract}} | ||
|
||
{{ redirect "https://info.juliahub.com/cuda-jl-5-1-unified-memory" }} | ||
|
||
<!-- | ||
# CUDA.jl 5.1: Unified memory and cooperative groups | ||
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified | ||
memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which | ||
offer a more modular approach to kernel programming. | ||
## Unified memory | ||
Unified memory is a feature of CUDA that allows the programmer to **access memory from both | ||
the CPU and GPU**, relying on the driver to move data between the two. This can be useful | ||
for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU | ||
has available, or to be able to incrementally port code to the GPU and still have parts of | ||
the application run on the CPU. | ||
CUDA.jl did already support unified memory, but only for the most basic use cases. With | ||
CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that | ||
memory from the CPU: | ||
```julia-repl | ||
julia> gpu = cu([1., 2.]; unified=true) | ||
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}: | ||
1.0 | ||
2.0 | ||
julia> # accessing GPU memory from the CPU | ||
gpu[1] = 3; | ||
julia> gpu | ||
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}: | ||
3.0 | ||
2.0 | ||
``` | ||
Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is **safe and | ||
efficient to perform scalar iteration on `CuArray`s backed by unified memory**. This greatly | ||
simplifies porting applications to the GPU, as it no longer is a problem when code uses | ||
`AbstractArray` fallbacks from Base that process element by element. | ||
In addition, CUDA.jl 5.1 also makes it **easier to convert `CuArray`s to `Array` objects**. | ||
This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK | ||
which do not support `CuArray`s: | ||
```julia-repl | ||
julia> cpu = unsafe_wrap(Array, gpu) | ||
2-element Vector{Float32}: | ||
3.0 | ||
2.0 | ||
julia> LinearAlgebra.BLAS.scal!(2f0, cpu); | ||
julia> gpu | ||
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}: | ||
6.0 | ||
4.0 | ||
``` | ||
The reverse is also possible: CPU-based `Array`s can now trivially be converted to `CuArray` | ||
objects for use on the GPU, **without the need to explicitly allocate unified memory**. This | ||
further simplifies memory management, as it makes it possible to use the GPU inside of an | ||
existing application without having to copy data into a `CuArray`: | ||
```julia-repl | ||
julia> gpu = unsafe_wrap(CuArray, cpu) | ||
2-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}: | ||
1 | ||
2 | ||
julia> CUDA.@sync gpu .+= 1; | ||
julia> cpu | ||
2-element Vector{Int64}: | ||
2 | ||
3 | ||
``` | ||
Note that the above methods are prefixed `unsafe` because of how they require **careful | ||
management of object lifetimes**: When creating an `Array` from a `CuArray`, the `CuArray` | ||
must be kept alive for as long as the `Array` is used, and vice-versa when creating a | ||
`CuArray` from an `Array`. Explicit synchronization (i.e. waiting for the GPU to finish | ||
computing) is also required, as CUDA.jl cannot synchronize automatically when accessing GPU | ||
memory through a CPU pointer. | ||
For now, CUDA.jl still defaults to device memory for unspecified allocations. This can be | ||
changed using the `default_memory` | ||
[preference](https://github.com/JuliaPackaging/Preferences.jl) of the CUDA.jl module, which | ||
can be set to either `"device"`, `"unified"` or `"host"`. When these changes have been | ||
sufficiently tested, and the remaining rough edges have been smoothed out, we may consider | ||
switching the default allocator. | ||
## Cooperative groups | ||
Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA | ||
cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it | ||
possible to **write kernels that are more flexible than the traditional approach** of | ||
differentiating computations based on thread and block indices. Instead, cooperative groups | ||
allow the programmer to use objects representing groups of threads, pass those around, and | ||
differentiate computations based on queries on those objects. | ||
For example, let's port the example from the [introductory NVIDIA blogpost | ||
post](https://developer.nvidia.com/blog/cooperative-groups/), which provides a function to | ||
compute the sum of an array in parallel: | ||
```julia | ||
function reduce_sum(group, temp, val) | ||
lane = CG.thread_rank(group) | ||
# Each iteration halves the number of active threads | ||
# Each thread adds its partial sum[i] to sum[lane+i] | ||
i = CG.num_threads(group) ÷ 2 | ||
while i > 0 | ||
temp[lane] = val | ||
CG.sync(group) | ||
if lane <= i | ||
val += temp[lane + i] | ||
end | ||
CG.sync(group) | ||
i ÷= 2 | ||
end | ||
return val # note: only thread 1 will return full sum | ||
end | ||
``` | ||
When the threads of a group call this function, they cooperatively compute the sum of the | ||
values passed by each thread in the group. For example, let's write a kernel that calls this | ||
function using a group representing the current thread block: | ||
```julia | ||
function sum_kernel_block(sum::AbstractArray{T}, | ||
input::AbstractArray{T}) where T | ||
# have each thread compute a partial sum | ||
my_sum = thread_sum(input) | ||
# perform a cooperative summation | ||
temp = CuStaticSharedArray(T, 256) | ||
g = CG.this_thread_block() | ||
block_sum = reduce_sum(g, temp, my_sum) | ||
# combine the block sums | ||
if CG.thread_rank(g) == 1 | ||
CUDA.@atomic sum[] += block_sum | ||
end | ||
return | ||
end | ||
function thread_sum(input::AbstractArray{T}) where T | ||
sum = zero(T) | ||
i = (blockIdx().x-1) * blockDim().x + threadIdx().x | ||
stride = blockDim().x * gridDim().x | ||
while i <= length(input) | ||
sum += input[i] | ||
i += stride | ||
end | ||
return sum | ||
end | ||
n = 1<<24 | ||
threads = 256 | ||
blocks = cld(n, threads) | ||
data = CUDA.rand(n) | ||
sum = CUDA.fill(zero(eltype(data)), 1) | ||
@cuda threads=threads blocks=blocks sum_kernel_block(sum, data) | ||
``` | ||
This style of programming makes it possible to write kernels that are safer and more modular | ||
than traditional kernels. Some CUDA features also require the use of cooperative groups, for | ||
example, asynchronous memory copies between global and shared memory are done using the | ||
`CG.memcpy_async` function. | ||
With CUDA.jl 5.1, it is now possible to use a large part of these APIs from Julia. Support | ||
has been added for implicit groups (with the exception of cluster groups and the deprecated | ||
multi-grid groups), all relevant queries on these groups, as well as the many important | ||
collective functions, such as `shuffle`, `vote`, and `memcpy_async`. Support for explicit | ||
groups is still missing, as are collectives like `reduce` and `invoke`. For more | ||
information, refer to [the CUDA.jl | ||
documentation](https://cuda.juliagpu.org/dev/development/kernel/#Cooperative-groups). | ||
## Other updates | ||
Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements: | ||
- Support for CUDA 12.3 | ||
- Performance improvements related to memory copies, which regressed in CUDA 5.0 | ||
- Improvements to the native profiler (`CUDA.@profiler`), now also showing local memory | ||
usage, supporting more NVTX metadata, and with better support for Pluto.jl and Jupyter | ||
- Many CUSOLVER and CUSPARSE improvements by [@amontoison](https://github.com/amontoison) | ||
--> |