Skip to content

Commit

Permalink
Add blog post on CUDA.jl 5.3 and 5.3. (#42)
Browse files Browse the repository at this point in the history
  • Loading branch information
maleadt authored Apr 26, 2024
1 parent 0f3b777 commit 95dcd9c
Show file tree
Hide file tree
Showing 2 changed files with 213 additions and 5 deletions.
5 changes: 0 additions & 5 deletions post/2023-11-07-cuda_5.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,6 @@ abstract = """

{{abstract}}

# CUDA.jl 5.1: Unified memory and cooperative groups

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified
memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which
offer a more modular approach to kernel programming.

## Unified memory

Expand Down
213 changes: 213 additions & 0 deletions post/2024-04-26-cuda_5.2_5.3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
+++
title = "CUDA.jl 5.2 and 5.3: Maintenance releases"
author = "Tim Besard"
external = true
abstract = """
CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug
fixes and minor improvements, but also come with a number of interesting new
features. This blog post summarizes the changes in these releases."""
+++

{{abstract}}


## Profiler improvements

CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia
GPU applications without having to use NSight Systems or other external tools.
The tool has seen continued development, mostly improving its robustness, but
CUDA.jl now also provides a `@bprofile` equivalent that runs your application
multiple times and reports on the time distribution of individual events:

```julia-repl
julia> CUDA.@bprofile CuArray([1]) .+ 1
Profiler ran for 1.0 s, capturing 1427349 events.
Host-side activity: calling CUDA APIs took 792.95 ms (79.29% of the trace)
┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤
│ 19.27% │ 192.67 ms │ 109796 │ 1.75 µs ± 10.19 ( 0.95 ‥ 1279.83) │ cuMemAllocFromPoolAsync │
│ 17.08% │ 170.8 ms │ 54898 │ 3.11 µs ± 0.27 ( 2.15 ‥ 23.84) │ cuLaunchKernel │
│ 16.77% │ 167.67 ms │ 54898 │ 3.05 µs ± 0.24 ( 0.48 ‥ 16.69) │ cuCtxSynchronize │
│ 14.11% │ 141.12 ms │ 54898 │ 2.57 µs ± 0.79 ( 1.67 ‥ 70.57) │ cuMemcpyHtoDAsync │
│ 1.70% │ 17.04 ms │ 54898 │ 310.36 ns ± 132.89 (238.42 ‥ 5483.63) │ cuStreamSynchronize │
└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘
Device-side activity: GPU was busy for 87.38 ms (8.74% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤
│ 6.66% │ 66.61 ms │ 54898 │ 1.21 µs ± 0.16 ( 0.95 ‥ 1.67) │ kernel │
│ 2.08% │ 20.77 ms │ 54898 │ 378.42 ns ± 147.66 (238.42 ‥ 1192.09) │ [copy to device] │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘
NVTX ranges:
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│ 98.99% │ 989.94 ms │ 54898 │ 18.03 µs ± 49.88 ( 15.26 ‥ 10731.22) │ @bprofile.iteration │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘
```

By default, `CUDA.@bprofile` runs the application for 1 second, but this can be
adjusted using the `time` keyword argument.

Display of the time distribution isn't limited to `CUDA.@bprofile`, and will
also be used by `CUDA.@profile` when any operation is called more than once. For
example, with the broadcasting example from above we allocate both the input
`CuArray` and the broadcast result, which results in two calls to the allocator:

```julia-repl
julia> CUDA.@profile CuArray([1]) .+ 1
Host-side activity:
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│ 99.92% │ 99.42 ms │ 1 │ │ cuMemcpyHtoDAsync │
│ 0.02% │ 21.22 µs │ 2 │ 10.61 µs ± 6.57 ( 5.96 ‥ 15.26) │ cuMemAllocFromPoolAsync │
│ 0.02% │ 17.88 µs │ 1 │ │ cuLaunchKernel │
│ 0.00% │ 953.67 ns │ 1 │ │ cuStreamSynchronize │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘
```


## Kernel launch debugging

A common issue with CUDA programming is that kernel launches may fail when
exhausting certain resources, such as shared memory or registers. This typically
results in a cryptic error message, but CUDA.jl will now try to diagnose launch
failures and provide a more helpful error message, as suggested by
[@simonbyrne](https://github.com/simonbyrne):

For example, when using more parameter memory than allowed by the architecture:

```julia-repl
julia> kernel(x) = nothing
julia> @cuda kernel(ntuple(_->UInt64(1), 2^13))
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2.
```

Or when using an invalid launch configuration, violating a device limit:

```julia-repl
julia> @cuda threads=2000 identity(nothing)
ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
```

We also diagnose launch failures that involve kernel-specific limits, such as
exceeding the number of threads that are allowed in a block (e.g., because of
register use):

```julia-repl
julia> @cuda threads=1024 heavy_kernel()
ERROR: Number of threads per block exceeds kernel limit (1024 > 512).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
```


## Sorting improvements

Thanks to [@xaellison](https://github.com/xaellison), our bitonic sorting
implementation now supports sorting specific dimensions, making it possible to
implement `sortperm` for multi-dimensional arrays:

```julia-repl
julia> A = cu([8 7; 5 6])
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
8 7
5 6
julia> sortperm(A, dims = 1)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
2 4
1 3
julia> sortperm(A, dims = 2)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
3 1
2 4
```

The bitonic kernel is now used for all sorting operations, in favor of the often
slower quicksort implementation:

```julia-repl
# before (quicksort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
2.760 ms (30 allocations: 1.02 KiB)
# after (bitonic sort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
246.386 μs (567 allocations: 13.66 KiB)
# reference CPU time
julia> @btime sort($(rand(Float32, 1024, 1024)); dims=1)
4.795 ms (1030 allocations: 5.07 MiB)
```


## Unified memory fixes

CUDA.jl 5.1 greatly improved support for unified memory, and this has continued
in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting `CuArray`s we now
correctly preserve the memory type of the input arrays. This means that if you
broadcast a `CuArray` that is allocated as unified memory, the result will also
be allocated as unified memory. In case of a conflict, e.g. broadcasting a
unified `CuArray` with one backed by device memory, we will prefer unified
memory:

```julia-repl
julia> cu([1]; host=true) .+ 1
1-element CuArray{Int64, 1, Mem.HostBuffer}:
2
julia> cu([1]; host=true) .+ cu([2]; device=true)
1-element CuArray{Int64, 1, Mem.UnifiedBuffer}:
3
```


## Software updates

Finally, we also did routine updates of the software stack, support the latest
and greatest by NVIDIA. This includes support for **CUDA 12.4** (Update 1),
**cuDNN 9**, and **cuTENSOR 2.0**. This latest release of cuTENSOR is noteworthy
as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to
follow this change. For more details, refer to the [cuTENSOR 2 migration
guide](https://docs.nvidia.com/cuda/cutensor/latest/api_transition.html) by
NVIDIA.

Of course, cuTENSOR.jl also provides a high-level Julia API which has been
mostly unaffected by these changes:

```julia
using CUDA
A = CUDA.rand(7, 8, 3, 2)
B = CUDA.rand(3, 2, 2, 8)
C = CUDA.rand(3, 3, 7, 2)

using cuTENSOR
tA = CuTensor(A, ['a', 'f', 'b', 'e'])
tB = CuTensor(B, ['c', 'e', 'd', 'f'])
tC = CuTensor(C, ['b', 'c', 'a', 'd'])

using LinearAlgebra
mul!(tC, tA, tB)
```

This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and
have to adapt to the new API, now is a good time to consider improving the
high-level interface instead!


## Future releases

The next release of CUDA.jl is gearing up to be a much larger release, with
significant changes to both the API and internals of the package. Although the
intent is to keep these changes non-breaking, it is always possible that some
code will be affected in unexpected ways, so we encourage users to test the
upcoming release by simply running `] add CUDA#master` and report any issues.

0 comments on commit 95dcd9c

Please sign in to comment.