Add blog post on CUDA.jl v5.4

JuliaGPU · May 28, 2024 · 4008b7a · 4008b7a
1 parent dbb9765
commit 4008b7a
Showing 1 changed file with 314 additions and 0 deletions.
diff --git a/post/2024-05-28-cuda_5.4.md b/post/2024-05-28-cuda_5.4.md
@@ -0,0 +1,314 @@
++++
+title = "CUDA.jl 5.4: Memory management mayhem"
+author = "Tim Besard"
+external = true
+abstract = """
+  CUDA.jl 5.4 comes with many memory-management related changes that should improve
+  performance of memory-heavy applications, and make it easier to work with heterogeneous
+  set-ups involving multiple GPUs or using both the CPU and GPU."""
++++
+
+{{abstract}}
+
+Before anything else, let's get the breaking changes out of the way. CUDA.jl v5.4 only
+bumps the minor version, so it should be compatible with existing codebases. However, there
+are a couple of API changes that, although covered by appropriate deprecation warnings,
+applications should be updated to:
+
+* The `CUDA.Mem` submodule has been removed. All identifiers have been moved to the parent
+  `CUDA` submodule, with a couple being renamed in the process:
+  - `Mem.Device` and `Mem.DeviceBuffer` have been renamed to `CUDA.DeviceMemory` (the same
+    applies to `Mem.Host` and `Mem.Unified`);
+  - enums from the `Mem` submodule have gained a `MEM` suffix, e.g., `Mem.ATTACH_GLOBAL` has
+    been renamed to `CUDA.MEM_ATTACH_GLOBAL`;
+  - `Mem.set!` has been renamed to `CUDA.memset`;
+  - `Mem.info()` has been renamed to `CUDA.memory_info()`;
+* `CUDA.memory_status()` has been renamed to `CUDA.pool_status()`;
+- `CUDA.available_memory()` has been renamed to `CUDA.free_memory()`.
+
+The meat of this release is in the memory management improvements detailed below. These
+changes can have a significant impact of the performance of your application, so it's
+recommended to thoroughly test your application after upgrading!
+
+
+## Eager garbage collection
+
+Julia is a garbage collected language, which means that (GPU) allocations can fail because
+garbage has piled up, necessitating a collection cycle. Previous versions of CUDA.jl handled
+this at the allocation site, detecting out-of-memory errors and triggering the GC. This was
+not ideal, as it could lead to significant pauses and a bloated memory usage.
+
+To improve this, **CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that
+information to trigger the GC early at appropriate times**, e.g., when waiting for a kernel
+to finish. This should lead to more predictable performance, both by distributing the cost
+of garbage collection over time and by potentially masking it behind other operations.
+
+For example, the following toy model implemented with Flux.jl allocates a ton of memory:
+
+```julia
+using CUDA, Flux
+using MLUtils: DataLoader
+
+n_obs = 300_000
+n_feature = 1000
+X = rand(n_feature, n_obs)
+y = rand(1, n_obs)
+train_data = DataLoader((X, y) |< gpu; batchsize = 2048, shuffle=false)
+
+model = Dense(n_feature, >) |< gpu
+loss(m, _x, _y) = Flux.Losses.mse(m(_x), _>)
+opt_state = Flux.setup(Flux.Adam(), model)
+Flux.train!(loss, model, train_data, opt_state)
+for epoch in 1:100
+  Flux.train!(loss, model, train_data, opt_state)
+end
+```
+
+Without eager garbage collection, this leads to expensive pauses while freeing a large
+amount of memory at every epoch. We can simulate this by artificially limiting the memory
+available to the GPU, while also disabling the new eager garbage collection feature by
+setting the `JULIA_CUDA_GC_EARLY` environment variable to `false` (this is a temporary knob
+that will be removed in the future, but may be useful now for evaluating the new feature):
+
+```text
+❯ JULIA_CUDA_GC_EARLY=false JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \
+  julia --project train.jl
+...
+[ Info: Epoch 90 train time 0.031s
+retry_reclaim: freed 2.865 GiB
+[ Info: Epoch 91 train time 0.031s
+[ Info: Epoch 92 train time 0.027s
+retry_reclaim: freed 2.865 GiB
+[ Info: Epoch 93 train time 0.03s
+retry_reclaim: freed 2.873 GiB
+[ Info: Epoch 94 train time 0.031s
+retry_reclaim: freed 2.873 GiB
+[ Info: Epoch 95 train time 0.03s
+retry_reclaim: freed 2.873 GiB
+[ Info: Epoch 96 train time 0.031s
+[ Info: Epoch 97 train time 0.027s
+retry_reclaim: freed 2.873 GiB
+[ Info: Epoch 98 train time 0.031s
+retry_reclaim: freed 2.865 GiB
+[ Info: Epoch 99 train time 0.031s
+retry_reclaim: freed 2.865 GiB
+[ Info: Epoch 100 train time 0.031s
+[ Info: Total time 4.307s
+```
+
+With eager garbage collection enabled, more frequent but less costly pauses result in
+significantly improved performance:
+
+```text
+❯ JULIA_CUDA_GC_EARLY=true JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \
+  julia --project wip.jl
+...
+[ Info: Epoch 90 train time 0.031s
+maybe_collect: collected 1.8 GiB
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 91 train time 0.033s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 92 train time 0.031s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 93 train time 0.031s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 94 train time 0.03s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 95 train time 0.03s
+maybe_collect: collected 1.8 GiB
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 96 train time 0.033s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 97 train time 0.03s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 98 train time 0.03s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 99 train time 0.03s
+maybe_collect: collected 1.8 GiB
+[ Info: Epoch 100 train time 0.03s
+[ Info: Total time 3.76s
+```
+
+Eager garbage collection is driven by a heuristic that considers the current memory
+pressure, how much memory was freed during previous collections, and how much time that
+took. It is possible that the current implementation is not optimal, so if you encounter
+performance issues, please file an issue.
+
+
+## Tracked memory allocations
+
+When working with multiple GPUs, it is important to differentiate between the device that
+memory was allocated on, and the device used to execute code. Practically, this meant that
+users of CUDA.jl had to manually remember that allocating and using `CuArray` objects
+(typically) needed to happen with the same device active. The same is true for streams,
+which are used to order operations executing on a single GPU.
+
+To improve this, **CUDA.jl now keeps track of the device that owns the memory, and the
+stream last used to access it, enabling the package to "do the right thing" when using that
+memory** in kernels or with library functionality. This does **not** mean that CUDA.jl will
+automatically switch the active device: We want to keep the user in control of that, as it
+often makes sense to access memory from another device, if your system supports it.
+
+Let's break down what the implications are of this change.
+
+**1. Using multiple GPUs**
+
+If you have multiple GPUs, it may be possible that direct P2P access between devices is
+possible (e.g., using NVLink, or just over PCIe). In this case, CUDA.jl will now
+automatically configure the system to allow such access, making it possible to seamlessly
+use memory allocated on one device in kernels executing on a different device:
+
+```julia
+julia> # Allocate memory on device 0
+       device!(0)
+CuDevice(0): Tesla V100-PCIE-16GB
+julia> a = CuArray([1]);
+
+julia> # Use on device 1
+       device!(1)
+CuDevice(1): Tesla V100S-PCIE-32GB
+julia> a .+ 1;
+```
+
+If P2P access between devices is not possible, CUDA.jl will now raise an error instead of
+throwing an illegal memory access error as it did before:
+
+```julia
+julia> # Use on incompatible device 2
+       device!(2)
+CuDevice(2): NVIDIA GeForce GTX 1080 Ti
+julia> a .+ 1
+ERROR: cannot take the GPU address of inaccessible device memory.
+
+You are trying to use memory from GPU 0 on GPU 2.
+P2P access between these devices is not possible;
+either switch to GPU 0 by calling `CUDA.device!(0)`,
+or copy the data to an array allocated on device 2.
+```
+
+As the error message suggests, you can always copy memory between devices using the
+`copyto!` function. In this case, CUDA.jl will fall back to staging the copy on the host
+when P2P access is not possible.
+
+**2. Using multiple streams**
+
+Streams are used to order operations executing on a single GPU. In CUDA.jl, every Julia task
+has its own stream, making it very easy to group independent operations together, and make
+it possible for the GPU to potentially overlap execution of these operations.
+
+Before CUDA.jl v5.4, users had to be careful about synchronizing data used in multiple
+tasks. It was recommended, for example, to end every data-producing task with an explicit
+call to `synchronize()`, or alternatively make sure to `device_synchronize()` at the start
+of a data-consuming task. Now that CUDA.jl keeps track of the stream used to last access
+memory, it can automatically synchronize streams when needed:
+
+```julia
+# Allocate some data
+a = CUDA.zeros(4096, 4096)
+b = CUDA.zeros(4096, 4096)
+#synchronize()  # No longer needed
+
+# Perform work on a task
+t = @async begin
+  a * b
+  #synchronize()  # No longer needed
+end
+
+# Fetch the results
+c = fetch(t)
+```
+
+**3. Using capturing APIs**
+
+All of the above is implemented by piggybacking on the function that converts memory objects
+to pointers, in the assumption that this will be the final operation before the memory is
+used. This is generally true, with one important exception: APIs that capture memory. For
+example, when recording an operation using the CUDA graph APIs, a memory address may be
+captured and used later without CUDA.jl being aware of it.
+
+CUDA.jl accounts for this by detecting conversions during stream capture, however, some APIs
+may not covered yet. If you encounter issues with capturing APIs, let us know, and keep
+using additional synchronization calls to ensure correctness.
+
+
+## Unified memory iteration
+
+Unified memory is a feature of CUDA that allows memory to be accessed from both the CPU and
+the GPU. We have now greatly **improved the performance of using unified memory with CPU
+code that iterates over elements** of a `CuArray`. Although this is typically unwanted,
+triggering the dreaded "scalar indexing" error when accessing device memory in such a way,
+it can be useful when incrementaly porting code to the GPU.
+
+Concretely, accessing elements of a unified `CuArray` on the CPU is much faster now:
+
+```julia-repl
+julia> # Reference
+       a = [1];
+julia> @btime $a[];
+  1.959 ns (0 allocations: 0 bytes)
+
+julia> b = cu(a; unified=true);
+
+julia> # Before
+       @btime $b[]
+  2.617 μs (0 allocations: 0 bytes);
+
+julia> # After
+       @btime $b[];
+  4.140 ns (0 allocations: 0 bytes)
+```
+
+Notice the different unit! This has a massive impact on real-life performance, for example,
+as demonstrated by calling `foldl` which does not have a GPU-optimized implementation:
+
+```julia-repl
+julia> a = cu(rand(1024, 1024); unified=true);
+
+julia> # Before
+       @b foldl(+, a)
+4.210 s (9 allocs: 208 bytes, without a warmup)
+
+julia> # After
+       @b foldl(+, a)
+3.107 ms (9 allocs: 208 bytes)
+```
+
+For completeness, doing this with regular device memory triggers a scalar indexing error:
+
+```julia-repl
+julia> a = cu(rand(1024, 1024));
+
+julia> foldl(+, a)
+ERROR: Scalar indexing is disallowed.
+```
+
+These changes should make it easier to port applications to the GPU by incrementally moving
+parts of the codebase to the GPU without having to worry about the performance of accessing
+memory from the CPU. The only requirement is to use unified memory, e.g., by calling `cu`
+with `unified=true`, or setting the CUDA.jl preference `default_memory` to use unified
+memory by default. However, as unified memory comes with a slight cost, and results in
+synchronous allocation behavior, it is still recommended to switch back to regular device
+memory when your application has been fully ported to the GPU.
+
+
+## Other changes
+
+To keep this post from becoming even longer, a quick rundown of other changes:
+
+* [@wsmoses](https://github.com/wsmoses) introduced initial support for automatic
+  differentiation of heterogeneous host/device code using Enzyme.jl. Before, you would have
+  to differentiate through host and device code separately, and manually set up rules for
+  crossing the host/device boundary. Now, you can differentiate through entire applications
+  with ease;
+* `CUDA.@profile` now [automatically detects external
+  profilers](https://github.com/JuliaGPU/CUDA.jl/pull/2339), so it should not be required to
+  specify `external=true` anymore when running under NSight;
+* [Exception output has been improved](https://github.com/JuliaGPU/CUDA.jl/pull/2342), only
+  reporting a single error message instead of generating output on each thread, and better
+  forwarding the exception type;
+* Cached handles from libraries [will now be
+  freed](https://github.com/JuliaGPU/CUDA.jl/pull/2352) when under memory pressure;
+* Tegra devices [are now supported](https://github.com/JuliaGPU/CUDA.jl/pull/2374) by our
+  artifacts, obviating the use of a local toolkit;
+* Support for [CUDA 12.5](https://github.com/JuliaGPU/CUDA.jl/pull/2392) has been added, as
+  well as [initial support for Julia 1.12](https://github.com/JuliaGPU/CUDA.jl/pull/2390).