From d26014f45ccc509ac40aefa9c915297a5afeef47 Mon Sep 17 00:00:00 2001 From: Tim Besard Date: Mon, 27 May 2024 17:33:06 +0200 Subject: [PATCH] Add blog post on CUDA.jl v5.4 --- post/2024-05-28-cuda_5.4.md | 313 ++++++++++++++++++++++++++++++++++++ 1 file changed, 313 insertions(+) create mode 100644 post/2024-05-28-cuda_5.4.md diff --git a/post/2024-05-28-cuda_5.4.md b/post/2024-05-28-cuda_5.4.md new file mode 100644 index 0000000..9144b78 --- /dev/null +++ b/post/2024-05-28-cuda_5.4.md @@ -0,0 +1,313 @@ ++++ +title = "CUDA.jl 5.4: Memory management mayhem" +author = "Tim Besard" +external = true +abstract = """ + CUDA.jl 5.4 is an important release that comes with many memory-management related changes, + improving the performance and usability of several key features.""" ++++ + +{{abstract}} + +Before anything else, let's get the breaking changes out of the way. CUDA.jl v5.4 only +bumps the minor version, so it should be compatible with existing codebases. However, there +are a couple of API changes that, although covered by appropriate deprecation warnings, +applications should be updated to: + +* The `CUDA.Mem` submodule has been removed. All identifiers have been moved to the parent + `CUDA` submodule, with a couple being renamed in the process: + - `Mem.Device` and `Mem.DeviceBuffer` have been renamed to `CUDA.DeviceMemory` (the same + applies to `Mem.Host` and `Mem.Unified`); + - enums from the `Mem` submodule have gained a `MEM` suffix, e.g., `Mem.ATTACH_GLOBAL` has + been renamed to `CUDA.MEM_ATTACH_GLOBAL`; + - `Mem.set!` has been renamed to `CUDA.memset`; + - `Mem.info()` has been renamed to `CUDA.memory_info()`; +* `CUDA.memory_status()` has been renamed to `CUDA.pool_status()`; +- `CUDA.available_memory()` has been renamed to `CUDA.free_memory()`. + +The meat of this release is in the memory management improvements detailed below. These +changes can have a significant impact of the performance of your application, so it's +recommended to thoroughly test your application after upgrading! + + +## Eager garbage collection + +Julia is a garbage collected language, which means that (GPU) allocations can fail because +garbage has piled up, necessitating a collection cycle. Previous versions of CUDA.jl handled +this at the allocation site, detecting out-of-memory errors and triggering the GC. This was +not ideal, as it could lead to significant pauses and a bloated memory usage. + +To improve this, **CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that +information to trigger the GC early at appropriate times**, e.g., when waiting for a kernel +to finish. This should lead to more predictable performance, both by distributing the cost +of garbage collection over time and by potentially masking it behind other operations. + +For example, the following toy model implemented with Flux.jl allocates a ton of memory: + +```julia +using CUDA, Flux +using MLUtils: DataLoader + +n_obs = 300_000 +n_feature = 1000 +X = rand(n_feature, n_obs) +y = rand(1, n_obs) +train_data = DataLoader((X, y) |< gpu; batchsize = 2048, shuffle=false) + +model = Dense(n_feature, >) |< gpu +loss(m, _x, _y) = Flux.Losses.mse(m(_x), _>) +opt_state = Flux.setup(Flux.Adam(), model) +Flux.train!(loss, model, train_data, opt_state) +for epoch in 1:100 + Flux.train!(loss, model, train_data, opt_state) +end +``` + +Without eager garbage collection, this leads to expensive pauses while freeing a large +amount of memory at every epoch. We can simulate this by artificially limiting the memory +available to the GPU, while also disabling the new eager garbage collection feature by +setting the `JULIA_CUDA_GC_EARLY` environment variable to `false` (this is a temporary knob +that will be removed in the future, but may be useful now for evaluating the new feature): + +```text +❯ JULIA_CUDA_GC_EARLY=false JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \ + julia --project train.jl +... +[ Info: Epoch 90 train time 0.031s +retry_reclaim: freed 2.865 GiB +[ Info: Epoch 91 train time 0.031s +[ Info: Epoch 92 train time 0.027s +retry_reclaim: freed 2.865 GiB +[ Info: Epoch 93 train time 0.03s +retry_reclaim: freed 2.873 GiB +[ Info: Epoch 94 train time 0.031s +retry_reclaim: freed 2.873 GiB +[ Info: Epoch 95 train time 0.03s +retry_reclaim: freed 2.873 GiB +[ Info: Epoch 96 train time 0.031s +[ Info: Epoch 97 train time 0.027s +retry_reclaim: freed 2.873 GiB +[ Info: Epoch 98 train time 0.031s +retry_reclaim: freed 2.865 GiB +[ Info: Epoch 99 train time 0.031s +retry_reclaim: freed 2.865 GiB +[ Info: Epoch 100 train time 0.031s +[ Info: Total time 4.307s +``` + +With eager garbage collection enabled, more frequent but less costly pauses result in +significantly improved performance: + +```text +❯ JULIA_CUDA_GC_EARLY=true JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \ + julia --project wip.jl +... +[ Info: Epoch 90 train time 0.031s +maybe_collect: collected 1.8 GiB +maybe_collect: collected 1.8 GiB +[ Info: Epoch 91 train time 0.033s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 92 train time 0.031s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 93 train time 0.031s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 94 train time 0.03s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 95 train time 0.03s +maybe_collect: collected 1.8 GiB +maybe_collect: collected 1.8 GiB +[ Info: Epoch 96 train time 0.033s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 97 train time 0.03s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 98 train time 0.03s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 99 train time 0.03s +maybe_collect: collected 1.8 GiB +[ Info: Epoch 100 train time 0.03s +[ Info: Total time 3.76s +``` + +Eager garbage collection is driven by a heuristic that considers the current memory +pressure, how much memory was freed during previous collections, and how much time that +took. It is possible that the current implementation is not optimal, so if you encounter +performance issues, please file an issue. + + +## Tracked memory allocations + +When working with multiple GPUs, it is important to differentiate between the device that +memory was allocated on, and the device used to execute code. Practically, this meant that +users of CUDA.jl had to manually remember that allocating and using `CuArray` objects +(typically) needed to happen with the same device active. The same is true for streams, +which are used to order operations executing on a single GPU. + +To improve this, **CUDA.jl now keeps track of the device that owns the memory, and the +stream last used to access it, enabling the package to "do the right thing" when using that +memory** in kernels or with library functionality. This does **not** mean that CUDA.jl will +automatically switch the active device: We want to keep the user in control of that, as it +often makes sense to access memory from another device, if your system supports it. + +Let's break down what the implications are of this change. + +**1. Using multiple GPUs** + +If you have multiple GPUs, it may be possible that direct P2P access between devices is +possible (e.g., using NVLink, or just over PCIe). In this case, CUDA.jl will now +automatically configure the system to allow such access, making it possible to seamlessly +use memory allocated on one device in kernels executing on a different device: + +```julia +julia> # Allocate memory on device 0 + device!(0) +CuDevice(0): Tesla V100-PCIE-16GB +julia> a = CuArray([1]); + +julia> # Use on device 1 + device!(1) +CuDevice(1): Tesla V100S-PCIE-32GB +julia> a .+ 1; +``` + +If P2P access between devices is not possible, CUDA.jl will now raise an error instead of +throwing an illegal memory access error as it did before: + +```julia +julia> # Use on incompatible device 2 + device!(2) +CuDevice(5): NVIDIA GeForce GTX 1080 Ti +julia> a .+ 1 +ERROR: cannot take the GPU address of inaccessible device memory. + +You are trying to use memory from GPU 0 on GPU 2. +P2P access between these devices is not possible; +either switch to GPU 0 by calling `CUDA.device!(0)`, +or copy the data to an array allocated on device 2. +``` + +As the error message suggests, you can always copy memory between devices using the +`copyto!` function. In this case, CUDA.jl will fall back to staging the copy on the host +when P2P access is not possible. + +**2. Using multiple streams** + +Streams are used to order operations executing on a single GPU. In CUDA.jl, every Julia task +has its own stream, making it very easy to group independent operations together, and make +it possible for the GPU to potentially overlap execution of these operations. + +Before CUDA.jl v5.4, users had to be careful about synchronizing data used in multiple +tasks. It was recommended, for example, to end every data-producing task with an explicit +call to `synchronize()`, or alternatively make sure to `device_synchronize()` at the start +of a data-consuming task. Now that CUDA.jl keeps track of the stream used to last access +memory, it can automatically synchronize streams when needed: + +```julia +# Allocate some data +a = CUDA.zeros(4096, 4096) +b = CUDA.zeros(4096, 4096) +#synchronize() # No longer needed + +# Perform work on a task +t = @async begin + a * b + #synchronize() # No longer needed +end + +# Fetch the results +c = fetch(t) +``` + +**3. Using capturing APIs** + +All of the above is implemented by piggybacking on the function that converts memory objects +to pointers, in the assumption that this will be the final operation before the memory is +used. This is generally true, with one important exception: APIs that capture memory. For +example, when recording an operation using the CUDA graph APIs, a memory address may be +captured and used later without CUDA.jl being aware of it. + +CUDA.jl accounts for this by detecting conversions during stream capture, however, some APIs +may not covered yet. If you encounter issues with capturing APIs, let us know, and keep +using additional synchronization calls to ensure correctness. + + +## Unified memory iteration + +Unified memory is a feature of CUDA that allows memory to be accessed from both the CPU and +the GPU. We have now greatly **improved the performance of using unified memory with CPU +code that iterates over elements** of a `CuArray`. Although this is typically unwanted, +triggering the dreaded "scalar indexing" error when accessing device memory in such a way, +it can be useful when incrementaly porting code to the GPU. + +Concretely, accessing elements of a unified `CuArray` on the CPU is much faster now: + +```julia-repl +julia> # Reference + a = [1]; +julia> @btime $a[]; + 1.959 ns (0 allocations: 0 bytes) + +julia> b = cu(a; unified=true); + +julia> # Before + @btime $b[] + 2.617 μs (0 allocations: 0 bytes); + +julia> # After + @btime $b[]; + 4.140 ns (0 allocations: 0 bytes) +``` + +Notice the different unit! This has a massive impact on real-life performance, for example, +as demonstrated by calling `foldl` which does not have a GPU-optimized implementation: + +```julia-repl +julia> a = cu(rand(1024, 1024); unified=true); + +julia> # Before + @b foldl(+, a) +4.210 s (9 allocs: 208 bytes, without a warmup) + +julia> # After + @b foldl(+, a) +3.107 ms (9 allocs: 208 bytes) +``` + +For completeness, doing this with regular device memory triggers a scalar indexing error: + +```julia-repl +julia> a = cu(rand(1024, 1024)); + +julia> foldl(+, a) +ERROR: Scalar indexing is disallowed. +``` + +These changes should make it easier to port applications to the GPU by incrementally moving +parts of the codebase to the GPU without having to worry about the performance of accessing +memory from the CPU. The only requirement is to use unified memory, e.g., by calling `cu` +with `unified=true`, or setting the CUDA.jl preference `default_memory` to use unified +memory by default. However, as unified memory comes with a slight cost, and results in +synchronous allocation behavior, it is still recommended to switch back to regular device +memory when your application has been fully ported to the GPU. + + +## Other changes + +To keep this post from becoming even longer, a quick rundown of other changes: + +* [@wsmoses](https://github.com/wsmoses) introduced initial support for automatic + differentiation of heterogeneous host/device code using Enzyme.jl. Before, you would have + to differentiate through host and device code separately, and manually set up rules for + crossing the host/device boundary. Now, you can differentiate through entire applications + with ease; +* `CUDA.@profile` now [automatically detects external + profilers](https://github.com/JuliaGPU/CUDA.jl/pull/2339), so it should not be required to + specify `external=true` anymore when running under NSight; +* [Exception output has been improved](https://github.com/JuliaGPU/CUDA.jl/pull/2342), only + reporting a single error message instead of generating output on each thread, and better + forwarding the exception type; +* Cached handles from libraries [will now be + freed](https://github.com/JuliaGPU/CUDA.jl/pull/2352) when under memory pressure; +* Tegra devices [are now supported](https://github.com/JuliaGPU/CUDA.jl/pull/2374) by our + artifacts, obviating the use of a local toolkit; +* Support for [CUDA 12.5](https://github.com/JuliaGPU/CUDA.jl/pull/2392) has been added, as + well as [initial support for Julia 1.12](https://github.com/JuliaGPU/CUDA.jl/pull/2390).