Deploying to gh-pages from @ 602dee2 🚀

JuliaGPU · May 28, 2024 · 99e57ca · 99e57ca
1 parent 3bf2ad9
commit 99e57ca
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 16 deletions.
diff --git a/previews/PR44/post/2024-05-28-cuda_5.4/index.html b/previews/PR44/post/2024-05-28-cuda_5.4/index.html
@@ -171,7 +171,7 @@ <h1>CUDA.jl 5.4: Memory management mayhem</h1>
 <p>The meat of this release is in the memory management improvements detailed below. These changes can have a significant impact of the performance of your application, so it&#39;s recommended to thoroughly test your application after upgrading&#33;</p>
 <h2 id="eager_garbage_collection"><a href="#eager_garbage_collection" class="header-anchor">Eager garbage collection</a></h2>
 <p>Julia is a garbage collected language, which means that &#40;GPU&#41; allocations can fail because garbage has piled up, necessitating a collection cycle. Previous versions of CUDA.jl handled this at the allocation site, detecting out-of-memory errors and triggering the GC. This was not ideal, as it could lead to significant pauses and a bloated memory usage.</p>
-<p>To improve this, CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that information to trigger the GC early at appropriate times, e.g., when waiting for a kernel to finish. This should lead to more predictable performance, both by distributing the cost of garbage collection over time and by potentially masking it behind other operations.</p>
+<p>To improve this, <strong>CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that information to trigger the GC early at appropriate times</strong>, e.g., when waiting for a kernel to finish. This should lead to more predictable performance, both by distributing the cost of garbage collection over time and by potentially masking it behind other operations.</p>
 <p>For example, the following toy model implemented with Flux.jl allocates a ton of memory:</p>
 <pre><code class="language-julia">using CUDA, Flux
 using MLUtils: DataLoader
@@ -244,7 +244,7 @@ <h2 id="eager_garbage_collection"><a href="#eager_garbage_collection" class="hea
 <p>Eager garbage collection is driven by a heuristic that considers the current memory pressure, how much memory was freed during previous collections, and how much time that took. It is possible that the current implementation is not optimal, so if you encounter performance issues, please file an issue.</p>
 <h2 id="tracked_memory_allocations"><a href="#tracked_memory_allocations" class="header-anchor">Tracked memory allocations</a></h2>
 <p>When working with multiple GPUs, it is important to differentiate between the device that memory was allocated on, and the device used to execute code. Practically, this meant that users of CUDA.jl had to manually remember that allocating and using <code>CuArray</code> objects &#40;typically&#41; needed to happen with the same device active. The same is true for streams, which are used to order operations executing on a single GPU.</p>
-<p>To improve this, CUDA.jl now keeps track of the device that owns the memory, and the stream last used to access it, enabling the package to &quot;do the right thing&quot; when using that memory in kernels or with library functionality. This does <strong>not</strong> mean that CUDA.jl will automatically switch the active device: We want to keep the user in control of that, as it often makes sense to access memory from another device, if your system supports it.</p>
+<p>To improve this, <strong>CUDA.jl now keeps track of the device that owns the memory, and the stream last used to access it, enabling the package to &quot;do the right thing&quot; when using that memory</strong> in kernels or with library functionality. This does <strong>not</strong> mean that CUDA.jl will automatically switch the active device: We want to keep the user in control of that, as it often makes sense to access memory from another device, if your system supports it.</p>
 <p>Let&#39;s break down what the implications are of this change.</p>
 <p><strong>1. Using multiple GPUs</strong></p>
 <p>If you have multiple GPUs, it may be possible that direct P2P access between devices is possible &#40;e.g., using NVLink, or just over PCIe&#41;. In this case, CUDA.jl will now automatically configure the system to allow such access, making it possible to seamlessly use memory allocated on one device in kernels executing on a different device:</p>
@@ -289,8 +289,8 @@ <h2 id="tracked_memory_allocations"><a href="#tracked_memory_allocations" class=
 <p>All of the above is implemented by piggybacking on the function that converts memory objects to pointers, in the assumption that this will be the final operation before the memory is used. This is generally true, with one important exception: APIs that capture memory. For example, when recording an operation using the CUDA graph APIs, a memory address may be captured and used later without CUDA.jl being aware of it.</p>
 <p>CUDA.jl accounts for this by detecting conversions during stream capture, however, some APIs may not covered yet. If you encounter issues with capturing APIs, let us know, and keep using additional synchronization calls to ensure correctness.</p>
 <h2 id="unified_memory_iteration"><a href="#unified_memory_iteration" class="header-anchor">Unified memory iteration</a></h2>
-<p>As part of these changes, we refactored how unified memory is tracked, improving performance when accessing <code>CuArray</code>s on the CPU. Although this is generally unwanted, triggering the dreaded &quot;scalar iteration&quot; error when accessing device memory like that, with unified memory it&#39;s a common pattern to use the same memory on both the CPU and GPU.</p>
-<p>In CUDA.jl v5.4, iterating unified GPU memory on the CPU has been greatly optimized:</p>
+<p>Unified memory is a feature of CUDA that allows memory to be accessed from both the CPU and the GPU. We have now greatly <strong>improved the performance of using unified memory with CPU code that iterates over elements</strong> of a <code>CuArray</code>. Although this is typically unwanted, triggering the dreaded &quot;scalar indexing&quot; error when accessing device memory in such a way, it can be useful when incrementaly porting code to the GPU.</p>
+<p>Concretely, accessing elements of a unified <code>CuArray</code> on the CPU is much faster now:</p>
 <pre><code class="language-julia-repl">julia&gt; # Reference
        a &#61; &#91;1&#93;;
 julia&gt; @btime &#36;a&#91;&#93;;
@@ -302,10 +302,10 @@ <h2 id="unified_memory_iteration"><a href="#unified_memory_iteration" class="hea
        @btime &#36;b&#91;&#93;
   2.617 μs &#40;0 allocations: 0 bytes&#41;;
 
-julia&gt; # After &#40;notice the different unit&#33;&#41;
+julia&gt; # After
        @btime &#36;b&#91;&#93;;
   4.140 ns &#40;0 allocations: 0 bytes&#41;</code></pre>
-<p>This has a massive impact on real-life performance, for example, as demonstrated by calling <code>foldl</code> which does not have a CUDA.jl implementation:</p>
+<p>Notice the different unit&#33; This has a massive impact on real-life performance, for example, as demonstrated by calling <code>foldl</code> which does not have a GPU-optimized implementation:</p>
 <pre><code class="language-julia-repl">julia&gt; a &#61; cu&#40;rand&#40;1024, 1024&#41;; unified&#61;true&#41;;
 
 julia&gt; # Before
@@ -315,14 +315,17 @@ <h2 id="unified_memory_iteration"><a href="#unified_memory_iteration" class="hea
 julia&gt; # After
        @b foldl&#40;&#43;, a&#41;
 3.107 ms &#40;9 allocs: 208 bytes&#41;</code></pre>
+<p>For completeness, doing this with regular device memory triggers a scalar indexing error:</p>
+<pre><code class="language-julia-repl">julia&gt; a &#61; cu&#40;rand&#40;1024, 1024&#41;&#41;;
+
+julia&gt; foldl&#40;&#43;, a&#41;
+ERROR: Scalar indexing is disallowed.</code></pre>
 <p>These changes should make it easier to port applications to the GPU by incrementally moving parts of the codebase to the GPU without having to worry about the performance of accessing memory from the CPU. The only requirement is to use unified memory, e.g., by calling <code>cu</code> with <code>unified&#61;true</code>, or setting the CUDA.jl preference <code>default_memory</code> to use unified memory by default. However, as unified memory comes with a slight cost, and results in synchronous allocation behavior, it is still recommended to switch back to regular device memory when your application has been fully ported to the GPU.</p>
 <h2 id="other_changes"><a href="#other_changes" class="header-anchor">Other changes</a></h2>
 <p>To keep this post from becoming even longer, a quick rundown of other changes:</p>
 <ul>
 <li><p><a href="https://github.com/wsmoses">@wsmoses</a> introduced initial support for automatic differentiation of heterogeneous host/device code using Enzyme.jl. Before, you would have to differentiate through host and device code separately, and manually set up rules for crossing the host/device boundary. Now, you can differentiate through entire applications with ease;</p>
 </li>
-<li><p><code>CUDA.Mem</code> has been deprecated: <code>Mem.&#40;Device|Unified|Host&#41;</code> has been renamed to <code>CUDA.&#40;Device|Unified|Host&#41;Memory</code>, and other identifiers have been moved to the <code>CUDA</code> module. These changes are breaking, but covered by deprecation warnings;</p>
-</li>
 <li><p><code>CUDA.@profile</code> now <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2339">automatically detects external profilers</a>, so it should not be required to specify <code>external&#61;true</code> anymore when running under NSight;</p>
 </li>
 <li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2342">Exception output has been improved</a>, only reporting a single error message instead of generating output on each thread, and better forwarding the exception type;</p>

diff --git a/previews/PR44/post/index.xml b/previews/PR44/post/index.xml
@@ -48,7 +48,7 @@
 <p>The meat of this release is in the memory management improvements detailed below. These changes can have a significant impact of the performance of your application, so it&#39;s recommended to thoroughly test your application after upgrading&#33;</p>
 <h2 id="eager_garbage_collection">Eager garbage collection</h2>
 <p>Julia is a garbage collected language, which means that &#40;GPU&#41; allocations can fail because garbage has piled up, necessitating a collection cycle. Previous versions of CUDA.jl handled this at the allocation site, detecting out-of-memory errors and triggering the GC. This was not ideal, as it could lead to significant pauses and a bloated memory usage.</p>
-<p>To improve this, CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that information to trigger the GC early at appropriate times, e.g., when waiting for a kernel to finish. This should lead to more predictable performance, both by distributing the cost of garbage collection over time and by potentially masking it behind other operations.</p>
+<p>To improve this, <strong>CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that information to trigger the GC early at appropriate times</strong>, e.g., when waiting for a kernel to finish. This should lead to more predictable performance, both by distributing the cost of garbage collection over time and by potentially masking it behind other operations.</p>
 <p>For example, the following toy model implemented with Flux.jl allocates a ton of memory:</p>
 <pre><code class="language-julia">using CUDA, Flux
 using MLUtils: DataLoadern_obs &#61; 300_000
@@ -117,7 +117,7 @@ maybe_collect: collected 1.8 GiB
 <p>Eager garbage collection is driven by a heuristic that considers the current memory pressure, how much memory was freed during previous collections, and how much time that took. It is possible that the current implementation is not optimal, so if you encounter performance issues, please file an issue.</p>
 <h2 id="tracked_memory_allocations">Tracked memory allocations</h2>
 <p>When working with multiple GPUs, it is important to differentiate between the device that memory was allocated on, and the device used to execute code. Practically, this meant that users of CUDA.jl had to manually remember that allocating and using <code>CuArray</code> objects &#40;typically&#41; needed to happen with the same device active. The same is true for streams, which are used to order operations executing on a single GPU.</p>
-<p>To improve this, CUDA.jl now keeps track of the device that owns the memory, and the stream last used to access it, enabling the package to &quot;do the right thing&quot; when using that memory in kernels or with library functionality. This does <strong>not</strong> mean that CUDA.jl will automatically switch the active device: We want to keep the user in control of that, as it often makes sense to access memory from another device, if your system supports it.</p>
+<p>To improve this, <strong>CUDA.jl now keeps track of the device that owns the memory, and the stream last used to access it, enabling the package to &quot;do the right thing&quot; when using that memory</strong> in kernels or with library functionality. This does <strong>not</strong> mean that CUDA.jl will automatically switch the active device: We want to keep the user in control of that, as it often makes sense to access memory from another device, if your system supports it.</p>
 <p>Let&#39;s break down what the implications are of this change.</p>
 <p><strong>1. Using multiple GPUs</strong></p>
 <p>If you have multiple GPUs, it may be possible that direct P2P access between devices is possible &#40;e.g., using NVLink, or just over PCIe&#41;. In this case, CUDA.jl will now automatically configure the system to allow such access, making it possible to seamlessly use memory allocated on one device in kernels executing on a different device:</p>
@@ -154,30 +154,31 @@ c &#61; fetch&#40;t&#41;</code></pre>
 <p>All of the above is implemented by piggybacking on the function that converts memory objects to pointers, in the assumption that this will be the final operation before the memory is used. This is generally true, with one important exception: APIs that capture memory. For example, when recording an operation using the CUDA graph APIs, a memory address may be captured and used later without CUDA.jl being aware of it.</p>
 <p>CUDA.jl accounts for this by detecting conversions during stream capture, however, some APIs may not covered yet. If you encounter issues with capturing APIs, let us know, and keep using additional synchronization calls to ensure correctness.</p>
 <h2 id="unified_memory_iteration">Unified memory iteration</h2>
-<p>As part of these changes, we refactored how unified memory is tracked, improving performance when accessing <code>CuArray</code>s on the CPU. Although this is generally unwanted, triggering the dreaded &quot;scalar iteration&quot; error when accessing device memory like that, with unified memory it&#39;s a common pattern to use the same memory on both the CPU and GPU.</p>
-<p>In CUDA.jl v5.4, iterating unified GPU memory on the CPU has been greatly optimized:</p>
+<p>Unified memory is a feature of CUDA that allows memory to be accessed from both the CPU and the GPU. We have now greatly <strong>improved the performance of using unified memory with CPU code that iterates over elements</strong> of a <code>CuArray</code>. Although this is typically unwanted, triggering the dreaded &quot;scalar indexing&quot; error when accessing device memory in such a way, it can be useful when incrementaly porting code to the GPU.</p>
+<p>Concretely, accessing elements of a unified <code>CuArray</code> on the CPU is much faster now:</p>
 <pre><code class="language-julia-repl">julia&gt; # Reference
        a &#61; &#91;1&#93;;
 julia&gt; @btime &#36;a&#91;&#93;;
   1.959 ns &#40;0 allocations: 0 bytes&#41;julia&gt; b &#61; cu&#40;a; unified&#61;true&#41;;julia&gt; # Before
        @btime &#36;b&#91;&#93;
-  2.617 μs &#40;0 allocations: 0 bytes&#41;;julia&gt; # After &#40;notice the different unit&#33;&#41;
+  2.617 μs &#40;0 allocations: 0 bytes&#41;;julia&gt; # After
        @btime &#36;b&#91;&#93;;
   4.140 ns &#40;0 allocations: 0 bytes&#41;</code></pre>
-<p>This has a massive impact on real-life performance, for example, as demonstrated by calling <code>foldl</code> which does not have a CUDA.jl implementation:</p>
+<p>Notice the different unit&#33; This has a massive impact on real-life performance, for example, as demonstrated by calling <code>foldl</code> which does not have a GPU-optimized implementation:</p>
 <pre><code class="language-julia-repl">julia&gt; a &#61; cu&#40;rand&#40;1024, 1024&#41;; unified&#61;true&#41;;julia&gt; # Before
        @b foldl&#40;&#43;, a&#41;
 4.210 s &#40;9 allocs: 208 bytes, without a warmup&#41;julia&gt; # After
        @b foldl&#40;&#43;, a&#41;
 3.107 ms &#40;9 allocs: 208 bytes&#41;</code></pre>
+<p>For completeness, doing this with regular device memory triggers a scalar indexing error:</p>
+<pre><code class="language-julia-repl">julia&gt; a &#61; cu&#40;rand&#40;1024, 1024&#41;&#41;;julia&gt; foldl&#40;&#43;, a&#41;
+ERROR: Scalar indexing is disallowed.</code></pre>
 <p>These changes should make it easier to port applications to the GPU by incrementally moving parts of the codebase to the GPU without having to worry about the performance of accessing memory from the CPU. The only requirement is to use unified memory, e.g., by calling <code>cu</code> with <code>unified&#61;true</code>, or setting the CUDA.jl preference <code>default_memory</code> to use unified memory by default. However, as unified memory comes with a slight cost, and results in synchronous allocation behavior, it is still recommended to switch back to regular device memory when your application has been fully ported to the GPU.</p>
 <h2 id="other_changes">Other changes</h2>
 <p>To keep this post from becoming even longer, a quick rundown of other changes:</p>
 <ul>
 <li><p><a href="https://github.com/wsmoses">@wsmoses</a> introduced initial support for automatic differentiation of heterogeneous host/device code using Enzyme.jl. Before, you would have to differentiate through host and device code separately, and manually set up rules for crossing the host/device boundary. Now, you can differentiate through entire applications with ease;</p>
 </li>
-<li><p><code>CUDA.Mem</code> has been deprecated: <code>Mem.&#40;Device|Unified|Host&#41;</code> has been renamed to <code>CUDA.&#40;Device|Unified|Host&#41;Memory</code>, and other identifiers have been moved to the <code>CUDA</code> module. These changes are breaking, but covered by deprecation warnings;</p>
-</li>
 <li><p><code>CUDA.@profile</code> now <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2339">automatically detects external profilers</a>, so it should not be required to specify <code>external&#61;true</code> anymore when running under NSight;</p>
 </li>
 <li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2342">Exception output has been improved</a>, only reporting a single error message instead of generating output on each thread, and better forwarding the exception type;</p>