Deploying to gh-pages from @ 95dcd9c 🚀

JuliaGPU · Apr 26, 2024 · 4200894 · 4200894
1 parent 3ec5dff
commit 4200894
Show file tree

Hide file tree

Showing 146 changed files with 141 additions and 11,427 deletions.
diff --git a/...s/PR42/2024-04-26-cuda_5.2_5.3/index.html → 2024-04-26-cuda_5.2_5.3/index.html b/...s/PR42/2024-04-26-cuda_5.2_5.3/index.html → 2024-04-26-cuda_5.2_5.3/index.html
diff --git a/post/2023-11-07-cuda_5.1/index.html b/post/2023-11-07-cuda_5.1/index.html
@@ -148,8 +148,6 @@ <h1>CUDA.jl 5.1: Unified memory and cooperative groups</h1>
 
 <!-- Content appended here -->
 
-<p>CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.</p>
-<h1 id="cudajl_51_unified_memory_and_cooperative_groups"><a href="#cudajl_51_unified_memory_and_cooperative_groups" class="header-anchor">CUDA.jl 5.1: Unified memory and cooperative groups</a></h1>
 <p>CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.</p>
 <h2 id="unified_memory"><a href="#unified_memory" class="header-anchor">Unified memory</a></h2>
 <p>Unified memory is a feature of CUDA that allows the programmer to <strong>access memory from both the CPU and GPU</strong>, relying on the driver to move data between the two. This can be useful for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU has available, or to be able to incrementally port code to the GPU and still have parts of the application run on the CPU.</p>

diff --git a/...2/post/2024-04-26-cuda_5.2_5.3/index.html → post/2024-04-26-cuda_5.2_5.3/index.html b/...2/post/2024-04-26-cuda_5.2_5.3/index.html → post/2024-04-26-cuda_5.2_5.3/index.html
@@ -4,12 +4,12 @@
   <meta charset="UTF-8">
   <meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no">
   <meta content="index, follow" name=robots>
-  <link rel="icon" href="/previews/PR42/assets/favicon.ico">
-  <link rel="alternate" type="application/rss+xml" href="/previews/PR42/post/index.xml" title="RSS Feed for JuliaGPU">
+  <link rel="icon" href="/assets/favicon.ico">
+  <link rel="alternate" type="application/rss+xml" href="/post/index.xml" title="RSS Feed for JuliaGPU">
 
-  <link rel="stylesheet" href="/previews/PR42/css/bootstrap.min.css">
+  <link rel="stylesheet" href="/css/bootstrap.min.css">
 
-   <link rel="stylesheet" href="/previews/PR42/libs/highlight/github.min.css">
+   <link rel="stylesheet" href="/libs/highlight/github.min.css">
 
 
   <style>
@@ -109,21 +109,21 @@
   <div id=nav-border class=container>
     <nav class="navbar navbar-expand-lg navbar-light justify-content-center">
         <ul class=navbar-nav>
-            <li class="nav-item "><a class=nav-link href="/previews/PR42/"><i data-feather=home></i>Home</a>
+            <li class="nav-item "><a class=nav-link href="/"><i data-feather=home></i>Home</a>
             </li>
-            <li class="nav-item active"><a class=nav-link href="/previews/PR42/post/"><i data-feather=file-text></i>Blog</a>
+            <li class="nav-item active"><a class=nav-link href="/post/"><i data-feather=file-text></i>Blog</a>
             </li>
-            <li class="nav-item "><a class=nav-link href="/previews/PR42/learn/"><i data-feather=book-open></i>Learn</a>
+            <li class="nav-item "><a class=nav-link href="/learn/"><i data-feather=book-open></i>Learn</a>
             </li>
-            <li class="nav-item "><a class=nav-link href="/previews/PR42/cuda/">CUDA</a>
+            <li class="nav-item "><a class=nav-link href="/cuda/">CUDA</a>
             </li>
-            <li class="nav-item "><a class=nav-link href="/previews/PR42/rocm/">ROCm</a>
+            <li class="nav-item "><a class=nav-link href="/rocm/">ROCm</a>
             </li>
-            <li class="nav-item "><a class=nav-link href="/previews/PR42/oneapi/">oneAPI</a>
+            <li class="nav-item "><a class=nav-link href="/oneapi/">oneAPI</a>
             </li>
-            <li class="nav-item "><a class=nav-link href="/previews/PR42/metal/">Metal</a>
+            <li class="nav-item "><a class=nav-link href="/metal/">Metal</a>
             </li>
-            <li class="nav-item "><a class=nav-link href="/previews/PR42/other/">Other</a>
+            <li class="nav-item "><a class=nav-link href="/other/">Other</a>
             </li>
         </ul>
     </nav>
@@ -269,7 +269,7 @@ <h2 id="future_releases"><a href="#future_releases" class="header-anchor">Future
 
 
 
-        <script src="/previews/PR42/libs/highlight/highlight.min.js"></script>
+        <script src="/libs/highlight/highlight.min.js"></script>
 <script>hljs.initHighlightingOnLoad();hljs.configure({tabReplace: '    '});</script>
 
 
@@ -279,7 +279,7 @@ <h2 id="future_releases"><a href="#future_releases" class="header-anchor">Future
     </footer>
 
     <!-- FEATHER -->
-    <script src="/previews/PR42/libs/feather/feather.min.js"></script>
+    <script src="/libs/feather/feather.min.js"></script>
     <script>feather.replace()</script>
 
     <!-- GOOGLE ANALYTICS -->

diff --git a/post/index.html b/post/index.html
@@ -141,6 +141,14 @@ <h1>
   <a href=/post/index.xml><i data-feather=rss></i></a>
 </h1>
 
+<p>
+  <a class=font-125 href="/post/2024-04-26-cuda_5.2_5.3/">
+    CUDA.jl 5.2 and 5.3: Maintenance releases
+  </a><span>&nbsp;&#8599;</span>
+  <br>
+  <i data-feather=calendar></i>
+  <time datetime=2024-4-26>Apr 26, 2024</time><br>
+</p>
 <p>
   <a class=font-125 href="/post/2023-11-07-cuda_5.1/">
     CUDA.jl 5.1: Unified memory and cooperative groups

diff --git a/post/index.xml b/post/index.xml
@@ -18,6 +18,116 @@
       type="application/rss+xml" />
 
 
+<item>
+  <title><![CDATA[CUDA.jl 5.2 and 5.3: Maintenance releases]]></title>
+  <link>https://juliagpu.org/post/2024-04-26-cuda_5.2_5.3/index.html</link>
+  <guid>https://juliagpu.org/2024-04-26-cuda_5.2_5.3/</guid>
+  <description><![CDATA[CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug fixes and minor improvements, but also come with a number of interesting new features. This blog post summarizes the changes in these releases.]]></description>  
+
+  <content:encoded><![CDATA[
+<p>CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug fixes and minor improvements, but also come with a number of interesting new features. This blog post summarizes the changes in these releases.</p>
+<h2 id="profiler_improvements">Profiler improvements</h2>
+<p>CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia GPU applications without having to use NSight Systems or other external tools. The tool has seen continued development, mostly improving its robustness, but CUDA.jl now also provides a <code>@bprofile</code> equivalent that runs your application multiple times and reports on the time distribution of individual events:</p>
+<pre><code class="language-julia-repl">julia&gt; CUDA.@bprofile CuArray&#40;&#91;1&#93;&#41; .&#43; 1
+Profiler ran for 1.0 s, capturing 1427349 events.Host-side activity: calling CUDA APIs took 792.95 ms &#40;79.29&#37; of the trace&#41;
+┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐
+│ Time &#40;&#37;&#41; │ Total time │  Calls │ Time distribution                     │ Name                    │
+├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤
+│   19.27&#37; │  192.67 ms │ 109796 │   1.75 µs ± 10.19  &#40;  0.95 ‥ 1279.83&#41; │ cuMemAllocFromPoolAsync │
+│   17.08&#37; │   170.8 ms │  54898 │   3.11 µs ± 0.27   &#40;  2.15 ‥ 23.84&#41;   │ cuLaunchKernel          │
+│   16.77&#37; │  167.67 ms │  54898 │   3.05 µs ± 0.24   &#40;  0.48 ‥ 16.69&#41;   │ cuCtxSynchronize        │
+│   14.11&#37; │  141.12 ms │  54898 │   2.57 µs ± 0.79   &#40;  1.67 ‥ 70.57&#41;   │ cuMemcpyHtoDAsync       │
+│    1.70&#37; │   17.04 ms │  54898 │ 310.36 ns ± 132.89 &#40;238.42 ‥ 5483.63&#41; │ cuStreamSynchronize     │
+└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘Device-side activity: GPU was busy for 87.38 ms &#40;8.74&#37; of the trace&#41;
+┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
+│ Time &#40;&#37;&#41; │ Total time │ Calls │ Time distribution                     │ Name               │
+├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤
+│    6.66&#37; │   66.61 ms │ 54898 │   1.21 µs ± 0.16   &#40;  0.95 ‥ 1.67&#41;    │ kernel             │
+│    2.08&#37; │   20.77 ms │ 54898 │ 378.42 ns ± 147.66 &#40;238.42 ‥ 1192.09&#41; │ &#91;copy to device&#93;   │
+└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘NVTX ranges:
+┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
+│ Time &#40;&#37;&#41; │ Total time │ Calls │ Time distribution                      │ Name                │
+├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
+│   98.99&#37; │  989.94 ms │ 54898 │  18.03 µs ± 49.88  &#40; 15.26 ‥ 10731.22&#41; │ @bprofile.iteration │
+└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘</code></pre>
+<p>By default, <code>CUDA.@bprofile</code> runs the application for 1 second, but this can be adjusted using the <code>time</code> keyword argument.</p>
+<p>Display of the time distribution isn&#39;t limited to <code>CUDA.@bprofile</code>, and will also be used by <code>CUDA.@profile</code> when any operation is called more than once. For example, with the broadcasting example from above we allocate both the input <code>CuArray</code> and the broadcast result, which results in two calls to the allocator:</p>
+<pre><code class="language-julia-repl">julia&gt; CUDA.@profile CuArray&#40;&#91;1&#93;&#41; .&#43; 1Host-side activity:
+┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
+│ Time &#40;&#37;&#41; │ Total time │ Calls │ Time distribution                   │ Name                    │
+├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
+│   99.92&#37; │   99.42 ms │     1 │                                     │ cuMemcpyHtoDAsync       │
+│    0.02&#37; │   21.22 µs │     2 │  10.61 µs ± 6.57   &#40;  5.96 ‥ 15.26&#41; │ cuMemAllocFromPoolAsync │
+│    0.02&#37; │   17.88 µs │     1 │                                     │ cuLaunchKernel          │
+│    0.00&#37; │  953.67 ns │     1 │                                     │ cuStreamSynchronize     │
+└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘</code></pre>
+<h2 id="kernel_launch_debugging">Kernel launch debugging</h2>
+<p>A common issue with CUDA programming is that kernel launches may fail when exhausting certain resources, such as shared memory or registers. This typically results in a cryptic error message, but CUDA.jl will now try to diagnose launch failures and provide a more helpful error message, as suggested by <a href="https://github.com/simonbyrne">@simonbyrne</a>:</p>
+<p>For example, when using more parameter memory than allowed by the architecture:</p>
+<pre><code class="language-julia-repl">julia&gt; kernel&#40;x&#41; &#61; nothing
+julia&gt; @cuda kernel&#40;ntuple&#40;_-&gt;UInt64&#40;1&#41;, 2^13&#41;&#41;
+ERROR: Kernel invocation uses too much parameter memory.
+64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2.</code></pre>
+<p>Or when using an invalid launch configuration, violating a device limit:</p>
+<pre><code class="language-julia-repl">julia&gt; @cuda threads&#61;2000 identity&#40;nothing&#41;
+ERROR: Number of threads in x-dimension exceeds device limit &#40;2000 &gt; 1024&#41;.
+caused by: CUDA error: invalid argument &#40;code 1, ERROR_INVALID_VALUE&#41;</code></pre>
+<p>We also diagnose launch failures that involve kernel-specific limits, such as exceeding the number of threads that are allowed in a block &#40;e.g., because of register use&#41;:</p>
+<pre><code class="language-julia-repl">julia&gt; @cuda threads&#61;1024 heavy_kernel&#40;&#41;
+ERROR: Number of threads per block exceeds kernel limit &#40;1024 &gt; 512&#41;.
+caused by: CUDA error: invalid argument &#40;code 1, ERROR_INVALID_VALUE&#41;</code></pre>
+<h2 id="sorting_improvements">Sorting improvements</h2>
+<p>Thanks to <a href="https://github.com/xaellison">@xaellison</a>, our bitonic sorting implementation now supports sorting specific dimensions, making it possible to implement <code>sortperm</code> for multi-dimensional arrays:</p>
+<pre><code class="language-julia-repl">julia&gt; A &#61; cu&#40;&#91;8 7; 5 6&#93;&#41;
+2×2 CuArray&#123;Int64, 2, Mem.DeviceBuffer&#125;:
+ 8  7
+ 5  6julia&gt; sortperm&#40;A, dims &#61; 1&#41;
+2×2 CuArray&#123;Int64, 2, Mem.DeviceBuffer&#125;:
+ 2  4
+ 1  3julia&gt; sortperm&#40;A, dims &#61; 2&#41;
+2×2 CuArray&#123;Int64, 2, Mem.DeviceBuffer&#125;:
+ 3  1
+ 2  4</code></pre>
+<p>The bitonic kernel is now used for all sorting operations, in favor of the often slower quicksort implementation:</p>
+<pre><code class="language-julia-repl"># before &#40;quicksort&#41;
+julia&gt; @btime CUDA.@sync sort&#40;&#36;&#40;CUDA.rand&#40;1024, 1024&#41;&#41;; dims&#61;1&#41;
+  2.760 ms &#40;30 allocations: 1.02 KiB&#41;# after &#40;bitonic sort&#41;
+julia&gt; @btime CUDA.@sync sort&#40;&#36;&#40;CUDA.rand&#40;1024, 1024&#41;&#41;; dims&#61;1&#41;
+  246.386 μs &#40;567 allocations: 13.66 KiB&#41;# reference CPU time
+julia&gt; @btime sort&#40;&#36;&#40;rand&#40;Float32, 1024, 1024&#41;&#41;; dims&#61;1&#41;
+  4.795 ms &#40;1030 allocations: 5.07 MiB&#41;</code></pre>
+<h2 id="unified_memory_fixes">Unified memory fixes</h2>
+<p>CUDA.jl 5.1 greatly improved support for unified memory, and this has continued in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting <code>CuArray</code>s we now correctly preserve the memory type of the input arrays. This means that if you broadcast a <code>CuArray</code> that is allocated as unified memory, the result will also be allocated as unified memory. In case of a conflict, e.g. broadcasting a unified <code>CuArray</code> with one backed by device memory, we will prefer unified memory:</p>
+<pre><code class="language-julia-repl">julia&gt; cu&#40;&#91;1&#93;; host&#61;true&#41; .&#43; 1
+1-element CuArray&#123;Int64, 1, Mem.HostBuffer&#125;:
+ 2julia&gt; cu&#40;&#91;1&#93;; host&#61;true&#41; .&#43; cu&#40;&#91;2&#93;; device&#61;true&#41;
+1-element CuArray&#123;Int64, 1, Mem.UnifiedBuffer&#125;:
+ 3</code></pre>
+<h2 id="software_updates">Software updates</h2>
+<p>Finally, we also did routine updates of the software stack, support the latest and greatest by NVIDIA. This includes support for <strong>CUDA 12.4</strong> &#40;Update 1&#41;, <strong>cuDNN 9</strong>, and <strong>cuTENSOR 2.0</strong>. This latest release of cuTENSOR is noteworthy as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to follow this change. For more details, refer to the <a href="https://docs.nvidia.com/cuda/cutensor/latest/api_transition.html">cuTENSOR 2 migration guide</a> by NVIDIA.</p>
+<p>Of course, cuTENSOR.jl also provides a high-level Julia API which has been mostly unaffected by these changes:</p>
+<pre><code class="language-julia">using CUDA
+A &#61; CUDA.rand&#40;7, 8, 3, 2&#41;
+B &#61; CUDA.rand&#40;3, 2, 2, 8&#41;
+C &#61; CUDA.rand&#40;3, 3, 7, 2&#41;using cuTENSOR
+tA &#61; CuTensor&#40;A, &#91;&#39;a&#39;, &#39;f&#39;, &#39;b&#39;, &#39;e&#39;&#93;&#41;
+tB &#61; CuTensor&#40;B, &#91;&#39;c&#39;, &#39;e&#39;, &#39;d&#39;, &#39;f&#39;&#93;&#41;
+tC &#61; CuTensor&#40;C, &#91;&#39;b&#39;, &#39;c&#39;, &#39;a&#39;, &#39;d&#39;&#93;&#41;using LinearAlgebra
+mul&#33;&#40;tC, tA, tB&#41;</code></pre>
+<p>This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and have to adapt to the new API, now is a good time to consider improving the high-level interface instead&#33;</p>
+<h2 id="future_releases">Future releases</h2>
+<p>The next release of CUDA.jl is gearing up to be a much larger release, with significant changes to both the API and internals of the package. Although the intent is to keep these changes non-breaking, it is always possible that some code will be affected in unexpected ways, so we encourage users to test the upcoming release by simply running <code>&#93; add CUDA#master</code> and report any issues.</p>
+]]></content:encoded>
+
+  <pubDate>Fri, 26 Apr 2024 00:00:00 +0000</pubDate>  
+
+
+  <atom:author>
+    <atom:name>Tim Besard</atom:name>
+  </atom:author>
+
+</item>
+
 <item>
   <title><![CDATA[CUDA.jl 5.1: Unified memory and cooperative groups]]></title>
   <link>https://juliagpu.org/post/2023-11-07-cuda_5.1/index.html</link>
@@ -26,8 +136,6 @@
 
   <content:encoded><![CDATA[
 <p>CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.</p>
-<h1 id="cudajl_51_unified_memory_and_cooperative_groups">CUDA.jl 5.1: Unified memory and cooperative groups</h1>
-<p>CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.</p>
 <h2 id="unified_memory">Unified memory</h2>
 <p>Unified memory is a feature of CUDA that allows the programmer to <strong>access memory from both the CPU and GPU</strong>, relying on the driver to move data between the two. This can be useful for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU has available, or to be able to incrementally port code to the GPU and still have parts of the application run on the CPU.</p>
 <p>CUDA.jl did already support unified memory, but only for the most basic use cases. With CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that memory from the CPU:</p>

diff --git a/previews/PR42/.nojekyll b/previews/PR42/.nojekyll