From 7d29b4ede1ed8cfeb7659f8cf256e9712a8a93f0 Mon Sep 17 00:00:00 2001 From: Chris Abraham Date: Thu, 21 Nov 2024 13:57:54 -0800 Subject: [PATCH] Fixed case on key words in blog posts (#1819) Signed-off-by: Chris Abraham --- _posts/2024-10-17-pytorch2-5.md | 6 +++--- _posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md | 14 +++++++------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/_posts/2024-10-17-pytorch2-5.md b/_posts/2024-10-17-pytorch2-5.md index 4aa18d841b61..f077d613cd30 100644 --- a/_posts/2024-10-17-pytorch2-5.md +++ b/_posts/2024-10-17-pytorch2-5.md @@ -3,7 +3,7 @@ layout: blog_detail title: "PyTorch 2.5 Release Blog" --- -We are excited to announce the release of PyTorch® 2.5 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.5.0))! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode. +We are excited to announce the release of PyTorch® 2.5 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.5.0))! This release features a new cuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode. This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page. @@ -18,7 +18,7 @@ As well, please check out our new ecosystem projects releases with [TorchRec](ht - CuDNN backend for SDPA + cuDNN backend for SDPA FlexAttention @@ -74,7 +74,7 @@ As well, please check out our new ecosystem projects releases with [TorchRec](ht ## BETA FEATURES -### [Beta] CuDNN backend for SDPA +### [Beta] cuDNN backend for SDPA The cuDNN "Fused Flash Attention" backend was landed for *torch.nn.functional.scaled_dot_product_attention*. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs. diff --git a/_posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md b/_posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md index 912b9552de68..4bbc3203072f 100644 --- a/_posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md +++ b/_posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md @@ -1,6 +1,6 @@ --- layout: blog_detail -title: "Deep Dive on Cutlass Ping-Pong GEMM Kernel" +title: "Deep Dive on CUTLASS Ping-Pong GEMM Kernel" author: Less Wright, Adnan Hoque --- @@ -10,7 +10,7 @@ author: Less Wright, Adnan Hoque ## Summary -In this post, we provide an overview, with relevant FP8 inference kernel benchmarking, of the cutlass Ping-Pong GEMM kernel. +In this post, we provide an overview, with relevant FP8 inference kernel benchmarking, of the CUTLASS Ping-Pong GEMM kernel. Ping-Pong is one of the fastest matmul (GEMM) kernel architectures available for the Hopper GPU architecture. Ping-Pong is a member of the Warp Group Specialized Persistent Kernels family, which includes both Cooperative and Ping-Pong variants. Relative to previous GPUs, Hopper’s substantial tensor core compute capability requires deep asynchronous software pipelining in order to achieve peak performance. @@ -30,7 +30,7 @@ For Ping-Pong, each warp group takes on a specialized role of either Data produc The producer warp group focuses on producing data movement to fill the shared memory buffers (via TMA). Two other warp groups are dedicated consumers that process the math (MMA) portion with tensor cores, and then do any follow up work and write their results back to global memory (epilogue). -Producer warp groups work with TMA (Tensor Memory Accelerator), and are deliberately kept as lightweight as possible. In fact, in Ping-Pong, they deliberately reduce their register resources to improve occupancy. Producers will reduce their max register counts by 40, vs consumers will increase their max register count by 232, an effect we can see in the cutlass source and corresponding SASS: +Producer warp groups work with TMA (Tensor Memory Accelerator), and are deliberately kept as lightweight as possible. In fact, in Ping-Pong, they deliberately reduce their register resources to improve occupancy. Producers will reduce their max register counts by 40, vs consumers will increase their max register count by 232, an effect we can see in the CUTLASS source and corresponding SASS: ![source code](/assets/images/cutlass-ping-pong-gemm-kernel/fg2.png){:style="width:100%"} @@ -76,13 +76,13 @@ To expand on TMA, or Tensor Memory Accelerator, TMA is a hardware component intr ## CUTLASS Asynchronous Pipeline Class -This signaling between producers and consumers is coordinated via the new Asynchronous Pipeline Class which Cutlass describes as follows: +This signaling between producers and consumers is coordinated via the new Asynchronous Pipeline Class which CUTLASS describes as follows: “Implementing a persistent GEMM algorithm calls for managing dozens of different kinds of asynchronously executing operations that synchronize using multiple barriers organized as a circular list. This complexity is too much for human programmers to manage by hand. -As a result, we have developed [[Cutlass Pipeline Async Class](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fcutlass%2Fblob%2Fmain%2Finclude%2Fcutlass%2Fpipeline%2Fsm90_pipeline.hpp&h=AT0Qy69t9mn_9VGkJlf1TkC_yCVPAQbYzHtS9it0ZVxTxVasGZfb6u-VHKReULm29NsLhp3DtuRfN4BHnzczniArsCFe8Uzj7izIx646Otyl4lEwl9jUHDhTcUq87KfS919MkadFMjq5i4qtkbe7QbgZEMbhFi0ARgvz3-u7_X0Hf3kHwQ&__tn__=-UK-R&c[0]=AT2Wep-mQJcJ7w2cBPcqoNcO9gLYx7_Qg9TGIcfKPSoo8kGdDtl70vKog1VICaOX45DhNP-Eu6pUbUl9TxGeGLQHgzyXWuxAgDQrdlOhhiOC3QRDMckh2vCi8RADkSCainRbZ5JoF7CERyij7CrhsSskOfVqQ_fvN-lKG6W2_TkvMFLe8UbKNPkzSqjzfdo)]…” +As a result, we have developed [[CUTLASS Pipeline Async Class](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fcutlass%2Fblob%2Fmain%2Finclude%2Fcutlass%2Fpipeline%2Fsm90_pipeline.hpp&h=AT0Qy69t9mn_9VGkJlf1TkC_yCVPAQbYzHtS9it0ZVxTxVasGZfb6u-VHKReULm29NsLhp3DtuRfN4BHnzczniArsCFe8Uzj7izIx646Otyl4lEwl9jUHDhTcUq87KfS919MkadFMjq5i4qtkbe7QbgZEMbhFi0ARgvz3-u7_X0Hf3kHwQ&__tn__=-UK-R&c[0]=AT2Wep-mQJcJ7w2cBPcqoNcO9gLYx7_Qg9TGIcfKPSoo8kGdDtl70vKog1VICaOX45DhNP-Eu6pUbUl9TxGeGLQHgzyXWuxAgDQrdlOhhiOC3QRDMckh2vCi8RADkSCainRbZ5JoF7CERyij7CrhsSskOfVqQ_fvN-lKG6W2_TkvMFLe8UbKNPkzSqjzfdo)]…” ## Barriers and synchronization within the Ping-Pong async pipeline @@ -182,7 +182,7 @@ And translating that into a relative speedup chart of Ping-Pong vs cuBLAS and Tr **Figure 5, above: Relative speedup of Ping-Pong vs the two closest kernels.** -The full source code for the Ping-Pong kernel is here (619 lines of deeply templated Cutlass code, or to paraphrase the famous turtle meme - "it's templates...all the way down! ): +The full source code for the Ping-Pong kernel is here (619 lines of deeply templated CUTLASS code, or to paraphrase the famous turtle meme - "it's templates...all the way down! ): - [https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp](https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) @@ -190,7 +190,7 @@ In addition, we have implemented PingPong as a CPP extension to make it easy to - [https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm](https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm) -Finally, for continued learning, Nvidia has two GTC videos that dive into kernel design with Cutlass: +Finally, for continued learning, Nvidia has two GTC videos that dive into kernel design with CUTLASS: - [Developing Optimal CUDA Kernels on Hopper Tensor Cores \| GTC Digital Spring 2023 \| NVIDIA On-Demand](https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51413/) - [CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores \| GTC 24 2024 \| NVIDIA On-Demand](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61198/)