Skip to content

Commit

Permalink
Fixed case on key words in blog posts (#1819)
Browse files Browse the repository at this point in the history
Signed-off-by: Chris Abraham <[email protected]>
  • Loading branch information
cjyabraham authored Nov 21, 2024
1 parent 7acfae9 commit 7d29b4e
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 10 deletions.
6 changes: 3 additions & 3 deletions _posts/2024-10-17-pytorch2-5.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: blog_detail
title: "PyTorch 2.5 Release Blog"
---

We are excited to announce the release of PyTorch® 2.5 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.5.0))! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.
We are excited to announce the release of PyTorch® 2.5 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.5.0))! This release features a new cuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.

This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.

Expand All @@ -18,7 +18,7 @@ As well, please check out our new ecosystem projects releases with [TorchRec](ht
</td>
</tr>
<tr>
<td>CuDNN backend for SDPA
<td>cuDNN backend for SDPA
</td>
<td>FlexAttention
</td>
Expand Down Expand Up @@ -74,7 +74,7 @@ As well, please check out our new ecosystem projects releases with [TorchRec](ht
## BETA FEATURES


### [Beta] CuDNN backend for SDPA
### [Beta] cuDNN backend for SDPA

The cuDNN "Fused Flash Attention" backend was landed for *torch.nn.functional.scaled_dot_product_attention*. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.

Expand Down
14 changes: 7 additions & 7 deletions _posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: blog_detail
title: "Deep Dive on Cutlass Ping-Pong GEMM Kernel"
title: "Deep Dive on CUTLASS Ping-Pong GEMM Kernel"
author: Less Wright, Adnan Hoque
---

Expand All @@ -10,7 +10,7 @@ author: Less Wright, Adnan Hoque

## Summary

In this post, we provide an overview, with relevant FP8 inference kernel benchmarking, of the cutlass Ping-Pong GEMM kernel.
In this post, we provide an overview, with relevant FP8 inference kernel benchmarking, of the CUTLASS Ping-Pong GEMM kernel.

Ping-Pong is one of the fastest matmul (GEMM) kernel architectures available for the Hopper GPU architecture. Ping-Pong is a member of the Warp Group Specialized Persistent Kernels family, which includes both Cooperative and Ping-Pong variants. Relative to previous GPUs, Hopper’s substantial tensor core compute capability requires deep asynchronous software pipelining in order to achieve peak performance.

Expand All @@ -30,7 +30,7 @@ For Ping-Pong, each warp group takes on a specialized role of either Data produc

The producer warp group focuses on producing data movement to fill the shared memory buffers (via TMA). Two other warp groups are dedicated consumers that process the math (MMA) portion with tensor cores, and then do any follow up work and write their results back to global memory (epilogue).

Producer warp groups work with TMA (Tensor Memory Accelerator), and are deliberately kept as lightweight as possible. In fact, in Ping-Pong, they deliberately reduce their register resources to improve occupancy. Producers will reduce their max register counts by 40, vs consumers will increase their max register count by 232, an effect we can see in the cutlass source and corresponding SASS:
Producer warp groups work with TMA (Tensor Memory Accelerator), and are deliberately kept as lightweight as possible. In fact, in Ping-Pong, they deliberately reduce their register resources to improve occupancy. Producers will reduce their max register counts by 40, vs consumers will increase their max register count by 232, an effect we can see in the CUTLASS source and corresponding SASS:


![source code](/assets/images/cutlass-ping-pong-gemm-kernel/fg2.png){:style="width:100%"}
Expand Down Expand Up @@ -76,13 +76,13 @@ To expand on TMA, or Tensor Memory Accelerator, TMA is a hardware component intr

## CUTLASS Asynchronous Pipeline Class

This signaling between producers and consumers is coordinated via the new Asynchronous Pipeline Class which Cutlass describes as follows:
This signaling between producers and consumers is coordinated via the new Asynchronous Pipeline Class which CUTLASS describes as follows:

“Implementing a persistent GEMM algorithm calls for managing dozens of different kinds of asynchronously executing operations that synchronize using multiple barriers organized as a circular list.

This complexity is too much for human programmers to manage by hand.

As a result, we have developed [[Cutlass Pipeline Async Class](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fcutlass%2Fblob%2Fmain%2Finclude%2Fcutlass%2Fpipeline%2Fsm90_pipeline.hpp&h=AT0Qy69t9mn_9VGkJlf1TkC_yCVPAQbYzHtS9it0ZVxTxVasGZfb6u-VHKReULm29NsLhp3DtuRfN4BHnzczniArsCFe8Uzj7izIx646Otyl4lEwl9jUHDhTcUq87KfS919MkadFMjq5i4qtkbe7QbgZEMbhFi0ARgvz3-u7_X0Hf3kHwQ&__tn__=-UK-R&c[0]=AT2Wep-mQJcJ7w2cBPcqoNcO9gLYx7_Qg9TGIcfKPSoo8kGdDtl70vKog1VICaOX45DhNP-Eu6pUbUl9TxGeGLQHgzyXWuxAgDQrdlOhhiOC3QRDMckh2vCi8RADkSCainRbZ5JoF7CERyij7CrhsSskOfVqQ_fvN-lKG6W2_TkvMFLe8UbKNPkzSqjzfdo)]…”
As a result, we have developed [[CUTLASS Pipeline Async Class](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fcutlass%2Fblob%2Fmain%2Finclude%2Fcutlass%2Fpipeline%2Fsm90_pipeline.hpp&h=AT0Qy69t9mn_9VGkJlf1TkC_yCVPAQbYzHtS9it0ZVxTxVasGZfb6u-VHKReULm29NsLhp3DtuRfN4BHnzczniArsCFe8Uzj7izIx646Otyl4lEwl9jUHDhTcUq87KfS919MkadFMjq5i4qtkbe7QbgZEMbhFi0ARgvz3-u7_X0Hf3kHwQ&__tn__=-UK-R&c[0]=AT2Wep-mQJcJ7w2cBPcqoNcO9gLYx7_Qg9TGIcfKPSoo8kGdDtl70vKog1VICaOX45DhNP-Eu6pUbUl9TxGeGLQHgzyXWuxAgDQrdlOhhiOC3QRDMckh2vCi8RADkSCainRbZ5JoF7CERyij7CrhsSskOfVqQ_fvN-lKG6W2_TkvMFLe8UbKNPkzSqjzfdo)]…”

## Barriers and synchronization within the Ping-Pong async pipeline

Expand Down Expand Up @@ -182,15 +182,15 @@ And translating that into a relative speedup chart of Ping-Pong vs cuBLAS and Tr

**Figure 5, above: Relative speedup of Ping-Pong vs the two closest kernels.**

The full source code for the Ping-Pong kernel is here (619 lines of deeply templated Cutlass code, or to paraphrase the famous turtle meme - "it's templates...all the way down! ):
The full source code for the Ping-Pong kernel is here (619 lines of deeply templated CUTLASS code, or to paraphrase the famous turtle meme - "it's templates...all the way down! ):

- [https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp](https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp)

In addition, we have implemented PingPong as a CPP extension to make it easy to integrate into use with PyTorch here (along with a simple test script showing it’s usage):

- [https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm](https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm)

Finally, for continued learning, Nvidia has two GTC videos that dive into kernel design with Cutlass:
Finally, for continued learning, Nvidia has two GTC videos that dive into kernel design with CUTLASS:

- [Developing Optimal CUDA Kernels on Hopper Tensor Cores \| GTC Digital Spring 2023 \| NVIDIA On-Demand](https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51413/)
- [CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores \| GTC 24 2024 \| NVIDIA On-Demand](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61198/)
Expand Down

0 comments on commit 7d29b4e

Please sign in to comment.