Skip to content

Commit

Permalink
[HGEMM] Update NVIDIA L20/4090 Perf plots (#126)
Browse files Browse the repository at this point in the history
  • Loading branch information
DefTruth authored Nov 8, 2024
1 parent a66cc2f commit b225a14
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA and MMA)` implemented in this repo can achieve approximately `95%~98%` of its performance. Please check [hgemm benchmark](./hgemm) for more details.

![L20](https://github.com/user-attachments/assets/a0039200-cd9e-4ae6-be13-422fff75dd2b)
![L20](./hgemm/NVIDIA_L20.png)

<!---
![4090](https://github.com/user-attachments/assets/c7d65fe5-9fb9-49a8-b962-a6c09bcc030a)
Expand Down
Binary file added hgemm/NVIDIA_GeForce_RTX_4090.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added hgemm/NVIDIA_L20.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions hgemm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,12 @@ python3 hgemm.py --mma-all --plot --topk 8

<div id="NV-L20"></div>


<!---
![L20](https://github.com/user-attachments/assets/a0039200-cd9e-4ae6-be13-422fff75dd2b)
--->

![L20](./NVIDIA_L20.png)

- WMMA: Up to 113.76 TFLOPS, 113.83/119.5=95.25% TFLOPS utilization, 113.83/116.25=97.91% cuBLAS performance.
- MMA: Up to 115.12 TFLOPS, 115.12/119.5=96.33% TFLOPS utilization, 115.12/116.25=99.03% cuBLAS performance.
Expand Down Expand Up @@ -105,7 +110,11 @@ python3 hgemm.py --mma-all --wmma-all --cuda-all
### NVIDIA GeForce RTX 4090
在NVIDIA RTX 4090上(FP16 Tensor Cores算力为330 TFLOPS),WMMA(m16n16k16)性能表现比MMA(m16n8k16)要更好,大分部MNK下,本仓库的实现能达到cuBLAS 95%~99%的性能,某些case能超过cuBLAS。就本仓库的实现而言,在RTX 4090上,大规模矩阵乘(MNK>=8192),WMMA表现更优,小规模矩阵乘,MMA表现更优。

<!---
![4090](https://github.com/user-attachments/assets/c7d65fe5-9fb9-49a8-b962-a6c09bcc030a)
--->

![4090](./NVIDIA_GeForce_RTX_4090.png)

```bash
----------------------------------------------------------------------------------------------------------------------------------
Expand Down

0 comments on commit b225a14

Please sign in to comment.