diff --git a/.github/workflows/issue.yml b/.github/workflows/issue.yml index 0f662e74..47fd13b6 100644 --- a/.github/workflows/issue.yml +++ b/.github/workflows/issue.yml @@ -19,4 +19,4 @@ jobs: close-issue-message: "This issue was closed because it has been inactive for 7 days since being marked as stale." days-before-pr-stale: -1 days-before-pr-close: -1 - repo-token: ${{ secrets.GITHUB_TOKEN }} + repo-token: ${{ secrets.GITHUB_TOKEN }} \ No newline at end of file diff --git a/.gitignore b/.gitignore index 4b4ec220..22d4dcae 100644 --- a/.gitignore +++ b/.gitignore @@ -16,3 +16,5 @@ __pycache__ *.ncu* *.sqlite* *.engine +*.bin +outupt diff --git a/.gitmodules b/.gitmodules index d16e9335..88481846 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,3 +1,4 @@ [submodule "third-party/cutlass"] path = third-party/cutlass url = https://github.com/NVIDIA/cutlass.git + diff --git a/LICENSE b/LICENSE index e72bfdda..f288702d 100644 --- a/LICENSE +++ b/LICENSE @@ -671,4 +671,4 @@ into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read -. \ No newline at end of file +. diff --git a/README.md b/README.md index f7801238..457ba38b 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@
-📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖toy-hgemm library🔥🔥](./hgemm) which can achieve the performance of **cuBLAS**, check [📖HGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉 +📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖toy-hgemm library🔥🔥](./kernels/hgemm) which can achieve the performance of **cuBLAS**, check [📖HGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
@@ -25,7 +25,7 @@ -Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./hgemm) for more details. +Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details. |CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread| |:---:|:---:|:---:|:---:| @@ -92,147 +92,147 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d |📖 cuda kernel| 📖 elem dtype| 📖 acc dtype| 📖 docs | 📖 level | |:---|:---|:---|:---|:---| -| ✔️ [nsys/ncu(timeline/ptx/sass)](./nvidia-nsight/)|/|/|[link](./nvidia-nsight/)|⭐️| -| ✔️ [elementwise_f32](./elementwise/elementwise.cu)|f32|/|[link](./elementwise/)|⭐️| -| ✔️ [elementwise_f32x4](./elementwise/elementwise.cu)|f32|/|[link](./elementwise/)|⭐️| -| ✔️ [elementwise_f16](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️| -| ✔️ [elementwise_f16x2](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️| -| ✔️ [elementwise_f16x8](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️| -| ✔️ [elementwise_f16x8_pack](./elementwise/elementwise.cu)|f16|/|[link](./elementwise/)|⭐️⭐️| -| ✔️ [histogram_i32](./histogram/histogram.cu)|i32|/|[link](./histogram/)|⭐️| -| ✔️ [histogram_i32x4](./histogram/histogram.cu)|i32|/|[link](./histogram/)|⭐️| -| ✔️ [sigmoid_f32](./sigmoid/sigmoid.cu)|f32|/|[link](./sigmoid/)|⭐️| -| ✔️ [sigmoid_f32x4](./sigmoid/sigmoid.cu)|f32|/|[link](./sigmoid/)|⭐️| -| ✔️ [sigmoid_f16](./sigmoid/sigmoid.cu)|16|/|[link](./sigmoid/)|⭐️| -| ✔️ [sigmoid_f16x2](./sigmoid/sigmoid.cu)|f16|/|[link](./sigmoid/)|⭐️| -| ✔️ [sigmoid_f16x8](./sigmoid/sigmoid.cu)|f16|/|[link](./sigmoid/)|⭐️| -| ✔️ [sigmoid_f16x8_pack](./sigmoid/sigmoid.cu)|f16|/|[link](./sigmoid/)|⭐️⭐️| -| ✔️ [relu_f32](./relu/relu.cu)|f32|/|[link](./relu/)|⭐️| -| ✔️ [relu_f32x4](./relu/relu.cu)|f32|/|[link](./relu/)|⭐️| -| ✔️ [relu_f16](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️| -| ✔️ [relu_f16x2](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️| -| ✔️ [relu_f16x8](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️| -| ✔️ [relu_f16x8_pack](./relu/relu.cu)|f16|/|[link](./relu/)|⭐️⭐️| -| ✔️ [gelu_f32](./gelu/gelu.cu)|f32|/|[link](./gelu/)|⭐️| -| ✔️ [gelu_f32x4](./gelu/gelu.cu)|f32|/|[link](./gelu/)|⭐️| -| ✔️ [gelu_f16](./gelu/gelu.cu)|f16|/|[link](./gelu/)|⭐️| -| ✔️ [gelu_f16x2](./gelu/gelu.cu)|f16|/|[link](./gelu/)|⭐️| -| ✔️ [gelu_f16x8](./gelu/gelu.cu)|f16|/|[link](./gelu/)|⭐️| -| ✔️ [gelu_f16x8_pack](./gelu/gelu.cu)|f16|/|[link](./gelu/)|⭐️⭐️| -| ✔️ [swish_f32](./swish/swish.cu)|f32|/|[link](./swish/)|⭐️| -| ✔️ [swish_f32x4](./swish/swish.cu)|f32|/|[link](./swish/)|⭐️| -| ✔️ [swish_f16](./swish/swish.cu)|f16|/|[link](./swish/)|⭐️| -| ✔️ [swish_f16x2](./swish/swish.cu)|f16|/|[link](./swish/)|⭐️| -| ✔️ [swish_f16x8](./swish/swish.cu)|f16|/|[link](./swish/)|⭐️| -| ✔️ [swish_f16x8_pack](./swish/swish.cu)|f16|/|[link](./swish/)|⭐️⭐️| -| ✔️ [embedding_f32](./embedding/embedding.cu)|f32|/|[link](./embedding/)|⭐️| -| ✔️ [embedding_f32x4](./embedding/embedding.cu)|f32|/|[link](./embedding/)|⭐️| -| ✔️ [embedding_f32x4_pack](./embedding/embedding.cu)|f32|/|[link](./embedding/)|⭐️| -| ✔️ [embedding_f16](./embedding/embedding.cu)|f16|/|[link](./embedding/)|⭐️| -| ✔️ [embedding_f16x2](./embedding/embedding.cu)|f16|/|[link](./embedding/)|⭐️| -| ✔️ [embedding_f16x8](./embedding/embedding.cu)|f16|/|[link](./embedding/)|⭐️| -| ✔️ [embedding_f16x8_pack](./embedding/embedding.cu)|f16|/|[link](./embedding/)|⭐️⭐️| -| ✔️ [mat_trans_f32_col2row{2d}](./mat-transpose/mat_transpose.cu)|f32|/|[link](./mat-transpose/)|⭐️| -| ✔️ [mat_trans_f32_row2col{2d}](./mat-transpose/mat_transpose.cu)|f32|/|[link](./mat-transpose/)|⭐️| -| ✔️ [mat_trans_f32_diagonal2d](./mat-transpose/mat_transpose.cu)|f32|/|[link](./mat-transpose/)|⭐️⭐️| -| ✔️ [mat_trans_f32x4_col2row{2d}](./mat-transpose/mat_transpose.cu)|f32|/|[link](./mat-transpose/)|⭐️⭐️| -| ✔️ [mat_trans_f32x4_row2col{2d}](./mat-transpose/mat_transpose.cu)|f32|/|[link](./mat-transpose/)|⭐️⭐️| -| ✔️ [warp_reduce_[all]](./reduce/block_all_reduce.cu)|all|all|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f32_f32](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f32x4_f32](./reduce/block_all_reduce.cu)|f32|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f16_f16](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f16_f32](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f16x2_f16](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f16x2_f32](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f16x8_pack_f16](./reduce/block_all_reduce.cu)|f16|f16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_f16x8_pack_f32](./reduce/block_all_reduce.cu)|f16|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_bf16_bf16](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_bf16_f32](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_bf16x2_bf16](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_bf16x2_f32](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_bf16x8_pack_bf16](./reduce/block_all_reduce.cu)|bf16|bf16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_bf16x8_pack_f32](./reduce/block_all_reduce.cu)|bf16|f32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_fp8_e4m3_f16](./reduce/block_all_reduce.cu)|fp8_e4m3|f16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_fp8_e5m2_f16](./reduce/block_all_reduce.cu)|fp8_e5m2|f16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_fp8_e4m3x16_pack_f16](./reduce/reduce.cu)|fp8_e4m3|f16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_fp8_e5m2x16_pack_f16](./reduce/reduce.cu)|fp8_e5m2|f16|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_i8_i32](./reduce/block_all_reduce.cu)|i8|i32|[link](./reduce/)|⭐️⭐️| -| ✔️ [reduce_i8x16_pack_i32](./reduce/block_all_reduce.cu)|i8|i32|[link](./reduce/)|⭐️⭐️| -| ✔️ [dot_product_f32](./dot-product/dot_product.cu)|f32|f32|[link](./dot-product/)|⭐️⭐️| -| ✔️ [dot_product_f32x4](./dot-product/dot_product.cu)|f32|f32|[link](./dot-product/)|⭐️⭐️| -| ✔️ [dot_product_f16_f32](./dot-product/dot_product.cu)|f16|f32|[link](./dot-product/)|⭐️⭐️| -| ✔️ [dot_product_f16x2_f32](./dot-product/dot_product.cu)|f16|f32|[link](./dot-product/)|⭐️⭐️| -| ✔️ [dot_product_f16x8_pack_f32](./dot-product/dot_product.cu)|f16|f32|[link](./dot-product/)|⭐️⭐️| -| ✔️ [softmax_f32(fence)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [softmax_f32x4(fence)](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [softmax_f32](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [softmax_f32x4](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [safe_softmax_f32](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [safe_softmax_f32x4](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [safe_softmax_f16_f32](./softmax/softmax.cu)|f16|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [safe_softmax_f16x2_f32](./softmax/softmax.cu)|f16|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [safe_softmax_f16x8_pack_f32](./softmax/softmax.cu)|f16|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [online_safe_softmax_f32](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [online_safe_softmax_f32x4_pack](./softmax/softmax.cu)|f32|f32|[link](./softmax/)|⭐️⭐️| -| ✔️ [rope_f32](./rope/rope.cu)|f32|f32|[link](./rope/)|⭐️⭐️| -| ✔️ [rope_f32x4_pack](./rope/rope.cu)|f32|f32|[link](./rope/)|⭐️⭐️| -| ✔️ [layer_norm_f32](./layer-norm/layer_norm.cu)|f32|f32|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [layer_norm_f32x4](./layer-norm/layer_norm.cu)|f32|f32|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [layer_norm_f16_f16](./layer-norm/layer_norm.cu)|f16|f16|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [layer_norm_f16x2_f16](./layer-norm/layer_norm.cu)|f16|f16|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [layer_norm_f16x8_f16](./layer-norm/layer_norm.cu)|f16|f16|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [layer_norm_f16x8_pack_f16](./layer-norm/layer_norm.cu)|f16|f16|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [layer_norm_f16x8_pack_f32](./layer-norm/layer_norm.cu)|f16|f32|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [layer_norm_f16_f32](./layer-norm/layer_norm.cu)|f16|f32|[link](./layer-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f32](./rms-norm/rms_norm.cu)|f32|f32|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f32x4](./rms-norm/rms_norm.cu)|f32|f32|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f16_f16](./rms-norm/rms_norm.cu)|f16|f16|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f16x2_f16](./rms-norm/rms_norm.cu)|f16|f16|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f16x8_f16](./rms-norm/rms_norm.cu)|f16|f16|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f16x8_f32](./rms-norm/rms_norm.cu)|f16|f32|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f16x8_pack_f16](./rms-norm/rms_norm.cu)|f16|f16|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f16x8_pack_f32](./rms-norm/rms_norm.cu)|f16|f32|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [rms_norm_f16_f32](./rms-norm/rms_norm.cu)|f16|f32|[link](./rms-norm/)|⭐️⭐️| -| ✔️ [sgemm_naive_f32](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️| -| ✔️ [sgemm_sliced_k_f32](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [sgemm_t_8x8_sliced_k_f32x4](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [sgemm_t_8x8_sliced_k...bcf](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [sgemm_t_8x8_sliced_k...dbuf](./sgemm/sgemm.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [sgemm_t_8x8_sliced_k16...dbuf](./sgemm/sgemm_async.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [sgemm_t_8x8_sliced_k16...async](./sgemm/sgemm_async.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [sgemm_wmma_m16n16k8...stages*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [sgemm_wmma_m16n16k8...swizzle*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_naive_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️| -| ✔️ [hgemm_sliced_k_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_t_8x8_sliced_k_f16x4](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_t_8/16x8...k16/32...async](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_wmma_m16n16k16...naive*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_mma_stages{swizzle}...cute*](./hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| -| ✔️ [hgemm_mma_cublas*](./hgemm/cublas/hgemm_cublas.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️| -| ✔️ [sgemv_k32_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️| -| ✔️ [sgemv_k128_f32x4](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️| -| ✔️ [sgemv_k16_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️| -| ✔️ [hgemv_k32_f16](./hgemv/hgemv.cu)|f16|f16|[link](./hgemv/)|⭐️⭐️⭐️| -| ✔️ [hgemv_k128_f16x4](./hgemv/hgemv.cu)|f16|f16|[link](./hgemv/)|⭐️⭐️⭐️| -| ✔️ [hgemv_k16_f16](./hgemv/hgemv.cu)|f16|f16|[link](./hgemv/)|⭐️⭐️⭐️| -| ✔️ [flash_attn_f32](./flash-attn/flash_attn.cu)|f32|f32|[link](./flash-attn)|⭐️⭐️⭐️| -| ✔️ [flash_attn_mma_m16n8k16*](./flash-attn/flash_attn_mma.cu)|f16|f16|[link](./flash-attn)|⭐️⭐️⭐️| -| ✔️ [nms_f32](./nms/nms.cu)|f32|/|[link](./nms)|⭐️⭐️| -| ✔️ [notes v1(deprecated)](./notes-v1.cu)|f32|f32|/|⭐️| +| ✔️ [nsys/ncu(timeline/ptx/sass)](./kernels/nvidia-nsight/)|/|/|[link](./kernels/nvidia-nsight/)|⭐️| +| ✔️ [elementwise_f32](./kernels/elementwise/elementwise.cu)|f32|/|[link](./kernels/elementwise/)|⭐️| +| ✔️ [elementwise_f32x4](./kernels/elementwise/elementwise.cu)|f32|/|[link](./kernels/elementwise/)|⭐️| +| ✔️ [elementwise_f16](./kernels/elementwise/elementwise.cu)|f16|/|[link](./kernels/elementwise/)|⭐️| +| ✔️ [elementwise_f16x2](./kernels/elementwise/elementwise.cu)|f16|/|[link](./kernels/elementwise/)|⭐️| +| ✔️ [elementwise_f16x8](./kernels/elementwise/elementwise.cu)|f16|/|[link](./kernels/elementwise/)|⭐️| +| ✔️ [elementwise_f16x8_pack](./kernels/elementwise/elementwise.cu)|f16|/|[link](./kernels/elementwise/)|⭐️⭐️| +| ✔️ [histogram_i32](./kernels/histogram/histogram.cu)|i32|/|[link](./kernels/histogram/)|⭐️| +| ✔️ [histogram_i32x4](./kernels/histogram/histogram.cu)|i32|/|[link](./kernels/histogram/)|⭐️| +| ✔️ [sigmoid_f32](./kernels/sigmoid/sigmoid.cu)|f32|/|[link](./kernels/sigmoid/)|⭐️| +| ✔️ [sigmoid_f32x4](./kernels/sigmoid/sigmoid.cu)|f32|/|[link](./kernels/sigmoid/)|⭐️| +| ✔️ [sigmoid_f16](./kernels/sigmoid/sigmoid.cu)|16|/|[link](./kernels/sigmoid/)|⭐️| +| ✔️ [sigmoid_f16x2](./kernels/sigmoid/sigmoid.cu)|f16|/|[link](./kernels/sigmoid/)|⭐️| +| ✔️ [sigmoid_f16x8](./kernels/sigmoid/sigmoid.cu)|f16|/|[link](./kernels/sigmoid/)|⭐️| +| ✔️ [sigmoid_f16x8_pack](./kernels/sigmoid/sigmoid.cu)|f16|/|[link](./kernels/sigmoid/)|⭐️⭐️| +| ✔️ [relu_f32](./kernels/relu/relu.cu)|f32|/|[link](./kernels/relu/)|⭐️| +| ✔️ [relu_f32x4](./kernels/relu/relu.cu)|f32|/|[link](./kernels/relu/)|⭐️| +| ✔️ [relu_f16](./kernels/relu/relu.cu)|f16|/|[link](./kernels/relu/)|⭐️| +| ✔️ [relu_f16x2](./kernels/relu/relu.cu)|f16|/|[link](./kernels/relu/)|⭐️| +| ✔️ [relu_f16x8](./kernels/relu/relu.cu)|f16|/|[link](./kernels/relu/)|⭐️| +| ✔️ [relu_f16x8_pack](./kernels/relu/relu.cu)|f16|/|[link](./kernels/relu/)|⭐️⭐️| +| ✔️ [gelu_f32](./kernels/gelu/gelu.cu)|f32|/|[link](./kernels/gelu/)|⭐️| +| ✔️ [gelu_f32x4](./kernels/gelu/gelu.cu)|f32|/|[link](./kernels/gelu/)|⭐️| +| ✔️ [gelu_f16](./kernels/gelu/gelu.cu)|f16|/|[link](./kernels/gelu/)|⭐️| +| ✔️ [gelu_f16x2](./kernels/gelu/gelu.cu)|f16|/|[link](./kernels/gelu/)|⭐️| +| ✔️ [gelu_f16x8](./kernels/gelu/gelu.cu)|f16|/|[link](./kernels/gelu/)|⭐️| +| ✔️ [gelu_f16x8_pack](./kernels/gelu/gelu.cu)|f16|/|[link](./kernels/gelu/)|⭐️⭐️| +| ✔️ [swish_f32](./kernels/swish/swish.cu)|f32|/|[link](./kernels/swish/)|⭐️| +| ✔️ [swish_f32x4](./kernels/swish/swish.cu)|f32|/|[link](./kernels/swish/)|⭐️| +| ✔️ [swish_f16](./kernels/swish/swish.cu)|f16|/|[link](./kernels/swish/)|⭐️| +| ✔️ [swish_f16x2](./kernels/swish/swish.cu)|f16|/|[link](./kernels/swish/)|⭐️| +| ✔️ [swish_f16x8](./kernels/swish/swish.cu)|f16|/|[link](./kernels/swish/)|⭐️| +| ✔️ [swish_f16x8_pack](./kernels/swish/swish.cu)|f16|/|[link](./kernels/swish/)|⭐️⭐️| +| ✔️ [embedding_f32](./kernels/embedding/embedding.cu)|f32|/|[link](./kernels/embedding/)|⭐️| +| ✔️ [embedding_f32x4](./kernels/embedding/embedding.cu)|f32|/|[link](./kernels/embedding/)|⭐️| +| ✔️ [embedding_f32x4_pack](./kernels/embedding/embedding.cu)|f32|/|[link](./kernels/embedding/)|⭐️| +| ✔️ [embedding_f16](./kernels/embedding/embedding.cu)|f16|/|[link](./kernels/embedding/)|⭐️| +| ✔️ [embedding_f16x2](./kernels/embedding/embedding.cu)|f16|/|[link](./kernels/embedding/)|⭐️| +| ✔️ [embedding_f16x8](./kernels/embedding/embedding.cu)|f16|/|[link](./kernels/embedding/)|⭐️| +| ✔️ [embedding_f16x8_pack](./kernels/embedding/embedding.cu)|f16|/|[link](./kernels/embedding/)|⭐️⭐️| +| ✔️ [mat_trans_f32_col2row{2d}](./kernels/mat-transpose/mat_transpose.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️| +| ✔️ [mat_trans_f32_row2col{2d}](./kernels/mat-transpose/mat_transpose.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️| +| ✔️ [mat_trans_f32_diagonal2d](./kernels/mat-transpose/mat_transpose.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️⭐️| +| ✔️ [mat_trans_f32x4_col2row{2d}](./kernels/mat-transpose/mat_transpose.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️⭐️| +| ✔️ [mat_trans_f32x4_row2col{2d}](./kernels/mat-transpose/mat_transpose.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️⭐️| +| ✔️ [warp_reduce_[all]](./kernels/reduce/block_all_reduce.cu)|all|all|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f32_f32](./kernels/reduce/block_all_reduce.cu)|f32|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f32x4_f32](./kernels/reduce/block_all_reduce.cu)|f32|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f16_f16](./kernels/reduce/block_all_reduce.cu)|f16|f16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f16_f32](./kernels/reduce/block_all_reduce.cu)|f16|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f16x2_f16](./kernels/reduce/block_all_reduce.cu)|f16|f16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f16x2_f32](./kernels/reduce/block_all_reduce.cu)|f16|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f16x8_pack_f16](./kernels/reduce/block_all_reduce.cu)|f16|f16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_f16x8_pack_f32](./kernels/reduce/block_all_reduce.cu)|f16|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_bf16_bf16](./kernels/reduce/block_all_reduce.cu)|bf16|bf16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_bf16_f32](./kernels/reduce/block_all_reduce.cu)|bf16|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_bf16x2_bf16](./kernels/reduce/block_all_reduce.cu)|bf16|bf16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_bf16x2_f32](./kernels/reduce/block_all_reduce.cu)|bf16|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_bf16x8_pack_bf16](./kernels/reduce/block_all_reduce.cu)|bf16|bf16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_bf16x8_pack_f32](./kernels/reduce/block_all_reduce.cu)|bf16|f32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_fp8_e4m3_f16](./kernels/reduce/block_all_reduce.cu)|fp8_e4m3|f16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_fp8_e5m2_f16](./kernels/reduce/block_all_reduce.cu)|fp8_e5m2|f16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_fp8_e4m3x16_pack_f16](./kernels/reduce/block_all_reduce.cu)|fp8_e4m3|f16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_fp8_e5m2x16_pack_f16](./kernels/reduce/block_all_reduce.cu)|fp8_e5m2|f16|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_i8_i32](./kernels/reduce/block_all_reduce.cu)|i8|i32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [reduce_i8x16_pack_i32](./kernels/reduce/block_all_reduce.cu)|i8|i32|[link](./kernels/reduce/)|⭐️⭐️| +| ✔️ [dot_product_f32](./kernels/dot-product/dot_product.cu)|f32|f32|[link](./kernels/dot-product/)|⭐️⭐️| +| ✔️ [dot_product_f32x4](./kernels/dot-product/dot_product.cu)|f32|f32|[link](./kernels/dot-product/)|⭐️⭐️| +| ✔️ [dot_product_f16_f32](./kernels/dot-product/dot_product.cu)|f16|f32|[link](./kernels/dot-product/)|⭐️⭐️| +| ✔️ [dot_product_f16x2_f32](./kernels/dot-product/dot_product.cu)|f16|f32|[link](./kernels/dot-product/)|⭐️⭐️| +| ✔️ [dot_product_f16x8_pack_f32](./kernels/dot-product/dot_product.cu)|f16|f32|[link](./kernels/dot-product/)|⭐️⭐️| +| ✔️ [softmax_f32(fence)](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [softmax_f32x4(fence)](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [softmax_f32](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [softmax_f32x4](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [safe_softmax_f32](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [safe_softmax_f32x4](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [safe_softmax_f16_f32](./kernels/softmax/softmax.cu)|f16|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [safe_softmax_f16x2_f32](./kernels/softmax/softmax.cu)|f16|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [safe_softmax_f16x8_pack_f32](./kernels/softmax/softmax.cu)|f16|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [online_safe_softmax_f32](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [online_safe_softmax_f32x4_pack](./kernels/softmax/softmax.cu)|f32|f32|[link](./kernels/softmax/)|⭐️⭐️| +| ✔️ [rope_f32](./kernels/rope/rope.cu)|f32|f32|[link](./kernels/rope/)|⭐️⭐️| +| ✔️ [rope_f32x4_pack](./kernels/rope/rope.cu)|f32|f32|[link](./kernels/rope/)|⭐️⭐️| +| ✔️ [layer_norm_f32](./kernels/layer-norm/layer_norm.cu)|f32|f32|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [layer_norm_f32x4](./kernels/layer-norm/layer_norm.cu)|f32|f32|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [layer_norm_f16_f16](./kernels/layer-norm/layer_norm.cu)|f16|f16|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [layer_norm_f16x2_f16](./kernels/layer-norm/layer_norm.cu)|f16|f16|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [layer_norm_f16x8_f16](./kernels/layer-norm/layer_norm.cu)|f16|f16|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [layer_norm_f16x8_pack_f16](./kernels/layer-norm/layer_norm.cu)|f16|f16|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [layer_norm_f16x8_pack_f32](./kernels/layer-norm/layer_norm.cu)|f16|f32|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [layer_norm_f16_f32](./kernels/layer-norm/layer_norm.cu)|f16|f32|[link](./kernels/layer-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f32](./kernels/rms-norm/rms_norm.cu)|f32|f32|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f32x4](./kernels/rms-norm/rms_norm.cu)|f32|f32|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f16_f16](./kernels/rms-norm/rms_norm.cu)|f16|f16|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f16x2_f16](./kernels/rms-norm/rms_norm.cu)|f16|f16|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f16x8_f16](./kernels/rms-norm/rms_norm.cu)|f16|f16|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f16x8_f32](./kernels/rms-norm/rms_norm.cu)|f16|f32|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f16x8_pack_f16](./kernels/rms-norm/rms_norm.cu)|f16|f16|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f16x8_pack_f32](./kernels/rms-norm/rms_norm.cu)|f16|f32|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [rms_norm_f16_f32](./kernels/rms-norm/rms_norm.cu)|f16|f32|[link](./kernels/rms-norm/)|⭐️⭐️| +| ✔️ [sgemm_naive_f32](./kernels/sgemm/sgemm.cu)|f32|f32|[link](./kernels/sgemm/)|⭐️⭐️| +| ✔️ [sgemm_sliced_k_f32](./kernels/sgemm/sgemm.cu)|f32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [sgemm_t_8x8_sliced_k_f32x4](./kernels/sgemm/sgemm.cu)|f32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [sgemm_t_8x8_sliced_k...bcf](./kernels/sgemm/sgemm.cu)|f32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [sgemm_t_8x8_sliced_k...dbuf](./kernels/sgemm/sgemm.cu)|f32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [sgemm_t_8x8_sliced_k16...dbuf](./kernels/sgemm/sgemm_async.cu)|f32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [sgemm_t_8x8_sliced_k16...async](./kernels/sgemm/sgemm_async.cu)|f32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [sgemm_wmma_m16n16k8...stages*](./kernels/sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [sgemm_wmma_m16n16k8...swizzle*](./kernels/sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./kernels/sgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_naive_f16](./kernels/hgemm/naive/hgemm.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️| +| ✔️ [hgemm_sliced_k_f16](./kernels/hgemm/naive/hgemm.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_t_8x8_sliced_k_f16x4](./kernels/hgemm/hgemm.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./kernels/hgemm/naive/hgemm.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./kernels/hgemm/naive/hgemm.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./kernels/hgemm/naive/hgemm.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./kernels/hgemm/naive/hgemm_async.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_t_8/16x8...k16/32...async](./kernels/hgemm/naive/hgemm_async.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_wmma_m16n16k16...naive*](./kernels/hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./kernels/hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./kernels/hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./kernels/hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./kernels/hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_wmma_m16n16k16...stages*](./kernels/hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./kernels/hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_mma_m16n8k16...naive*](./kernels/hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./kernels/hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_mma_m16n8k16...stages*](./kernels/hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_mma_m16n8k16...swizzle*](./kernels/hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_mma_stages{swizzle}...cute*](./kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| +| ✔️ [hgemm_mma_cublas*](./kernels/hgemm/cublas/hgemm_cublas.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️| +| ✔️ [sgemv_k32_f32](./kernels/sgemv/sgemv.cu)|f32|f32|[link](./kernels/sgemv/)|⭐️⭐️⭐️| +| ✔️ [sgemv_k128_f32x4](./kernels/sgemv/sgemv.cu)|f32|f32|[link](./kernels/sgemv/)|⭐️⭐️⭐️| +| ✔️ [sgemv_k16_f32](./kernels/sgemv/sgemv.cu)|f32|f32|[link](./kernels/sgemv/)|⭐️⭐️⭐️| +| ✔️ [hgemv_k32_f16](./kernels/hgemv/hgemv.cu)|f16|f16|[link](./kernels/hgemv/)|⭐️⭐️⭐️| +| ✔️ [hgemv_k128_f16x4](./kernels/hgemv/hgemv.cu)|f16|f16|[link](./kernels/hgemv/)|⭐️⭐️⭐️| +| ✔️ [hgemv_k16_f16](./kernels/hgemv/hgemv.cu)|f16|f16|[link](./kernels/hgemv/)|⭐️⭐️⭐️| +| ✔️ [flash_attn_f32](./kernels/flash-attn/flash_attn.cu)|f32|f32|[link](./kernels/flash-attn)|⭐️⭐️⭐️| +| ✔️ [flash_attn_mma_m16n8k16*](./kernels/flash-attn/flash_attn_mma.cu)|f16|f16|[link](./kernels/flash-attn)|⭐️⭐️⭐️| +| ✔️ [nms_f32](./kernels/nms/nms.cu)|f32|/|[link](./kernels/nms)|⭐️⭐️| +| ✔️ [notes v1(deprecated)](./kernels/notes-v1.cu)|f32|f32|/|⭐️| ## 📖 博客目录 diff --git a/hgemm/tools/clear.sh b/hgemm/tools/clear.sh deleted file mode 100644 index 7fd558bb..00000000 --- a/hgemm/tools/clear.sh +++ /dev/null @@ -1 +0,0 @@ -rm -rf __pycache__ build dist toy_hgemm.egg-info diff --git a/cuda-slides/.gitignore b/kernels/cutlass/.gitignore similarity index 100% rename from cuda-slides/.gitignore rename to kernels/cutlass/.gitignore diff --git a/cutlass/cute/.gitignore b/kernels/cutlass/cute/.gitignore similarity index 100% rename from cutlass/cute/.gitignore rename to kernels/cutlass/cute/.gitignore diff --git a/cutlass/cute/vector_add.cu b/kernels/cutlass/cute/vector_add.cu similarity index 100% rename from cutlass/cute/vector_add.cu rename to kernels/cutlass/cute/vector_add.cu diff --git a/cutlass/cutlass-3.x/.gitignore b/kernels/cutlass/cutlass-3.x/.gitignore similarity index 100% rename from cutlass/cutlass-3.x/.gitignore rename to kernels/cutlass/cutlass-3.x/.gitignore diff --git a/dot-product/.gitignore b/kernels/dot-product/.gitignore similarity index 100% rename from dot-product/.gitignore rename to kernels/dot-product/.gitignore diff --git a/dot-product/README.md b/kernels/dot-product/README.md similarity index 100% rename from dot-product/README.md rename to kernels/dot-product/README.md diff --git a/dot-product/dot_product.cu b/kernels/dot-product/dot_product.cu similarity index 100% rename from dot-product/dot_product.cu rename to kernels/dot-product/dot_product.cu diff --git a/dot-product/dot_product.py b/kernels/dot-product/dot_product.py similarity index 100% rename from dot-product/dot_product.py rename to kernels/dot-product/dot_product.py diff --git a/elementwise/.gitignore b/kernels/elementwise/.gitignore similarity index 100% rename from elementwise/.gitignore rename to kernels/elementwise/.gitignore diff --git a/elementwise/README.md b/kernels/elementwise/README.md similarity index 100% rename from elementwise/README.md rename to kernels/elementwise/README.md diff --git a/elementwise/elementwise.cu b/kernels/elementwise/elementwise.cu similarity index 100% rename from elementwise/elementwise.cu rename to kernels/elementwise/elementwise.cu diff --git a/elementwise/elementwise.py b/kernels/elementwise/elementwise.py similarity index 100% rename from elementwise/elementwise.py rename to kernels/elementwise/elementwise.py diff --git a/embedding/.gitignore b/kernels/embedding/.gitignore similarity index 100% rename from embedding/.gitignore rename to kernels/embedding/.gitignore diff --git a/embedding/README.md b/kernels/embedding/README.md similarity index 100% rename from embedding/README.md rename to kernels/embedding/README.md diff --git a/embedding/embedding.cu b/kernels/embedding/embedding.cu similarity index 100% rename from embedding/embedding.cu rename to kernels/embedding/embedding.cu diff --git a/embedding/embedding.py b/kernels/embedding/embedding.py similarity index 100% rename from embedding/embedding.py rename to kernels/embedding/embedding.py diff --git a/cutlass/.gitignore b/kernels/flash-attn/.gitignore similarity index 100% rename from cutlass/.gitignore rename to kernels/flash-attn/.gitignore diff --git a/flash-attn/README.md b/kernels/flash-attn/README.md similarity index 100% rename from flash-attn/README.md rename to kernels/flash-attn/README.md diff --git a/flash-attn/flash_attn.cc b/kernels/flash-attn/flash_attn.cc similarity index 100% rename from flash-attn/flash_attn.cc rename to kernels/flash-attn/flash_attn.cc diff --git a/flash-attn/flash_attn.cu b/kernels/flash-attn/flash_attn.cu similarity index 100% rename from flash-attn/flash_attn.cu rename to kernels/flash-attn/flash_attn.cu diff --git a/flash-attn/flash_attn.py b/kernels/flash-attn/flash_attn.py similarity index 100% rename from flash-attn/flash_attn.py rename to kernels/flash-attn/flash_attn.py diff --git a/flash-attn/flash_attn_mma.cu b/kernels/flash-attn/flash_attn_mma.cu similarity index 100% rename from flash-attn/flash_attn_mma.cu rename to kernels/flash-attn/flash_attn_mma.cu diff --git a/gelu/.gitignore b/kernels/gelu/.gitignore similarity index 100% rename from gelu/.gitignore rename to kernels/gelu/.gitignore diff --git a/gelu/README.md b/kernels/gelu/README.md similarity index 100% rename from gelu/README.md rename to kernels/gelu/README.md diff --git a/gelu/gelu.cu b/kernels/gelu/gelu.cu similarity index 100% rename from gelu/gelu.cu rename to kernels/gelu/gelu.cu diff --git a/gelu/gelu.py b/kernels/gelu/gelu.py similarity index 100% rename from gelu/gelu.py rename to kernels/gelu/gelu.py diff --git a/hgemm/.gitignore b/kernels/hgemm/.gitignore similarity index 100% rename from hgemm/.gitignore rename to kernels/hgemm/.gitignore diff --git a/hgemm/README.md b/kernels/hgemm/README.md similarity index 72% rename from hgemm/README.md rename to kernels/hgemm/README.md index e769d20f..3816e2e2 100755 --- a/hgemm/README.md +++ b/kernels/hgemm/README.md @@ -1,56 +1,65 @@ -# 🔥🔥Toy-HGEMM Library: Achieve the performance of cuBLAS +## 🔥🔥Toy-HGEMM Library: Achieve the performance of cuBLAS |CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread| |:---:|:---:|:---:|:---:| |✔️|✔️|✔️|✔️| -|**WMMA(m16n16k16)**|**MMA(m16n8k16)**|**Pack LDST(128 bits)**|**SMEM Padding**| +|WMMA(m16n16k16)|MMA(m16n8k16)|Pack LDST(128 bits)|SMEM Padding| |✔️|✔️|✔️|✔️| -|**Copy Async**|**Tile MMA(More Threads)**|**Tile Warp(More Values)**|**Multi Stages**| +|Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages| |✔️|✔️|✔️|✔️| -|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Reg Reuse&Warp Shfl)**| +|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)| |✔️|✔️|✔️|✔️| -|**Row Major(NN)**|**Col Major(TN)**|**SGEMM TF32**|**SMEM Swizzle(CuTe)**| +|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)| |✔️|✔️|✔️|✔️| -
- 🔑️ 点击查看所有支持的HGEMM Kernels! - -- [X] hgemm_sliced_k_f16_kernel -- [X] hgemm_t_8x8_sliced_k_f16x4_kernel(unpack) -- [X] hgemm_t_8x8_sliced_k_f16x4_pack_kernel(pack 16x4) -- [X] hgemm_t_8x8_sliced_k_f16x4_bcf_kernel(bank conflicts reduce) -- [X] hgemm_t_8x8_sliced_k_f16x4_pack_bcf_kernel(bank conflicts reduce, pack) -- [X] hgemm_t_8x8_sliced_k_f16x8_pack_bcf_kernel(bank conflicts reduce, pack) -- [X] hgemm_t_8x8_sliced_k_f16x8_pack_bcf_dbuf_kernel(bank conflicts reduce, pack, double buffers) -- [X] hgemm_t_8x8_sliced_k16/32_f16x8_pack_bcf_dbuf_kernel(pack, double buffers) -- [X] hgemm_t_8x8_sliced_k16/32_f16x8_pack_bcf_dbuf_async_kernel(pack, double buffers, copy async) -- [X] hgemm_wmma_m16n16k16_naive(WMMA) -- [X] hgemm_wmma_m16n16k16_mma4x2(WMMA, Tile MMA) -- [X] hgemm_wmma_m16n16k16_mma4x2_warp2x4(TWMMA, Tile MMA/Warp, pack) -- [X] hgemm_wmma_m16n16k16_mma4x2_warp2x4_async(WMMA, Tile MMA/Warp, Copy Async) -- [X] hgemm_wmma_m16n16k16_mma4x2_warp2x4_async_offset(WMMA, Tile MMA/Warp, Copy Async, Pad) -- [X] hgemm_wmma_m16n16k16_mma4x2_warp2x4_dbuf_async(WMMA, Tile MMA/Warp, Copy Async, Double Buffers, Pad) -- [X] hgemm_wmma_m16n16k16_mma4x2_warp2x4_stages(WMMA, Tile MMA/Warp, Copy Async, Stages, Pad, Block swizzle) -- [X] hgemm_wmma_m16n16k16_mma4x2_warp4x4_stages(WMMA, Tile MMA/Warp, Copy Async, Stages, Pad, Block swizzle) -- [X] hgemm_wmma_m16n16k16_mma4x4_warp4x4_stages(WMMA, Tile MMA/Warp, Copy Async, Stages, Pad, Block swizzle) -- [X] hgemm_wmma_m32n8k16_mma2x4_warp2x4_dbuf_async(WMMA, Tile MMA/Warp, Copy Async, Double Buffers, Pad) -- [X] hgemm_mma_m16n8k16_naive(MMA) -- [X] hgemm_mma_m16n8k16_mma2x4_warp4x4(MMA, Tile MMA/Warp, pack) -- [X] hgemm_mma_m16n8k16_mma2x4_warp4x4_stages(MMA, Tile MMA/Warp, Copy Async, Stages, Pad, Block swizzle) -- [X] hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages(MMA, Tile MMA/Warp, Copy Async, Stages, Pad, Block swizzle, Warp swizzle, Reg Double Buffers, Collective Store with Reg Reuse & Warp Shuffle) -- [X] hgemm_mma_stages_block_swizzle_tn_cute(MMA, Tile MMA/Warp, Copy Async, Stages, Block Swizzle, SMEM Swizzle, Collective Store with SMEM) -- [X] PyTorch bindings -
+## 📖 HGEMM CUDA Kernels in Toy-HGEMM Library 🎉🎉 + +```C++ +void hgemm_naive_f16(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_sliced_k_f16(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k_f16x4(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k_f16x4_pack(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k_f16x4_bcf(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k_f16x4_pack_bcf(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k_f16x8_pack_bcf(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k_f16x8_pack_bcf_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k16_f16x8_pack_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k16_f16x8_pack_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k32_f16x8_pack_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_8x8_sliced_k32_f16x8_pack_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_16x8_sliced_k32_f16x8_pack_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_t_16x8_sliced_k32_f16x8_pack_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_cublas_tensor_op_nn(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_cublas_tensor_op_tn(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_wmma_m16n16k16_naive(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_wmma_m16n16k16_mma4x2(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_wmma_m16n16k16_mma4x2_warp2x4(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_wmma_m16n16k16_mma4x2_warp2x4_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_wmma_m32n8k16_mma2x4_warp2x4_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_wmma_m16n16k16_mma4x2_warp2x4_stages(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_wmma_m16n16k16_mma4x2_warp2x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_wmma_m16n16k16_mma4x2_warp4x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_wmma_m16n16k16_mma4x4_warp4x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_mma_m16n8k16_naive(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_mma_m16n8k16_mma2x4_warp4x4(torch::Tensor a, torch::Tensor b, torch::Tensor c); +void hgemm_mma_m16n8k16_mma2x4_warp4x4_stages(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_mma_m16n8k16_mma2x4_warp4x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_x4(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_rr(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_mma_m16n8k16_mma2x4_warp4x4_stages_dsmem_tn(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +void hgemm_mma_stages_tn_cute(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride); +``` -## 安装 -本仓库实现的HGEMM CUDA kernels可以作为一个python库toy-hgemm使用,安装命令如下。(可选) +## 📖 安装 +本仓库实现的HGEMM可以作为一个python库使用(可选) ```bash -git submodule update --init --recursive --force -bash tools/install.sh # pip uninstall toy-hgemm 卸载 +git submodule update --init --recursive --force # 更新cutlass, 必须 +python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl # pip uninstall toy-hgemm -y 卸载 ``` -## 测试命令 +## 📖 测试 **CUTLASS**: 更新CUTLASS依赖库 ```bash @@ -114,7 +123,7 @@ M N K = 16128 16128 16128, Time = 0.07319142 0.07320709 0.07326925 s, A M N K = 16384 16384 16384, Time = 0.07668429 0.07669371 0.07670784 s, AVG Performance = 114.6912 Tflops ``` -## 目前性能 +## 📖 目前性能 ### NVIDIA L20 @@ -132,8 +141,6 @@ M N K = 16384 16384 16384, Time = 0.07668429 0.07669371 0.07670784 s, A ![NVIDIA_L20_NN+TN+v2](https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99) -- WMMA: Up to 113.76 TFLOPS, 113.83/119.5=95.25% TFLOPS utilization, 113.83/116.25=97.91% cuBLAS performance. -- MMA: Up to 115.12 TFLOPS, 115.12/119.5=96.33% TFLOPS utilization, 115.12/116.25=99.03% cuBLAS performance. 全量MNK测试命令(提示: 每个MNK单独测试的性能数据更准确) ```bash @@ -166,7 +173,7 @@ python3 hgemm.py --wmma-all --plot ``` -## 性能优化笔记 +## 📖 性能优化笔记 ### PyTorch HGEMM Profile diff --git a/hgemm/bench/NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png b/kernels/hgemm/bench/NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png similarity index 100% rename from hgemm/bench/NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png rename to kernels/hgemm/bench/NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png diff --git a/hgemm/bench/NVIDIA_GeForce_RTX_4090.png b/kernels/hgemm/bench/NVIDIA_GeForce_RTX_4090.png similarity index 100% rename from hgemm/bench/NVIDIA_GeForce_RTX_4090.png rename to kernels/hgemm/bench/NVIDIA_GeForce_RTX_4090.png diff --git a/hgemm/bench/NVIDIA_L20.png b/kernels/hgemm/bench/NVIDIA_L20.png similarity index 100% rename from hgemm/bench/NVIDIA_L20.png rename to kernels/hgemm/bench/NVIDIA_L20.png diff --git a/hgemm/bench/prof.py b/kernels/hgemm/bench/prof.py similarity index 100% rename from hgemm/bench/prof.py rename to kernels/hgemm/bench/prof.py diff --git a/hgemm/cublas/hgemm_cublas.cu b/kernels/hgemm/cublas/hgemm_cublas.cu similarity index 100% rename from hgemm/cublas/hgemm_cublas.cu rename to kernels/hgemm/cublas/hgemm_cublas.cu diff --git a/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu b/kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu similarity index 100% rename from hgemm/cutlass/hgemm_mma_stage_tn_cute.cu rename to kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu diff --git a/hgemm/hgemm.py b/kernels/hgemm/hgemm.py similarity index 100% rename from hgemm/hgemm.py rename to kernels/hgemm/hgemm.py diff --git a/hgemm/makefile b/kernels/hgemm/makefile similarity index 78% rename from hgemm/makefile rename to kernels/hgemm/makefile index 97ec2860..d1a6624a 100644 --- a/hgemm/makefile +++ b/kernels/hgemm/makefile @@ -1,4 +1,4 @@ -INCLUDE_DIRS=-I ../ -I ./utils -I ../third-party/cutlass/include -I ../third-party/cutlass/tools/util/include +INCLUDE_DIRS=-I ./utils -I ../../third-party/cutlass/include -I ../../third-party/cutlass/tools/util/include default: nvcc cutlass/hgemm_mma_stage_tn_cute.cu -o hgemm_cute.bin -O2 -arch=sm_89 -std=c++17 $(INCLUDE_DIRS) --expt-relaxed-constexpr -lcublas nvcc cublas/hgemm_cublas.cu -o hgemm_cublas.bin -O2 -arch=sm_89 -std=c++17 $(INCLUDE_DIRS) --expt-relaxed-constexpr -lcublas diff --git a/hgemm/mma/hgemm_mma.cu b/kernels/hgemm/mma/hgemm_mma.cu similarity index 100% rename from hgemm/mma/hgemm_mma.cu rename to kernels/hgemm/mma/hgemm_mma.cu diff --git a/hgemm/mma/hgemm_mma_stage.cu b/kernels/hgemm/mma/hgemm_mma_stage.cu similarity index 100% rename from hgemm/mma/hgemm_mma_stage.cu rename to kernels/hgemm/mma/hgemm_mma_stage.cu diff --git a/hgemm/mma/hgemm_mma_stage_tn.cu b/kernels/hgemm/mma/hgemm_mma_stage_tn.cu similarity index 100% rename from hgemm/mma/hgemm_mma_stage_tn.cu rename to kernels/hgemm/mma/hgemm_mma_stage_tn.cu diff --git a/hgemm/naive/hgemm.cu b/kernels/hgemm/naive/hgemm.cu similarity index 100% rename from hgemm/naive/hgemm.cu rename to kernels/hgemm/naive/hgemm.cu diff --git a/hgemm/naive/hgemm_async.cu b/kernels/hgemm/naive/hgemm_async.cu similarity index 100% rename from hgemm/naive/hgemm_async.cu rename to kernels/hgemm/naive/hgemm_async.cu diff --git a/hgemm/pybind/hgemm.cc b/kernels/hgemm/pybind/hgemm.cc similarity index 100% rename from hgemm/pybind/hgemm.cc rename to kernels/hgemm/pybind/hgemm.cc diff --git a/hgemm/setup.py b/kernels/hgemm/setup.py similarity index 95% rename from hgemm/setup.py rename to kernels/hgemm/setup.py index e0f2d381..6572b28d 100644 --- a/hgemm/setup.py +++ b/kernels/hgemm/setup.py @@ -10,7 +10,7 @@ ) from tools.utils import (get_build_sources, get_build_cuda_cflags) -# package name managed by pip, which can be remove by `pip uninstall tiny_pkg` +# package name managed by pip, which can be remove by `pip uninstall toy-hgemm` PACKAGE_NAME = "toy-hgemm" ext_modules = [] diff --git a/kernels/hgemm/tools/clear.sh b/kernels/hgemm/tools/clear.sh new file mode 100644 index 00000000..aaa0d8d5 --- /dev/null +++ b/kernels/hgemm/tools/clear.sh @@ -0,0 +1,5 @@ +set -x + +rm -rf __pycache__ build dist toy_hgemm.egg-info *.bin + +set +x \ No newline at end of file diff --git a/hgemm/tools/install.sh b/kernels/hgemm/tools/install.sh similarity index 71% rename from hgemm/tools/install.sh rename to kernels/hgemm/tools/install.sh index 993dcd18..eeb7491d 100644 --- a/hgemm/tools/install.sh +++ b/kernels/hgemm/tools/install.sh @@ -1,4 +1,8 @@ +set -x + +git submodule update --init --recursive --force python3 -m pip uninstall toy-hgemm -y python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl && cd - rm -rf toy_hgemm.egg-info __pycache__ +set +x \ No newline at end of file diff --git a/hgemm/tools/utils.py b/kernels/hgemm/tools/utils.py similarity index 89% rename from hgemm/tools/utils.py rename to kernels/hgemm/tools/utils.py index 4e4bcdb8..76736147 100644 --- a/hgemm/tools/utils.py +++ b/kernels/hgemm/tools/utils.py @@ -31,8 +31,8 @@ def get_build_sources(): def get_project_dir(): - return os.path.dirname( - os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) + return os.path.dirname(os.path.dirname( + os.path.dirname(os.path.dirname(os.path.abspath(__file__))))) def get_build_cuda_cflags(build_pkg: bool = False): @@ -68,7 +68,7 @@ def get_build_cuda_cflags(build_pkg: bool = False): extra_cuda_cflags.append("-Xptxas -v") else: extra_cuda_cflags.append("--ptxas-options=-v") - extra_cuda_cflags.append("--ptxas-options=-O2") + extra_cuda_cflags.append("--ptxas-options=-O3") # extra cuda flags for cute hgemm project_dir = get_project_dir() extra_cuda_cflags.append('-DNO_MMA_HGEMM_BIN') @@ -76,14 +76,14 @@ def get_build_cuda_cflags(build_pkg: bool = False): extra_cuda_cflags.append('-DNO_CUTE_HGEMM_BIN') extra_cuda_cflags.append('-DNO_CUBLAS_HGEMM_BIN') # add cutlass headers and link cublas. - extra_cuda_cflags.append(f'-I {project_dir}') - extra_cuda_cflags.append(f'-I {project_dir}/utils') - extra_cuda_cflags.append(f'-I {project_dir}/naive') - extra_cuda_cflags.append(f'-I {project_dir}/wmma') - extra_cuda_cflags.append(f'-I {project_dir}/mma') - extra_cuda_cflags.append(f'-I {project_dir}/cutlass') - extra_cuda_cflags.append(f'-I {project_dir}/cublas') - extra_cuda_cflags.append(f'-I {project_dir}/pybind') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm/utils') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm/naive') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm/wmma') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm/mma') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm/cutlass') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm/cublas') + extra_cuda_cflags.append(f'-I {project_dir}/kernels/hgemm/pybind') extra_cuda_cflags.append(f'-I {project_dir}/third-party/cutlass/include') extra_cuda_cflags.append(f'-I {project_dir}/third-party/cutlass/tools/util/include') extra_cuda_cflags.append('-lcublas') diff --git a/hgemm/utils/utils.h b/kernels/hgemm/utils/utils.h similarity index 100% rename from hgemm/utils/utils.h rename to kernels/hgemm/utils/utils.h diff --git a/hgemm/wmma/hgemm_wmma.cu b/kernels/hgemm/wmma/hgemm_wmma.cu similarity index 100% rename from hgemm/wmma/hgemm_wmma.cu rename to kernels/hgemm/wmma/hgemm_wmma.cu diff --git a/hgemm/wmma/hgemm_wmma_stage.cu b/kernels/hgemm/wmma/hgemm_wmma_stage.cu similarity index 100% rename from hgemm/wmma/hgemm_wmma_stage.cu rename to kernels/hgemm/wmma/hgemm_wmma_stage.cu diff --git a/hgemv/.gitignore b/kernels/hgemv/.gitignore similarity index 100% rename from hgemv/.gitignore rename to kernels/hgemv/.gitignore diff --git a/hgemv/README.md b/kernels/hgemv/README.md similarity index 100% rename from hgemv/README.md rename to kernels/hgemv/README.md diff --git a/hgemv/hgemv.cu b/kernels/hgemv/hgemv.cu similarity index 100% rename from hgemv/hgemv.cu rename to kernels/hgemv/hgemv.cu diff --git a/hgemv/hgemv.py b/kernels/hgemv/hgemv.py similarity index 100% rename from hgemv/hgemv.py rename to kernels/hgemv/hgemv.py diff --git a/histogram/.gitignore b/kernels/histogram/.gitignore similarity index 100% rename from histogram/.gitignore rename to kernels/histogram/.gitignore diff --git a/histogram/README.md b/kernels/histogram/README.md similarity index 100% rename from histogram/README.md rename to kernels/histogram/README.md diff --git a/histogram/histogram.cu b/kernels/histogram/histogram.cu similarity index 100% rename from histogram/histogram.cu rename to kernels/histogram/histogram.cu diff --git a/histogram/histogram.py b/kernels/histogram/histogram.py similarity index 100% rename from histogram/histogram.py rename to kernels/histogram/histogram.py diff --git a/layer-norm/.gitignore b/kernels/layer-norm/.gitignore similarity index 100% rename from layer-norm/.gitignore rename to kernels/layer-norm/.gitignore diff --git a/layer-norm/README.md b/kernels/layer-norm/README.md similarity index 100% rename from layer-norm/README.md rename to kernels/layer-norm/README.md diff --git a/layer-norm/layer_norm.cu b/kernels/layer-norm/layer_norm.cu similarity index 100% rename from layer-norm/layer_norm.cu rename to kernels/layer-norm/layer_norm.cu diff --git a/layer-norm/layer_norm.py b/kernels/layer-norm/layer_norm.py similarity index 100% rename from layer-norm/layer_norm.py rename to kernels/layer-norm/layer_norm.py diff --git a/mat-transpose/.gitignore b/kernels/mat-transpose/.gitignore similarity index 100% rename from mat-transpose/.gitignore rename to kernels/mat-transpose/.gitignore diff --git a/mat-transpose/README.md b/kernels/mat-transpose/README.md similarity index 100% rename from mat-transpose/README.md rename to kernels/mat-transpose/README.md diff --git a/mat-transpose/mat_transpose.cu b/kernels/mat-transpose/mat_transpose.cu similarity index 100% rename from mat-transpose/mat_transpose.cu rename to kernels/mat-transpose/mat_transpose.cu diff --git a/mat-transpose/mat_transpose.py b/kernels/mat-transpose/mat_transpose.py similarity index 100% rename from mat-transpose/mat_transpose.py rename to kernels/mat-transpose/mat_transpose.py diff --git a/flash-attn/.gitignore b/kernels/nms/.gitignore similarity index 100% rename from flash-attn/.gitignore rename to kernels/nms/.gitignore diff --git a/nms/README.md b/kernels/nms/README.md similarity index 100% rename from nms/README.md rename to kernels/nms/README.md diff --git a/nms/nms.cc b/kernels/nms/nms.cc similarity index 100% rename from nms/nms.cc rename to kernels/nms/nms.cc diff --git a/nms/nms.cu b/kernels/nms/nms.cu similarity index 100% rename from nms/nms.cu rename to kernels/nms/nms.cu diff --git a/nms/nms.py b/kernels/nms/nms.py similarity index 100% rename from nms/nms.py rename to kernels/nms/nms.py diff --git a/notes-v1.cu b/kernels/notes-v1.cu similarity index 100% rename from notes-v1.cu rename to kernels/notes-v1.cu diff --git a/nvidia-nsight/.gitignore b/kernels/nvidia-nsight/.gitignore similarity index 100% rename from nvidia-nsight/.gitignore rename to kernels/nvidia-nsight/.gitignore diff --git a/nvidia-nsight/README.md b/kernels/nvidia-nsight/README.md similarity index 100% rename from nvidia-nsight/README.md rename to kernels/nvidia-nsight/README.md diff --git a/nvidia-nsight/elementwise.cu b/kernels/nvidia-nsight/elementwise.cu similarity index 100% rename from nvidia-nsight/elementwise.cu rename to kernels/nvidia-nsight/elementwise.cu diff --git a/nvidia-nsight/relu.cu b/kernels/nvidia-nsight/relu.cu similarity index 100% rename from nvidia-nsight/relu.cu rename to kernels/nvidia-nsight/relu.cu diff --git a/nms/.gitignore b/kernels/openai-triton/.gitignore similarity index 100% rename from nms/.gitignore rename to kernels/openai-triton/.gitignore diff --git a/openai-triton/README.md b/kernels/openai-triton/README.md similarity index 100% rename from openai-triton/README.md rename to kernels/openai-triton/README.md diff --git a/openai-triton/flash_attn_v2_fwd.py b/kernels/openai-triton/flash_attn_v2_fwd.py similarity index 100% rename from openai-triton/flash_attn_v2_fwd.py rename to kernels/openai-triton/flash_attn_v2_fwd.py diff --git a/openai-triton/prefix_prefill.py b/kernels/openai-triton/prefix_prefill.py similarity index 100% rename from openai-triton/prefix_prefill.py rename to kernels/openai-triton/prefix_prefill.py diff --git a/openai-triton/prefix_prefill_alibi.py b/kernels/openai-triton/prefix_prefill_alibi.py similarity index 100% rename from openai-triton/prefix_prefill_alibi.py rename to kernels/openai-triton/prefix_prefill_alibi.py diff --git a/openai-triton/requirements.txt b/kernels/openai-triton/requirements.txt similarity index 100% rename from openai-triton/requirements.txt rename to kernels/openai-triton/requirements.txt diff --git a/openai-triton/test_flash_attn_v2_fwd.py b/kernels/openai-triton/test_flash_attn_v2_fwd.py similarity index 100% rename from openai-triton/test_flash_attn_v2_fwd.py rename to kernels/openai-triton/test_flash_attn_v2_fwd.py diff --git a/openai-triton/test_prefix_prefill.py b/kernels/openai-triton/test_prefix_prefill.py similarity index 100% rename from openai-triton/test_prefix_prefill.py rename to kernels/openai-triton/test_prefix_prefill.py diff --git a/openai-triton/test_prefix_prefill_alibi.py b/kernels/openai-triton/test_prefix_prefill_alibi.py similarity index 100% rename from openai-triton/test_prefix_prefill_alibi.py rename to kernels/openai-triton/test_prefix_prefill_alibi.py diff --git a/reduce/.gitignore b/kernels/reduce/.gitignore similarity index 100% rename from reduce/.gitignore rename to kernels/reduce/.gitignore diff --git a/reduce/README.md b/kernels/reduce/README.md similarity index 100% rename from reduce/README.md rename to kernels/reduce/README.md diff --git a/reduce/block_all_reduce.cu b/kernels/reduce/block_all_reduce.cu similarity index 100% rename from reduce/block_all_reduce.cu rename to kernels/reduce/block_all_reduce.cu diff --git a/reduce/block_all_reduce.py b/kernels/reduce/block_all_reduce.py similarity index 100% rename from reduce/block_all_reduce.py rename to kernels/reduce/block_all_reduce.py diff --git a/relu/.gitignore b/kernels/relu/.gitignore similarity index 100% rename from relu/.gitignore rename to kernels/relu/.gitignore diff --git a/relu/README.md b/kernels/relu/README.md similarity index 100% rename from relu/README.md rename to kernels/relu/README.md diff --git a/relu/relu.cu b/kernels/relu/relu.cu similarity index 100% rename from relu/relu.cu rename to kernels/relu/relu.cu diff --git a/relu/relu.py b/kernels/relu/relu.py similarity index 100% rename from relu/relu.py rename to kernels/relu/relu.py diff --git a/rms-norm/.gitignore b/kernels/rms-norm/.gitignore similarity index 100% rename from rms-norm/.gitignore rename to kernels/rms-norm/.gitignore diff --git a/rms-norm/README.md b/kernels/rms-norm/README.md similarity index 100% rename from rms-norm/README.md rename to kernels/rms-norm/README.md diff --git a/rms-norm/rms_norm.cu b/kernels/rms-norm/rms_norm.cu similarity index 100% rename from rms-norm/rms_norm.cu rename to kernels/rms-norm/rms_norm.cu diff --git a/rms-norm/rms_norm.py b/kernels/rms-norm/rms_norm.py similarity index 100% rename from rms-norm/rms_norm.py rename to kernels/rms-norm/rms_norm.py diff --git a/rope/.gitignore b/kernels/rope/.gitignore similarity index 100% rename from rope/.gitignore rename to kernels/rope/.gitignore diff --git a/rope/README.md b/kernels/rope/README.md similarity index 100% rename from rope/README.md rename to kernels/rope/README.md diff --git a/rope/rope.cu b/kernels/rope/rope.cu similarity index 100% rename from rope/rope.cu rename to kernels/rope/rope.cu diff --git a/rope/rope.py b/kernels/rope/rope.py similarity index 100% rename from rope/rope.py rename to kernels/rope/rope.py diff --git a/sgemm/.gitignore b/kernels/sgemm/.gitignore similarity index 100% rename from sgemm/.gitignore rename to kernels/sgemm/.gitignore diff --git a/sgemm/README.md b/kernels/sgemm/README.md similarity index 100% rename from sgemm/README.md rename to kernels/sgemm/README.md diff --git a/sgemm/sgemm.cu b/kernels/sgemm/sgemm.cu similarity index 100% rename from sgemm/sgemm.cu rename to kernels/sgemm/sgemm.cu diff --git a/sgemm/sgemm.py b/kernels/sgemm/sgemm.py similarity index 100% rename from sgemm/sgemm.py rename to kernels/sgemm/sgemm.py diff --git a/sgemm/sgemm_async.cu b/kernels/sgemm/sgemm_async.cu similarity index 100% rename from sgemm/sgemm_async.cu rename to kernels/sgemm/sgemm_async.cu diff --git a/sgemm/sgemm_cublas.cu b/kernels/sgemm/sgemm_cublas.cu similarity index 100% rename from sgemm/sgemm_cublas.cu rename to kernels/sgemm/sgemm_cublas.cu diff --git a/sgemm/sgemm_wmma_tf32_stage.cu b/kernels/sgemm/sgemm_wmma_tf32_stage.cu similarity index 100% rename from sgemm/sgemm_wmma_tf32_stage.cu rename to kernels/sgemm/sgemm_wmma_tf32_stage.cu diff --git a/sgemv/.gitignore b/kernels/sgemv/.gitignore similarity index 100% rename from sgemv/.gitignore rename to kernels/sgemv/.gitignore diff --git a/sgemv/README.md b/kernels/sgemv/README.md similarity index 100% rename from sgemv/README.md rename to kernels/sgemv/README.md diff --git a/sgemv/sgemv.cu b/kernels/sgemv/sgemv.cu similarity index 100% rename from sgemv/sgemv.cu rename to kernels/sgemv/sgemv.cu diff --git a/sgemv/sgemv.py b/kernels/sgemv/sgemv.py similarity index 100% rename from sgemv/sgemv.py rename to kernels/sgemv/sgemv.py diff --git a/sigmoid/.gitignore b/kernels/sigmoid/.gitignore similarity index 100% rename from sigmoid/.gitignore rename to kernels/sigmoid/.gitignore diff --git a/sigmoid/README.md b/kernels/sigmoid/README.md similarity index 100% rename from sigmoid/README.md rename to kernels/sigmoid/README.md diff --git a/sigmoid/sigmoid.cu b/kernels/sigmoid/sigmoid.cu similarity index 100% rename from sigmoid/sigmoid.cu rename to kernels/sigmoid/sigmoid.cu diff --git a/sigmoid/sigmoid.py b/kernels/sigmoid/sigmoid.py similarity index 100% rename from sigmoid/sigmoid.py rename to kernels/sigmoid/sigmoid.py diff --git a/softmax/.gitignore b/kernels/softmax/.gitignore similarity index 100% rename from softmax/.gitignore rename to kernels/softmax/.gitignore diff --git a/softmax/README.md b/kernels/softmax/README.md similarity index 100% rename from softmax/README.md rename to kernels/softmax/README.md diff --git a/softmax/softmax.cu b/kernels/softmax/softmax.cu similarity index 100% rename from softmax/softmax.cu rename to kernels/softmax/softmax.cu diff --git a/softmax/softmax.py b/kernels/softmax/softmax.py similarity index 100% rename from softmax/softmax.py rename to kernels/softmax/softmax.py diff --git a/swish/.gitignore b/kernels/swish/.gitignore similarity index 100% rename from swish/.gitignore rename to kernels/swish/.gitignore diff --git a/swish/README.md b/kernels/swish/README.md similarity index 100% rename from swish/README.md rename to kernels/swish/README.md diff --git a/swish/swish.cu b/kernels/swish/swish.cu similarity index 100% rename from swish/swish.cu rename to kernels/swish/swish.cu diff --git a/swish/swish.py b/kernels/swish/swish.py similarity index 100% rename from swish/swish.py rename to kernels/swish/swish.py diff --git a/openai-triton/.gitignore b/kernels/transformer/.gitignore similarity index 100% rename from openai-triton/.gitignore rename to kernels/transformer/.gitignore diff --git a/pytorch/.gitignore b/others/pytorch/.gitignore similarity index 100% rename from pytorch/.gitignore rename to others/pytorch/.gitignore diff --git a/pytorch/custom_ops/.gitignore b/others/pytorch/custom_ops/.gitignore similarity index 100% rename from pytorch/custom_ops/.gitignore rename to others/pytorch/custom_ops/.gitignore diff --git a/pytorch/distributed/.gitignore b/others/pytorch/distributed/.gitignore similarity index 100% rename from pytorch/distributed/.gitignore rename to others/pytorch/distributed/.gitignore diff --git a/pytorch/distributed/test_all_gather.py b/others/pytorch/distributed/test_all_gather.py similarity index 100% rename from pytorch/distributed/test_all_gather.py rename to others/pytorch/distributed/test_all_gather.py diff --git a/pytorch/distributed/test_all_gather_objects.py b/others/pytorch/distributed/test_all_gather_objects.py similarity index 100% rename from pytorch/distributed/test_all_gather_objects.py rename to others/pytorch/distributed/test_all_gather_objects.py diff --git a/pytorch/distributed/test_all_reduce.py b/others/pytorch/distributed/test_all_reduce.py similarity index 100% rename from pytorch/distributed/test_all_reduce.py rename to others/pytorch/distributed/test_all_reduce.py diff --git a/pytorch/distributed/test_all_to_all.py b/others/pytorch/distributed/test_all_to_all.py similarity index 100% rename from pytorch/distributed/test_all_to_all.py rename to others/pytorch/distributed/test_all_to_all.py diff --git a/pytorch/distributed/test_broadcast.py b/others/pytorch/distributed/test_broadcast.py similarity index 100% rename from pytorch/distributed/test_broadcast.py rename to others/pytorch/distributed/test_broadcast.py diff --git a/pytorch/distributed/test_gather.py b/others/pytorch/distributed/test_gather.py similarity index 100% rename from pytorch/distributed/test_gather.py rename to others/pytorch/distributed/test_gather.py diff --git a/pytorch/distributed/test_p2p.py b/others/pytorch/distributed/test_p2p.py similarity index 100% rename from pytorch/distributed/test_p2p.py rename to others/pytorch/distributed/test_p2p.py diff --git a/pytorch/distributed/test_reduce.py b/others/pytorch/distributed/test_reduce.py similarity index 100% rename from pytorch/distributed/test_reduce.py rename to others/pytorch/distributed/test_reduce.py diff --git a/pytorch/distributed/test_reduce_scatter.py b/others/pytorch/distributed/test_reduce_scatter.py similarity index 100% rename from pytorch/distributed/test_reduce_scatter.py rename to others/pytorch/distributed/test_reduce_scatter.py diff --git a/pytorch/distributed/test_scatter.py b/others/pytorch/distributed/test_scatter.py similarity index 100% rename from pytorch/distributed/test_scatter.py rename to others/pytorch/distributed/test_scatter.py diff --git a/pytorch/slides/pytorch_2.pdf b/others/pytorch/slides/pytorch_2.pdf similarity index 100% rename from pytorch/slides/pytorch_2.pdf rename to others/pytorch/slides/pytorch_2.pdf diff --git a/tensorrt/README.md b/others/tensorrt/README.md similarity index 100% rename from tensorrt/README.md rename to others/tensorrt/README.md diff --git a/tensorrt/fmha/.gitignore b/others/tensorrt/fmha/.gitignore similarity index 100% rename from tensorrt/fmha/.gitignore rename to others/tensorrt/fmha/.gitignore diff --git a/tensorrt/fmha/README.md b/others/tensorrt/fmha/README.md similarity index 100% rename from tensorrt/fmha/README.md rename to others/tensorrt/fmha/README.md diff --git a/tensorrt/fmha/export_fmha.py b/others/tensorrt/fmha/export_fmha.py similarity index 100% rename from tensorrt/fmha/export_fmha.py rename to others/tensorrt/fmha/export_fmha.py diff --git a/tensorrt/fmha/fmha_pattern_match_ops.py b/others/tensorrt/fmha/fmha_pattern_match_ops.py similarity index 100% rename from tensorrt/fmha/fmha_pattern_match_ops.py rename to others/tensorrt/fmha/fmha_pattern_match_ops.py diff --git a/tensorrt/plugin/.gitignore b/others/tensorrt/plugin/.gitignore similarity index 100% rename from tensorrt/plugin/.gitignore rename to others/tensorrt/plugin/.gitignore diff --git a/tensorrt/plugin/README.md b/others/tensorrt/plugin/README.md similarity index 100% rename from tensorrt/plugin/README.md rename to others/tensorrt/plugin/README.md diff --git a/transformer/.gitignore b/slides/cuda-slides/.gitignore similarity index 100% rename from transformer/.gitignore rename to slides/cuda-slides/.gitignore diff --git a/cuda-slides/CUDA_C_Programming_Guide_125.pdf b/slides/cuda-slides/CUDA_C_Programming_Guide_125.pdf similarity index 100% rename from cuda-slides/CUDA_C_Programming_Guide_125.pdf rename to slides/cuda-slides/CUDA_C_Programming_Guide_125.pdf diff --git a/cuda-slides/CUTLASS/A41131 - CUTLASS_ Python API, Enhancements, and NVIDIA Hopper.pdf b/slides/cuda-slides/CUTLASS/A41131 - CUTLASS_ Python API, Enhancements, and NVIDIA Hopper.pdf similarity index 100% rename from cuda-slides/CUTLASS/A41131 - CUTLASS_ Python API, Enhancements, and NVIDIA Hopper.pdf rename to slides/cuda-slides/CUTLASS/A41131 - CUTLASS_ Python API, Enhancements, and NVIDIA Hopper.pdf diff --git a/cuda-slides/CUTLASS/ACCELERATING A TRITON FUSED KERNEL FOR W4A16 QUANTIZED INFERENCE WITH SPLITK WORK DECOMPOSITION.pdf b/slides/cuda-slides/CUTLASS/ACCELERATING A TRITON FUSED KERNEL FOR W4A16 QUANTIZED INFERENCE WITH SPLITK WORK DECOMPOSITION.pdf similarity index 100% rename from cuda-slides/CUTLASS/ACCELERATING A TRITON FUSED KERNEL FOR W4A16 QUANTIZED INFERENCE WITH SPLITK WORK DECOMPOSITION.pdf rename to slides/cuda-slides/CUTLASS/ACCELERATING A TRITON FUSED KERNEL FOR W4A16 QUANTIZED INFERENCE WITH SPLITK WORK DECOMPOSITION.pdf diff --git a/cuda-slides/CUTLASS/Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS_1647481957785001Hmln.pdf b/slides/cuda-slides/CUTLASS/Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS_1647481957785001Hmln.pdf similarity index 100% rename from cuda-slides/CUTLASS/Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS_1647481957785001Hmln.pdf rename to slides/cuda-slides/CUTLASS/Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS_1647481957785001Hmln.pdf diff --git a/cuda-slides/CUTLASS/Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks.pdf b/slides/cuda-slides/CUTLASS/Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks.pdf similarity index 100% rename from cuda-slides/CUTLASS/Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks.pdf rename to slides/cuda-slides/CUTLASS/Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks.pdf diff --git a/cuda-slides/CUTLASS/GPU Load Balancing.pdf b/slides/cuda-slides/CUTLASS/GPU Load Balancing.pdf similarity index 100% rename from cuda-slides/CUTLASS/GPU Load Balancing.pdf rename to slides/cuda-slides/CUTLASS/GPU Load Balancing.pdf diff --git a/cuda-slides/CUTLASS/Graphene-CUTE-CUTLASS-2023.pdf b/slides/cuda-slides/CUTLASS/Graphene-CUTE-CUTLASS-2023.pdf similarity index 100% rename from cuda-slides/CUTLASS/Graphene-CUTE-CUTLASS-2023.pdf rename to slides/cuda-slides/CUTLASS/Graphene-CUTE-CUTLASS-2023.pdf diff --git a/cuda-slides/CUTLASS/How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance_ a Worklog.pdf b/slides/cuda-slides/CUTLASS/How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance_ a Worklog.pdf similarity index 100% rename from cuda-slides/CUTLASS/How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance_ a Worklog.pdf rename to slides/cuda-slides/CUTLASS/How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance_ a Worklog.pdf diff --git a/cuda-slides/CUTLASS/Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core.pdf b/slides/cuda-slides/CUTLASS/Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core.pdf similarity index 100% rename from cuda-slides/CUTLASS/Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core.pdf rename to slides/cuda-slides/CUTLASS/Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core.pdf diff --git a/cuda-slides/CUTLASS/NVIDIA Tensor Core Programmability, Performance & Precision.pdf b/slides/cuda-slides/CUTLASS/NVIDIA Tensor Core Programmability, Performance & Precision.pdf similarity index 100% rename from cuda-slides/CUTLASS/NVIDIA Tensor Core Programmability, Performance & Precision.pdf rename to slides/cuda-slides/CUTLASS/NVIDIA Tensor Core Programmability, Performance & Precision.pdf diff --git a/cuda-slides/CUTLASS/S31883_ManishGupta_Accelerating_Convolution_with_Tensor_Cores_in_CUTLASS_1617596903236001ZNDn.pdf b/slides/cuda-slides/CUTLASS/S31883_ManishGupta_Accelerating_Convolution_with_Tensor_Cores_in_CUTLASS_1617596903236001ZNDn.pdf similarity index 100% rename from cuda-slides/CUTLASS/S31883_ManishGupta_Accelerating_Convolution_with_Tensor_Cores_in_CUTLASS_1617596903236001ZNDn.pdf rename to slides/cuda-slides/CUTLASS/S31883_ManishGupta_Accelerating_Convolution_with_Tensor_Cores_in_CUTLASS_1617596903236001ZNDn.pdf diff --git a/cuda-slides/CUTLASS/colfax-gemm-kernels-hopper.pdf b/slides/cuda-slides/CUTLASS/colfax-gemm-kernels-hopper.pdf similarity index 100% rename from cuda-slides/CUTLASS/colfax-gemm-kernels-hopper.pdf rename to slides/cuda-slides/CUTLASS/colfax-gemm-kernels-hopper.pdf diff --git a/cuda-slides/CUTLASS/layout_algebra.pdf b/slides/cuda-slides/CUTLASS/layout_algebra.pdf similarity index 100% rename from cuda-slides/CUTLASS/layout_algebra.pdf rename to slides/cuda-slides/CUTLASS/layout_algebra.pdf diff --git a/cuda-slides/CUTLASS/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf b/slides/cuda-slides/CUTLASS/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf similarity index 100% rename from cuda-slides/CUTLASS/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf rename to slides/cuda-slides/CUTLASS/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf diff --git a/cuda-slides/CWES52010- Connect with the ExpertsInter-GPU Communication Techniques and Libraries.pdf b/slides/cuda-slides/CWES52010- Connect with the ExpertsInter-GPU Communication Techniques and Libraries.pdf similarity index 100% rename from cuda-slides/CWES52010- Connect with the ExpertsInter-GPU Communication Techniques and Libraries.pdf rename to slides/cuda-slides/CWES52010- Connect with the ExpertsInter-GPU Communication Techniques and Libraries.pdf diff --git a/cuda-slides/Graphene-NV-Tensor-IR.pdf b/slides/cuda-slides/Graphene-NV-Tensor-IR.pdf similarity index 100% rename from cuda-slides/Graphene-NV-Tensor-IR.pdf rename to slides/cuda-slides/Graphene-NV-Tensor-IR.pdf diff --git a/cuda-slides/Hopper_Tuning_Guide.pdf b/slides/cuda-slides/Hopper_Tuning_Guide.pdf similarity index 100% rename from cuda-slides/Hopper_Tuning_Guide.pdf rename to slides/cuda-slides/Hopper_Tuning_Guide.pdf diff --git a/cuda-slides/Inline_PTX_Assembly.pdf b/slides/cuda-slides/Inline_PTX_Assembly.pdf similarity index 100% rename from cuda-slides/Inline_PTX_Assembly.pdf rename to slides/cuda-slides/Inline_PTX_Assembly.pdf diff --git a/cuda-slides/MULTI-GPU TRAINING WITHNCCL.pdf b/slides/cuda-slides/MULTI-GPU TRAINING WITHNCCL.pdf similarity index 100% rename from cuda-slides/MULTI-GPU TRAINING WITHNCCL.pdf rename to slides/cuda-slides/MULTI-GPU TRAINING WITHNCCL.pdf diff --git a/cuda-slides/NCCL 2.0.pdf b/slides/cuda-slides/NCCL 2.0.pdf similarity index 100% rename from cuda-slides/NCCL 2.0.pdf rename to slides/cuda-slides/NCCL 2.0.pdf diff --git a/cuda-slides/NVIDIA H100 Tensor Core GPU Architecture Overview.pdf b/slides/cuda-slides/NVIDIA H100 Tensor Core GPU Architecture Overview.pdf similarity index 100% rename from cuda-slides/NVIDIA H100 Tensor Core GPU Architecture Overview.pdf rename to slides/cuda-slides/NVIDIA H100 Tensor Core GPU Architecture Overview.pdf diff --git a/cuda-slides/NVIDIA-Torch-TensorRT.pdf b/slides/cuda-slides/NVIDIA-Torch-TensorRT.pdf similarity index 100% rename from cuda-slides/NVIDIA-Torch-TensorRT.pdf rename to slides/cuda-slides/NVIDIA-Torch-TensorRT.pdf diff --git a/cuda-slides/Nsight Systems - DL Profiling Argonne National Labs 2022-06-30.pdf b/slides/cuda-slides/Nsight Systems - DL Profiling Argonne National Labs 2022-06-30.pdf similarity index 100% rename from cuda-slides/Nsight Systems - DL Profiling Argonne National Labs 2022-06-30.pdf rename to slides/cuda-slides/Nsight Systems - DL Profiling Argonne National Labs 2022-06-30.pdf diff --git a/cuda-slides/ORT-Python-Docs.pdf b/slides/cuda-slides/ORT-Python-Docs.pdf similarity index 100% rename from cuda-slides/ORT-Python-Docs.pdf rename to slides/cuda-slides/ORT-Python-Docs.pdf diff --git "a/cuda-slides/S31880 \342\200\223 NCCL- HIGH-SPEEDINTER-GPU COMMUNICATIONFOR LARGE-SCALE TRAINING.pdf" "b/slides/cuda-slides/S31880 \342\200\223 NCCL- HIGH-SPEEDINTER-GPU COMMUNICATIONFOR LARGE-SCALE TRAINING.pdf" similarity index 100% rename from "cuda-slides/S31880 \342\200\223 NCCL- HIGH-SPEEDINTER-GPU COMMUNICATIONFOR LARGE-SCALE TRAINING.pdf" rename to "slides/cuda-slides/S31880 \342\200\223 NCCL- HIGH-SPEEDINTER-GPU COMMUNICATIONFOR LARGE-SCALE TRAINING.pdf" diff --git a/cuda-slides/S41784- FAST INTER-GPU COMMUNICATION WITH NCCL FORDEEP LEARNING TRAINING, AND MORE.pdf b/slides/cuda-slides/S41784- FAST INTER-GPU COMMUNICATION WITH NCCL FORDEEP LEARNING TRAINING, AND MORE.pdf similarity index 100% rename from cuda-slides/S41784- FAST INTER-GPU COMMUNICATION WITH NCCL FORDEEP LEARNING TRAINING, AND MORE.pdf rename to slides/cuda-slides/S41784- FAST INTER-GPU COMMUNICATION WITH NCCL FORDEEP LEARNING TRAINING, AND MORE.pdf diff --git a/cuda-slides/TORCH.FX.pdf b/slides/cuda-slides/TORCH.FX.pdf similarity index 100% rename from cuda-slides/TORCH.FX.pdf rename to slides/cuda-slides/TORCH.FX.pdf diff --git a/cuda-slides/TensorRT-API.pdf b/slides/cuda-slides/TensorRT-API.pdf similarity index 100% rename from cuda-slides/TensorRT-API.pdf rename to slides/cuda-slides/TensorRT-API.pdf diff --git a/cuda-slides/TensorRT-Developer-Guide 10.1.pdf b/slides/cuda-slides/TensorRT-Developer-Guide 10.1.pdf similarity index 100% rename from cuda-slides/TensorRT-Developer-Guide 10.1.pdf rename to slides/cuda-slides/TensorRT-Developer-Guide 10.1.pdf diff --git a/cuda-slides/TensorRT-Operators.pdf b/slides/cuda-slides/TensorRT-Operators.pdf similarity index 100% rename from cuda-slides/TensorRT-Operators.pdf rename to slides/cuda-slides/TensorRT-Operators.pdf diff --git a/cuda-slides/gtc22-whitepaper-hopper.pdf b/slides/cuda-slides/gtc22-whitepaper-hopper.pdf similarity index 100% rename from cuda-slides/gtc22-whitepaper-hopper.pdf rename to slides/cuda-slides/gtc22-whitepaper-hopper.pdf diff --git a/cuda-slides/nvidia-ampere-architecture-whitepaper.pdf b/slides/cuda-slides/nvidia-ampere-architecture-whitepaper.pdf similarity index 100% rename from cuda-slides/nvidia-ampere-architecture-whitepaper.pdf rename to slides/cuda-slides/nvidia-ampere-architecture-whitepaper.pdf diff --git a/cuda-slides/ptx_isa_8.5.pdf b/slides/cuda-slides/ptx_isa_8.5.pdf similarity index 100% rename from cuda-slides/ptx_isa_8.5.pdf rename to slides/cuda-slides/ptx_isa_8.5.pdf diff --git a/cuda-slides/pytorch-nsys-profiling.pdf b/slides/cuda-slides/pytorch-nsys-profiling.pdf similarity index 100% rename from cuda-slides/pytorch-nsys-profiling.pdf rename to slides/cuda-slides/pytorch-nsys-profiling.pdf diff --git a/cuda-slides/s21417-faster-transformer.pdf b/slides/cuda-slides/s21417-faster-transformer.pdf similarity index 100% rename from cuda-slides/s21417-faster-transformer.pdf rename to slides/cuda-slides/s21417-faster-transformer.pdf diff --git a/cuda-slides/volta-architecture-whitepaper.pdf b/slides/cuda-slides/volta-architecture-whitepaper.pdf similarity index 100% rename from cuda-slides/volta-architecture-whitepaper.pdf rename to slides/cuda-slides/volta-architecture-whitepaper.pdf diff --git a/vllm-slides/[Public] vLLM Project Update @ Second vLLM Meetup.pptx b/slides/vllm-slides/[Public] vLLM Project Update @ Second vLLM Meetup.pptx similarity index 100% rename from vllm-slides/[Public] vLLM Project Update @ Second vLLM Meetup.pptx rename to slides/vllm-slides/[Public] vLLM Project Update @ Second vLLM Meetup.pptx diff --git a/vllm-slides/blogs/README.md b/slides/vllm-slides/blogs/README.md similarity index 100% rename from vllm-slides/blogs/README.md rename to slides/vllm-slides/blogs/README.md diff --git a/vllm-slides/blogs/vllm-automatic-prefix-caching.drawio b/slides/vllm-slides/blogs/vllm-automatic-prefix-caching.drawio similarity index 100% rename from vllm-slides/blogs/vllm-automatic-prefix-caching.drawio rename to slides/vllm-slides/blogs/vllm-automatic-prefix-caching.drawio diff --git a/vllm-slides/blogs/vllm-automatic-prefix-caching.drawio.png b/slides/vllm-slides/blogs/vllm-automatic-prefix-caching.drawio.png similarity index 100% rename from vllm-slides/blogs/vllm-automatic-prefix-caching.drawio.png rename to slides/vllm-slides/blogs/vllm-automatic-prefix-caching.drawio.png diff --git a/vllm-slides/blogs/vllm-prefix-prefill-triton-kernel-tiling.png b/slides/vllm-slides/blogs/vllm-prefix-prefill-triton-kernel-tiling.png similarity index 100% rename from vllm-slides/blogs/vllm-prefix-prefill-triton-kernel-tiling.png rename to slides/vllm-slides/blogs/vllm-prefix-prefill-triton-kernel-tiling.png diff --git a/vllm-slides/blogs/vllm-prefix-prefill-triton-kernel.drawio b/slides/vllm-slides/blogs/vllm-prefix-prefill-triton-kernel.drawio similarity index 100% rename from vllm-slides/blogs/vllm-prefix-prefill-triton-kernel.drawio rename to slides/vllm-slides/blogs/vllm-prefix-prefill-triton-kernel.drawio diff --git a/vllm-slides/vLLM First SF Meetup Slides (Public).pptx b/slides/vllm-slides/vLLM First SF Meetup Slides (Public).pptx similarity index 100% rename from vllm-slides/vLLM First SF Meetup Slides (Public).pptx rename to slides/vllm-slides/vLLM First SF Meetup Slides (Public).pptx diff --git a/vllm-slides/vLLM Project Update @ Third vLLM Meetup (Public).pptx b/slides/vllm-slides/vLLM Project Update @ Third vLLM Meetup (Public).pptx similarity index 100% rename from vllm-slides/vLLM Project Update @ Third vLLM Meetup (Public).pptx rename to slides/vllm-slides/vLLM Project Update @ Third vLLM Meetup (Public).pptx diff --git a/third-party/.gitignore b/third-party/.gitignore index eb33da95..cd7ae360 100644 --- a/third-party/.gitignore +++ b/third-party/.gitignore @@ -7,4 +7,6 @@ build *.whl tmp +bin +