[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #386

LucasWilkinson · 2024-07-31T03:19:37Z

Notes

This PR is a work in progress and based off of: vllm-project#6396 so that will have to land before this.

Description

This PR introduces a spiritual successor to the Marlin kernel but optimized for Hopper architectures and based off of cutlass.

Motivation

The motivation for this kernel is multifold:

Marlin (v1) uses mma instructions, which are fastest tensor core instructions available on Ampere but with Hopper Nvidia release a set of new wgmma instructions which are required to hit the peak FLOPs reported by Nvidia, without them i.e. using mma instructions you can expect to achieve at best ~75% of peak [1, 2]
Marlin (v1) uses a specific weight storage layout that is specialized for the mma instructions, we want to adopt a more flexible/dynamic way of defining these layouts so we can accommodate new instructions more rapidly, i.e. wgmma and new instructions Blackwell introduces if any
- MarlinV2 achieves this by describing the weight storage scheme using cutlass and CUTE
Marlin (v1) does not support cutlass epilogues, we eventually plan to investigate subbyte weight quantization + activation quantization, for activation quantization we'd like to leverage the great work done by @tlrmchlsmth @varun-sundar-rabindranath and @ProExpertProg to write custom cutlass epilogues for fp8 and int8

TODO:

Chose a new name (candidates: wahoo, swordfish (kinda cutlass + marlin), non-fish names ...): edit: chose machete
Improve heuristic namely for 4096x4096
Improve BFloat16 performance (via bit shift or interleaving)
E2E integration (future PR)
Improve batch size < 32 performance (potentially a future PR, likely through improving the stream-k scheduler)
Investigate fp8 activation support (future PR)

Current Performance

Float16

BFloat16

github-actions · 2024-07-31T03:19:48Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

LucasWilkinson · 2024-08-02T02:04:57Z

Migrated to: #401

LucasWilkinson changed the title ~~[WIP, Kernel] (1/N) MarlinV2 - Hopper Optimized Marlin~~ [WIP, Kernel] (1/N) MarlinV2 - Hopper Optimized Mixed Precision Linear Kernel Jul 31, 2024

LucasWilkinson force-pushed the lwilkinson/marlinv2 branch from ed15ea8 to 50c9fb1 Compare July 31, 2024 04:43

LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from 775049e to 1d90d74 Compare July 31, 2024 04:44

LucasWilkinson force-pushed the lwilkinson/marlinv2 branch from 50c9fb1 to fc58012 Compare July 31, 2024 04:45

LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from 1d90d74 to 4e63ad1 Compare July 31, 2024 19:34

LucasWilkinson force-pushed the lwilkinson/marlinv2 branch from 7f8cf90 to 232ff71 Compare July 31, 2024 19:53

LucasWilkinson changed the title ~~[WIP, Kernel] (1/N) MarlinV2 - Hopper Optimized Mixed Precision Linear Kernel~~ [WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel Aug 1, 2024

LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from e31dd1f to a926e67 Compare August 1, 2024 18:13

squash-patch changes

c69093b

LucasWilkinson force-pushed the lwilkinson/marlinv2 branch from 232ff71 to c69093b Compare August 1, 2024 19:47

LucasWilkinson closed this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #386

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #386

LucasWilkinson commented Jul 31, 2024 •

edited

Loading

github-actions bot commented Jul 31, 2024

LucasWilkinson commented Aug 2, 2024

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #386

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #386

Conversation

LucasWilkinson commented Jul 31, 2024 • edited Loading

Notes

Description

Motivation

TODO:

Current Performance

Float16

BFloat16

github-actions bot commented Jul 31, 2024

LucasWilkinson commented Aug 2, 2024

LucasWilkinson commented Jul 31, 2024 •

edited

Loading