Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel #386

Closed

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Jul 31, 2024

Notes

This PR is a work in progress and based off of: vllm-project#6396 so that will have to land before this.

Description

This PR introduces a spiritual successor to the Marlin kernel but optimized for Hopper architectures and based off of cutlass.

Motivation

The motivation for this kernel is multifold:

  1. Marlin (v1) uses mma instructions, which are fastest tensor core instructions available on Ampere but with Hopper Nvidia release a set of new wgmma instructions which are required to hit the peak FLOPs reported by Nvidia, without them i.e. using mma instructions you can expect to achieve at best ~75% of peak [1, 2]
  2. Marlin (v1) uses a specific weight storage layout that is specialized for the mma instructions, we want to adopt a more flexible/dynamic way of defining these layouts so we can accommodate new instructions more rapidly, i.e. wgmma and new instructions Blackwell introduces if any
    • MarlinV2 achieves this by describing the weight storage scheme using cutlass and CUTE
  3. Marlin (v1) does not support cutlass epilogues, we eventually plan to investigate subbyte weight quantization + activation quantization, for activation quantization we'd like to leverage the great work done by @tlrmchlsmth @varun-sundar-rabindranath and @ProExpertProg to write custom cutlass epilogues for fp8 and int8

TODO:

  • Chose a new name (candidates: wahoo, swordfish (kinda cutlass + marlin), non-fish names ...): edit: chose machete
  • Improve heuristic namely for 4096x4096
  • Improve BFloat16 performance (via bit shift or interleaving)
  • E2E integration (future PR)
  • Improve batch size < 32 performance (potentially a future PR, likely through improving the stream-k scheduler)
  • Investigate fp8 activation support (future PR)

Current Performance

Float16

graph_marlinv2_bench_float16

BFloat16

graph_marlinv2_bench_bfloat16

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@LucasWilkinson LucasWilkinson changed the title [WIP, Kernel] (1/N) MarlinV2 - Hopper Optimized Marlin [WIP, Kernel] (1/N) MarlinV2 - Hopper Optimized Mixed Precision Linear Kernel Jul 31, 2024
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from 775049e to 1d90d74 Compare July 31, 2024 04:44
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from 1d90d74 to 4e63ad1 Compare July 31, 2024 19:34
@LucasWilkinson LucasWilkinson changed the title [WIP, Kernel] (1/N) MarlinV2 - Hopper Optimized Mixed Precision Linear Kernel [WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel Aug 1, 2024
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/scalar-type-cherrypick branch from e31dd1f to a926e67 Compare August 1, 2024 18:13
@LucasWilkinson
Copy link
Collaborator Author

Migrated to: #401

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant