Marlin moe integration #266

ElizaWszola · 2024-05-24T05:15:33Z

Unit testing:

pytest tests/kernels/test_moe.py -k test_fused_marlin_moe

(requires to uncomment @pytest.mark.skip in test_moe.py).

End-to-end testing:
Run offline_inference.py with

// quantized moe with act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ",
          revision="gptq-4bit-128g-actorder_True",
          enforce_eager=True)

and

// quantized moe without act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ", enforce_eager=True)

robertgshaw2-neuralmagic · 2024-05-24T11:13:40Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+    return torch.stack(tensors, dim=0).to(dev)
+
+
+def fused_marlin_moe(


This function does not need to adhere to the exact same interface as fused_moe

This function will be called on the hotpath. It should receive INT4 weights and scales and just call the marlin moe kernel directly

robertgshaw2-neuralmagic · 2024-05-24T11:14:38Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+    qweights1 = []
+    scaless1 = []
+
+    for i in range(w1.shape[0]):


This will not be called on the hotpath.

Rather, the quantized weights should be an input to this function

robertgshaw2-neuralmagic

see comments in code

robertgshaw2-neuralmagic · 2024-05-24T11:19:56Z

So a couple things. vLLM is layed out in the following way

Models --> llama.py, which uses linear_layers like ColumnParallelLinear

Each layer has a LinearMethod which handles the representation of the weights and the forward pass
For instance, we have an Fp16 linear method, Marlin linear method, etc., etc.

Each LinearMethod exposes the following interface:
create_weights --> defines what the weights look like (e.g. dtype, whether there are scales, etc)
apply --> calls the kernels during the forward pass

So, we will eventually want to create a LinearMethod for MarlinMoE
create_weights --> will load the int4 weights from disk
apply --> passes the weights to your fused_moe_marlin kernel

As a result, the fused_moe_marlin kernel should recieve already quantized weights and just execute the computation. We should not be quantizing the weights inside of that function

For this PR, we should land the kernel + testing code for the kernel. we can work on adding the LinearMethod afterwards

robertgshaw2-neuralmagic · 2024-05-24T11:23:31Z

vllm/model_executor/layers/fused_moe/fused_moe.py

@@ -477,3 +478,342 @@ def fused_moe(
                         out=hidden_states)
    return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
                     dim=1)
+


Per my comment below, we will load the already compressed weights via create_weights.

So none of this will need to be called on the hotpath

As a result, all of this should be moved into testing utilities

csrc/moe/torch_bindings.cpp

vllm/_custom_ops.py

csrc/moe/marlin_moe_ops.h

vllm/_custom_ops.py

tlrmchlsmth · 2024-06-25T15:23:25Z

tests/kernels/test_moe.py

I think it would be good to add a test to make sure this works on other GPUs as well (we do this in the cutlass unit tests, if you want to replicate that here)

Do you mean testing on different devices? (@pytest.mark.parametrize("device", CUDA_DEVICES))

yes exactly

Doing anything on cuda:1 results in memory erros (illegal access) in moe_align_block_size_kernel which I rely on, but didn't modify - should I look into it or is it ok to leave it for now?

tlrmchlsmth · 2024-06-25T15:24:10Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+    # Check constraints.
+    assert hidden_states.shape[0] == gating_output.shape[0], (
+        "Number of tokens mismatch")
+    assert hidden_states.shape[1] == w.shape[1] * 16, "Hidden size mismatch"


is 16 a hardcoded block size?

This is related to Marlin format which is hardcoded

vllm/model_executor/layers/fused_moe/fused_moe.py

tlrmchlsmth · 2024-06-25T15:28:54Z

vllm/model_executor/models/mixtral_quant.py

+            w1_s = self.experts[i].w1.get_parameter("scales").half()
+            w3_s = self.experts[i].w3.get_parameter("scales").half()
+            w2_qw = self.experts[i].w2.get_parameter("qweight").int()
+            w2_s = self.experts[i].w2.get_parameter("scales").half()


are these guaranteed to be fp16?

From Pytorch documentation: self.half() is equivalent to self.to(torch.float16)
Scales are not necessarily fp16 when loaded.

tlrmchlsmth · 2024-06-25T15:31:15Z

csrc/moe/marlin_moe_ops.cu

how much of this file is copy-pasted from the original marlin code? Could we factor out common functions? It will make it much easier to review if we can see what the new code is

There is quite a bit of overlap, and many of changes boil down to adding one variable or an extra condition here and there. I don't really want to refactor into common functions until act_order is done, because there might be more of these tiny modifications (or is it better to do the refactor now?).

In any case, running a comparison of this file against csrc/quantization/gptq_marlin/gptq_marlin.cu helps seeing what changed.

Edit: fixed file name

That’s fair for things that may be changed by act_reorder but any functions that are copied over unmodified should be factored out IMO

ElizaWszola · 2024-08-02T15:16:05Z

Moved to the other repo

ElizaWszola added 5 commits May 17, 2024 09:19

Start working on linking the elements together

15d0f20

Runs to completion

60c097c

Commit before moving things around in the kernel

42750bc

Fix shared memory issue

b312ad5

Fused marlin moe in model executor

289cc8c

ElizaWszola assigned robertgshaw2-neuralmagic May 24, 2024

Various small fixes

f7b106e

robertgshaw2-neuralmagic reviewed May 24, 2024

View reviewed changes

ElizaWszola added 15 commits May 24, 2024 12:39

quantize outside fused_marlin_moe function

6765bde

Test shapes

0886f76

Better test, some passing

8ee85dd

Working expert chunks

cb4dbaa

working

cd4410f

Combined kernel with fused op, scores

e9ee483

Debugging

c839319

Rand for everything

5ac5ba6

Working A copy

7a9f453

cleanup

c412ab8

Merge branch 'main' into marlin-moe-integration

2d4ac6a

Try to fix pybind11 builds after merge

ffd5f64

Continue work on bindings

a22272a

impl

d2cba06

typo fix

4b5a9f1

bnellnm reviewed Jun 17, 2024

View reviewed changes

csrc/moe/torch_bindings.cpp Outdated Show resolved Hide resolved

bnellnm reviewed Jun 17, 2024

View reviewed changes

vllm/_custom_ops.py Outdated Show resolved Hide resolved

all links but maxdiff (must fix assertions)

0991c74

ElizaWszola added 9 commits June 17, 2024 17:48

it works

9dd97b5

Merge branch 'main' into marlin-moe-integration

c62bc7f

lots of debugging

e65c195

Fixed types, it's working!

5f1ebdb

Cleanup

836d627

Tiny cleanups

d3665fa

Some renaming

4bcfde6

Renaming, Bill's feedback

7701ee7

Format

b9fdda3

tlrmchlsmth reviewed Jun 25, 2024

View reviewed changes

ElizaWszola added 17 commits June 26, 2024 06:06

add act_order to marlin moe

1b05843

Tensor constness, factor common fused_moe parts into smaller functions

8adcb26

Spelling

6a3ef46

single marlin test should still be disabled

c676bea

Pass unit tests

53c8ff7

integrate with model

0fd62c8

cleanups

458c83f

Merge branch 'marlin-moe-act_order' into marlin-moe-integration

400e35a

Merge branch 'main' into marlin-moe-integration

a0d7f77

format

f629593

Merge branch 'main' into marlin-moe-integration

c8b79f3

Start work on integrating with refactored mixtral code

e405879

Runs to completion, but produces garbage

cda9a0f

it works!

c469b74

Cleanup, format, minor fixes

d8b455f

more efficient m blocking, a couple small fixes

7504696

Multi-GPU works, but could make it faster

3641692

ElizaWszola closed this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marlin moe integration #266

Marlin moe integration #266

ElizaWszola commented May 24, 2024 •

edited

Loading

robertgshaw2-neuralmagic May 24, 2024 •

edited

Loading

robertgshaw2-neuralmagic May 24, 2024 •

edited

Loading

robertgshaw2-neuralmagic left a comment

robertgshaw2-neuralmagic commented May 24, 2024 •

edited

Loading

robertgshaw2-neuralmagic May 24, 2024

tlrmchlsmth Jun 25, 2024

ElizaWszola Jun 26, 2024

tlrmchlsmth Jun 26, 2024

ElizaWszola Jun 27, 2024

tlrmchlsmth Jun 25, 2024

ElizaWszola Jun 26, 2024

tlrmchlsmth Jun 25, 2024

ElizaWszola Jun 26, 2024

tlrmchlsmth Jun 25, 2024

ElizaWszola Jun 26, 2024 •

edited

Loading

tlrmchlsmth Jun 26, 2024

ElizaWszola commented Aug 2, 2024

		return torch.stack(tensors, dim=0).to(dev)


		def fused_marlin_moe(

Marlin moe integration #266

Marlin moe integration #266

Conversation

ElizaWszola commented May 24, 2024 • edited Loading

robertgshaw2-neuralmagic May 24, 2024 • edited Loading

Choose a reason for hiding this comment

robertgshaw2-neuralmagic May 24, 2024 • edited Loading

Choose a reason for hiding this comment

robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented May 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElizaWszola Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElizaWszola commented Aug 2, 2024

ElizaWszola commented May 24, 2024 •

edited

Loading

robertgshaw2-neuralmagic May 24, 2024 •

edited

Loading

robertgshaw2-neuralmagic May 24, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented May 24, 2024 •

edited

Loading

ElizaWszola Jun 26, 2024 •

edited

Loading