[QUESTION] Does exllamav2 support no-dequant inference? #670

AaronZLT · 2024-11-07T07:59:48Z

Problem

Given a quant model (for example llama2-7B-nf4), vanilla inference is to dequant the model to fp16 or bf16 to compute, does exllamav2 support no-dequant inference?

turboderp · 2024-11-07T11:35:49Z

Not sure what you mean? If you use a quantized model, inference will be done using the quantized weights directly. They still have to be converted to FP16 at some point, but this happens in the matmul kernel on individual weights as they're being streamed from VRAM and applied. For long sequences and/or large batch sizes, weights are dequantized one (partial) matrix at a time since that's more efficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Does exllamav2 support no-dequant inference? #670

[QUESTION] Does exllamav2 support no-dequant inference? #670

AaronZLT commented Nov 7, 2024 •

edited

Loading

turboderp commented Nov 7, 2024

[QUESTION] Does exllamav2 support no-dequant inference? #670

[QUESTION] Does exllamav2 support no-dequant inference? #670

Comments

AaronZLT commented Nov 7, 2024 • edited Loading

Problem

turboderp commented Nov 7, 2024

AaronZLT commented Nov 7, 2024 •

edited

Loading