Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Does exllamav2 support no-dequant inference? #670

Open
AaronZLT opened this issue Nov 7, 2024 · 1 comment
Open

[QUESTION] Does exllamav2 support no-dequant inference? #670

AaronZLT opened this issue Nov 7, 2024 · 1 comment

Comments

@AaronZLT
Copy link

AaronZLT commented Nov 7, 2024

Problem

Given a quant model (for example llama2-7B-nf4), vanilla inference is to dequant the model to fp16 or bf16 to compute, does exllamav2 support no-dequant inference?

@turboderp
Copy link
Owner

Not sure what you mean? If you use a quantized model, inference will be done using the quantized weights directly. They still have to be converted to FP16 at some point, but this happens in the matmul kernel on individual weights as they're being streamed from VRAM and applied. For long sequences and/or large batch sizes, weights are dequantized one (partial) matrix at a time since that's more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants