You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given a quant model (for example llama2-7B-nf4), vanilla inference is to dequant the model to fp16 or bf16 to compute, does exllamav2 support no-dequant inference?
The text was updated successfully, but these errors were encountered:
Not sure what you mean? If you use a quantized model, inference will be done using the quantized weights directly. They still have to be converted to FP16 at some point, but this happens in the matmul kernel on individual weights as they're being streamed from VRAM and applied. For long sequences and/or large batch sizes, weights are dequantized one (partial) matrix at a time since that's more efficient.
Problem
Given a quant model (for example llama2-7B-nf4), vanilla inference is to dequant the model to fp16 or bf16 to compute, does exllamav2 support no-dequant inference?
The text was updated successfully, but these errors were encountered: