-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What happens to bias during int8 quantization? #108
Comments
You don't need to quantize it. The weight matrix is say, 4096 x 4096. The bias matrix is just another 4096 elements, so 0.02% of the size. |
@Chillee Agreed, but the bias is missing key when I try to quantize my own model. |
@gchhablani I am relativley confident the following quantization code should do the trick. class WeightOnlyInt8Linear(Module):
__constants__ = ["in_features", "out_features"]
in_features: int
out_features: int
weight: Tensor
bias: Tensor
scales: Tensor
def __init__(
self,
in_features: int,
out_features: int,
device=None,
dtype=None,
) -> None:
factory_kwargs = {"device": device, "dtype": dtype}
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.register_buffer(
"weight", torch.empty((out_features, in_features), dtype=torch.int8)
)
self.register_buffer("scales", torch.ones(out_features, dtype=torch.bfloat16))
# initialize bias to zero, in case the original model has no bias. bias has same shape as scales
self.register_buffer("bias", torch.zeros((out_features), dtype=torch.bfloat16))
def forward(self, input: Tensor) -> Tensor:
return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales + self.bias.to(dtype=input.dtype) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I see that the linear layers weights are replaces with quantized weights.
However, I don't see what happens to the bias in the linear layers? Is it not needed anymore?
Why?
I assume it should be something like this for a generic model that include bias as well:
Can this be further optimized?
The text was updated successfully, but these errors were encountered: