What happens to bias during int8 quantization? #108

gchhablani · 2024-02-24T22:06:27Z

I see that the linear layers weights are replaces with quantized weights.
However, I don't see what happens to the bias in the linear layers? Is it not needed anymore?
Why?

I assume it should be something like this for a generic model that include bias as well:

class WeightOnlyInt8Linear(torch.nn.Module):
    __constants__ = ['in_features', 'out_features']
    in_features: int
    out_features: int
    weight: torch.Tensor

    def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.register_buffer("weight", torch.empty((out_features, in_features), dtype=torch.int8))
        self.register_buffer("bias", torch.empty((out_features, in_features), dtype=torch.bfloat16))
        self.register_buffer("scales", torch.ones(out_features, dtype=torch.bfloat16))

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales + self.bias.to(dtype=input.dtype)

Can this be further optimized?

The text was updated successfully, but these errors were encountered:

Chillee · 2024-02-25T07:34:51Z

You don't need to quantize it. The weight matrix is say, 4096 x 4096. The bias matrix is just another 4096 elements, so 0.02% of the size.

gchhablani · 2024-02-25T09:10:46Z

@Chillee Agreed, but the bias is missing key when I try to quantize my own model.

michaelfeil · 2024-03-14T05:45:31Z

@gchhablani I am relativley confident the following quantization code should do the trick.

class WeightOnlyInt8Linear(Module):
    __constants__ = ["in_features", "out_features"]
    in_features: int
    out_features: int
    weight: Tensor
    bias: Tensor
    scales: Tensor

    def __init__(
        self,
        in_features: int,
        out_features: int,
        device=None,
        dtype=None,
    ) -> None:
        factory_kwargs = {"device": device, "dtype": dtype}
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.register_buffer(
            "weight", torch.empty((out_features, in_features), dtype=torch.int8)
        )
        self.register_buffer("scales", torch.ones(out_features, dtype=torch.bfloat16))
        # initialize bias to zero, in case the original model has no bias. bias has same shape as scales
        self.register_buffer("bias", torch.zeros((out_features), dtype=torch.bfloat16))

    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales  + self.bias.to(dtype=input.dtype)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What happens to bias during int8 quantization? #108

What happens to bias during int8 quantization? #108

gchhablani commented Feb 24, 2024 •

edited

Loading

Chillee commented Feb 25, 2024

gchhablani commented Feb 25, 2024

michaelfeil commented Mar 14, 2024

What happens to bias during int8 quantization? #108

What happens to bias during int8 quantization? #108

Comments

gchhablani commented Feb 24, 2024 • edited Loading

Chillee commented Feb 25, 2024

gchhablani commented Feb 25, 2024

michaelfeil commented Mar 14, 2024

gchhablani commented Feb 24, 2024 •

edited

Loading