Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What happens to bias during int8 quantization? #108

Open
gchhablani opened this issue Feb 24, 2024 · 3 comments
Open

What happens to bias during int8 quantization? #108

gchhablani opened this issue Feb 24, 2024 · 3 comments

Comments

@gchhablani
Copy link

gchhablani commented Feb 24, 2024

I see that the linear layers weights are replaces with quantized weights.
However, I don't see what happens to the bias in the linear layers? Is it not needed anymore?
Why?

I assume it should be something like this for a generic model that include bias as well:

class WeightOnlyInt8Linear(torch.nn.Module):
    __constants__ = ['in_features', 'out_features']
    in_features: int
    out_features: int
    weight: torch.Tensor

    def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.register_buffer("weight", torch.empty((out_features, in_features), dtype=torch.int8))
        self.register_buffer("bias", torch.empty((out_features, in_features), dtype=torch.bfloat16))
        self.register_buffer("scales", torch.ones(out_features, dtype=torch.bfloat16))

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales + self.bias.to(dtype=input.dtype)

Can this be further optimized?

@Chillee
Copy link
Contributor

Chillee commented Feb 25, 2024

You don't need to quantize it. The weight matrix is say, 4096 x 4096. The bias matrix is just another 4096 elements, so 0.02% of the size.

@gchhablani
Copy link
Author

@Chillee Agreed, but the bias is missing key when I try to quantize my own model.

@michaelfeil
Copy link
Contributor

@gchhablani I am relativley confident the following quantization code should do the trick.

class WeightOnlyInt8Linear(Module):
    __constants__ = ["in_features", "out_features"]
    in_features: int
    out_features: int
    weight: Tensor
    bias: Tensor
    scales: Tensor

    def __init__(
        self,
        in_features: int,
        out_features: int,
        device=None,
        dtype=None,
    ) -> None:
        factory_kwargs = {"device": device, "dtype": dtype}
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.register_buffer(
            "weight", torch.empty((out_features, in_features), dtype=torch.int8)
        )
        self.register_buffer("scales", torch.ones(out_features, dtype=torch.bfloat16))
        # initialize bias to zero, in case the original model has no bias. bias has same shape as scales
        self.register_buffer("bias", torch.zeros((out_features), dtype=torch.bfloat16))

    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales  + self.bias.to(dtype=input.dtype)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants