question: [Quantization] Which files to change to make inference faster for Q8BERT? #221

sarthaklangde · 2021-05-18T05:40:31Z

I know from previous issues it is mentioned that that Q8BERT was just an experiment to measure the accuracy of quantized BERT model. But, given that the accuracy is good, what changed would need to be made to torch.nn.quantization file to replace the FP32 operations by INT8 operations?

Replacing the FP32 Linear layers with the torch.nn.quantized.Linear should theoretically work since it will have optimized operations, but it doesn't. Same for other layers.

If someone could just point out how to improve the inference speed (hints, tips, directions, code, anything), it would be helpful since the model's accuracy is really good and I would like to use it for downstream tasks. I don't mind even creating a PR once those changes are done so that it merges with the main repo.

Thank you!

Ayyub29 · 2022-08-25T03:41:36Z

Do you find the answer for this?

sarthaklangde added the question Further information is requested label May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: [Quantization] Which files to change to make inference faster for Q8BERT? #221

question: [Quantization] Which files to change to make inference faster for Q8BERT? #221

sarthaklangde commented May 18, 2021 •

edited

Loading

Ayyub29 commented Aug 25, 2022

question: [Quantization] Which files to change to make inference faster for Q8BERT? #221

question: [Quantization] Which files to change to make inference faster for Q8BERT? #221

Comments

sarthaklangde commented May 18, 2021 • edited Loading

Ayyub29 commented Aug 25, 2022

sarthaklangde commented May 18, 2021 •

edited

Loading