Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[BesTLA] Support RTN int2 weight #178

Merged
merged 23 commits into from
Apr 9, 2024
Merged

[BesTLA] Support RTN int2 weight #178

merged 23 commits into from
Apr 9, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented Mar 19, 2024

Type of Change

  • add RTN INT2 sym&asym quantization
  • INT3 AVX2 kernels
  • INT4 optimization for MTL

Task INT4 optimization for MTL will be planned as a new feature: vector kernels development. INT2 kernels of add RTN INT2 sym&asym quantization will also be done in this feature.

INT2 work is not finished yet. We only preview its text generation result. The vector kernels should also cover INT2 weight.

@luoyu-intel
Copy link
Contributor Author

luoyu-intel commented Mar 22, 2024

RTN, weight_dtype=int2 alg=asym, group_size=16, compute_dtype=fp32

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have the freedom to do whatever she desired without any barriers that would hinder her path like the world outside of this small village where she was born and raised.
She had read or heard about it before when she was at home in her crib the usual "Once upon a time" story for all her internal

@bil-ash
Copy link

bil-ash commented Apr 8, 2024

This seems to support only LLMs. Any plans for 2bit and 3bit support for whisper inference(whisper.cpp supports 2 and 3bits quantization, although inference quality is horrible for 2bit) because in many cases having a small model is essential.

Also please add support for nllb (with 2 and 3 bit quantization). I had opened an issue with this request 2 months ago, it was closed 2 months ago and there is no recent update. So, making this request once again.

@luoyu-intel
Copy link
Contributor Author

@bil-ash 2bit and 3bit are both experimental, it's only kernel-ready. It's not tested as many models as int4 quantization. As you can see, 2bit can not be applied to all weights, it requires model-specific quantization configuration.

@luoyu-intel
Copy link
Contributor Author

luoyu-intel commented Apr 8, 2024

INT3 AVX2 kernels perf of LLaMa2-7B on core 12900K

weight_dtype=int3, group_size=128, compute_dtype=int8, 2733MB:
model_print_timings: prompt eval time = 421.94 ms / 32 tokens ( 13.19 ms per token)
model_print_timings: eval time = 7466.54 ms / 127 runs ( 58.79 ms per token)

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have a lot of fun. One day she set off on her own and came to a vast city.
She was surprised by the number of people that were in the city at any given time. In contrast to her town where there were only a few houses. The girl was amazed by how many buildings and houses there were in this new place. And as far as she could see and hear. She saw so many people going about their business.
She walked into one of the shops. A big black and red door. In front of it stood a large man. With a black mustache who looked angry. He grunted

weight_dtype=int4, group_size=128, compute_dtype=int8, 3522MB:
model_print_timings: prompt eval time = 417.83 ms / 32 tokens ( 13.06 ms per token)
model_print_timings: eval time = 8342.10 ms / 127 runs ( 65.69 ms per token)

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have exciting experiences. But her parents were always too busy and stressed, and they couldn't take her on trips like she wanted. So the little girl decided to take matters into her own hands.
She started by looking for clues around her house that might lead her to a magical portal, which would transport her to amazing places. She searched high and low, until one day, she finally found it hidden behind an old bookshelf in the basement. The portal was shimmering and glowing, and the little girl could feel its magic calling to her.
Excited and a bit

~10% faster than int4 weight.

@luoyu-intel luoyu-intel marked this pull request as ready for review April 8, 2024 02:47
bestla/bestla/kernel_jit.h Outdated Show resolved Hide resolved
@bil-ash
Copy link

bil-ash commented Apr 9, 2024

@bil-ash 2bit and 3bit are both experimental, it's only kernel-ready. It's not tested as many models as int4 quantization. As you can see, 2bit can not be applied to all weights, it requires model-specific quantization configuration.

Okay, understood. So, basically int2 will take some months. However, since int3(avx512f & avx2) implementation is almost complete, please now add int3 support for whisper. Would like to compare whisper.cpp & neural-speed.

And in the long run, also please add support for 3bit & 2bit quantized nllb.

@luoyu-intel
Copy link
Contributor Author

@bil-ash The priority of audio models is decided by the project manager, not my scope.

@VincyZhang VincyZhang merged commit da8e2b8 into main Apr 9, 2024
12 checks passed
@luoyu-intel luoyu-intel deleted the int2 branch April 9, 2024 05:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants