-
Notifications
You must be signed in to change notification settings - Fork 38
Conversation
RTN, weight_dtype= Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have the freedom to do whatever she desired without any barriers that would hinder her path like the world outside of this small village where she was born and raised. |
This seems to support only LLMs. Any plans for 2bit and 3bit support for whisper inference(whisper.cpp supports 2 and 3bits quantization, although inference quality is horrible for 2bit) because in many cases having a small model is essential. Also please add support for nllb (with 2 and 3 bit quantization). I had opened an issue with this request 2 months ago, it was closed 2 months ago and there is no recent update. So, making this request once again. |
@bil-ash 2bit and 3bit are both experimental, it's only kernel-ready. It's not tested as many models as int4 quantization. As you can see, 2bit can not be applied to all weights, it requires model-specific quantization configuration. |
INT3 AVX2 kernels perf of LLaMa2-7B on core 12900K
Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have a lot of fun. One day she set off on her own and came to a vast city.
Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have exciting experiences. But her parents were always too busy and stressed, and they couldn't take her on trips like she wanted. So the little girl decided to take matters into her own hands.
|
Okay, understood. So, basically int2 will take some months. However, since int3(avx512f & avx2) implementation is almost complete, please now add int3 support for whisper. Would like to compare whisper.cpp & neural-speed. And in the long run, also please add support for 3bit & 2bit quantized nllb. |
@bil-ash The priority of audio models is decided by the project manager, not my scope. |
Type of Change
Task INT4 optimization for MTL will be planned as a new feature: vector kernels development. INT2 kernels of add RTN INT2 sym&asym quantization will also be done in this feature.
INT2 work is not finished yet. We only preview its text generation result. The vector kernels should also cover INT2 weight.