[BesTLA] Support RTN int2 weight #178

luoyu-intel · 2024-03-19T11:02:06Z

Type of Change

add RTN INT2 sym&asym quantization
INT3 AVX2 kernels
INT4 optimization for MTL

Task INT4 optimization for MTL will be planned as a new feature: vector kernels development. INT2 kernels of add RTN INT2 sym&asym quantization will also be done in this feature.

INT2 work is not finished yet. We only preview its text generation result. The vector kernels should also cover INT2 weight.

luoyu-intel · 2024-03-22T07:25:37Z

RTN, weight_dtype=int2 alg=asym, group_size=16, compute_dtype=fp32

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have the freedom to do whatever she desired without any barriers that would hinder her path like the world outside of this small village where she was born and raised.
She had read or heard about it before when she was at home in her crib the usual "Once upon a time" story for all her internal

bil-ash · 2024-04-08T02:22:51Z

This seems to support only LLMs. Any plans for 2bit and 3bit support for whisper inference(whisper.cpp supports 2 and 3bits quantization, although inference quality is horrible for 2bit) because in many cases having a small model is essential.

Also please add support for nllb (with 2 and 3 bit quantization). I had opened an issue with this request 2 months ago, it was closed 2 months ago and there is no recent update. So, making this request once again.

luoyu-intel · 2024-04-08T02:36:56Z

@bil-ash 2bit and 3bit are both experimental, it's only kernel-ready. It's not tested as many models as int4 quantization. As you can see, 2bit can not be applied to all weights, it requires model-specific quantization configuration.

luoyu-intel · 2024-04-08T02:46:09Z

INT3 AVX2 kernels perf of LLaMa2-7B on core 12900K

weight_dtype=int3, group_size=128, compute_dtype=int8, 2733MB:
model_print_timings: prompt eval time = 421.94 ms / 32 tokens ( 13.19 ms per token)
model_print_timings: eval time = 7466.54 ms / 127 runs ( 58.79 ms per token)

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have a lot of fun. One day she set off on her own and came to a vast city.
She was surprised by the number of people that were in the city at any given time. In contrast to her town where there were only a few houses. The girl was amazed by how many buildings and houses there were in this new place. And as far as she could see and hear. She saw so many people going about their business.
She walked into one of the shops. A big black and red door. In front of it stood a large man. With a black mustache who looked angry. He grunted

weight_dtype=int4, group_size=128, compute_dtype=int8, 3522MB:
model_print_timings: prompt eval time = 417.83 ms / 32 tokens ( 13.06 ms per token)
model_print_timings: eval time = 8342.10 ms / 127 runs ( 65.69 ms per token)

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have exciting experiences. But her parents were always too busy and stressed, and they couldn't take her on trips like she wanted. So the little girl decided to take matters into her own hands.
She started by looking for clues around her house that might lead her to a magical portal, which would transport her to amazing places. She searched high and low, until one day, she finally found it hidden behind an old bookshelf in the basement. The portal was shimmering and glowing, and the little girl could feel its magic calling to her.
Excited and a bit

~10% faster than int4 weight.

bestla/bestla/kernel_jit.h

bil-ash · 2024-04-09T02:16:33Z

@bil-ash 2bit and 3bit are both experimental, it's only kernel-ready. It's not tested as many models as int4 quantization. As you can see, 2bit can not be applied to all weights, it requires model-specific quantization configuration.

Okay, understood. So, basically int2 will take some months. However, since int3(avx512f & avx2) implementation is almost complete, please now add int3 support for whisper. Would like to compare whisper.cpp & neural-speed.

And in the long run, also please add support for 3bit & 2bit quantized nllb.

luoyu-intel · 2024-04-09T03:51:20Z

@bil-ash The priority of audio models is decided by the project manager, not my scope.

bil-ash mentioned this pull request Apr 8, 2024

Int4 Support OpenNMT/CTranslate2#1104

Open

luoyu-intel marked this pull request as ready for review April 8, 2024 02:47

luoyu-intel requested review from zhewang1-intc, yuchengliu1 and airMeng April 8, 2024 03:58

yuchengliu1 approved these changes Apr 8, 2024

View reviewed changes

luoyu-intel and others added 19 commits April 8, 2024 16:28

add ref of s3 decompression

446437e

add s2_clip

a27d360

add int2 quant

12b7ed5

support asym int2

5c6d8ea

add avx2 s3 decompress

36f5907

fix bug

f9b65a9

fix bug

888208b

fix bug

5464283

fix accuracy

97da61c

clean code

891a155

add ggml code for benchmark

3474c4a

blocksize=32 only

335fcca

support old client

3a5fd0b

fix accuracy

143a145

fix blocksize

da3f8a7

enable benchmark with debug

e4990f4

revert control of fusions

b4b9b39

format

f595b1c

revert debug code

a0ef3e1

luoyu-intel force-pushed the int2 branch from 7a15ff6 to a0ef3e1 Compare April 8, 2024 08:28

zhewang1-intc approved these changes Apr 9, 2024

View reviewed changes

bestla/bestla/kernel_jit.h Outdated Show resolved Hide resolved

luoyu-intel added 4 commits April 9, 2024 10:49

fix ut err threshold

8e9cbc2

fix typo

2184af4

clang-format

edc8a0a

fix ut err threshold

a534296

VincyZhang merged commit da8e2b8 into main Apr 9, 2024
12 checks passed

luoyu-intel deleted the int2 branch April 9, 2024 05:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BesTLA] Support RTN int2 weight #178

[BesTLA] Support RTN int2 weight #178

luoyu-intel commented Mar 19, 2024 •

edited

Loading

luoyu-intel commented Mar 22, 2024 •

edited

Loading

bil-ash commented Apr 8, 2024

luoyu-intel commented Apr 8, 2024

luoyu-intel commented Apr 8, 2024 •

edited

Loading

bil-ash commented Apr 9, 2024 •

edited

Loading

luoyu-intel commented Apr 9, 2024

[BesTLA] Support RTN int2 weight #178

[BesTLA] Support RTN int2 weight #178

Conversation

luoyu-intel commented Mar 19, 2024 • edited Loading

Type of Change

luoyu-intel commented Mar 22, 2024 • edited Loading

bil-ash commented Apr 8, 2024

luoyu-intel commented Apr 8, 2024

luoyu-intel commented Apr 8, 2024 • edited Loading

bil-ash commented Apr 9, 2024 • edited Loading

luoyu-intel commented Apr 9, 2024

luoyu-intel commented Mar 19, 2024 •

edited

Loading

luoyu-intel commented Mar 22, 2024 •

edited

Loading

luoyu-intel commented Apr 8, 2024 •

edited

Loading

bil-ash commented Apr 9, 2024 •

edited

Loading