Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[BesTLA] Improve RTN quantization accuracy of int4 and int3 #172

Merged
merged 29 commits into from
Mar 18, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented Mar 13, 2024

Type of Change

Get higher quantization accuracy of BesTLA's quantization packweight API.

  • Introduce auto-fullrange for NBits quantization
  • Add int3 rounding conversion
  • Optimize int4 decompression on client CPUs
  • Remove S4_Fullrange, as it's already covered by auto-fullrange
  • root cause Float4 performance issue on hybrid CPUs, 20%+ speedup

@luoyu-intel luoyu-intel marked this pull request as draft March 13, 2024 08:47
@luoyu-intel
Copy link
Contributor Author

luoyu-intel commented Mar 13, 2024

Text generation comparison ( weight_dtype=int3, group_size=128)
prompt: 'Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. '
This PR:

And every day that same little girl would put on her best dress and pear

model_print_timings:        load time =    73.59 ms
model_print_timings:      sample time =     8.07 ms /    16 runs   (    0.50 ms per token)
model_print_timings: prompt eval time =    73.56 ms /    34 tokens (    2.16 ms per token)
model_print_timings:        eval time =   285.11 ms /    15 runs   (   19.01 ms per token)
model_print_timings:       total time =   370.79 ms
========== eval time log of each prediction ==========
prediction   0, time: 73.56ms
prediction   1, time: 19.65ms
prediction   2, time: 19.07ms
prediction   3, time: 18.98ms
prediction   4, time: 19.01ms
prediction   5, time: 19.08ms

Main:

Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија
model_print_timings:        load time =    73.01 ms
model_print_timings:      sample time =     7.62 ms /    16 runs   (    0.48 ms per token)
model_print_timings: prompt eval time =    72.98 ms /    34 tokens (    2.15 ms per token)
model_print_timings:        eval time =   289.00 ms /    15 runs   (   19.27 ms per token)
model_print_timings:       total time =   373.55 ms
========== eval time log of each prediction ==========
prediction   0, time: 72.98ms
prediction   1, time: 19.92ms
prediction   2, time: 19.21ms
prediction   3, time: 19.20ms
prediction   4, time: 19.23ms
prediction   5, time: 19.45ms

@luoyu-intel luoyu-intel marked this pull request as ready for review March 13, 2024 09:18
bestla/bestla/kernel_ref.h Show resolved Hide resolved
@luoyu-intel luoyu-intel requested a review from airMeng March 13, 2024 09:56
@luoyu-intel
Copy link
Contributor Author

@kevinintel @hshen14 INT3 RTN quantization can generate reasonable texts now. I added 'int3' to the quantization weight_dtype options in this PR.

@luoyu-intel luoyu-intel changed the title [BesTLA] Improve quantization accuracy of int4 and int3 [BesTLA] Improve RTN quantization accuracy of int4 and int3 Mar 13, 2024
Copy link
Contributor

@hshen14 hshen14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's our INT3 GEMM perf vs. llama.cpp INT3 GEMM perf?

@luoyu-intel
Copy link
Contributor Author

what's our INT3 GEMM perf vs. llama.cpp INT3 GEMM perf?

@hshen14 llama.cpp Q3_K_S' performance:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. But she was always too scared to leave her house or talk to strangers
llama_print_timings:        load time =     154.97 ms
llama_print_timings:      sample time =       3.68 ms /    16 runs   (    0.23 ms per token,  4345.46 tokens per second)
llama_print_timings: prompt eval time =     382.78 ms /    34 tokens (   11.26 ms per token,    88.82 tokens per second)
llama_print_timings:        eval time =     576.09 ms /    15 runs   (   38.41 ms per token,    26.04 tokens per second)
llama_print_timings:       total time =     968.28 ms /    49 tokens

19ms vs. 38ms

@airMeng airMeng merged commit a90aea7 into main Mar 18, 2024
12 checks passed
@zhewang1-intc zhewang1-intc deleted the opt_int4_quant branch May 6, 2024 07:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants