4-bit KV Cache #5932

Dampfinchen · 2024-03-08T09:10:23Z

Dampfinchen
Mar 8, 2024

Turboderp, developer of Exllama V2 has made a breakthrough: A 4 bit KV Cache that seemingly performs on par with FP16. Here are his words:

"I'm working on some benchmarks at the moment, but they're taking a while to run. Preliminary results show the Q4 cache mode is more precise overall than FP8, and comparable to full precision. HumanEval tests are still running."

https://www.reddit.com/r/LocalLLaMA/comments/1b9571u/80k_context_possible_with_cache_4bit/

This is huge. Sadly, Llama.cpp doesn't even have full 8 bit cache right now (only K cache). So in that aspect, there's a lot of potential for improvement.

puppetm4st3r · 2024-03-09T06:25:36Z

puppetm4st3r
Mar 9, 2024

thats is huge! hope you can implement in llama.cpp :)

0 replies

BarfingLemurs · 2024-03-09T15:27:13Z

BarfingLemurs
Mar 9, 2024

Here is some benchmarks and information - https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

0 replies

Dampfinchen · 2024-03-10T10:33:06Z

Dampfinchen
Mar 10, 2024
Author

Here is some benchmarks and information - https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

Very interesting. What do @ggerganov @JohannesGaessler and @slaren think about these results? The current consensus is that 4 bit KV cache isn't worth it as the uptick in perplexity would be too severe. However, that isn't the case with Turboderp's implementation. I wonder what llama.cpp can learn from that.

11 replies

JohannesGaessler May 13, 2024
Collaborator

The MMQ kernels all use the same template with quantization type specific functions for allocation, data loading, and the actual dot products. I meant the latter when I was talking about the MMQ code. I'm thinking the way to go is to write a new template for FA and re-use the functions.

henk717 May 13, 2024

I have had overwhelming feedback from users that they are eager for 4-bit cache because it helps them run larger models they currently need to resort to EXL2 for. They prefer Llamacpp based solutions but when it comes down to being able to run a model and not being able to run a model on their limited VRAM they prefer to be able to run a model.

So from a user demand perspective at least on our community demand has been high for this and its something I expect the majority of our users to turn on if implemented (and given the nature of the request I expect this to be the same in the entire ecosystem) . I fully understand the "if it makes sense to add" is from a coding perspective and the end result of implementations can sway it one way or the other, but I did want to ensure that the interest from the users is known to help guide any decisions / evaluations if it's down to user interest.

JohannesGaessler May 13, 2024
Collaborator

I mean, in September of last year I had already made a PR for quantizing the KV cache so I definitely agree that it would be a very good feature to have. But that PR was never merged due to introducing too many changes so for a new implementation I'll have to be careful to lay sufficient groundwork in advance.

Dampfinchen May 13, 2024
Author

On that note, what happened to your enhanced mmq code with tensor core support, Johannes? I remember being pretty impressed by the speed increase and low VRAM usage back then.

#4801 (comment)

Perhaps this would be useful for large batch sizes when you are working on the 4 bit kv cache?

JohannesGaessler May 13, 2024
Collaborator

I at some point noticed that there are numerical issues with my implementation that are not easily fixable (I just never posted about this). After talking to an NVIDIA engineer I'm also thinking that I'll need to overhaul the implementation in general. But I have definitely not forgotten about this. My next three major goals are FlashAttention, n-gram lookup, and int8 tensor core support. After that I'll look into training.

kir-gadjello · 2024-04-30T12:13:17Z

kir-gadjello
Apr 30, 2024

Integral part of good performance of turboderp's KV-cache quantization is Hadamard transform for smoothing the kv distribution:

See https://arxiv.org/pdf/2404.00456 and this part of turboderp's implementation turboderp/exllamav2@324404e

0 replies

henk717 · 2024-04-30T16:03:30Z

henk717
Apr 30, 2024

Now that #5021 is merged i'd love to see this, we have a lot of users who are eager for 4-bit KV cache to help save vram.

1 reply

strawberrymelonpanda Apr 30, 2024

Count me as one of them, but the new Flash Attention flag really helps already. Combined with this, sounds amazing.

WiseFarAI · 2024-05-03T16:31:43Z

WiseFarAI
May 3, 2024

This might be useful:

0 replies

sammcj · 2024-07-25T21:46:27Z

sammcj
Jul 25, 2024

Howdy 👋,

I'm trying to find out if anyone measured the perplexity / performance with llama.cpp's q4_0 / q8_0 K/V cache that could share their results?

I've been running it with q8_0 for the past day and the vRAM savings really are excellent - brings memory usage inline with ExllamaV2, while I haven't noticed any quality degradation I've only been using newer models I'm not that familiar with so would love to see some perplexity and performance measurements if you've taken any.

Sorry if I missed this in the docs somewhere 😅

(I've got a PR up to Ollama to add support for this)

5 replies

JohannesGaessler Jul 25, 2024
Collaborator

#7412

sammcj Jul 25, 2024

Oh that's perfect, thank you so much! My search foo must have failed me.

sammcj Jul 25, 2024

@JohannesGaessler I read your comment #7527 regarding the need to set GGML_CUDA_DISABLE_GRAPHS=1 - is that still required?

JohannesGaessler Jul 25, 2024
Collaborator

No, that has been fixed.

sammcj Jul 25, 2024

Fantastic, thank you for all your hard work!

pomoke · 2024-08-03T06:08:48Z

pomoke
Aug 3, 2024

On Qwen2-7B, q4_0 produces weird results. q8_0 is ok, but I haven't compared with f16 in performance.

1 reply

ddh0 Aug 3, 2024

Qwen2 uses 8x GQA if I remember correctly, I'd bet that a heavily quantized KV cache hurts it more than a model with less aggressive GQA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-bit KV Cache #5932

{{title}}

Replies: 8 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

4-bit KV Cache #5932

Replies: 8 comments · 18 replies

Dampfinchen Mar 10, 2024 Author

JohannesGaessler May 13, 2024 Collaborator

JohannesGaessler May 13, 2024 Collaborator

Dampfinchen May 13, 2024 Author

JohannesGaessler May 13, 2024 Collaborator

JohannesGaessler Jul 25, 2024 Collaborator

JohannesGaessler Jul 25, 2024 Collaborator

Replies: 8 comments 18 replies

Dampfinchen
Mar 10, 2024
Author

JohannesGaessler May 13, 2024
Collaborator

JohannesGaessler May 13, 2024
Collaborator

Dampfinchen May 13, 2024
Author

JohannesGaessler May 13, 2024
Collaborator

JohannesGaessler Jul 25, 2024
Collaborator

JohannesGaessler Jul 25, 2024
Collaborator