4-bit KV Cache #5932
Replies: 8 comments 18 replies
-
thats is huge! hope you can implement in llama.cpp :) |
Beta Was this translation helpful? Give feedback.
-
Here is some benchmarks and information - https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md |
Beta Was this translation helpful? Give feedback.
-
Very interesting. What do @ggerganov @JohannesGaessler and @slaren think about these results? The current consensus is that 4 bit KV cache isn't worth it as the uptick in perplexity would be too severe. However, that isn't the case with Turboderp's implementation. I wonder what llama.cpp can learn from that. |
Beta Was this translation helpful? Give feedback.
-
Integral part of good performance of turboderp's KV-cache quantization is Hadamard transform for smoothing the kv distribution: |
Beta Was this translation helpful? Give feedback.
-
Now that #5021 is merged i'd love to see this, we have a lot of users who are eager for 4-bit KV cache to help save vram. |
Beta Was this translation helpful? Give feedback.
-
This might be useful: |
Beta Was this translation helpful? Give feedback.
-
Howdy 👋, I'm trying to find out if anyone measured the perplexity / performance with llama.cpp's q4_0 / q8_0 K/V cache that could share their results? I've been running it with q8_0 for the past day and the vRAM savings really are excellent - brings memory usage inline with ExllamaV2, while I haven't noticed any quality degradation I've only been using newer models I'm not that familiar with so would love to see some perplexity and performance measurements if you've taken any. Sorry if I missed this in the docs somewhere 😅 (I've got a PR up to Ollama to add support for this) |
Beta Was this translation helpful? Give feedback.
-
On Qwen2-7B, q4_0 produces weird results. q8_0 is ok, but I haven't compared with f16 in performance. |
Beta Was this translation helpful? Give feedback.
-
Turboderp, developer of Exllama V2 has made a breakthrough: A 4 bit KV Cache that seemingly performs on par with FP16. Here are his words:
"I'm working on some benchmarks at the moment, but they're taking a while to run. Preliminary results show the Q4 cache mode is more precise overall than FP8, and comparable to full precision. HumanEval tests are still running."
https://www.reddit.com/r/LocalLLaMA/comments/1b9571u/80k_context_possible_with_cache_4bit/
This is huge. Sadly, Llama.cpp doesn't even have full 8 bit cache right now (only K cache). So in that aspect, there's a lot of potential for improvement.
Beta Was this translation helpful? Give feedback.
All reactions