Replies: 3 comments 5 replies
-
I read the paper but it's still not very clear for me if L²QER change something at inference time or not. Please correct if I'm wrong, but what I understand is that we firstly do LQER by running SVD to get LoRA, then further optimizing it via L²QER (around page 6 of the paper). Anyway,
FYI, someone already done the python implementation of GGML quants in HF transformers library. Hopefully this can help your implementation.
I'm really interested in this, because control vector can be benefit from SVD (we're currently using PCA)
Yes, I think it would be useful to know if we should call The current |
Beta Was this translation helpful? Give feedback.
-
If it's any use then I tidied up the Mergekit code to extract LoRAs from fine turned models: It works on all but the weirdest models now (eg: Might be useful to make a reference version using pytorch and then compare with a native C/C++ reimplemention. EDIT: The only part of Mergekit it's using is the "lazy tensor loading" and other than that it's mostly based on: https://github.com/thomasgauthier/LoRD (I'm not sure how it ended up part of Mergekit) |
Beta Was this translation helpful? Give feedback.
-
Interesting work, will be cool to try to implement this approach and see how the perplexity improves for different ranks I'm looking at some of the results in the paper and not sure how to interpret Appendix B: Based on the graph, it seems L2QER performs worse (i.e. higher error) compared to LQER, while the text states the opposite. Am I reading it wrong? |
Beta Was this translation helpful? Give feedback.
-
Since the recent LoRA refactor by @ngxson in #8332, I think it should be possible to improve existing quantization schemes with Low-Rank Quantization Error Reconstruction (see https://arxiv.org/abs/2402.02446)
It would only need two things:
gguf-py/gguf/quants.py
to make this easier.And also I think L²QER could be implemented with the existing imatrix files.
Beta Was this translation helpful? Give feedback.
All reactions