Replies: 6 comments 10 replies
-
IMO, if this approach could be applied to the GPU code without loss of performance, it would be worth to update the data format Pinging @ikawrakow |
Beta Was this translation helpful? Give feedback.
-
So, I didn't put too much effort into optimizing the CPU code (the thinking being that one uses these low-bit quantization types mainly because one wants to fit a model into the VRAM of the GPU one has available). But I did try the bit-to-byte conversion trick on AVX2 and, if I remember correctly, did not see a significant difference in performance. @PeterReid What is the model size for the timings you are reporting? @ggerganov I don't see how the approach applies to the GPU (CUDA or Metal)? Doesn't one need to implement all kernels on all supported platforms first, show performance gains or at least no performance degradation for all, and only then decide to change the data format? There are quite a few |
Beta Was this translation helpful? Give feedback.
-
IQ2_XS is kind of slow on my P40 but it's still quite a bit faster than Q2KS on my 6700XT and 7900XTX. |
Beta Was this translation helpful? Give feedback.
-
I have been working on the CUDA vec_dot_iq2_xs_q8_1 that uses this formula, and otherwise keeps things vectorized. The types are not right, because I am actually just doing this in regular C, but I think the idea is right, and the functionality is right at least in my C model of CUDA. This removes the inner loop over each sign bit, which I think should make things a bit faster.
(Existing implementation is here for comparison. |
Beta Was this translation helpful? Give feedback.
-
Running Mistral-7B quantized with
So, nice speedup for small number of threads, negligible performance gain when the calculation becomes memory bound for sufficient number of threads. |
Beta Was this translation helpful? Give feedback.
-
@PeterReid Can you try #5187 on your CPUs? Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I have been working on accelerating IQ2_XS's dot product and gotten eval time to about 75% of what it was on my machine. The commit showing my work is here: PeterReid@52b2738 .
My idea is to vectorize the computation of s2_1 and s2_2 in ggml_vec_dot_iq2_xs_q8_K. Before this commit, building those required picking the XMM registers apart into regular registers in 8 separate pieces, doing a memory reference for each of those pieces into the table that maps sign encodings to their meanings, and then assembling those pieces back together, one a time, into YMM registers.
After the commit, the computation stays in the AVX2 registers. Rather than doing lookups in the table, it uses the fact that the bytes 0..127 (inputs to the sign lookup table) can be mapped to bytes with an even number of bits set (outputs of the sign table) with the formula
x ^ (x<<1)
.Unfortunately, the mapping doesn't have the same order as the original sign table has, so I have to modify the .gguf file to work after this patch. The ordering of the sign table is basically arbitrary, but it would obviously be bad to break existing gguf files. That may make this a non-starter, but I figured it is worth showing this anyway. (The commented out nonsense in llama.cpp you may see in the commit is doing the conversion.)
The machine I'm working on is an elderly i5-4300U, so your results may be different.
What I've done so far is specific to AVX2, but it seems like the same idea could be used on most platforms, and for IQ2_XXS.
My timings:
Beta Was this translation helpful? Give feedback.
All reactions