Accelerating IQ2_XS #5152

PeterReid · 2024-01-26T20:36:13Z

PeterReid
Jan 26, 2024

Hi,

I have been working on accelerating IQ2_XS's dot product and gotten eval time to about 75% of what it was on my machine. The commit showing my work is here: PeterReid@52b2738 .

My idea is to vectorize the computation of s2_1 and s2_2 in ggml_vec_dot_iq2_xs_q8_K. Before this commit, building those required picking the XMM registers apart into regular registers in 8 separate pieces, doing a memory reference for each of those pieces into the table that maps sign encodings to their meanings, and then assembling those pieces back together, one a time, into YMM registers.

After the commit, the computation stays in the AVX2 registers. Rather than doing lookups in the table, it uses the fact that the bytes 0..127 (inputs to the sign lookup table) can be mapped to bytes with an even number of bits set (outputs of the sign table) with the formula x ^ (x<<1).

Unfortunately, the mapping doesn't have the same order as the original sign table has, so I have to modify the .gguf file to work after this patch. The ordering of the sign table is basically arbitrary, but it would obviously be bad to break existing gguf files. That may make this a non-starter, but I figured it is worth showing this anyway. (The commented out nonsense in llama.cpp you may see in the commit is doing the conversion.)

The machine I'm working on is an elderly i5-4300U, so your results may be different.

What I've done so far is specific to AVX2, but it seems like the same idea could be used on most platforms, and for IQ2_XXS.

My timings:


Pre change:
lama_print_timings:        load time =    1719.99 ms
llama_print_timings:      sample time =     147.08 ms /   192 runs   (    0.77 ms per token,  1305.43 tokens per second)
llama_print_timings: prompt eval time =   13841.57 ms /    18 tokens (  768.98 ms per token,     1.30 tokens per second)
llama_print_timings:        eval time =  161728.03 ms /   192 runs   (  842.33 ms per token,     1.19 tokens per second)
llama_print_timings:       total time =  182960.32 ms /   210 tokens

Post change:
llama_print_timings:        load time =    1270.98 ms
llama_print_timings:      sample time =     193.19 ms /   240 runs   (    0.80 ms per token,  1242.31 tokens per second)
llama_print_timings: prompt eval time =    9554.77 ms /    18 tokens (  530.82 ms per token,     1.88 tokens per second)
llama_print_timings:        eval time =  148493.11 ms /   241 runs   (  616.15 ms per token,     1.62 tokens per second)
llama_print_timings:       total time =  164450.33 ms /   259 tokens

ggerganov · 2024-01-27T09:41:13Z

ggerganov
Jan 27, 2024
Maintainer

Unfortunately, the mapping doesn't have the same order as the original sign table has, so I have to modify the .gguf file to work after this patch. The ordering of the sign table is basically arbitrary, but it would obviously be bad to break existing gguf files. That may make this a non-starter, but I figured it is worth showing this anyway. (The commented out nonsense in llama.cpp you may see in the commit is doing the conversion.)

IMO, if this approach could be applied to the GPU code without loss of performance, it would be worth to update the data format

Pinging @ikawrakow

1 reply

PeterReid Jan 27, 2024
Author

I think it is safe to say that it could be applied without a loss, because it would just be re-ordering the bytes in ksigns_iq2xs. Right now, ksigns_iq2xs[i] = i | ( (i.bitcount()&1)<<7 ). Which is to say, it sets the high bit of each element to make the parity even. I don't think that particular ordering is of any help to the speed.

ikawrakow · 2024-01-27T11:01:13Z

ikawrakow
Jan 27, 2024

So, I didn't put too much effort into optimizing the CPU code (the thinking being that one uses these low-bit quantization types mainly because one wants to fit a model into the VRAM of the GPU one has available). But I did try the bit-to-byte conversion trick on AVX2 and, if I remember correctly, did not see a significant difference in performance.

@PeterReid What is the model size for the timings you are reporting?

@ggerganov I don't see how the approach applies to the GPU (CUDA or Metal)? Doesn't one need to implement all kernels on all supported platforms first, show performance gains or at least no performance degradation for all, and only then decide to change the data format? There are quite a few IQ2_XS quantized models out there by now.

5 replies

ggerganov Jan 27, 2024
Maintainer

I could be missing something, but for the GPU code we have to replace the sign table references with the proposed formula. Expectation is that the performance would at least not be worse.

ikawrakow Jan 27, 2024

No, the trick just gives you a conversion from 7 to 8 sign bits. Then you still need to convert the bits to bytes (signs), which still requires a lookup table. The proposed trick is useful for AVX because one can convert the 4x7 sign bits into a 128 bit vector holding 4 copies of the 32 bits with just 2 AVX instructions. That is not easily transferable to other platforms. The thing is, once you have wandered along this path, there is no lookup table required, as one can go from bits to +/- bytes directly, without a lookup table.

Or is it me missing something?

PeterReid Jan 27, 2024
Author

What is the model size for the timings you are reporting?

I am using the mistral 7b instruct you made.

I am not an experienced GPU programmer, but it seems to me that this could possibly give a benefit in, for example, the CUDA version of this function. Specifically, const uint8_t signs = ksigns_iq2xs[q2[l] >> 9]; would have the option to replace that table lookup with that x ^ (x<<1) formula (const uint8_t sign_encoding = q2[l] >> 9; const uint8_t signs = sign_encoding ^ (sign_encoding << 1)). I don't have a good intuition for how expensive a memory lookup is in CUDA code vs a shift and xor, and don't have a GPU to test it out on, but it seems at least possible that it would be faster. But I would also have thought having kmask_iq2xs be a memory lookup instead of a single shift would be too expensive, so it may be that my mental model of GPU cost is totally broken.

PeterReid Jan 27, 2024
Author

Doesn't one need to implement all kernels on all supported platforms first, show performance gains or at least no performance degradation for all, and only then decide to change the data format?

I would also caution that the CPU I'm working with is very old and may not be representative. I will test it with a newer AVX one at least when I can. I will try to rent a GPU and see how the CUDA version of this goes, although that is new territory for me.

PeterReid Jan 27, 2024
Author

I ran some timing tests on a newer processor (i7-1165G7) and the results were somewhat less good: 85% of the original runtime.

sorasoras · 2024-01-27T11:49:18Z

sorasoras
Jan 27, 2024

IQ2_XS is kind of slow on my P40 but it's still quite a bit faster than Q2KS on my 6700XT and 7900XTX.

0 replies

PeterReid · 2024-01-28T04:34:04Z

PeterReid
Jan 28, 2024
Author

I have been working on the CUDA vec_dot_iq2_xs_q8_1 that uses this formula, and otherwise keeps things vectorized. The types are not right, because I am actually just doing this in regular C, but I think the idea is right, and the functionality is right at least in my C model of CUDA. This removes the inner loop over each sign bit, which I think should make things a bit faster.

    for (int l = 0; l < 2; ++l) {
        const uint32_t * grid = (const uint32_t *)(iq2xs_grid + (q2[l] & 511));
        const uint8_t sign_encoding = q2[l] >> 9;
        const uint8_t sign_bits = sign_encoding ^ (sign_encoding << 1);
        const uint32_t signs_extended = sign_bits * 0x01010101;
        const uint32_t lower_signs = __vcmpne4(signs_extended & 0x08040201, 0);
        const uint32_t upper_signs = __vcmpne4(signs_extended & 0x80402010, 0);
        uint32_t lower_grid = __vsub4((grid[0] ^ lower_signs), lower_signs);
        uint32_t upper_grid = __vsub4((grid[1] ^ upper_signs), upper_signs);
        
        sumi1 = __dp4a(*(uint32_t *)q8, lower_grid,  sumi1);
        sumi1 = __dp4a(*(uint32_t *)(q8 + 4), upper_grid, sumi1);
        q8 += 8;
    }
    
    int sumi2 = 0;
    for (int l = 2; l < 4; ++l) {
        const uint32_t * grid = (const uint32_t *)(iq2xs_grid + (q2[l] & 511));
        const uint8_t sign_encoding = q2[l] >> 9;
        const uint8_t sign_bits = sign_encoding ^ (sign_encoding << 1);
        const uint32_t signs_extended = sign_bits * 0x01010101;
        const uint32_t lower_signs = __vcmpne4(signs_extended & 0x08040201, 0);
        const uint32_t upper_signs = __vcmpne4(signs_extended & 0x80402010, 0);
        uint32_t lower_grid = __vsub4((grid[0] ^ lower_signs), lower_signs);
        uint32_t upper_grid = __vsub4((grid[1] ^ upper_signs), upper_signs);
        
        sumi2 = __dp4a(*(uint32_t *)q8, lower_grid,  sumi2);
        sumi2 = __dp4a(*(uint32_t *)(q8 + 4), upper_grid, sumi2);
        q8 += 8;
    }

(Existing implementation is here for comparison.

3 replies

ikawrakow Jan 28, 2024

Very neat trick!

But somehow I'm not getting the expected result from __vcmpne4. Wasted an hour and just don't see the issue.
So, I just did something simpler (see below). On my GPU (RTX-4080) this increases TG-128 performance from 134 t/s to 170 t/s. It does not require the "framblification" of the sign data. "Framblification" + the above code (that does not give the desired result) is slower by a few percent. Perhaps one can do something similar for AVX?

static const __device__ uint64_t ksigns64[128] = {
    0x0000000000000000, 0xff000000000000ff, 0xff0000000000ff00, 0x000000000000ffff,
    0xff00000000ff0000, 0x0000000000ff00ff, 0x0000000000ffff00, 0xff00000000ffffff,
    0xff000000ff000000, 0x00000000ff0000ff, 0x00000000ff00ff00, 0xff000000ff00ffff,
    0x00000000ffff0000, 0xff000000ffff00ff, 0xff000000ffffff00, 0x00000000ffffffff,
    0xff0000ff00000000, 0x000000ff000000ff, 0x000000ff0000ff00, 0xff0000ff0000ffff,
    0x000000ff00ff0000, 0xff0000ff00ff00ff, 0xff0000ff00ffff00, 0x000000ff00ffffff,
    0x000000ffff000000, 0xff0000ffff0000ff, 0xff0000ffff00ff00, 0x000000ffff00ffff,
    0xff0000ffffff0000, 0x000000ffffff00ff, 0x000000ffffffff00, 0xff0000ffffffffff,
    0xff00ff0000000000, 0x0000ff00000000ff, 0x0000ff000000ff00, 0xff00ff000000ffff,
    0x0000ff0000ff0000, 0xff00ff0000ff00ff, 0xff00ff0000ffff00, 0x0000ff0000ffffff,
    0x0000ff00ff000000, 0xff00ff00ff0000ff, 0xff00ff00ff00ff00, 0x0000ff00ff00ffff,
    0xff00ff00ffff0000, 0x0000ff00ffff00ff, 0x0000ff00ffffff00, 0xff00ff00ffffffff,
    0x0000ffff00000000, 0xff00ffff000000ff, 0xff00ffff0000ff00, 0x0000ffff0000ffff,
    0xff00ffff00ff0000, 0x0000ffff00ff00ff, 0x0000ffff00ffff00, 0xff00ffff00ffffff,
    0xff00ffffff000000, 0x0000ffffff0000ff, 0x0000ffffff00ff00, 0xff00ffffff00ffff,
    0x0000ffffffff0000, 0xff00ffffffff00ff, 0xff00ffffffffff00, 0x0000ffffffffffff,
    0xffff000000000000, 0x00ff0000000000ff, 0x00ff00000000ff00, 0xffff00000000ffff,
    0x00ff000000ff0000, 0xffff000000ff00ff, 0xffff000000ffff00, 0x00ff000000ffffff,
    0x00ff0000ff000000, 0xffff0000ff0000ff, 0xffff0000ff00ff00, 0x00ff0000ff00ffff,
    0xffff0000ffff0000, 0x00ff0000ffff00ff, 0x00ff0000ffffff00, 0xffff0000ffffffff,
    0x00ff00ff00000000, 0xffff00ff000000ff, 0xffff00ff0000ff00, 0x00ff00ff0000ffff,
    0xffff00ff00ff0000, 0x00ff00ff00ff00ff, 0x00ff00ff00ffff00, 0xffff00ff00ffffff,
    0xffff00ffff000000, 0x00ff00ffff0000ff, 0x00ff00ffff00ff00, 0xffff00ffff00ffff,
    0x00ff00ffffff0000, 0xffff00ffffff00ff, 0xffff00ffffffff00, 0x00ff00ffffffffff,
    0x00ffff0000000000, 0xffffff00000000ff, 0xffffff000000ff00, 0x00ffff000000ffff,
    0xffffff0000ff0000, 0x00ffff0000ff00ff, 0x00ffff0000ffff00, 0xffffff0000ffffff,
    0xffffff00ff000000, 0x00ffff00ff0000ff, 0x00ffff00ff00ff00, 0xffffff00ff00ffff,
    0x00ffff00ffff0000, 0xffffff00ffff00ff, 0xffffff00ffffff00, 0x00ffff00ffffffff,
    0xffffffff00000000, 0x00ffffff000000ff, 0x00ffffff0000ff00, 0xffffffff0000ffff,
    0x00ffffff00ff0000, 0xffffffff00ff00ff, 0xffffffff00ffff00, 0x00ffffff00ffffff,
    0x00ffffffff000000, 0xffffffffff0000ff, 0xffffffffff00ff00, 0x00ffffffff00ffff,
    0xffffffffffff0000, 0x00ffffffffff00ff, 0x00ffffffffffff00, 0xffffffffffffffff,
};

static __device__ __forceinline__ float vec_dot_iq2_xs_q8_1(
    const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & iqs) {
#if __CUDA_ARCH__ >= MIN_CC_DP4A // lowest compute capability for integer intrinsics
#if QK_K == 256
    const block_iq2_xs * bq2 = (const block_iq2_xs *) vbq;

    const int ib32 = iqs;
    const uint16_t * q2 = bq2->qs + 4*ib32;
    const int8_t   * q8 = bq8_1[ib32].qs;
    const uint8_t ls1 = bq2->scales[ib32] & 0xf;
    const uint8_t ls2 = bq2->scales[ib32] >>  4;
    int sumi1 = 0;
    for (int l = 0; l < 2; ++l) {
        const uint32_t * grid = (const uint32_t *)(iq2xs_grid + (q2[l] & 511));
        const uint32_t * signs = (const uint32_t *)(ksigns64 + (q2[l] >> 9));
        const int grid_l = __vsub4(grid[0] ^ signs[0], signs[0]);
        const int grid_h = __vsub4(grid[1] ^ signs[1], signs[1]);
        sumi1 = __dp4a(grid_l, *((const int *)q8 + 0), sumi1);
        sumi1 = __dp4a(grid_h, *((const int *)q8 + 1), sumi1);
        q8 += 8; 
    }
    int sumi2 = 0;
    for (int l = 2; l < 4; ++l) {
        const uint32_t * grid = (const uint32_t *)(iq2xs_grid + (q2[l] & 511));
        const uint32_t * signs = (const uint32_t *)(ksigns64 + (q2[l] >> 9));
        const int grid_l = __vsub4(grid[0] ^ signs[0], signs[0]);
        const int grid_h = __vsub4(grid[1] ^ signs[1], signs[1]);
        sumi2 = __dp4a(grid_l, *((const int *)q8 + 0), sumi2);
        sumi2 = __dp4a(grid_h, *((const int *)q8 + 1), sumi2);
        q8 += 8; 
    }
    const float d = (float)bq2->d * __low2float(bq8_1[ib32].ds) * 0.25f;
    return d * ((0.5f + ls1) * sumi1 + (0.5f + ls2) * sumi2);
#else
    assert(false);
    return 0.f;
#endif
#else
    assert(false);
    return 0.f;
#endif
}

PeterReid Jan 28, 2024
Author

Thanks for trying that out! That is a pretty nice gain. Here is the sign re-ordered gguf I've been working with in case you ever want to try out some variety of that approach again: https://huggingface.co/PeterReid/mistral-instruct-7b-2.43bpw-signreordered/tree/main ... without that you'll get nonsense, obviously.

Am I understanding right that doing the (x ^ (x<<1)) / multiply / vcmpne4 is what was a few percent slower than the new lookup?

As for applying this to AVX, I think the idea here that seems to have been useful for CUDA (vectorize it!) has already been in effect in AVX the whole time. It's just that its vectorization is covering 4x more elements in each instruction, which makes the table lookup have to be done in somewhat-slow chunks.

ikawrakow Jan 28, 2024

Yes, it is easy to add your "framblification" to the quantization function, so I was trying with a model quantized that way. The CUDA code above still does not work, I get gibberish with it (but I verified that it works if I simply do framblified_signs ^ (framblified_signs << 1) on the uint8_t signs, but then use the same loop as it is on current master without the SIMD tricks via __vcmpne4 and __dp4a).

ikawrakow · 2024-01-28T11:57:36Z

ikawrakow
Jan 28, 2024

Running Mistral-7B quantized with IQ2_XS on a Ryzen-7950X CPU I get this:

model	size	params	backend	threads	test	t/s (Master)	t/s (Proposed)	Speedup
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	CPU	16	tg 128	21.84 ± 0.54	22.02 ± 0.46	1.008
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	CPU	8	tg 128	16.31 ± 0.04	17.90 ± 0.02	1.097
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	CPU	4	tg 128	9.09 ± 0.01	11.48 ± 0.04	1.263
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	CPU	2	tg 128	4.78 ± 0.01	6.12 ± 0.01	1.280

So, nice speedup for small number of threads, negligible performance gain when the calculation becomes memory bound for sufficient number of threads.

0 replies

ikawrakow · 2024-01-29T07:27:00Z

ikawrakow
Jan 29, 2024

@PeterReid Can you try #5187 on your CPUs? Thanks.

1 reply

PeterReid Jan 29, 2024
Author

That is a good idea to go four blocks at a time! I will run some timing tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerating IQ2_XS #5152

{{title}}

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Accelerating IQ2_XS #5152

Replies: 6 comments · 10 replies

ggerganov Jan 27, 2024 Maintainer

PeterReid Jan 27, 2024 Author

ggerganov Jan 27, 2024 Maintainer

PeterReid Jan 27, 2024 Author

PeterReid Jan 27, 2024 Author

PeterReid Jan 27, 2024 Author

PeterReid Jan 28, 2024 Author

PeterReid Jan 28, 2024 Author

PeterReid Jan 29, 2024 Author

Replies: 6 comments 10 replies

ggerganov
Jan 27, 2024
Maintainer

PeterReid Jan 27, 2024
Author

ggerganov Jan 27, 2024
Maintainer

PeterReid Jan 27, 2024
Author

PeterReid Jan 27, 2024
Author

PeterReid Jan 27, 2024
Author

PeterReid
Jan 28, 2024
Author

PeterReid Jan 28, 2024
Author

PeterReid Jan 29, 2024
Author