What is the purpose of `GGML_F32_STEP` and `GGML_F16_STEP`? #386

abitofevrything · 2023-01-07T20:50:57Z

abitofevrything
Jan 7, 2023

I can tell they're used to compute GGML_F32_ARR, which is then used to batch calls to the SIMD methods. But what is the purpose of this?

As an example, let's look at ggml_vec_dot_f32, with GGML_SIMD enabled:

inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
    ggml_float sumf = 0.0;

    const int np = (n & ~(GGML_F32_STEP - 1));

    GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };

    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];

    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);

            sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
        }
    }

    // reduce sum0..sum3 to sum0
    GGML_F32_VEC_REDUCE(sumf, sum);

    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }

    *s = sumf;
}

For starters, we can flatten the two main loops into a single one and simplify the index computation:

inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
    ggml_float sumf = 0.0;

    const int np = (n & ~(GGML_F32_STEP - 1));

    GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };

    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];

    for (int i = 0; i < np; i += GGML_F32_EPR) {
        int j = i % GGML_F32_ARR;

        ax[j] = GGML_F32_VEC_LOAD(x + i);
        ay[j] = GGML_F32_VEC_LOAD(y + i);

        sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
    }

    // reduce sum0..sum3 to sum0
    GGML_F32_VEC_REDUCE(sumf, sum);

    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }

    *s = sumf;
}

Now, it looks like we don't really need ax, ay or sum to be arrays - we can just make do with single variables. We also need a new way to reduce that single variable into sumf which I've just hacked togther with 0 padding. We can improve this later.
We can also make np something that doesn't rely on GGML_F32_STEP - it's simple n - (n % GGML_F32_EPR).

inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
    ggml_float sumf = 0.0;

    const int np = n - (n % GGML_F32_EPR);

    GGML_F32_VEC sum = GGML_F32_VEC_ZERO;

    for (int i = 0; i < np; i += GGML_F32_EPR) {
        GGML_F32_VEC ax = GGML_F32_VEC_LOAD(x + i);
        GGML_F32_VEC ay = GGML_F32_VEC_LOAD(y + i);

        sum = GGML_F32_VEC_FMA(sum, ax, ay);
    }

    // reduce sum0..sum3 to sum0
    GGML_F32_VEC __temp_for_sum[GGML_F32_ARR]  = { GGML_F32_VEC_ZERO };
    __temp_for_sum[0] = sum;
    GGML_F32_VEC_REDUCE(sumf, __temp_for_sum);

    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }

    *s = sumf;
}

Now ggml_vec_dot_f32 does not depend on GGML_F32_STEP (apart through GGML_F32_ARR for the reduction - but that's a temporary hack). We can repeat this process for all other functions that use GGML_F32_STEP, GGML_F16_STEP, GGML_F32_ARR or GGML_F16_ARR.

The results?

Before

$ time ./main -f samples/jfk.wav -m models/ggml-tiny.bin
...
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
...

whisper_print_timings:     load time =   628.13 ms
whisper_print_timings:      mel time =   194.54 ms
whisper_print_timings:   sample time =    12.82 ms
whisper_print_timings:   encode time = 14896.71 ms / 3724.18 ms per layer
whisper_print_timings:   decode time =  2720.81 ms / 680.20 ms per layer
whisper_print_timings:    total time = 18456.29 ms
./main -f samples/jfk.wav -m models/ggml-tiny.bin  56.26s user 1.12s system 309% cpu 18.536 total

After

$ time ./main -f samples/jfk.wav -m models/ggml-tiny.bin
...
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
...

whisper_print_timings:     load time =   481.98 ms
whisper_print_timings:      mel time =   315.16 ms
whisper_print_timings:   sample time =    12.57 ms
whisper_print_timings:   encode time = 14513.76 ms / 3628.44 ms per layer
whisper_print_timings:   decode time =  2240.87 ms / 560.22 ms per layer
whisper_print_timings:    total time = 17567.01 ms
./main -f samples/jfk.wav -m models/ggml-tiny.bin  61.11s user 0.89s system 351% cpu 17.663 total

So, using the GGML_*_STEP doesn't seem to have an impact on performance on my system (the slight speed increase isn't consistent). However, it does make the code more complicated to understand and therefore optimise.

So why don't we remove it? Is there a performance reason behind it that isn't visible on my system? I'm running an Intel Celeron N4120 with SSE3 and BLAS.

I'd appreciate if someone could test this on a PC that has better performance than a potato, unlike mine. A version of the code with the changes I made above can be found at https://github.com/abitofevrything/whisper.cpp/tree/remove_step. Note that I have not made the changes necessary for POWER9 as I couldn't find enough documentation online on how to reimplement GGML_F16_VEC_LOAD without the i parameter.

Answered by abitofevrything

Jan 8, 2023

I've just added support back for POWER9 (I think).

@ggerganov, hope you don't mind the mention, but do you have any explanation for GGML_F32_STEP and GGML_F16_STEP?

View full answer

RndyP · 2023-01-07T21:46:03Z

RndyP
Jan 7, 2023

I was working on ggml_vec_dot_f16() this morning. Did not get a significant improvement by flattening the nested loop. Here's what I came up with, but my reduction piece is not ready for prime time.

inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) {
ggml_float sumf = 0.0;
int i=0;

#if defined(GGML_SIMD)
GGML_F16_VEC sum= GGML_F16_VEC_ZERO;

for ( i=0 ; i<=n-GGML_F16_EPR ; i+=GGML_F16_EPR )
    sum = GGML_F16_VEC_FMA(sum,GGML_F16_VEC_LOAD(x + i, 0),GGML_F16_VEC_LOAD(y + i, 0));

// reduce
if ( i>0 )
{
    for ( int j=0 ; j<GGML_F16_EPR ; j++ )
        sumf+=sum.m256_f32[j];
}

#endif

// leftovers
for ( ; i < n; i++ )
  sumf += GGML_FP16_TO_FP32(x[i])*GGML_FP16_TO_FP32(y[i]);
*s = sumf;

}

1 reply

abitofevrything Jan 7, 2023
Author

Yeah, flattening the loop doesn't really have any performance benefit (TBH the compiler probably does it anyways), but it does highlight the fact that ax and ay really don't have to be arrays nor does GGML_F16_STEP have much use :)

abitofevrything · 2023-01-08T19:09:13Z

abitofevrything
Jan 8, 2023
Author

I've just added support back for POWER9 (I think).

@ggerganov, hope you don't mind the mention, but do you have any explanation for GGML_F32_STEP and GGML_F16_STEP?

1 reply

ggerganov Jan 9, 2023
Maintainer

The way I understand this (I'm not expert on SIMD) is that this way we utilise all available registers for the corresponding instruction set (AVX, NEON, etc). Not sure why you don't see a difference when using SSE3. Maybe try to disable BLAS in order to make all matrix multiplications go through the ggml routines.

On my M1, your version is 2x slower compared to master in the Encoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the purpose of `GGML_F32_STEP` and `GGML_F16_STEP`? #386

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the purpose of GGML_F32_STEP and GGML_F16_STEP? #386

abitofevrything Jan 7, 2023

Before

After

Replies: 2 comments · 2 replies

RndyP Jan 7, 2023

abitofevrything Jan 7, 2023 Author

abitofevrything Jan 8, 2023 Author

ggerganov Jan 9, 2023 Maintainer

What is the purpose of `GGML_F32_STEP` and `GGML_F16_STEP`? #386

abitofevrything
Jan 7, 2023

Replies: 2 comments 2 replies

RndyP
Jan 7, 2023

abitofevrything Jan 7, 2023
Author

abitofevrything
Jan 8, 2023
Author

ggerganov Jan 9, 2023
Maintainer