Allows to register OPs in the CPU backend. #10350

Djip007 · 2024-11-17T02:05:32Z

Djip007
Nov 17, 2024

Create a full backend is a lot of work, Some actual backend only need to implement some optimised OP, like BLAS, AMX, ... or tinyBLAS

I have some test for use the RDNA3 iGPU of AMD CPU (AMD Ryzen 9 7940HS w/ Radeon 780M Graphics)
I want to create some hight speed gemm for FP8 on CPU, but for good speed we need to use more "classique" matmul kernel, like the 5 level BLIS structure.
When XDNA driver will be available on linux I'll like to have a look on this NPU.

A full backend is nice for discrete accelerator, but to much work (copy) for integreted accelerator that use CPU memory.

Next there is some "idea" to build multiple CPU backend and use the "best" supported for the current CPU.

So my feeling is that is may be good to have the possibility to "register" OP, and select at runtime those which are possible, and the best for the current compute. May be let the user chose some of them...

slaren · 2024-11-17T02:12:27Z

slaren
Nov 17, 2024
Collaborator

The way we handle this is with ggml_backend_sched. That's how the BLAS and AMX backends work despite only implementing matrix multiplication.

13 replies

slaren Nov 17, 2024
Collaborator

In case, are there any plans to add all Dynamic types in ggml_type or are there plans to move it somewhere else?

I don't know if it would be worth adding new ggml_types for the AMX repacked tensors layouts. I don't have a strong opinion, but at the moment I think it would be simpler to avoid adding new types until there is a clear reason to do that.

Djip007 Nov 17, 2024
Author

OK I think it is to much for me to start with AMX
I will try to add a simple bf16 matmul with repacking

I totally agree with not adding any new type.
It is use here:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c

Lines 410 to 417 in ce2e59b

    
           [GGML_TYPE_Q4_0_4_4] = { 
        
               .from_float               = NULL, 
        
               .vec_dot                  = NULL, 
        
               .vec_dot_type             = GGML_TYPE_Q8_0, 
        
               .nrows                    = 1, 
        
               .ncols                    = 4, 
        
               .gemv                     = ggml_gemv_q4_0_4x4_q8_0, 
        
               .gemm                     = ggml_gemm_q4_0_4x4_q8_0,

Have you any plan with that?

same "question" with that: isn't it better to move this to the cpu backend ?

llama.cpp/ggml/src/ggml-common.h

Lines 206 to 210 in ce2e59b

    
           typedef struct { 
        
               ggml_half d[4];        // deltas for 4 q4_0 blocks 
        
               uint8_t qs[QK4_0 * 2]; // nibbles / quants for 4 q4_0 blocks 
        
           } block_q4_0x4; 
        
           static_assert(sizeof(block_q4_0x4) == 4 * sizeof(ggml_half) + QK4_0 * 2, "wrong q4_0x4 block size/padding");

What do you think if I move the gemv/gemm on another array and use some sort of "CPU_TENSOR_TYPE"?

Note: for me this gemv/gemm look like a way to register some "OP"

slaren Nov 17, 2024
Collaborator

I am not sure, maybe it would be good to phase out the Q4_0_x_x file types, remove them from llama-quantize and eventually make them available only via online repacking of Q4_0 models. There are some advantages to having a Q4_0_x_x model file that does not require conversion, it allows using mmap and reduces the load time, but I am not sure if that is significant enough to keep them. If you want to do some refactoring of the way the CPU backend handles this with a new struct or some other way go for it, anything that improves the code quality is welcome.

Djip007 Nov 18, 2024
Author

OK I'll try something and see if it work.

Djip007 Nov 21, 2024
Author

Create a draft #10446 with my current testing...
More need to be done 😅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allows to register OPs in the CPU backend. #10350

{{title}}

Replies: 1 comment 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Allows to register OPs in the CPU backend. #10350

Djip007 Nov 17, 2024

Replies: 1 comment · 13 replies

slaren Nov 17, 2024 Collaborator

slaren Nov 17, 2024 Collaborator

Djip007 Nov 17, 2024 Author

slaren Nov 17, 2024 Collaborator

Djip007 Nov 18, 2024 Author

Djip007 Nov 21, 2024 Author

Djip007
Nov 17, 2024

Replies: 1 comment 13 replies

slaren
Nov 17, 2024
Collaborator

slaren Nov 17, 2024
Collaborator

Djip007 Nov 17, 2024
Author

slaren Nov 17, 2024
Collaborator

Djip007 Nov 18, 2024
Author

Djip007 Nov 21, 2024
Author