Optimization to remove PCIe bandwidth limitations for large matrix multiplications on consumer GPU cards #5893

00prometheus · 2024-03-05T21:49:05Z

00prometheus
Mar 5, 2024

Matrix Multiply does O(n3) operations on O(n2) data. Given a sufficiently large matrix, this means that matrix multiply can potentially be implemented without bandwidth limitations. I used to work in HPC, and the common technique to do this was an optimization called "strip mining":

You split the target matrix, i.e. the resulting matrix to be computed, into blocks. The size of the target blocks are dependent on available size of local fastest memory, but you want as large a target block as you can fit sources and destination for in your fast memory. You only transfer to that fastest memory the rows of matrix A that will be used to compute the target block and the columns of matrix B that will be used to compute the target block. Now you can compute the whole A blockHeight x B blockWidth target block, but you only transferred A blockHeight rows + B blockWidth columns instead of the whole matrix A and matrix B. Each row of A will be used as many times as there are columns of B and vice versa. With large enough blocks, this gives you ample time to pre-transfer another set of input data while the first block is being computed (you can save even more bandwidth here, because you already have e.g. the rows of matrix A needed for the second block).

When implemented with loops, it results in two extra loop levels for stepping along the blocks. Since the method is effective for each stage of memory in your system (L1 cache, L2 cache, L3 cache, on-board memory, main system RAM, local disk, remote MPI machines, remote SAN), this subdivision into blocks can usefully be done quite deeply resulting in a loop nest looking like a strip mine, hence the name of the optimization. (Tip: Doing it recursively looks a bit more tidy and doesn't hard code the number of levels, but even just one hard coded level of strip mining may be enough to solve the PCIe bandwidth issue). Your inner most level memory needs to have space for at least the input rows and columns and target space needed to compute a 2x2 target matrix to gain any advantage, but larger target blocks give bigger benefit.

Using BLAS libraries, you don't implement the matrix multiply yourself, but you can still subdivide the target matrix into blocks and feed BLAS only the rows and columns needed for the current block. The decomposition can also be used to support multiple GPUs, where you simply calculate different target blocks of the same input matrices on different GPUs.

I checked with nVidia, but cuBLAS does not do this matrix decomposition, however you can of course do it manually in your calls to cuBLAS (and other BLAS libraries). They did confirm that transferring data across the PCIe bridge while computing works fine, so doing this with the large matrices of LLMs might allow us to pre-transfer matrix blocks across the PCIe bridge to run models like Llama-70B on something like 8GB gaming GPU cards at full GPU utilization. Since the matrices are large, it might even be worth experimenting with pulling in data all the way from disk (with an added level of strip mining for the disk level), to see if we can gain a speedup for e.g. 16GB or 32GB system RAM machines.

00prometheus · 2024-03-11T15:39:28Z

00prometheus
Mar 11, 2024
Author

Crickets? Am I barking up the wrong tree? Is this perhaps already implemented in llama.cpp?

2 replies

kalomaze Mar 20, 2024

Tagging @JohannesGaessler, this seems interestnig

JohannesGaessler Mar 21, 2024
Collaborator

There are essentially two performance-relevant cases for matrix multiplication: prompt processing and token generation.

For prompt processing you can evaluate all tokens in parallel and you multiply large matrices. In this case you are compute bound and it makes sense to move the data somewhere where the computation is faster. For example, with the current code, when you compile with GPU support and run part of the model on the CPU the CPU layers are still temporarily moved to the GPU where the evaluation is faster.
For token generation you have to evaluate the tokens one at a time. As a consequence the matrix multiplications become matrix vector multiplications with $O(N^2)$ I/O and $O(N^2)$ compute. In that case it makes no sense to do the computation on the GPU instead of the CPU because you are bottlenecked by the RAM bandwidth either way.

More generally, I am already doing things like moving parts of matrices into GPU SRAM in order to increase arithmetic intensity. However, for CPU+GPU hybrid inference the data is to my knowledge not being preloaded and the CPU matrix multiplication code needs work in general. It's simply a matter of no one having gotten around to working on it.

00prometheus · 2024-04-13T14:37:55Z

00prometheus
Apr 13, 2024
Author

@JohannesGaessler I am not quite sure I follow you; it might be that there are optimizations in llama.cpp that I am not aware of? The implementation I know of is the standard Transformer implementation, as in GPT2. In that case token generation is a direct part of context processing, however it sounds to me like your implementation somehow preprocesses the prompt and then just does token generation?

In a standard GPT2 implementation, the context consists of a many embedding vectors (modified through position embedding, layer norm, etc). Since you have many vectors, you can still use the stripmining optimization since you need to run all of the context token vectors through the various weight matrices (Q, K and V, projection and MLP) during a single token generation. Since you have many context tokens vectors you will have many matrix-vector multiplications for each weight matrix. Therefore, during generation of a single output token, each row of the weight matrices get re-used once for each context token. If you can transfer just some of the rows (usually at least as many as your cache-line lengths), run those through some of the tokens (again, at least as many as your cache-line lengths) until all the transferred weight matrix rows x context token vectors that you have transferred are done, you can meanwhile transfer the next batch of matrix rows or context vectors. As long as you have transferred a sufficiently large proportion of the weight matrix and you have enough context token vectors that you want to multiply with the weight matrix part, the transfer of more rows of the weight matrix can still take place while you are doing the dot products between weight matrix rows and token vectors, re-using each weight matrix row once per token vector.

Did this make sense? Am I lacking some important information about how llama.cpp is implemented? If so, can you tell me how it works at a high level or maybe there is some documentation?

2 replies

JohannesGaessler Apr 13, 2024
Collaborator

I am not quite sure I follow you; it might be that there are optimizations in llama.cpp that I am not aware of? The implementation I know of is the standard Transformer implementation, as in GPT2. In that case token generation is a direct part of context processing, however it sounds to me like your implementation somehow preprocesses the prompt and then just does token generation?

Before you can generate new tokens you need to fill the KV cache for the tokens in the prompt ("prompt processing"). This is essentially the same thing you do when generating tokens, the only difference is you can work on all tokens in parallel so it's much faster.

In a standard GPT2 implementation, the context consists of a many embedding vectors (modified through position embedding, layer norm, etc). Since you have many vectors, you can still use the stripmining optimization since you need to run all of the context token vectors through the various weight matrices (Q, K and V, projection and MLP) during a single token generation.

There is only a single vector per token and the K and V values for previous tokens are being cached in the KV cache. Therefore you only have multiple vectors during training and when populating the KV cache with the prompt. When you generate tokens you have to do it one at a time so there is only a single vector unless you serve many users at once.

00prometheus Apr 14, 2024
Author

OK, some exploration of KV-caching and I see the problem now. Thanks for the understanding!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization to remove PCIe bandwidth limitations for large matrix multiplications on consumer GPU cards #5893

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Optimization to remove PCIe bandwidth limitations for large matrix multiplications on consumer GPU cards #5893

00prometheus Mar 5, 2024

Replies: 2 comments · 4 replies

00prometheus Mar 11, 2024 Author

kalomaze Mar 20, 2024

JohannesGaessler Mar 21, 2024 Collaborator

00prometheus Apr 13, 2024 Author

JohannesGaessler Apr 13, 2024 Collaborator

00prometheus Apr 14, 2024 Author

00prometheus
Mar 5, 2024

Replies: 2 comments 4 replies

00prometheus
Mar 11, 2024
Author

JohannesGaessler Mar 21, 2024
Collaborator

00prometheus
Apr 13, 2024
Author

JohannesGaessler Apr 13, 2024
Collaborator

00prometheus Apr 14, 2024
Author