Optimization to remove PCIe bandwidth limitations for large matrix multiplications on consumer GPU cards #5893
Replies: 2 comments 4 replies
-
Crickets? Am I barking up the wrong tree? Is this perhaps already implemented in llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
@JohannesGaessler I am not quite sure I follow you; it might be that there are optimizations in llama.cpp that I am not aware of? The implementation I know of is the standard Transformer implementation, as in GPT2. In that case token generation is a direct part of context processing, however it sounds to me like your implementation somehow preprocesses the prompt and then just does token generation? In a standard GPT2 implementation, the context consists of a many embedding vectors (modified through position embedding, layer norm, etc). Since you have many vectors, you can still use the stripmining optimization since you need to run all of the context token vectors through the various weight matrices (Q, K and V, projection and MLP) during a single token generation. Since you have many context tokens vectors you will have many matrix-vector multiplications for each weight matrix. Therefore, during generation of a single output token, each row of the weight matrices get re-used once for each context token. If you can transfer just some of the rows (usually at least as many as your cache-line lengths), run those through some of the tokens (again, at least as many as your cache-line lengths) until all the transferred weight matrix rows x context token vectors that you have transferred are done, you can meanwhile transfer the next batch of matrix rows or context vectors. As long as you have transferred a sufficiently large proportion of the weight matrix and you have enough context token vectors that you want to multiply with the weight matrix part, the transfer of more rows of the weight matrix can still take place while you are doing the dot products between weight matrix rows and token vectors, re-using each weight matrix row once per token vector. Did this make sense? Am I lacking some important information about how llama.cpp is implemented? If so, can you tell me how it works at a high level or maybe there is some documentation? |
Beta Was this translation helpful? Give feedback.
-
Matrix Multiply does O(n3) operations on O(n2) data. Given a sufficiently large matrix, this means that matrix multiply can potentially be implemented without bandwidth limitations. I used to work in HPC, and the common technique to do this was an optimization called "strip mining":
You split the target matrix, i.e. the resulting matrix to be computed, into blocks. The size of the target blocks are dependent on available size of local fastest memory, but you want as large a target block as you can fit sources and destination for in your fast memory. You only transfer to that fastest memory the rows of matrix A that will be used to compute the target block and the columns of matrix B that will be used to compute the target block. Now you can compute the whole A blockHeight x B blockWidth target block, but you only transferred A blockHeight rows + B blockWidth columns instead of the whole matrix A and matrix B. Each row of A will be used as many times as there are columns of B and vice versa. With large enough blocks, this gives you ample time to pre-transfer another set of input data while the first block is being computed (you can save even more bandwidth here, because you already have e.g. the rows of matrix A needed for the second block).
When implemented with loops, it results in two extra loop levels for stepping along the blocks. Since the method is effective for each stage of memory in your system (L1 cache, L2 cache, L3 cache, on-board memory, main system RAM, local disk, remote MPI machines, remote SAN), this subdivision into blocks can usefully be done quite deeply resulting in a loop nest looking like a strip mine, hence the name of the optimization. (Tip: Doing it recursively looks a bit more tidy and doesn't hard code the number of levels, but even just one hard coded level of strip mining may be enough to solve the PCIe bandwidth issue). Your inner most level memory needs to have space for at least the input rows and columns and target space needed to compute a 2x2 target matrix to gain any advantage, but larger target blocks give bigger benefit.
Using BLAS libraries, you don't implement the matrix multiply yourself, but you can still subdivide the target matrix into blocks and feed BLAS only the rows and columns needed for the current block. The decomposition can also be used to support multiple GPUs, where you simply calculate different target blocks of the same input matrices on different GPUs.
I checked with nVidia, but cuBLAS does not do this matrix decomposition, however you can of course do it manually in your calls to cuBLAS (and other BLAS libraries). They did confirm that transferring data across the PCIe bridge while computing works fine, so doing this with the large matrices of LLMs might allow us to pre-transfer matrix blocks across the PCIe bridge to run models like Llama-70B on something like 8GB gaming GPU cards at full GPU utilization. Since the matrices are large, it might even be worth experimenting with pulling in data all the way from disk (with an added level of strip mining for the disk level), to see if we can gain a speedup for e.g. 16GB or 32GB system RAM machines.
Beta Was this translation helpful? Give feedback.
All reactions