Speculative decoding potential for running big LLMs on consumer grade GPUs efficiently #10466
steampunque
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I recently added an efficient greedy-only spec decode to my downstream server patch (a completely different implementation than the current spec decode PR). I then evaluated tg performance for two cases : 1) Solve the first humaneval problem with coding model and 2) solve the goldcoin problem with general model. I used Qwen 14B for the target and 0.5B, 1.5B, and 3B for the drafts. I evaluated tg vs. draft token length on a 4070 fully offloaded with the target and draft weights where target is IQ4_XS quant and draft is Q6_K quant.
HUMANEVAL first problem:
TARGET Qwen2.5-Coder-14B-Instruct
DRAFTS Qwen2.5-Coder-0.5B-Instruct, Qwen2.5-Coder-1.5B-Instruct, Qwen2.5-Coder-3B-Instruct
TPS vs draft tokens:
GOLDCOIN
I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? Use step-by-step reasoning to solve this problem.
TARGET Qwen2.5-14B-Instruct
DRAFTS Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct
TPS vs draft tokens:
TARGET Llama 3.1 8B Instruct
DRAFT Llama 3.2 1B Instruct
TPS vs draft tokens:
Results ummary:
Coding shows a max speedup of 2.5x tg at 10 draft tokens speculated using 0.5B model. At 1.5B draft the max speedup is 1.63x at 4 draft tokens. At 3B draft the max speedup is 1.33 at 4 draft tokens. The efficiency crossover (where draft+target is the same as no draft) is >32 draft tokens for 0.5B, >16 draft tokens for 1.5B, and 11 draft tokens for 3B.
Goldcoin shows a max speedup of 1.4x tg at 4 draft tokens speculated using 0.5B model. at 1.5 draft the max speedup is 1.17x at 4 draft tokens. At 3B draft the max speedup is 1.08 at 1 draft token. The efficiency crossover (where draft+target is the same as no draft) is 12 tokens for 0.5B, 6 tokens for 1.5B, and 3 tokens for 3B.
With Llama 3.18B instruct drafted by Llama 3.2 1B instruct a speedup in token gen of 1.83x is found at draft tokens of 5.
Conclusions and potential for running big LLMs on consumer grade GPUs:
Small draft model is needed (sine qua non). 0.5B size seems to work well. Any model in the range of 8G or above can benefit by distilling a 0.5B draft and speculating the model. Returns fall off rapidly as draft gets bigger, already questionable at 1.5B and not really useful at 3B draft. Coding is far more efficient than general text gen with speculation. Qwen 2.5 series is perfect for exploiting the potential of speculation.
For running big LLMs on consumer grade GPUs with limited memory it is desired to avoid the need to store all model weights and attention layer in VRAM because there is not enough room. Most of the model weights are sitting there doing nothing most of the time, i.e. a 32 layer model has 31 dead weights sitting there occupying VRAM doing nothing 31/32 of the time. To get around this problem it is necessary to dynamically swap the layers into VRAM as they are needed from CPU RAM which is normally much higher capacity. If the draft size at the efficiency crossover is big enough, there may be (emphasis on may, it needs to be investigated for feasibility) enough time to compute the target batch (say 8 to 10 samples) and simultaneously transfer the next layer into the GPU. The GPU capacity needs 1 working layer allocation and one transfer allocation (two total model layers which are ping ponged between compute and transfer) + a fully offloaded speculator. KV for speculator and target should also both be in GPU mem. Even if it is needed to go above the efficiency crossover, it can still be more efficient to do dynamic layer loading to GPU because offloading to CPU is an immediate 10X or higher slowdown due to memory BW limits.
Beta Was this translation helpful? Give feedback.
All reactions