Replies: 2 comments 2 replies
-
I have thought for a while that we should have a library of "llama.cpp extras" of ready to use algorithms such as speculative decoding, and now lookahead decoding and prompt lookup decoding. Beam search should also be moved to this library. The way we do this currently is by adding an example and telling application developers to re-implement it based on this example, and it is not really reasonable to dump all of this complexity on application developers. Another alternative would be to just include all of this in the core llama.cpp library. |
Beta Was this translation helpful? Give feedback.
-
Support for ngram speculation dropped in latest HF Transformers, with the implementation by the same author as that repo you linked above!
via https://twitter.com/joao_gante/status/1747322418425741550 Would be great to have this implemented in llama.cpp! I’ll also note their implementation is seemingly model agnostic, which seems desirable: |
Beta Was this translation helpful? Give feedback.
-
Description
This idea was prompted from a recently proposed approach for speculative decoding: Prompt Lookup Decoding
In short, we draft tokens from the prompt using the last
N ~ 3
generated tokens. With large prompt and repetitive text (code, summarization, etc.) this can trivially yield a significant inference speed-up.An obvious extension is that we can search for draft tokens not just in the prompt but in a larger corpus of data if we had one 1. Additionally, the corpus could be dynamically updated with time based on the specific generations that occur locally. Maintaining such a corpus can obviously be done individually by each 3rd party project, but I'm wondering if it would be a good idea to create a basic implementation that ships with
llama.cpp
and can be used directly.API proposal
This is a work in progress - suggestions are welcome:
Sample usage
Without speculative cache:
With speculative cache
Implementation considerations
Footnotes
https://twitter.com/mzh1024/status/1728978863890518158 (@wsxiaoys) ↩
Beta Was this translation helpful? Give feedback.
All reactions