Should we add a ngram cache API to llama.cpp? #4235

ggerganov · 2023-11-27T11:51:59Z

ggerganov
Nov 27, 2023
Maintainer

Description

This idea was prompted from a recently proposed approach for speculative decoding: Prompt Lookup Decoding

In short, we draft tokens from the prompt using the last N ~ 3 generated tokens. With large prompt and repetitive text (code, summarization, etc.) this can trivially yield a significant inference speed-up.

An obvious extension is that we can search for draft tokens not just in the prompt but in a larger corpus of data if we had one ¹. Additionally, the corpus could be dynamically updated with time based on the specific generations that occur locally. Maintaining such a corpus can obviously be done individually by each 3rd party project, but I'm wondering if it would be a good idea to create a basic implementation that ships with llama.cpp and can be used directly.

API proposal

This is a work in progress - suggestions are welcome:

struct llama_ngram_cache;
struct llama_ngram_cache_params {
    int ngram_size_max; // max size of stored n-grams
    int ngram_capacity; // max number of n-grams
    ...
};

llama_ngram_cache_init(struct llama_ngram_cache_params params);
llama_ngram_cache_free(struct llama_ngram_cache * cache);

llama_ngram_cache_load(const char * fname);
llama_ngram_cache_save(const char * fname, struct llama_ngram_cache * cache);

// add a corpus of tokens to be added to the cache
llama_ngram_cache_add(struct llama_ngram_cache * cache, llama_token * tokens, int n_tokens);

// query for n-gram
llama_ngram_cache_get(struct llama_ngram_cache * cache, llama_token prefix, int n);

Sample usage

Without speculative cache:

// standard decode of a new token
llama_batch_clear(batch);
llama_batch_add(batch, new_token_id, n_cur, { 0 }, true);
llama_decode(ctx, batch);

With speculative cache

nc = llama_ngram_cache_load("my-ngram-cache.gguf");
...
// decode a new token + speculate 2 more
llama_batch_clear(batch);
llama_batch_add(batch, llama_ngram_cache_get(nc, new_token_id, 3), { 0 }, true);
llama_decode(ctx, batch);

Implementation considerations

We would need some efficient data structure to store the n-grams
We would like to have a maximum capacity of the cache. When we reach it, we should likely start to evict n-grams that are "old" or "less-likely"
Need to keep some kind of "metric" for each n-gram so we can rank which are "better"
The cache is model-specific - restrict this programatically?

https://twitter.com/mzh1024/status/1728978863890518158 (@wsxiaoys) ↩

slaren · 2023-11-27T12:06:31Z

slaren
Nov 27, 2023
Collaborator

I have thought for a while that we should have a library of "llama.cpp extras" of ready to use algorithms such as speculative decoding, and now lookahead decoding and prompt lookup decoding. Beam search should also be moved to this library. The way we do this currently is by adding an example and telling application developers to re-implement it based on this example, and it is not really reasonable to dump all of this complexity on application developers. Another alternative would be to just include all of this in the core llama.cpp library.

2 replies

ggerganov Nov 27, 2023
Maintainer Author

We can wrap some of the examples and add them to llama.cpp. I prefer to have low-level control and so I like having the examples where we can tinker with the implementation, but I can see it would be useful to have common use-cases exposed through the API.

Don't think we should separate these into a new library. We can push the implementation in separate source e.g. llama-extra.cpp and have the API in llama.h.

Most likely we first have to start with merging llama_sampling_context

cmp-nct Jan 16, 2024

Fully agree on that, I've always disliked the examples approach. I think it originates from ggml examples ?
Maybe it could be considered to use a more modular approach, instead of having one large file (regardless if it's llama.cpp or llama-extra.cpp) it could also be a folder of modules that can separately added and compiled.

So instead of examples, it's plugins/modules that add functionality.
That would also be much less intimidating for people to modify/PR and much quicker to compile/link changes - of course also more difficult to integrate an interface for that

(if not possible to do the plugins, a extra lib as suggested would also be helpful)

brittlewis12 · 2024-01-16T22:54:18Z

brittlewis12
Jan 16, 2024

Support for ngram speculation dropped in latest HF Transformers, with the implementation by the same author as that repo you linked above!
https://github.com/huggingface/transformers/pull/27775/files#diff-29aeb39e0ffeae46392f87bdc8fbd21d2c237d010f2f85b09c8dd94a07691ea5R222

The idea introduced (and implemented) by @apoorv_umang consists of gathering the candidate sequences from the input text itself. If the latest generated ngram is in the input, use the continuation as a candidate!

No smaller model required, significant speedups achieved

In fact, the penalty of gathering and testing the candidates is so small that you should use this technique whenever possible!

via https://twitter.com/joao_gante/status/1747322418425741550

Would be great to have this implemented in llama.cpp!

I’ll also note their implementation is seemingly model agnostic, which seems desirable:
prompt_lookup_num_tokens=10 was the example provided in the OP.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we add a ngram cache API to llama.cpp? #4235

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Should we add a ngram cache API to llama.cpp? #4235

ggerganov Nov 27, 2023 Maintainer

Description

API proposal

Sample usage

Implementation considerations

Footnotes

Replies: 2 comments · 2 replies

slaren Nov 27, 2023 Collaborator

ggerganov Nov 27, 2023 Maintainer Author

cmp-nct Jan 16, 2024

brittlewis12 Jan 16, 2024

ggerganov
Nov 27, 2023
Maintainer

Replies: 2 comments 2 replies

slaren
Nov 27, 2023
Collaborator

ggerganov Nov 27, 2023
Maintainer Author

brittlewis12
Jan 16, 2024