Store Tokens in KV Cache #9043

julmb · 2024-08-15T12:03:14Z

julmb
Aug 15, 2024

Let's say I want to use llama.cpp as a shared library to build a service that other applications can make requests to. When this service gets a request, it feeds it to the model via llama_decode. The tokens that make up the request are processed and added to the internal KV cache.

Now, when the next request arrives, I need to decide which prefix of the request is already cached and therefore does not need to be processed again. From what I understand the KV cache does not store the actual tokens. So I have no way of knowing which part of the cache needs to be cleared and which part of the request tokens need to be fed to the model.

As far as I can tell, I have two options:

Clear the entire cache and reprocess the entire request. This is of course slow, especially for requests that share a large prefix.
Keep track if which tokens are currently in the cache myself. This is error prone as "what I believe is currently in the KV cache" and "what is actually in the KV cache" could easily get out of sync if I am not very careful (especially in the case of exceptions or other interruptions).

It seems like llama.cpp offers a stateful interface for interacting with the model/context but in some parts is lacking ways to inspect the state that the model/context is currently in, which makes it awkward to work with.

Would it not make sense for the KV cache structure inside of llama.cpp to keep track of which tokens (and which seq_ids) are currently in the cache and make this information available to users of the library? From what I can tell, the actual tokens would take up a vanishingly small amount of memory compared to the tensors the cache already stores.

Disclaimer: I only started using llama.cpp a few weeks ago, so I might be misunderstanding something. But if this is considered a good idea, maybe I should make an issue for it?

ggerganov · 2024-08-15T12:23:10Z

ggerganov
Aug 15, 2024
Maintainer

Yes, it makes sense to extend the API in some way to simplify this. At the moment, you have to keep track of the tokens in your app.

We can store the actual tokens in the struct llama_kv_cell and expose an interface that either returns this information, or gives you the largest common prefix - whatever would be more suitable and easy to use.

1 reply

ngxson Aug 15, 2024
Collaborator

Another benefit of storing actual token into llama_kv_cell would be to reuse same token in multiple sequences. In llama-server, we're currently doing this by setting a global system prompt, then call llama_kv_cache_seq_cp, but that doesn't work with the system prompt in /chat/completions

julmb · 2024-08-15T12:33:26Z

julmb
Aug 15, 2024
Author

From a library design perspective, it probably makes sense to maximize generality and flexibility rather than ease-of-use. This also aligns with the existing interface, whose llama_kv_cache_* functions are all fairly low-level and give a lot of flexibility to the user. The convenience functions (like finding largest common prefix) can then easily be implemented by the user in terms of the low-level interface.

0 replies

julmb · 2024-08-20T13:50:53Z

julmb
Aug 20, 2024
Author

From the wording of @ggerganov's comment, I cannot tell if there is a plan to implement this or if it is just a "good idea, in case somebody wants to submit a PR for it".

I looked at the code today to see if I could do it, but there are many things I do not understand both about what the KV cache does in general and also about the current implementation in llama.cpp.

1 reply

ggerganov Aug 21, 2024
Maintainer

I'll look in to implementing this soon. Will create an issue and add it to the roadmap:

#9113

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store Tokens in KV Cache #9043

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Store Tokens in KV Cache #9043

julmb Aug 15, 2024

Replies: 3 comments · 2 replies

ggerganov Aug 15, 2024 Maintainer

ngxson Aug 15, 2024 Collaborator

julmb Aug 15, 2024 Author

julmb Aug 20, 2024 Author

ggerganov Aug 21, 2024 Maintainer

julmb
Aug 15, 2024

Replies: 3 comments 2 replies

ggerganov
Aug 15, 2024
Maintainer

ngxson Aug 15, 2024
Collaborator

julmb
Aug 15, 2024
Author

julmb
Aug 20, 2024
Author

ggerganov Aug 21, 2024
Maintainer