Replies: 3 comments 2 replies
-
Yes, it makes sense to extend the API in some way to simplify this. At the moment, you have to keep track of the tokens in your app. We can store the actual tokens in the |
Beta Was this translation helpful? Give feedback.
-
From a library design perspective, it probably makes sense to maximize generality and flexibility rather than ease-of-use. This also aligns with the existing interface, whose |
Beta Was this translation helpful? Give feedback.
-
From the wording of @ggerganov's comment, I cannot tell if there is a plan to implement this or if it is just a "good idea, in case somebody wants to submit a PR for it". I looked at the code today to see if I could do it, but there are many things I do not understand both about what the KV cache does in general and also about the current implementation in llama.cpp. |
Beta Was this translation helpful? Give feedback.
-
Let's say I want to use llama.cpp as a shared library to build a service that other applications can make requests to. When this service gets a request, it feeds it to the model via
llama_decode
. The tokens that make up the request are processed and added to the internal KV cache.Now, when the next request arrives, I need to decide which prefix of the request is already cached and therefore does not need to be processed again. From what I understand the KV cache does not store the actual tokens. So I have no way of knowing which part of the cache needs to be cleared and which part of the request tokens need to be fed to the model.
As far as I can tell, I have two options:
It seems like llama.cpp offers a stateful interface for interacting with the model/context but in some parts is lacking ways to inspect the state that the model/context is currently in, which makes it awkward to work with.
Would it not make sense for the KV cache structure inside of llama.cpp to keep track of which tokens (and which
seq_id
s) are currently in the cache and make this information available to users of the library? From what I can tell, the actual tokens would take up a vanishingly small amount of memory compared to the tensors the cache already stores.Disclaimer: I only started using llama.cpp a few weeks ago, so I might be misunderstanding something. But if this is considered a good idea, maybe I should make an issue for it?
Beta Was this translation helpful? Give feedback.
All reactions