Reset kv cache after each query and infinite inference features #2560

pseelam02 · 2024-11-14T23:06:51Z

This PR introduces two exciting enhancements to the voice assistant functionality in talk-llama.cpp, designed to elevate the user experience during conversations with the assistant:

Infinite Conversation Mode (-inf Flag)
Previously, the voice assistant was limited by the context length of the KV-cache. For example, with a context length of 16 tokens, once the model generated enough tokens to fill the available context space, it would exit and return a "failed to decode" message. This exit and message comes from the llama.cpp file from the function "llama_kv_cache_find_slot" on line 3592(at the current time of writing this post).

Let's say you have a prompt: 1 2 3 4 and your context length is 16. You could say the context looks like this initially:
|................|

Then the prompt gets fed to the model:
|1234............|

Then the model starts generating tokens:
|1234ABCD........|

until it reaches the end of available context space:
|1234ABCDEFGHIJKL|

When the cache reached its limit, the assistant would terminate the conversation and return a "failed to decode" message(from llama.cpp).

With the new -inf flag, the voice assistant can now dynamically manage the KV-cache, enabling seamless, infinite conversations. This means that even after reaching the context limit, the assistant will handle cache overflow and continue generating responses until . The way this is done by preserving the original prompt (k_prompt_llama) and then shifting half of whatever is remaining next to the original prompt. That would be something like this after it is full and the cache is adjusted:
|1234EFGH........|

Resettable KV-Cache (-reset Flag)
For users who prefer resetting the context after each query, the -reset flag offers a practical solution. When enabled, the KV-cache clears automatically after every user question. This allows the model to process each query as an independent request, ideal for use cases where maintaining conversational history isn’t necessary.

Note:
While this feature improves memory management, it comes with a trade-off: the assistant won’t be able to refer back to previous questions or answers due to the cache reset, hence why it is ideal for use cases where maintaining conversational history isn’t necessary.

…space is full

Pranav Aditya Seelam and others added 4 commits November 11, 2024 16:39

Added infinite conversation feature to talk-llama

015e562

Added reset kv cache after question feature and adjust kv cache when …

e57c389

…space is full

Fixed inf loop

bb3d903

Merge branch 'ggerganov:master' into Reset_and_infinite_inference

b87d669

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset kv cache after each query and infinite inference features #2560

Reset kv cache after each query and infinite inference features #2560

pseelam02 commented Nov 14, 2024

Reset kv cache after each query and infinite inference features #2560

Are you sure you want to change the base?

Reset kv cache after each query and infinite inference features #2560

Conversation

pseelam02 commented Nov 14, 2024