Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset kv cache after each query and infinite inference features #2560

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pseelam02
Copy link

This PR introduces two exciting enhancements to the voice assistant functionality in talk-llama.cpp, designed to elevate the user experience during conversations with the assistant:

  1. Infinite Conversation Mode (-inf Flag)
    Previously, the voice assistant was limited by the context length of the KV-cache. For example, with a context length of 16 tokens, once the model generated enough tokens to fill the available context space, it would exit and return a "failed to decode" message. This exit and message comes from the llama.cpp file from the function "llama_kv_cache_find_slot" on line 3592(at the current time of writing this post).

Let's say you have a prompt: 1 2 3 4 and your context length is 16. You could say the context looks like this initially:
|................|

Then the prompt gets fed to the model:
|1234............|

Then the model starts generating tokens:
|1234ABCD........|

until it reaches the end of available context space:
|1234ABCDEFGHIJKL|

When the cache reached its limit, the assistant would terminate the conversation and return a "failed to decode" message(from llama.cpp).

With the new -inf flag, the voice assistant can now dynamically manage the KV-cache, enabling seamless, infinite conversations. This means that even after reaching the context limit, the assistant will handle cache overflow and continue generating responses until . The way this is done by preserving the original prompt (k_prompt_llama) and then shifting half of whatever is remaining next to the original prompt. That would be something like this after it is full and the cache is adjusted:
|1234EFGH........|

  1. Resettable KV-Cache (-reset Flag)
    For users who prefer resetting the context after each query, the -reset flag offers a practical solution. When enabled, the KV-cache clears automatically after every user question. This allows the model to process each query as an independent request, ideal for use cases where maintaining conversational history isn’t necessary.

Note:
While this feature improves memory management, it comes with a trade-off: the assistant won’t be able to refer back to previous questions or answers due to the cache reset, hence why it is ideal for use cases where maintaining conversational history isn’t necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant