Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve KV cache by allowing it to extend to larger sequence lengths #12

Open
rchan26 opened this issue Aug 16, 2024 · 0 comments
Open

Comments

@rchan26
Copy link
Collaborator

rchan26 commented Aug 16, 2024

Currently, the KV cache implementation only allows generation of sequences up to the maximum context length. However, for generating sequences post this, it's possible to just have a cut off and sequentially generate a new token by "forgetting" (i.e. not inputting) earlier tokens in the sequence (in other words, if you have maximum context length 64, to generate the 65th token, we could "forget" the first token and so on). In every generation after the max content length, you're actually generating the last token in the sequence (re-using that last token position).

This is what the original implementation / fork does for not using KV caching.

This requires either dynamically increasing the size of the cache when we've reached the end of the model's context length, or a clever way to shift the cache positions so that you're pulling the right previous key-values from the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant