Improve KV cache by allowing it to extend to larger sequence lengths #12

rchan26 · 2024-08-16T09:01:14Z

Currently, the KV cache implementation only allows generation of sequences up to the maximum context length. However, for generating sequences post this, it's possible to just have a cut off and sequentially generate a new token by "forgetting" (i.e. not inputting) earlier tokens in the sequence (in other words, if you have maximum context length 64, to generate the 65th token, we could "forget" the first token and so on). In every generation after the max content length, you're actually generating the last token in the sequence (re-using that last token position).

This is what the original implementation / fork does for not using KV caching.

This requires either dynamically increasing the size of the cache when we've reached the end of the model's context length, or a clever way to shift the cache positions so that you're pulling the right previous key-values from the cache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve KV cache by allowing it to extend to larger sequence lengths #12

Improve KV cache by allowing it to extend to larger sequence lengths #12

rchan26 commented Aug 16, 2024

Improve KV cache by allowing it to extend to larger sequence lengths #12

Improve KV cache by allowing it to extend to larger sequence lengths #12

Comments

rchan26 commented Aug 16, 2024