You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.
The idea is to achieve a naive implementation for infinite output generation using a strategy that simply clears the context window (you can keep the original prompt around), and starts adding new tokens.
This is a hack that doesn't properly leverage the advantages of the attention mechanism: When the context window gets full, the transformer's hidden state has information about more than just the last 2048 tokens, because this information is there indirectly embedded in the outputs for the self-attention mechanism. For example, if token 25 attended to tokens 10 and 12, even when tokens 10 and 12 fall outside the context window, a lot of information about these tokens will still be encoded at position 25.
A solution that slides the context window would achieve a gradually "fading" context window, instead of something where the transformer 100% forgets about a word the moment a token falls outside of context. I have some reason to suspect systems like ChatGPT are relying on a mechanism like this based on their ability to consistently recall parts of the conversation that occured way before the token window was exceeded. However, I'm not knowledgeable enough to figure out if there's a way to actually make this work, given the fact that the positional encoding function used in LLaMA (RoPE) is absolute, not relative.
By doing the swap trick proposed here, the transformer will effectively forget all prior context whenever the swap occurs, and there will be a lag spike due to the last few tokens having to be reprocessed. So this is very much non-ideal. However, since llama.cpp has recently implemented this, I feel like we should at least add this naive version too until someone can figure out a real solution.
The text was updated successfully, but these errors were encountered:
As discussed in ggerganov/llama.cpp#71 (comment)
The idea is to achieve a naive implementation for infinite output generation using a strategy that simply clears the context window (you can keep the original prompt around), and starts adding new tokens.
This is a hack that doesn't properly leverage the advantages of the attention mechanism: When the context window gets full, the transformer's hidden state has information about more than just the last 2048 tokens, because this information is there indirectly embedded in the outputs for the self-attention mechanism. For example, if token 25 attended to tokens 10 and 12, even when tokens 10 and 12 fall outside the context window, a lot of information about these tokens will still be encoded at position 25.
A solution that slides the context window would achieve a gradually "fading" context window, instead of something where the transformer 100% forgets about a word the moment a token falls outside of context. I have some reason to suspect systems like ChatGPT are relying on a mechanism like this based on their ability to consistently recall parts of the conversation that occured way before the token window was exceeded. However, I'm not knowledgeable enough to figure out if there's a way to actually make this work, given the fact that the positional encoding function used in LLaMA (RoPE) is absolute, not relative.
By doing the swap trick proposed here, the transformer will effectively forget all prior context whenever the swap occurs, and there will be a lag spike due to the last few tokens having to be reprocessed. So this is very much non-ideal. However, since llama.cpp has recently implemented this, I feel like we should at least add this naive version too until someone can figure out a real solution.
The text was updated successfully, but these errors were encountered: