Swap strategy for infinite output #77

setzer22 · 2023-03-26T11:30:33Z

As discussed in ggerganov/llama.cpp#71 (comment)

The idea is to achieve a naive implementation for infinite output generation using a strategy that simply clears the context window (you can keep the original prompt around), and starts adding new tokens.

This is a hack that doesn't properly leverage the advantages of the attention mechanism: When the context window gets full, the transformer's hidden state has information about more than just the last 2048 tokens, because this information is there indirectly embedded in the outputs for the self-attention mechanism. For example, if token 25 attended to tokens 10 and 12, even when tokens 10 and 12 fall outside the context window, a lot of information about these tokens will still be encoded at position 25.

A solution that slides the context window would achieve a gradually "fading" context window, instead of something where the transformer 100% forgets about a word the moment a token falls outside of context. I have some reason to suspect systems like ChatGPT are relying on a mechanism like this based on their ability to consistently recall parts of the conversation that occured way before the token window was exceeded. However, I'm not knowledgeable enough to figure out if there's a way to actually make this work, given the fact that the positional encoding function used in LLaMA (RoPE) is absolute, not relative.

By doing the swap trick proposed here, the transformer will effectively forget all prior context whenever the swap occurs, and there will be a lag spike due to the last few tokens having to be reprocessed. So this is very much non-ideal. However, since llama.cpp has recently implemented this, I feel like we should at least add this naive version too until someone can figure out a real solution.

jon-chuang · 2023-04-13T01:03:26Z

Yes, llama.cpp implements a "hacky" method like this, it takes the last $k$ tokens + first $n$ "prompt" tokens when the window becomes full

olexiyb · 2023-09-13T10:53:05Z

There is a pull request to solve this, please review
#424

philpax added the issue:enhancement New feature or request label Mar 26, 2023

sgoll mentioned this issue Mar 31, 2023

Longer and infinite output ggerganov/llama.cpp#71

Closed

This was referenced Apr 7, 2023

Make a Chat like Application #52

Closed

Reserve more eval memory and use ggml scratch buffers #116

Merged

suspicious-pineapple mentioned this issue May 10, 2023

Interactive Atome-FE/llama-node#47

Open

philpax mentioned this issue Jul 17, 2023

Implement SuperHOT/interpolated RoPE support #378

Closed

philpax mentioned this issue Nov 12, 2023

Develop #442

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap strategy for infinite output #77

Swap strategy for infinite output #77

setzer22 commented Mar 26, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023

olexiyb commented Sep 13, 2023

Swap strategy for infinite output #77

Swap strategy for infinite output #77

Comments

setzer22 commented Mar 26, 2023 • edited Loading

jon-chuang commented Apr 13, 2023

olexiyb commented Sep 13, 2023

setzer22 commented Mar 26, 2023 •

edited

Loading