You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to suggest an enhancement for the Transformer architecture by incorporating methodologies from two recent papers: "Fast Transformer Decoding: One Write-Head is All You Need" and "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".
The "Fast Transformer Decoding" paper presents a variant of multi-head attention where keys and values are shared across different attention heads. This method greatly reduces the size of these tensors, leading to less memory bandwidth requirements for incremental decoding. This results in faster model decoding with only minor quality degradation from the baseline.
The "FlashAttention" paper addresses the issue of Transformers being slow and memory-hungry on long sequences. It proposes an IO-aware attention algorithm that uses tiling to reduce the number of memory reads/writes between different levels of GPU memory. This method yields an approximate attention algorithm that outperforms any existing approximate attention method, enabling faster training of Transformer models and higher quality models due to longer context sequences.
Incorporating these methods could lead to a significant improvement in the performance and efficiency of the Transformer architecture in this repository.
This issue is related to "Sampling is slow due to repeated calculations #4"
Links to papers:
Fast Transformer Decoding: One Write-Head is All You Need: link
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: link
The text was updated successfully, but these errors were encountered:
I would like to suggest an enhancement for the Transformer architecture by incorporating methodologies from two recent papers: "Fast Transformer Decoding: One Write-Head is All You Need" and "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".
The "Fast Transformer Decoding" paper presents a variant of multi-head attention where keys and values are shared across different attention heads. This method greatly reduces the size of these tensors, leading to less memory bandwidth requirements for incremental decoding. This results in faster model decoding with only minor quality degradation from the baseline.
The "FlashAttention" paper addresses the issue of Transformers being slow and memory-hungry on long sequences. It proposes an IO-aware attention algorithm that uses tiling to reduce the number of memory reads/writes between different levels of GPU memory. This method yields an approximate attention algorithm that outperforms any existing approximate attention method, enabling faster training of Transformer models and higher quality models due to longer context sequences.
Incorporating these methods could lead to a significant improvement in the performance and efficiency of the Transformer architecture in this repository.
This issue is related to "Sampling is slow due to repeated calculations #4"
Links to papers:
The text was updated successfully, but these errors were encountered: