Enhancement Suggestion: Incorporate Fast Transformer Decoding and Flash Attention into Transformer Architecture #12

davidfitzek · 2023-06-05T09:37:10Z

I would like to suggest an enhancement for the Transformer architecture by incorporating methodologies from two recent papers: "Fast Transformer Decoding: One Write-Head is All You Need" and "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".

The "Fast Transformer Decoding" paper presents a variant of multi-head attention where keys and values are shared across different attention heads. This method greatly reduces the size of these tensors, leading to less memory bandwidth requirements for incremental decoding. This results in faster model decoding with only minor quality degradation from the baseline.

The "FlashAttention" paper addresses the issue of Transformers being slow and memory-hungry on long sequences. It proposes an IO-aware attention algorithm that uses tiling to reduce the number of memory reads/writes between different levels of GPU memory. This method yields an approximate attention algorithm that outperforms any existing approximate attention method, enabling faster training of Transformer models and higher quality models due to longer context sequences.

Incorporating these methods could lead to a significant improvement in the performance and efficiency of the Transformer architecture in this repository.

This issue is related to "Sampling is slow due to repeated calculations #4"

Links to papers:

Fast Transformer Decoding: One Write-Head is All You Need: link
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: link

davidfitzek added the enhancement New feature or request label Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement Suggestion: Incorporate Fast Transformer Decoding and Flash Attention into Transformer Architecture #12

Enhancement Suggestion: Incorporate Fast Transformer Decoding and Flash Attention into Transformer Architecture #12

davidfitzek commented Jun 5, 2023

Enhancement Suggestion: Incorporate Fast Transformer Decoding and Flash Attention into Transformer Architecture #12

Enhancement Suggestion: Incorporate Fast Transformer Decoding and Flash Attention into Transformer Architecture #12

Comments

davidfitzek commented Jun 5, 2023