Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Suggestion: Incorporate Fast Transformer Decoding and Flash Attention into Transformer Architecture #12

Open
davidfitzek opened this issue Jun 5, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@davidfitzek
Copy link
Collaborator

I would like to suggest an enhancement for the Transformer architecture by incorporating methodologies from two recent papers: "Fast Transformer Decoding: One Write-Head is All You Need" and "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".

The "Fast Transformer Decoding" paper presents a variant of multi-head attention where keys and values are shared across different attention heads. This method greatly reduces the size of these tensors, leading to less memory bandwidth requirements for incremental decoding. This results in faster model decoding with only minor quality degradation from the baseline.

The "FlashAttention" paper addresses the issue of Transformers being slow and memory-hungry on long sequences. It proposes an IO-aware attention algorithm that uses tiling to reduce the number of memory reads/writes between different levels of GPU memory. This method yields an approximate attention algorithm that outperforms any existing approximate attention method, enabling faster training of Transformer models and higher quality models due to longer context sequences.

Incorporating these methods could lead to a significant improvement in the performance and efficiency of the Transformer architecture in this repository.

This issue is related to "Sampling is slow due to repeated calculations #4"

Links to papers:

  1. Fast Transformer Decoding: One Write-Head is All You Need: link
  2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: link
@davidfitzek davidfitzek added the enhancement New feature or request label Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant