Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory use through better attention? #73

Open
johnml1135 opened this issue Nov 27, 2023 · 5 comments
Open

Reduce memory use through better attention? #73

johnml1135 opened this issue Nov 27, 2023 · 5 comments
Assignees

Comments

@johnml1135
Copy link
Collaborator

https://pytorch.org/blog/out-of-the-box-acceleration/

Are we utilizing the "fused kernels from FlashAttention and Memory-efficient attention"? Can we? We may be able to have significant speedups or to save significant memory that way.

@johnml1135 johnml1135 added this to Serval Dec 2, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Serval Dec 2, 2023
@johnml1135 johnml1135 added this to the Serval API 1.2 milestone Dec 2, 2023
@johnml1135 johnml1135 moved this from 🆕 New to 🔖 Ready in Serval Dec 2, 2023
@mshannon-sil
Copy link
Collaborator

This seems like it would be pretty useful for our purposes since we're training large models. The pytorch blog says they're seeing GPU memory savings of 20%-110% during training, speedups of 10%-70% during training, and speedups of 5%-20% during inferencing. I'd like to test this in SILNLP, but it's still using torch 1.10 rather than torch 2.0, which is required for the accelerated attention mechanism in BetterTransformer. I think it might be best for me to make a new branch in SILNLP with torch 2.0 and test using BetterTransformer there. If it does provide a significant speed/memory improvement for our models, I'd imagine we'd want to upgrade the master branch of SILNLP with torch 2.0 and BetterTransformer too.

@mshannon-sil
Copy link
Collaborator

The upgrades from BetterTransformer will be incorporated into future versions of transformers and should eventually cover the M2M100 that we use.

@johnml1135
Copy link
Collaborator Author

This should be auto-done when it is put into Transformers natively.

@johnml1135
Copy link
Collaborator Author

@mshannon-sil - this is resolved, if I am correct? Are we using flash attention (or did some other thing get in the way of using it)?

@ddaspit
Copy link
Contributor

ddaspit commented Dec 5, 2024

We haven't updated to the version of HF transformers that supports SDPA yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔖 Ready
Development

No branches or pull requests

3 participants