-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory use through better attention? #73
Comments
This seems like it would be pretty useful for our purposes since we're training large models. The pytorch blog says they're seeing GPU memory savings of 20%-110% during training, speedups of 10%-70% during training, and speedups of 5%-20% during inferencing. I'd like to test this in SILNLP, but it's still using torch 1.10 rather than torch 2.0, which is required for the accelerated attention mechanism in BetterTransformer. I think it might be best for me to make a new branch in SILNLP with torch 2.0 and test using BetterTransformer there. If it does provide a significant speed/memory improvement for our models, I'd imagine we'd want to upgrade the master branch of SILNLP with torch 2.0 and BetterTransformer too. |
The upgrades from BetterTransformer will be incorporated into future versions of transformers and should eventually cover the M2M100 that we use. |
This should be auto-done when it is put into Transformers natively. |
@mshannon-sil - this is resolved, if I am correct? Are we using flash attention (or did some other thing get in the way of using it)? |
We haven't updated to the version of HF transformers that supports SDPA yet. |
https://pytorch.org/blog/out-of-the-box-acceleration/
Are we utilizing the "fused kernels from FlashAttention and Memory-efficient attention"? Can we? We may be able to have significant speedups or to save significant memory that way.
The text was updated successfully, but these errors were encountered: