-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large memory usage #23
Comments
The results of flash attention are somehow amazing... keep an eye on this. |
thanks, I'll check it |
And after reading the code, I have found that the ring attention should accept already-chunked qkv instead of the whole qkv. That is, qkv should be split into local qkv before fed into the ring attention. This might be a diff. Not very certain that this is true, though. |
@LzhinFdu @GeneZC Yeah you need to shard the sequence yourself before feeding them into ring-flash-attention. |
That's right. Therefore, when comparing memory usage, the context used by Flash Attention should be doubled. Despite this, Flash Attention still maintains a significant lead. |
Thanks for sharing this excellent implementation of ring attention.
Here are my test results on 2*A100 (with nvlink). Judging from the results, the memory usage of ring attention(ring_flash_attn_qkvpacked_func) seems to be very large. This is not as expected. Are there any possible problems?
The text was updated successfully, but these errors were encountered: