-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get 70b in our fork working with pp4, tp4, dp>1 #19
Comments
A simple mod in pp split strategy. #20 |
I ran a few tests. When acc=16 it does not OOM when using zero1 for dp>1, however when acc=128 it does OOM for dp=2,4 but for dp=8 it works again... |
It sounds like some activation memory is not released in time. Can we have a small scale experiment, e.g. TP=1, PP=4, DP=2,4,... and get some memory snapshot from pytorch? This functionality is only available on x86 machines unfortunately. |
No memory profile cause of the x86, but I do agree that the most likely cause is activation memory bug as for the DP=1 case the max memory is 50.2 GB and each GPU has almost 100 GB. Changing DP=2 shouldn't increase the memory by x2. |
I'll try to get a memory snapshot on Bristen as there we should have x86 machines. |
any update on this? Has someone committed to it? I saw this PR @C-TC does it solve this issue? |
Using our launcher and the latest pull of our pretrain repo you can run a Llama3 70B model as follows. Thanks to @AleHD for getting activation recompute and async working.
(The use of the shell variables is to not have to set them twice and ensure config always matches the name.)
This should work and get about 780 tokens per second per GPU. This will be higher with batch_accumulation_per_replica=64 and lower with batch_accumulation_per_replica=16 which we can use to trade off wall time for efficiency.
The problem is that 70B just barely fits into four tödi nodes with dp=1, pp=4 and tp=4. It's so close that it is OOM with dp>1.
Looking at memory usage, we can see that each pipeline stage from first to last uses about 10GB less memory than the previous one. So our OOM is on the node of the first PP stage. The splitting of the layers is in nanotron done automatically using
nanotron/src/nanotron/models/llama.py
Line 787 in 67332b4
The text was updated successfully, but these errors were encountered: