-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed memory issues during forward #5
Conversation
Added 1-sqrt function for cooldown phase
add rope_theta to hf conversion script
…og_ci feat(ci): add trufflehog secrets detection
…rentiable distributed operations
Update README.md
Update README.md
Add layer-wise activation recomputation to llama model
I reviewed hf's comments and made appropiate adjustments. See: huggingface#203 (comment). Should be ready for review :). |
Tried to run it but I get
|
Fix _RowLinearAsyncCommunication
Might have to do with the version of the main branch. I will update everything on my end and make sure that it works properly before opening the PR again. And yes, there are some seemingly unrelated changes. My fork is from upstream nanotron so those are commits pushed to the main branch upstream. We should update our fork to make it easier to review. |
Nanotron seems to consume disproportionately more memory on its activations compared to megatron. This is due to at least the following factors:
DifferentiableAllGather
andDifferentiableReduceScatterSum
) a new tensor is allocated viatorch.empty
. This tensor is cached through the entire forward pass until the backward pass. However, this cache is unnecessary as these tensors are not used at all (rather, they are reconstructed via communication during the backward). Getting rid of these allocations provide significant memory gains. To fix this, this PR introduces a global memory buffer (MemoryBuffer
singleton) that recycles the allocated spaces, similar to megatron.Attached: Memory traces of the default nanotron implementation (which OOMs), the current PR implementation and megatron. The memory traces represent the first rank of a tp8 pp4 dp1 llama70b.