Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed memory issues during forward #5

Merged
merged 41 commits into from
Sep 2, 2024
Merged

Conversation

AleHD
Copy link

@AleHD AleHD commented Jun 27, 2024

Nanotron seems to consume disproportionately more memory on its activations compared to megatron. This is due to at least the following factors:

  • The glu activation, which is not fused, allocate two tensors: the activation and the element-wise multiplication. Fusing this operation provides (relatively small) memory gains.
  • During the differentiable operations (specifically, the DifferentiableAllGather and DifferentiableReduceScatterSum) a new tensor is allocated via torch.empty. This tensor is cached through the entire forward pass until the backward pass. However, this cache is unnecessary as these tensors are not used at all (rather, they are reconstructed via communication during the backward). Getting rid of these allocations provide significant memory gains. To fix this, this PR introduces a global memory buffer (MemoryBuffer singleton) that recycles the allocated spaces, similar to megatron.

Attached: Memory traces of the default nanotron implementation (which OOMs), the current PR implementation and megatron. The memory traces represent the first rank of a tp8 pp4 dp1 llama70b.

image
image
image

@AleHD AleHD marked this pull request as ready for review June 27, 2024 14:44
@AleHD AleHD marked this pull request as draft July 8, 2024 07:47
@AleHD
Copy link
Author

AleHD commented Jul 17, 2024

I reviewed hf's comments and made appropiate adjustments. See: huggingface#203 (comment). Should be ready for review :).

@AleHD AleHD marked this pull request as ready for review July 17, 2024 13:48
@ischlag
Copy link

ischlag commented Jul 29, 2024

Tried to run it but I get

AttributeError: 'dict' object has no attribute '_is_using_mup' self.model_config._is_using_mup = isinstance(self.init_method, SpectralMupInit)self.model_config._is_using_mup = isinstance(self.init_method, SpectralMupInit)
Note, I used the 70B config from the pretrain repo but I don't see any meaningful differences to the config. And there are a lot of other changes in the PR seemingly unrelated to the topic..?

@AleHD AleHD marked this pull request as draft July 30, 2024 08:42
@AleHD
Copy link
Author

AleHD commented Jul 30, 2024

Might have to do with the version of the main branch. I will update everything on my end and make sure that it works properly before opening the PR again. And yes, there are some seemingly unrelated changes. My fork is from upstream nanotron so those are commits pushed to the main branch upstream. We should update our fork to make it easier to review.

@AleHD AleHD marked this pull request as ready for review July 31, 2024 15:44
@ischlag ischlag merged commit 4eb520f into swiss-ai:main Sep 2, 2024
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants