Fixed memory issues during forward #5

AleHD · 2024-06-27T14:39:20Z

Nanotron seems to consume disproportionately more memory on its activations compared to megatron. This is due to at least the following factors:

The glu activation, which is not fused, allocate two tensors: the activation and the element-wise multiplication. Fusing this operation provides (relatively small) memory gains.
During the differentiable operations (specifically, the DifferentiableAllGather and DifferentiableReduceScatterSum) a new tensor is allocated via torch.empty. This tensor is cached through the entire forward pass until the backward pass. However, this cache is unnecessary as these tensors are not used at all (rather, they are reconstructed via communication during the backward). Getting rid of these allocations provide significant memory gains. To fix this, this PR introduces a global memory buffer (MemoryBuffer singleton) that recycles the allocated spaces, similar to megatron.

Attached: Memory traces of the default nanotron implementation (which OOMs), the current PR implementation and megatron. The memory traces represent the first rank of a tp8 pp4 dp1 llama70b.

Added 1-sqrt function for cooldown phase

add rope_theta to hf conversion script

…og_ci feat(ci): add trufflehog secrets detection

…rentiable distributed operations

Update README.md

Add layer-wise activation recomputation to llama model

AleHD · 2024-07-17T13:48:21Z

I reviewed hf's comments and made appropiate adjustments. See: huggingface#203 (comment). Should be ready for review :).

readme

…nto fix-row-parallel

ischlag · 2024-07-29T14:20:27Z

Tried to run it but I get

AttributeError: 'dict' object has no attribute '_is_using_mup' self.model_config._is_using_mup = isinstance(self.init_method, SpectralMupInit)self.model_config._is_using_mup = isinstance(self.init_method, SpectralMupInit)
Note, I used the 70B config from the pretrain repo but I don't see any meaningful differences to the config. And there are a lot of other changes in the PR seemingly unrelated to the topic..?

Fix _RowLinearAsyncCommunication

AleHD · 2024-07-30T09:02:21Z

Might have to do with the version of the main branch. I will update everything on my end and make sure that it works properly before opening the PR again. And yes, there are some seemingly unrelated changes. My fork is from upstream nanotron so those are commits pushed to the main branch upstream. We should update our fork to make it easier to review.

zzhhjjj and others added 15 commits April 22, 2024 14:39

readme

33fd672

wip

e484d99

layer recompute

7e15516

fix row parallel

7dd5beb

Add 1-sqrt function for the cooldown phase.

e4d3010

fix typo

180faf4

Merge pull request huggingface#185 from eliebak/add-decay-function

c721f4f

Added 1-sqrt function for cooldown phase

add rope_theta to hf conversion script

97c9780

Merge pull request huggingface#188 from jquesnelle/patch-1

6771639

add rope_theta to hf conversion script

feat(ci): add trufflehog secrets detection

1753921

fix(ci): remove unnecessary permissions

1db85f3

Merge pull request huggingface#193 from huggingface/feat/add_truffleh…

ee785d6

…og_ci feat(ci): add trufflehog secrets detection

Implemented global memory buffer to reduce activation memory of diffe…

bcf405d

…rentiable distributed operations

GLU fusion

ed1ca7d

precommit

9b0de5b

AleHD marked this pull request as ready for review June 27, 2024 14:44

AleHD marked this pull request as draft July 8, 2024 07:47

xrsrke and others added 13 commits July 8, 2024 17:05

Update README.md

ed5a11c

Merge pull request huggingface#205 from huggingface/xrsrke-patch-2

cb51ed8

Update README.md

Update README.md

d5cf7c4

Merge pull request huggingface#206 from huggingface/xrsrke-patch-3

f1adf52

Update README.md

Merge branch 'main' into fix_tp_mem_cache

bbc259f

Merge pull request huggingface#207 from C-TC/recompute

4c23ed0

Add layer-wise activation recomputation to llama model

Wrong backward fixed

803b6da

Removed useless prints

59bfb6b

Minor fixes

2c69e9a

precommit

30439fd

Added tp_recompute_allgather option

1e02a9c

Changed recompute default

9cc81bb

Changed recompute default

956fbfd

AleHD marked this pull request as ready for review July 17, 2024 13:48

Moved ColumnLinearNoAsync module for consistency

b9e9201

AleHD mentioned this pull request Jul 18, 2024

Memory optimization in async tp-linear #13

Merged

zzhhjjj and others added 6 commits July 21, 2024 16:49

Merge pull request huggingface#145 from zzhhjjj/readme

5f82f7a

readme

Merge branch 'main' into fix_tp_mem_cache

49633df

Fixed List not found

2afd007

Fixed tp=1 case

7e758db

fix row parallel

41f11f0

Merge branch 'fix-row-parallel' of https://github.com/c-tc/nanotron i…

b713f7b

…nto fix-row-parallel

Merge pull request huggingface#172 from C-TC/fix-row-parallel

2e48d66

Fix _RowLinearAsyncCommunication

AleHD marked this pull request as draft July 30, 2024 08:42

AleHD added 3 commits July 30, 2024 18:50

Merge branch 'main' into fix_tp_mem_cache

ce2a96b

Fixed column parallel

cd84d4f

Added tp_recompute_allgather test

d3db06a

AleHD marked this pull request as ready for review July 31, 2024 15:44

AleHD added 2 commits August 2, 2024 15:40

Minor restyling

7daa186

Fixed names

31c3c5a

ischlag merged commit 4eb520f into swiss-ai:main Sep 2, 2024
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed memory issues during forward #5

Fixed memory issues during forward #5

AleHD commented Jun 27, 2024

AleHD commented Jul 17, 2024

ischlag commented Jul 29, 2024 •

edited

Loading

AleHD commented Jul 30, 2024

Fixed memory issues during forward #5

Fixed memory issues during forward #5

Conversation

AleHD commented Jun 27, 2024

AleHD commented Jul 17, 2024

ischlag commented Jul 29, 2024 • edited Loading

AleHD commented Jul 30, 2024

ischlag commented Jul 29, 2024 •

edited

Loading