-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into kmorabia/update-width-pruning
Signed-off-by: Keval Morabia <[email protected]>
- Loading branch information
Showing
183 changed files
with
10,189 additions
and
3,135 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
166 changes: 70 additions & 96 deletions
166
docs/source/asr/asr_language_modeling_and_customization.rst
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
32 changes: 11 additions & 21 deletions
32
docs/source/features/optimizations/activation_recomputation.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,52 +1,42 @@ | ||
Activation Recomputation | ||
======================== | ||
|
||
The input activations of network layers are stored in the device memory to compute the gradients in back-propagation. | ||
The input activation stores easily saturate the device memory when training a LLM with a large sequence length or a large micro-batch size. | ||
Check-pointing a few activations and recomputing the rest of activations is a common technique to reduce the need of device memory. | ||
The input activations of network layers are stored in device memory and are used to compute gradients during back-propagation. When training a LLM with a long sequence length or a large micro-batch size, these input activations can quickly saturate device memory. Checkpointing a few activations and recomputing the rest is a common technique to reduce device memory usage. | ||
|
||
Transformer Layer Recomputation | ||
------------------------------- | ||
|
||
NeMo supports Transformer layer recomputation that checkpoints the input of each Transformer layer and recomputes the activations on the rest of the layers. | ||
Transformer layer recomputation significantly reduces the activation memory usage. | ||
However, this approach increases per-Transformer layer computation cost by 30%, which comes from re-executing the entire layer forwarding computation. | ||
NeMo also supports partial Transformer layer recomputation, which is beneficial when recomputing a few Transformer layers would fit the training workload on GPU memory. | ||
This would avoid recomputing the rest of layers. | ||
NeMo supports transformer layer recomputation, which checkpoints the input of each transformer layer and recomputes the activations for the remaining layers. This technique significantly reduces activation memory usage. However, it increases the per-transformer layer computation cost by 30% due to re-executing the entire layer’s forward computation. | ||
NeMo also supports partial transformer layer recomputation, which is beneficial when recomputing a few transformer layers help to reduce enough GPU memory for model to fit. This approach avoids the need to recompute the rest of the layers. | ||
|
||
Transformer layer recomputation is enabled by setting ``activations_checkpoint_granularity=full``. | ||
The number of Transformer layers to recompute can be set using ``activations_checkpoint_num_layers`` along with ``activations_checkpoint_method=block``. | ||
If one sets ``activations_checkpoint_num_layers`` as the total number of layers, the inputs of all Transformer layers are check-pointed and recomputed. | ||
The number of transformer layers to recompute can be set using ``activations_checkpoint_num_layers`` along with ``activations_checkpoint_method=block``. | ||
If you set ``activations_checkpoint_num_layers`` as the total number of layers, the inputs of all transformer layers are checkpointed and recomputed. | ||
When training with the pipeline parallelism, ``activations_checkpoint_num_layers`` indicates the layers per pipeline stage. | ||
If the virtual pipelining is used, ``activations_checkpoint_num_layers`` means the layers per virtual pipeline stage. | ||
When using virtual pipelining, ``activations_checkpoint_num_layers`` specifies the number of layers per virtual pipeline stage. | ||
|
||
NeMo also supports checkpointing the input to a block of multiple consecutive Transformer layers meaning that a block of Transformer layers becomes the recomputation granularity. | ||
This can further save activation memory at the cost of increasing the recomputation buffer memory. | ||
Thus, it is only beneficial for memory savings when the model has many Transformer layers or the intermediate layers of a Transformer layer hold relatively small activation stores. | ||
This recomputation mode can be enabled by setting ``activations_checkpoint_method=uniform``, and the number of Transformer layers per recomputation block is set using ``activations_checkpoint_num_layers``. | ||
NeMo also supports checkpointing the input to a block of multiple consecutive transformer layers, meaning that a block of transformer layers becomes the recomputation granularity. This approach can save activation memory but increases the recomputation buffer memory. Thus, it is only beneficial for memory savings when the model has many transformer layers or when the intermediate layers of a transformer layer hold relatively small activation stores. | ||
This recomputation mode can be enabled by setting ``activations_checkpoint_method=uniform``, with the number of transformer layers per recomputation block set using ``activations_checkpoint_num_layers``. | ||
|
||
Self-attention Recomputation | ||
---------------------------- | ||
|
||
NeMo supports the self-attention recomputation that checkpoints the inputs of each self-attention block and recomputes the intermediate input activations. | ||
This is a cost-efficient recomputation method; achieves high memory saving with lost recomputation cost. | ||
The intermediate layers of the self-attention block accounts for the majority portion the activation memory. | ||
This cost-efficient method achieves high memory savings with minimal recomputation cost. | ||
The intermediate layers of the self-attention block accounts for the majority of the activation memory. | ||
This is because the input sizes of softmax, dropout, and qkv dot-product attention layers have the memory complexity of the sequence length square. | ||
However, their recomputation cost is relatively smaller than the other linear projection layers that are linear with the hidden size square. | ||
|
||
Self-attention recomputation is hard-enabled when using FlashAttention, which is supported in Transformer Engine. | ||
Also, a user can use the self-attention recomputation without FlashAttention by setting ``activations_checkpoint_granularity=selective``. | ||
|
||
Also, you can use the self-attention recomputation without FlashAttention by setting ``activations_checkpoint_granularity=selective``. | ||
Scheme of full and selective checkpointing granularity: | ||
|
||
.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-activation-recomputation-exampe-2.jpg | ||
:align: center | ||
:alt: activation-recomputation-example-2 | ||
:scale: 50% | ||
|
||
Scheme of uniform and block checkpointing method (full checkpointing granularity): | ||
|
||
.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-activation-recomputation-exampe-1.jpg | ||
:align: center | ||
:alt: activation-recomputation-example-1 | ||
:scale: 50% |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.