diff --git a/docs/source/features/memory_optimizations.rst b/docs/source/features/memory_optimizations.rst index 4d363670fedf..1fe8215864a9 100644 --- a/docs/source/features/memory_optimizations.rst +++ b/docs/source/features/memory_optimizations.rst @@ -105,3 +105,24 @@ Implement MQA or GQA NeMo's support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class: Check implementation details from Attention Class in Megatron Core Repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49 + + +CPU Offloading +-------------- + +Overview +^^^^^^^^ + +CPU Offloading in NeMo is a feature that reduces the peak memory usage of the GPU by offloading activations and inactive weights to CPU storage. NeMo supports offloading at the transformer layer level, allowing users to specify the number of transformer layers in their language model that require CPU offloading. During the forward pass, NeMo offloads activations at the optimal time and reloads them as needed during the backward pass. + +Features +^^^^^^^^ +> Supports training models with long sequence lengths by managing activation memory efficiently. +> Enables high batch sizes per GPU by offloading activation memory. +> Overlaps computation with data transfers (Host2Device and Device2Host) during offloading and reloading. + +Usage +^^^^^ +> Set cpu_offloading to True to enable CPU offloading. +> Set cpu_offloading_num_layers to a value between 0 and the total number of layers in the model minus one. +> Set cpu_offloading_activations and cpu_offloading_weights based on your needs to offload activations only, weights only, or both.