Merge branch 'r1.23.0' into jenkins_for_mistral_mixtral/akoumparouli

NVIDIA · Feb 5, 2024 · 9df57ec · 9df57ec
2 parents 568ca20 + a592517
commit 9df57ec
Show file tree

Hide file tree

Showing 48 changed files with 3,140 additions and 1,340 deletions.
diff --git a/README.rst b/README.rst
@@ -101,7 +101,7 @@ Key Features
             * Hybrid Transducer/CTC
             * NeMo Original `Multi-blank Transducers <https://arxiv.org/abs/2211.03541>`_ and `Token-and-Duration Transducers (TDT) <https://arxiv.org/abs/2304.06795>`_
         * Streaming/Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference>`_
-        * `Cache-aware Streaming Conformer <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer>`_ with multiple lookaheads.
+        * `Cache-aware Streaming Conformer <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer>`_ with multiple lookaheads (including microphone streaming `tutorial <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_).
         * Beam Search decoding
         * `Language Modelling for ASR (CTC and RNNT) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
         * `Support of long audios for Conformer with memory efficient local attention <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/results.html#inference-on-long-audio>`_
@@ -125,7 +125,7 @@ Key Features
     * `Information retrieval <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/information_retrieval.html>`_
     * `Entity Linking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/entity_linking.html>`_
     * `Dialogue State Tracking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/dialogue.html>`_
-    * `Prompt Learning <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/prompt_learning.html>`_
+    * `Parameter Efficient Finetuning (PEFT) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/peft/landing_page.html>`_
     * `NGC collection of pre-trained NLP models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_nlp>`_
     * `Synthetic Tabular Data Generation <https://developer.nvidia.com/blog/generating-synthetic-data-with-transformers-a-solution-for-enterprise-data-challenges/>`_
 * Text-to-Speech Synthesis (TTS):

diff --git a/docs/source/asr/intro.rst b/docs/source/asr/intro.rst
@@ -108,9 +108,7 @@ See more information about LM decoding :doc:`here <./asr_language_modeling>`.
 Use real-time transcription
 ---------------------------
 
-It is possible to use NeMo to transcribe speech in real-time. You can find an example of how to do 
-this in the following `notebook tutorial <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo.ipynb>`_.
-
+It is possible to use NeMo to transcribe speech in real-time. We provide tutorial notebooks for `Cache Aware Streaming <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_ and `Buffered Streaming <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb>`_.
 
 Try different ASR models
 ------------------------

diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst
@@ -159,6 +159,8 @@ You may find more examples under ``<NeMo_git_root>/examples/asr/conf/fastconform
 Cache-aware Streaming Conformer
 -------------------------------
 
+Try real-time ASR with the Cache-aware Streaming Conformer `tutorial notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_.
+
 Buffered streaming uses overlapping chunks to make an offline ASR model to be used for streaming with reasonable accuracy. However, it uses significant amount of duplication in computations due to the overlapping chunks.
 Also there is a accuracy gap between the offline model and the streaming one as there is inconsistency between how we train the model and how we perform inference for streaming.
 The Cache-aware Streaming Conformer models would tackle and address these disadvantages. These streaming Conformers are trained with limited right context that it would make it possible to match how the model is being used in both the training and inference.

diff --git a/docs/source/nlp/nemo_megatron/peft/landing_page.rst b/docs/source/nlp/nemo_megatron/peft/landing_page.rst
@@ -12,14 +12,14 @@ fraction of the computational and storage costs.
 NeMo supports four PEFT methods which can be used with various
 transformer-based models.
 
-==================== ===== ===== ========= ==
-\                    GPT 3 NvGPT LLaMa 1/2 T5
-==================== ===== ===== ========= ==
-Adapters (Canonical) ✅    ✅    ✅        ✅
-LoRA                 ✅    ✅    ✅        ✅
-IA3                  ✅    ✅    ✅        ✅
-P-Tuning             ✅    ✅    ✅        ✅
-==================== ===== ===== ========= ==
+==================== ===== ======== ========= ====== ==
+\                    GPT 3 Nemotron LLaMa 1/2 Falcon T5
+==================== ===== ======== ========= ====== ==
+LoRA                  ✅    ✅      ✅        ✅     ✅
+P-Tuning              ✅    ✅      ✅        ✅     ✅
+Adapters (Canonical)  ✅    ✅      ✅               ✅
+IA3                   ✅    ✅      ✅               ✅
+==================== ===== ======== ========= ====== ==
 
 Learn more about PEFT in NeMo with the :ref:`peftquickstart` which provides an overview on how PEFT works
 in NeMo. Read about the supported PEFT methods

diff --git a/docs/source/nlp/nemo_megatron/peft/quick_start.rst b/docs/source/nlp/nemo_megatron/peft/quick_start.rst
@@ -62,7 +62,7 @@ Base model classes
 PEFT in NeMo is built with a mix-in class that does not belong to any
 model in particular. This means that the same interface is available to
 different NeMo models. Currently, NeMo supports PEFT for GPT-style
-models such as GPT 3, NvGPT, LLaMa 1/2 (``MegatronGPTSFTModel``), as
+models such as GPT 3, Nemotron, LLaMa 1/2 (``MegatronGPTSFTModel``), as
 well as T5 (``MegatronT5SFTModel``).
 
 Full finetuning vs PEFT
@@ -78,11 +78,13 @@ PEFT.
    trainer = MegatronTrainerBuilder(config).create_trainer()
    model_cfg = MegatronGPTSFTModel.merge_cfg_with(config.model.restore_from_path, config)
 
+   ### Training API ###
    model = MegatronGPTSFTModel.restore_from(restore_path, model_cfg, trainer) # restore from pretrained ckpt
-   + peft_cfg = LoRAPEFTConfig(model_cfg)
+   + peft_cfg = LoraPEFTConfig(model_cfg)
    + model.add_adapter(peft_cfg) 
    trainer.fit(model)  # saves adapter weights only
 
+   ### Inference API ###
    # Restore from base then load adapter API 
    model = MegatronGPTSFTModel.restore_from(restore_path, trainer, model_cfg)
    + model.load_adapters(adapter_save_path, peft_cfg)