Cache Aware Streaming tutorial notebook (#8296)

* add notebook Signed-off-by: Elena Rastorgueva <[email protected]> * rename old notebook to Buffered_Streaming Signed-off-by: Elena Rastorgueva <[email protected]> * call setup_streaming_params in set_default_att_context_size method Signed-off-by: Elena Rastorgueva <[email protected]> * update links in docs Signed-off-by: Elena Rastorgueva <[email protected]> * update links to tutorials in docs Signed-off-by: Elena Rastorgueva <[email protected]> * remove hard-coding Signed-off-by: Elena Rastorgueva <[email protected]> * rename var Signed-off-by: Elena Rastorgueva <[email protected]> --------- Signed-off-by: Elena Rastorgueva <[email protected]>
NVIDIA · Feb 1, 2024 · d3bad4b · d3bad4b
1 parent 40da002
commit d3bad4b
Show file tree

Hide file tree

Showing 7 changed files with 444 additions and 6 deletions.
diff --git a/README.rst b/README.rst
@@ -101,7 +101,7 @@ Key Features
             * Hybrid Transducer/CTC
             * NeMo Original `Multi-blank Transducers <https://arxiv.org/abs/2211.03541>`_ and `Token-and-Duration Transducers (TDT) <https://arxiv.org/abs/2304.06795>`_
         * Streaming/Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference>`_
-        * `Cache-aware Streaming Conformer <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer>`_ with multiple lookaheads.
+        * `Cache-aware Streaming Conformer <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer>`_ with multiple lookaheads (including microphone streaming `tutorial <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_).
         * Beam Search decoding
         * `Language Modelling for ASR (CTC and RNNT) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
         * `Support of long audios for Conformer with memory efficient local attention <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/results.html#inference-on-long-audio>`_

diff --git a/docs/source/asr/intro.rst b/docs/source/asr/intro.rst
@@ -108,9 +108,7 @@ See more information about LM decoding :doc:`here <./asr_language_modeling>`.
 Use real-time transcription
 ---------------------------
 
-It is possible to use NeMo to transcribe speech in real-time. You can find an example of how to do 
-this in the following `notebook tutorial <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo.ipynb>`_.
-
+It is possible to use NeMo to transcribe speech in real-time. We provide tutorial notebooks for `Cache Aware Streaming <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_ and `Buffered Streaming <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb>`_.
 
 Try different ASR models
 ------------------------

diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst
@@ -159,6 +159,8 @@ You may find more examples under ``<NeMo_git_root>/examples/asr/conf/fastconform
 Cache-aware Streaming Conformer
 -------------------------------
 
+Try real-time ASR with the Cache-aware Streaming Conformer `tutorial notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_.
+
 Buffered streaming uses overlapping chunks to make an offline ASR model to be used for streaming with reasonable accuracy. However, it uses significant amount of duplication in computations due to the overlapping chunks.
 Also there is a accuracy gap between the offline model and the streaming one as there is inconsistency between how we train the model and how we perform inference for streaming.
 The Cache-aware Streaming Conformer models would tackle and address these disadvantages. These streaming Conformers are trained with limited right context that it would make it possible to match how the model is being used in both the training and inference.

diff --git a/docs/source/starthere/tutorials.rst b/docs/source/starthere/tutorials.rst
@@ -47,8 +47,11 @@ To run a tutorial:
      - Offline ASR Inference with Beam Search and External Language Model Rescoring
      - `Offline ASR <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/Offline_ASR.ipynb>`_
    * - ASR
-     - Online ASR inference with Microphone
-     - `Online ASR Microphone <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_ASR_Microphone_Demo.ipynb>`_
+     - Online ASR inference with Microphone (Cache-Aware Streaming)
+     - `Online ASR Microphone Cache Aware Streaming <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_
+   * - ASR
+     - Online ASR inference with Microphone (Buffered Streaming)
+     - `Online ASR Microphone Buffered Streaming <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb>`_
    * - ASR
      - Fine-tuning CTC Models on New Languages
      - `ASR CTC Language Fine-Tuning <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb>`_

diff --git a/nemo/collections/asr/modules/conformer_encoder.py b/nemo/collections/asr/modules/conformer_encoder.py
@@ -786,6 +786,8 @@ def set_default_att_context_size(self, att_context_size):
         if att_context_size is not None:
             self.att_context_size = att_context_size
 
+        self.setup_streaming_params()
+
     def setup_streaming_params(
         self,
         chunk_size: int = None,

diff --git a/...ials/asr/Online_ASR_Microphone_Demo.ipynb → ..._Microphone_Demo_Buffered_Streaming.ipynb b/...ials/asr/Online_ASR_Microphone_Demo.ipynb → ..._Microphone_Demo_Buffered_Streaming.ipynb