Skip to content

Commit

Permalink
[Doc fixes] update file names, installation instructions, bad links (#…
Browse files Browse the repository at this point in the history
…11045)

* rename eval_beamsearch_ngram.py to eval_beamsearch_ngram_ctc.py in docs

Signed-off-by: Elena Rastorgueva <[email protected]>

* replace out of date installation instructions with pointer to NeMo README installation section

Signed-off-by: Elena Rastorgueva <[email protected]>

* point to user guide instead of readme

Signed-off-by: Elena Rastorgueva <[email protected]>

* some link updates

Signed-off-by: Elena Rastorgueva <[email protected]>

* update more links

Signed-off-by: Elena Rastorgueva <[email protected]>

---------

Signed-off-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Elena Rastorgueva <[email protected]>
  • Loading branch information
erastorgueva-nv authored Nov 12, 2024
1 parent 098aa18 commit 5670706
Show file tree
Hide file tree
Showing 13 changed files with 32 additions and 130 deletions.
34 changes: 17 additions & 17 deletions docs/source/asr/asr_language_modeling_and_customization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,15 +99,15 @@ Evaluate by Beam Search Decoding and N-gram LM

NeMo's beam search decoders are capable of using the KenLM's N-gram models to find the best candidates.
The script to evaluate an ASR model with beam search decoding and N-gram models can be found at
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__.
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`__.

This script has a large number of possible argument overrides; therefore, it is recommended that you use ``python eval_beamsearch_ngram.py --help`` to see the full list of arguments.
This script has a large number of possible argument overrides; therefore, it is recommended that you use ``python eval_beamsearch_ngram_ctc.py --help`` to see the full list of arguments.

You can evaluate an ASR model using the following:

.. code-block::
python eval_beamsearch_ngram.py nemo_model_file=<path to the .nemo file of the model> \
python eval_beamsearch_ngram_ctc.py nemo_model_file=<path to the .nemo file of the model> \
input_manifest=<path to the evaluation JSON manifest file \
kenlm_model_file=<path to the binary KenLM model> \
beam_width=[<list of the beam widths, separated with commas>] \
Expand All @@ -118,18 +118,18 @@ You can evaluate an ASR model using the following:
decoding_mode=beamsearch_ngram \
decoding_strategy="<Beam library such as beam, pyctcdecode or flashlight>"
It can evaluate a model in the following three modes by setting the argument `--decoding_mode`:
It can evaluate a model in the following three modes by setting the argument ``--decoding_mode``:

* greedy: Just greedy decoding is done and no beam search decoding is performed.
* beamsearch: The beam search decoding is done, but without using the N-gram language model. Final results are equivalent to setting the weight of LM (beam_beta) to zero.
* beamsearch_ngram: The beam search decoding is done with N-gram LM.

In `beamsearch` mode, the evaluation is performed using beam search decoding without any language model. The performance is reported in terms of Word Error Rate (WER) and Character Error Rate (CER). Moreover, when the best candidate is selected among the candidates, it is also reported as the best WER/CER. This can serve as an indicator of the quality of the predicted candidates.
In ``beamsearch`` mode, the evaluation is performed using beam search decoding without any language model. The performance is reported in terms of Word Error Rate (WER) and Character Error Rate (CER). Moreover, when the best candidate is selected among the candidates, it is also reported as the best WER/CER. This can serve as an indicator of the quality of the predicted candidates.


The script initially loads the ASR model and predicts the outputs of the model's encoder as log probabilities. This part is computed in batches on a device specified by --device, which can be either a CPU (`--device=cpu`) or a single GPU (`--device=cuda:0`).
The batch size for this part is specified by `--acoustic_batch_size`. Using the largest feasible batch size can speed up the calculation of log probabilities. Additionally, you can use `--use_amp` to accelerate the calculation and allow for larger --acoustic_batch_size values.
Currently, multi-GPU support is not available for calculating log probabilities. However, using `--probs_cache_file` can help. This option stores the log probabilities produced by the model’s encoder in a pickle file, allowing you to skip the first step in future runs.
The batch size for this part is specified by ``--acoustic_batch_size``. Using the largest feasible batch size can speed up the calculation of log probabilities. Additionally, you can use `--use_amp` to accelerate the calculation and allow for larger --acoustic_batch_size values.
Currently, multi-GPU support is not available for calculating log probabilities. However, using ``--probs_cache_file`` can help. This option stores the log probabilities produced by the model’s encoder in a pickle file, allowing you to skip the first step in future runs.

The following is the list of the important arguments for the evaluation script:

Expand Down Expand Up @@ -167,7 +167,7 @@ The following is the list of the important arguments for the evaluation script:
| decoding_strategy | str | beam | String argument for type of decoding strategy for the model. |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+
| decoding | Dict | BeamCTC | Subdict of beam search configs. Values found via |
| | Config | InferConfig | python eval_beamsearch_ngram.py --help |
| | Config | InferConfig | python eval_beamsearch_ngram_ctc.py --help |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+
| text_processing.do_lowercase | bool | ``False`` | Whether to make the training text all lower case. |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+
Expand All @@ -178,11 +178,11 @@ The following is the list of the important arguments for the evaluation script:
| text_processing.separate_punctuation | bool | ``True`` | Whether to separate punctuation with the previous word by space. |
+--------------------------------------+----------+------------------+-------------------------------------------------------------------------+

The width of the beam search (`--beam_width`) specifies the number of top candidates or predictions the beam search decoder will consider. Larger beam widths result in more accurate but slower predictions.
The width of the beam search (``--beam_width``) specifies the number of top candidates or predictions the beam search decoder will consider. Larger beam widths result in more accurate but slower predictions.

.. note::

The ``eval_beamsearch_ngram.py`` script contains the entire subconfig used for CTC Beam Decoding.
The ``eval_beamsearch_ngram_ctc.py`` script contains the entire subconfig used for CTC Beam Decoding.
Therefore it is possible to forward arguments for various beam search libraries such as ``flashlight``
and ``pyctcdecode`` via the ``decoding`` subconfig.

Expand Down Expand Up @@ -223,14 +223,14 @@ It supports several advanced features, such as lexicon-based decoding, lexicon-f
.. code-block::
# Lexicon-based decoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="flashlight" \
decoding.beam.flashlight_cfg.lexicon_path='/path/to/lexicon.lexicon' \
decoding.beam.flashlight_cfg.beam_size_token = 32 \
decoding.beam.flashlight_cfg.beam_threshold = 25.0
# Lexicon-free decoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="flashlight" \
decoding.beam.flashlight_cfg.beam_size_token = 32 \
decoding.beam.flashlight_cfg.beam_threshold = 25.0
Expand All @@ -256,7 +256,7 @@ It has advanced features, such as word boosting, which can be useful for transcr
.. code-block::
# PyCTCDecoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="pyctcdecode" \
decoding.beam.pyctcdecode_cfg.beam_prune_logp = -10. \
decoding.beam.pyctcdecode_cfg.token_min_logp = -5. \
Expand All @@ -273,7 +273,7 @@ For example, the following set of parameters would result in 212=4 beam search d

.. code-block::
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
beam_width=[64,128] \
beam_alpha=[1.0] \
beam_beta=[1.0,0.5]
Expand Down Expand Up @@ -330,7 +330,7 @@ Given a trained TransformerLMModel `.nemo` file or a pretrained HF model, the sc
can be used to re-score beams obtained with ASR model. You need the `.tsv` file containing the candidates produced
by the acoustic model and the beam search decoding to use this script. The candidates can be the result of just the beam
search decoding or the result of fusion with an N-gram LM. You can generate this file by specifying `--preds_output_folder` for
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__.
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`__.

The neural rescorer would rescore the beams/candidates by using two parameters of `rescorer_alpha` and `rescorer_beta`, as follows:

Expand All @@ -345,7 +345,7 @@ Use the following steps to evaluate a neural LM:
#. Obtain `.tsv` file with beams and their corresponding scores. Scores can be from a regular beam search decoder or
in fusion with an N-gram LM scores. For a given beam size `beam_size` and a number of examples
for evaluation `num_eval_examples`, it should contain (`num_eval_examples` x `beam_size`) lines of
form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__
form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`__

#. Rescore the candidates by `scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py>`__.

Expand Down Expand Up @@ -439,7 +439,7 @@ You can then pass this file to your Flashlight config object during decoding:
.. code-block::
# Lexicon-based decoding
python eval_beamsearch_ngram.py ... \
python eval_beamsearch_ngram_ctc.py ... \
decoding_strategy="flashlight" \
decoding.beam.flashlight_cfg.lexicon_path='/path/to/lexicon.lexicon' \
decoding.beam.flashlight_cfg.boost_path='/path/to/my_boost_file.boost' \
Expand Down
4 changes: 2 additions & 2 deletions docs/source/asr/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@ You can get a good improvement in transcription accuracy even using a simple N-g

After :ref:`training <train-ngram-lm>` an N-gram LM, you can use it for transcribing audio as follows:

1. Install the OpenSeq2Seq beam search decoding and KenLM libraries using the `install_beamsearch_decoders script <scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh>`_.
2. Perform transcription using the `eval_beamsearch_ngram script <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`_:
1. Install the OpenSeq2Seq beam search decoding and KenLM libraries using the `install_beamsearch_decoders script <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh>`_.
2. Perform transcription using the `eval_beamsearch_ngram script <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py>`_:

.. code-block:: bash
Expand Down
4 changes: 2 additions & 2 deletions docs/source/core/core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,8 +294,8 @@ CLI
With NeMo and Hydra, every aspect of model training can be modified from the command-line. This is extremely helpful for running lots
of experiments on compute clusters or for quickly testing parameters during development.

All NeMo `examples <https://github.com/NVIDIA/NeMo/tree/v1.0.2/examples>`_ come with instructions on how to
run the training/inference script from the command-line (see `here <https://github.com/NVIDIA/NeMo/blob/4e9da75f021fe23c9f49404cd2e7da4597cb5879/examples/asr/asr_ctc/speech_to_text_ctc.py#L24>`__
All NeMo `examples <https://github.com/NVIDIA/NeMo/tree/stable/examples>`_ come with instructions on how to
run the training/inference script from the command-line (e.g. see `here <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py>`__
for an example).

With Hydra, arguments are set using the ``=`` operator:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/multimodal/mllm/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ This section provides a detailed overview of the NeMo configuration file setup s

Within the configuration files of the NeMo Multimodal Language Model, details concerning dataset(s), augmentation, optimization parameters, and model architectural specifications are central. This page explores each of these aspects.

Discover exemplary configuration files for all NeMo Multimodal Language Model scripts in the `config directory of the examples <https://TODOURL>`_.
Discover exemplary configuration files for all NeMo Multimodal Language Model scripts in the `config directory of the examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/multimodal/multimodal_llm/neva/conf>`_.

Dataset Configuration
---------------------

The NeMo multimodal language model currently supports a conversation data format, inspired by and designed from https://github.com/haotian-liu/LLaVA/tree/main. To explore a sample dataset, visit https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md.

The configuration file allows setting any initialization parameter accepted by the Dataset class used in the experiment. For a comprehensive list of Datasets and their parameters, visit the `Datasets <./api.html#Datasets>`__ section of the API.
The configuration file allows setting any initialization parameter accepted by the Dataset class used in the experiment. For a comprehensive list of Datasets and their parameters, visit the :doc:`Datasets <./datasets>` section of the API.

A typical training configuration is as follows:

Expand Down
2 changes: 1 addition & 1 deletion docs/source/multimodal/text2img/imagen.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Imagen has two types of UNet: Regular UNet and EfficientUNet.
Regular UNet
~~~~~~~~~~~~
Regular UNet is used for Imagen base64 model. You can also use regular UNet for SR models
(see example config file `sr256-400m-edm.yaml <http://TODOURL>`_), but this typically
(see example config file `sr256-400m-edm.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/multimodal/text_to_image/imagen/conf/sr256-400m-edm.yaml>`__), but this typically
results in a larger memory footprint during training for the same model size.

Recommended UNet size for base64 and SR256 models are listed below:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/multimodal/vlm/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This section provides a detailed overview of the NeMo configuration file setup s

Within the configuration files of the NeMo Multimodal Language Model, details concerning dataset(s), augmentation, optimization parameters, and model architectural specifications are central. This page explores each of these aspects.

Discover exemplary configuration files for all NeMo Multimodal Language Model scripts in the `config directory of the examples <http://TODOURL>`_.
Discover exemplary configuration files for all NeMo Multimodal Language Model scripts in the `config directories of the examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/multimodal/vision_language_foundation/clip/conf>`__.

Dataset Configuration
=====================
Expand Down
2 changes: 1 addition & 1 deletion docs/source/multimodal/vlm/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ For webdatasets already downloaded locally, sub-stages 4-6 can be used to precac
For models that encode image and text on-the-fly, only sub-stages 1-3 need to be run.

Instruction for configuring each sub-stage is provided as a comment next to each field in
`download_multimodal.yaml <http://TODOURL>`_
`download_multimodal.yaml <https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/data_preparation/multimodal/download_multimodal.yaml>`__.
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Data upsampling
---------------
Data upsampling is an effective way to increase the training data for better model performance, especially on the long tail of semiotic tokens.
We used upsampling for training an English text normalization model, see `data/en/upsampling.py <https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/duplex_text_normalization/data/en/upsampling.py>`__.
We used upsampling for training an English text normalization model, see `data/en/upsampling.py <https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/duplex_text_normalization/data/en/upsample.py>`__.
Currently this script only upsamples a few classes, that are diverse in semiotic tokens but at the same time underrepresented in the training data.
Of all the input files in `train` folder created by `data/data_split.py <https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/duplex_text_normalization/data/data_split.py>`__. this script takes the first file and detects the class patterns that occur in it.
For those that are underrepresented, quantitatively defined as lower than `min_number`, the other files are scanned for sentences that have the missing patterns.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ WFST TN/ITN resources could be found in :doc:`here <wfst_resources>`.

Riva resources
--------------
- `Riva Text Normalization customization for TTS <https://riva-builder-01.nvidia.com/main/tts/tts-custom.html#custom-text-normalization>`_.
- `Riva ASR/Inverse Text Normalization customization <https://riva-builder-01.nvidia.com/main/asr/asr-customizing.html>`_.
- `Riva Text Normalization customization for TTS <https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html#custom-text-normalization>`_.
- `Riva ASR/Inverse Text Normalization customization <https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html>`_.
Loading

0 comments on commit 5670706

Please sign in to comment.