fastconformer hybrid recipe reports strange val_WER with `nemo:24.07` and `nemo:dev` #10299

itzsimpl · 2024-08-29T14:50:33Z

Describe the bug

Running a basic fastconformer hybrid recipe fails with image nemo:24.07 and newer; more specifically, the reported RNNT WER numbers are all over the place, whereas CTC WER numbers decrease normally. Tested with a known good setup and images nemo:24.01.01.framework, nemo:24.05.01, nemo:24.07 and nemo:dev, running with 8 GPUs on two machine types, a DGX-H100 and a no-name system with 8x A100 80GB PCIe. The drivers are 550.90.07.

Since the checkpoints are saved with RNNT WER this messes up training completely.

Up to nemo:24.05.01 the (RNNT) WER and CTC WER charts are

With nemo:24.07 and nemo:dev the (RNNT) WER and CTC WER charts are

The charts for training_batch_wer and training_batch_ctc_wer show no such anomaly. To me this all points to the newly introduced Cuda graphs. In the logs of nemo:24.07 I have noticed messages that Cuda graphs get Disabled during training but Enabled during validation, even though the driver does not support cuda toolkit 12.6.

Epoch 0:   0%|          | 0/443 [00:00<?, ?it/s] [NeMo I 2024-08-28 19:28:28 optional_cuda_graphs:53] Disabled CUDA graphs for module <class 'nemo.collections.asr.models.hybrid_rnnt_ctc_bpe_models.EncDecHybridRNNTCTCBPEModel'>.decoding.decoding
[NeMo I 2024-08-28 19:28:28 optional_cuda_graphs:53] Disabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>joint._wer.decoding.decoding

[NeMo W 2024-08-28 19:37:18 rnnt_loop_labels_computer:270] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: Driver supports cuda toolkit version 12.4, but the driver needs to support at least 12,6. Please update your cuda driver.

Validation: |          | 0/? [00:00<?, ?it/s]�[A[NeMo I 2024-08-28 19:37:18 optional_cuda_graphs:79] Enabled CUDA graphs for module <class 'nemo.collections.asr.models.hybrid_rnnt_ctc_bpe_models.EncDecHybridRNNTCTCBPEModel'>.decoding.decoding
[NeMo I 2024-08-28 19:37:18 optional_cuda_graphs:79] Enabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>joint._wer.decoding.decoding

Steps/Code to reproduce bug

Run a basic training with the fastconformer hybrid recipe.

Expected behavior

Training progresses normally as it did up to nemo:24.05.01.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Slurm]
Method of NeMo install: [pip install or from source]. nemo image nemo:24.01.01.framework, nemo:24.05.01, nemo:24.07 and nemo:dev
If method of install is [Docker], provide docker pull & docker run commands used

The text was updated successfully, but these errors were encountered:

artbataev · 2024-09-02T12:26:16Z

I can reproduce the problem, working on fixing it

github-actions · 2024-10-03T01:58:39Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

itzsimpl · 2024-10-03T11:27:35Z

@artbataev any news on this?

artbataev · 2024-10-04T19:26:45Z

@itzsimpl I'm still investigating, seeing multiple issues with CUDA graphs.
Please, as a temporary solution, disable CUDA graphs in the model config by adding ++model.decoding.greedy.use_cuda_graph_decoder=false

titu1994 · 2024-10-04T19:30:05Z

I think it might be worth disabling it for training explicitly until we can make sure it doesn't affect model training.

artbataev · 2024-10-07T10:03:22Z

@titu1994 Let me check one thing. If this will not work, I will make a PR disabling CUDA graphs in training.

itzsimpl · 2024-10-29T08:02:29Z

@artbataev @titu1994 just to let you know the issue is present also doing inference with RNNT, eg. with examples/asr/transcribe_speech_parallel.py decoder_type=rnnt on a hybrid model. The model works fine (i.e. returns valid transcriptions) if I force the default use_cuda_graph_decoder to False in https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/rnnt_decoding.py and https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py. Logs, however, still report cuda graphs as enabled.

artbataev · 2024-10-29T12:40:27Z

@itzsimpl Are you using RNN-T or TDT?

I tried a hybrid TDT-CTC model, and cannot reproduce the issue in nemo:24.07 container with examples/asr/transcribe_speech_parallel.py decoder_type=rnnt.

Also, beam search (rnnt_beam_decoding.py) does not use CUDA graphs / does not use parameter use_cuda_graph_decoder

artbataev · 2024-10-29T12:47:34Z

I also do not see the difference with a hybrid RNNT-CTC model (I use nvcr.io/nvidia/nemo:24.07 container, 2 GPUs)

python examples/asr/transcribe_speech_parallel.py \
 model=stt_en_fastconformer_hybrid_large_pc \
 predict_ds.manifest_filepath=<...>/manifests/librispeech/test_other.jsonl \
 predict_ds.batch_size=16 \
 output_path=test_other_decoded_1.jsonl \
 decoder_type="rnnt" \
 rnnt_decoding.strategy="greedy_batch" \
 rnnt_decoding.greedy.use_cuda_graph_decoder=false

Results are consistent with rnnt_decoding.greedy.use_cuda_graph_decoder=true and look OK.

artbataev · 2024-10-29T12:49:17Z

@itzsimpl can you please try reproducing the issue with transcribe_speech_parallel.py with stt_en_fastconformer_hybrid_large_pc checkpoint?

itzsimpl · 2024-10-29T21:28:34Z

@artbataev I'm using a non-english RNNT-CTC hybrid model, 1 GPU, and consistently getting a difference. It will take me a bit more time to test also with 2 GPU and with an English model as I do not have any test datasets ready. In addition, I'm experiencing a few other, not necessarily related, issues; a) setting the parameter rnnt_decoding.greedy.use_cuda_graph_decoder=false, like you did, has no effect for me, the log still displays

[NeMo I 2024-10-29 21:12:31 hybrid_rnnt_ctc_bpe_models:457] Changed decoding strategy of the RNNT decoder to 
...
    greedy:
...
      use_cuda_graph_decoder: true

And b) running on Slurm in interactive mode, the second of two consecutive runs of transcribe_speech_parallel.py gets stuck in right before

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
..

With use_cuda_graph_decoder: true the WER I get is

[NeMo I 2024-10-29 21:13:23 transcribe_speech_parallel:206] WER for all predictions is 23.3026.

and predictions contain texts like

... "pred_text": "rkarrkarkkarrrrukukkukkukaokjeokjeorrjerkkaoorkkjelakalakalakalakalarrrr"}

However, if I run the following two sed replacements

sed -i /opt/Nemo/nemo/collections/asr/parts/submodules/rnnt_decoding.py -e "s/'use_cuda_graph_decoder', True/'use_cuda_graph_decoder', False/g"
sed -i /opt/Nemo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py -e "s/use_cuda_graph_decoder: bool = True/use_cuda_graph_decoder: bool = False/g"

the final WER is

[NeMo I 2024-10-29 21:27:00 transcribe_speech_parallel:206] WER for all predictions is 0.0752.

artbataev · 2024-10-30T10:30:46Z

@itzsimpl Thank you! I will check transcribe_parallel.py with Slurm. No need to test the English model for now.

I would also appreciate it if you check the code from #11087

itzsimpl · 2024-10-30T21:31:00Z

@artbataev I can confirm that with the PR I get correct WER results.

I have also discovered some more info with respect to the script getting stuck on the second of multiple consecutive runs. I've opened a new issue (#11105) for that.

artbataev · 2024-10-31T21:30:13Z

@itzsimpl Thanks a lot!

gabitza-tech · 2024-11-14T16:12:50Z

Hello! @artbataev

I apologize if the following question might be dumb, but I will give a very short summary of what I am trying to do:

My main repo utilizes nemo 1.23.0 and until now I was training with it a hybrid model. However since the lhotse update was implemented a couple of months ago, I was thinking of speeding up training by using the current dev branch. I have gotten a similar behaviour as @itzsimpl and solved it by disabling CUDA graphs as you mentioned ( thanks!! ).

My question: will I be able to use the model trained in the current dev branch with cuda graphs disable with the code from 1.23.0 release?

itzsimpl added the bug Something isn't working label Aug 29, 2024

titu1994 assigned galv and artbataev Aug 30, 2024

github-actions bot added the stale label Oct 3, 2024

github-actions bot removed the stale label Oct 4, 2024

artbataev linked a pull request Oct 29, 2024 that will close this issue

Disable CUDA graphs in DDP (ASR). Improve toggle messages #11087

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastconformer hybrid recipe reports strange val_WER with `nemo:24.07` and `nemo:dev` #10299

fastconformer hybrid recipe reports strange val_WER with `nemo:24.07` and `nemo:dev` #10299

itzsimpl commented Aug 29, 2024

artbataev commented Sep 2, 2024

github-actions bot commented Oct 3, 2024

itzsimpl commented Oct 3, 2024

artbataev commented Oct 4, 2024

titu1994 commented Oct 4, 2024

artbataev commented Oct 7, 2024

itzsimpl commented Oct 29, 2024

artbataev commented Oct 29, 2024

artbataev commented Oct 29, 2024 •

edited

Loading

artbataev commented Oct 29, 2024

itzsimpl commented Oct 29, 2024

artbataev commented Oct 30, 2024

itzsimpl commented Oct 30, 2024

artbataev commented Oct 31, 2024

gabitza-tech commented Nov 14, 2024

fastconformer hybrid recipe reports strange val_WER with nemo:24.07 and nemo:dev #10299

fastconformer hybrid recipe reports strange val_WER with nemo:24.07 and nemo:dev #10299

Comments

itzsimpl commented Aug 29, 2024

artbataev commented Sep 2, 2024

github-actions bot commented Oct 3, 2024

itzsimpl commented Oct 3, 2024

artbataev commented Oct 4, 2024

titu1994 commented Oct 4, 2024

artbataev commented Oct 7, 2024

itzsimpl commented Oct 29, 2024

artbataev commented Oct 29, 2024

artbataev commented Oct 29, 2024 • edited Loading

artbataev commented Oct 29, 2024

itzsimpl commented Oct 29, 2024

artbataev commented Oct 30, 2024

itzsimpl commented Oct 30, 2024

artbataev commented Oct 31, 2024

gabitza-tech commented Nov 14, 2024

fastconformer hybrid recipe reports strange val_WER with `nemo:24.07` and `nemo:dev` #10299

fastconformer hybrid recipe reports strange val_WER with `nemo:24.07` and `nemo:dev` #10299

artbataev commented Oct 29, 2024 •

edited

Loading