Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastconformer hybrid recipe reports strange val_WER with nemo:24.07 and nemo:dev #10299

Open
itzsimpl opened this issue Aug 29, 2024 · 15 comments · May be fixed by #11087
Open

fastconformer hybrid recipe reports strange val_WER with nemo:24.07 and nemo:dev #10299

itzsimpl opened this issue Aug 29, 2024 · 15 comments · May be fixed by #11087
Assignees
Labels
bug Something isn't working

Comments

@itzsimpl
Copy link
Contributor

Describe the bug

Running a basic fastconformer hybrid recipe fails with image nemo:24.07 and newer; more specifically, the reported RNNT WER numbers are all over the place, whereas CTC WER numbers decrease normally. Tested with a known good setup and images nemo:24.01.01.framework, nemo:24.05.01, nemo:24.07 and nemo:dev, running with 8 GPUs on two machine types, a DGX-H100 and a no-name system with 8x A100 80GB PCIe. The drivers are 550.90.07.

Since the checkpoints are saved with RNNT WER this messes up training completely.

Up to nemo:24.05.01 the (RNNT) WER and CTC WER charts are

Image

With nemo:24.07 and nemo:dev the (RNNT) WER and CTC WER charts are

Image

The charts for training_batch_wer and training_batch_ctc_wer show no such anomaly. To me this all points to the newly introduced Cuda graphs. In the logs of nemo:24.07 I have noticed messages that Cuda graphs get Disabled during training but Enabled during validation, even though the driver does not support cuda toolkit 12.6.

Epoch 0:   0%|          | 0/443 [00:00<?, ?it/s] [NeMo I 2024-08-28 19:28:28 optional_cuda_graphs:53] Disabled CUDA graphs for module <class 'nemo.collections.asr.models.hybrid_rnnt_ctc_bpe_models.EncDecHybridRNNTCTCBPEModel'>.decoding.decoding
[NeMo I 2024-08-28 19:28:28 optional_cuda_graphs:53] Disabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>joint._wer.decoding.decoding

[NeMo W 2024-08-28 19:37:18 rnnt_loop_labels_computer:270] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: Driver supports cuda toolkit version 12.4, but the driver needs to support at least 12,6. Please update your cuda driver.

Validation: |          | 0/? [00:00<?, ?it/s]�[A[NeMo I 2024-08-28 19:37:18 optional_cuda_graphs:79] Enabled CUDA graphs for module <class 'nemo.collections.asr.models.hybrid_rnnt_ctc_bpe_models.EncDecHybridRNNTCTCBPEModel'>.decoding.decoding
[NeMo I 2024-08-28 19:37:18 optional_cuda_graphs:79] Enabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>joint._wer.decoding.decoding

Steps/Code to reproduce bug

Run a basic training with the fastconformer hybrid recipe.

Expected behavior

Training progresses normally as it did up to nemo:24.05.01.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Slurm]
  • Method of NeMo install: [pip install or from source]. nemo image nemo:24.01.01.framework, nemo:24.05.01, nemo:24.07 and nemo:dev
  • If method of install is [Docker], provide docker pull & docker run commands used
@itzsimpl itzsimpl added the bug Something isn't working label Aug 29, 2024
@artbataev
Copy link
Collaborator

I can reproduce the problem, working on fixing it

Copy link
Contributor

github-actions bot commented Oct 3, 2024

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 3, 2024
@itzsimpl
Copy link
Contributor Author

itzsimpl commented Oct 3, 2024

@artbataev any news on this?

@github-actions github-actions bot removed the stale label Oct 4, 2024
@artbataev
Copy link
Collaborator

@itzsimpl I'm still investigating, seeing multiple issues with CUDA graphs.
Please, as a temporary solution, disable CUDA graphs in the model config by adding ++model.decoding.greedy.use_cuda_graph_decoder=false

@titu1994
Copy link
Collaborator

titu1994 commented Oct 4, 2024

I think it might be worth disabling it for training explicitly until we can make sure it doesn't affect model training.

@artbataev
Copy link
Collaborator

@titu1994 Let me check one thing. If this will not work, I will make a PR disabling CUDA graphs in training.

@itzsimpl
Copy link
Contributor Author

@artbataev @titu1994 just to let you know the issue is present also doing inference with RNNT, eg. with examples/asr/transcribe_speech_parallel.py decoder_type=rnnt on a hybrid model. The model works fine (i.e. returns valid transcriptions) if I force the default use_cuda_graph_decoder to False in https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/rnnt_decoding.py and https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py. Logs, however, still report cuda graphs as enabled.

@artbataev
Copy link
Collaborator

@itzsimpl Are you using RNN-T or TDT?

I tried a hybrid TDT-CTC model, and cannot reproduce the issue in nemo:24.07 container with examples/asr/transcribe_speech_parallel.py decoder_type=rnnt.

Also, beam search (rnnt_beam_decoding.py) does not use CUDA graphs / does not use parameter use_cuda_graph_decoder

@artbataev
Copy link
Collaborator

artbataev commented Oct 29, 2024

I also do not see the difference with a hybrid RNNT-CTC model (I use nvcr.io/nvidia/nemo:24.07 container, 2 GPUs)

python examples/asr/transcribe_speech_parallel.py \
 model=stt_en_fastconformer_hybrid_large_pc \
 predict_ds.manifest_filepath=<...>/manifests/librispeech/test_other.jsonl \
 predict_ds.batch_size=16 \
 output_path=test_other_decoded_1.jsonl \
 decoder_type="rnnt" \
 rnnt_decoding.strategy="greedy_batch" \
 rnnt_decoding.greedy.use_cuda_graph_decoder=false 

Results are consistent with rnnt_decoding.greedy.use_cuda_graph_decoder=true and look OK.

@artbataev
Copy link
Collaborator

@itzsimpl can you please try reproducing the issue with transcribe_speech_parallel.py with stt_en_fastconformer_hybrid_large_pc checkpoint?

@artbataev artbataev linked a pull request Oct 29, 2024 that will close this issue
8 tasks
@itzsimpl
Copy link
Contributor Author

@artbataev I'm using a non-english RNNT-CTC hybrid model, 1 GPU, and consistently getting a difference. It will take me a bit more time to test also with 2 GPU and with an English model as I do not have any test datasets ready. In addition, I'm experiencing a few other, not necessarily related, issues; a) setting the parameter rnnt_decoding.greedy.use_cuda_graph_decoder=false, like you did, has no effect for me, the log still displays

[NeMo I 2024-10-29 21:12:31 hybrid_rnnt_ctc_bpe_models:457] Changed decoding strategy of the RNNT decoder to 
...
    greedy:
...
      use_cuda_graph_decoder: true

And b) running on Slurm in interactive mode, the second of two consecutive runs of transcribe_speech_parallel.py gets stuck in right before

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
..

With use_cuda_graph_decoder: true the WER I get is

[NeMo I 2024-10-29 21:13:23 transcribe_speech_parallel:206] WER for all predictions is 23.3026.

and predictions contain texts like

... "pred_text": "rkarrkarkkarrrrukukkukkukaokjeokjeorrjerkkaoorkkjelakalakalakalakalarrrr"}

However, if I run the following two sed replacements

sed -i /opt/Nemo/nemo/collections/asr/parts/submodules/rnnt_decoding.py -e "s/'use_cuda_graph_decoder', True/'use_cuda_graph_decoder', False/g"
sed -i /opt/Nemo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py -e "s/use_cuda_graph_decoder: bool = True/use_cuda_graph_decoder: bool = False/g"

the final WER is

[NeMo I 2024-10-29 21:27:00 transcribe_speech_parallel:206] WER for all predictions is 0.0752.

@artbataev
Copy link
Collaborator

@itzsimpl Thank you! I will check transcribe_parallel.py with Slurm. No need to test the English model for now.

I would also appreciate it if you check the code from #11087

@itzsimpl
Copy link
Contributor Author

@artbataev I can confirm that with the PR I get correct WER results.

I have also discovered some more info with respect to the script getting stuck on the second of multiple consecutive runs. I've opened a new issue (#11105) for that.

@artbataev
Copy link
Collaborator

@itzsimpl Thanks a lot!

@gabitza-tech
Copy link
Contributor

Hello! @artbataev

I apologize if the following question might be dumb, but I will give a very short summary of what I am trying to do:

My main repo utilizes nemo 1.23.0 and until now I was training with it a hybrid model. However since the lhotse update was implemented a couple of months ago, I was thinking of speeding up training by using the current dev branch. I have gotten a similar behaviour as @itzsimpl and solved it by disabling CUDA graphs as you mentioned ( thanks!! ).

My question: will I be able to use the model trained in the current dev branch with cuda graphs disable with the code from 1.23.0 release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants