-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastconformer hybrid recipe reports strange val_WER with nemo:24.07
and nemo:dev
#10299
Comments
I can reproduce the problem, working on fixing it |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
@artbataev any news on this? |
@itzsimpl I'm still investigating, seeing multiple issues with CUDA graphs. |
I think it might be worth disabling it for training explicitly until we can make sure it doesn't affect model training. |
@titu1994 Let me check one thing. If this will not work, I will make a PR disabling CUDA graphs in training. |
@artbataev @titu1994 just to let you know the issue is present also doing inference with RNNT, eg. with |
@itzsimpl Are you using RNN-T or TDT? I tried a hybrid TDT-CTC model, and cannot reproduce the issue in Also, beam search ( |
I also do not see the difference with a hybrid RNNT-CTC model (I use python examples/asr/transcribe_speech_parallel.py \
model=stt_en_fastconformer_hybrid_large_pc \
predict_ds.manifest_filepath=<...>/manifests/librispeech/test_other.jsonl \
predict_ds.batch_size=16 \
output_path=test_other_decoded_1.jsonl \
decoder_type="rnnt" \
rnnt_decoding.strategy="greedy_batch" \
rnnt_decoding.greedy.use_cuda_graph_decoder=false Results are consistent with |
@itzsimpl can you please try reproducing the issue with |
@artbataev I'm using a non-english RNNT-CTC hybrid model, 1 GPU, and consistently getting a difference. It will take me a bit more time to test also with 2 GPU and with an English model as I do not have any test datasets ready. In addition, I'm experiencing a few other, not necessarily related, issues; a) setting the parameter
And b) running on Slurm in interactive mode, the second of two consecutive runs of
With
and predictions contain texts like
However, if I run the following two sed replacements
the final WER is
|
@artbataev I can confirm that with the PR I get correct WER results. I have also discovered some more info with respect to the script getting stuck on the second of multiple consecutive runs. I've opened a new issue (#11105) for that. |
@itzsimpl Thanks a lot! |
Hello! @artbataev I apologize if the following question might be dumb, but I will give a very short summary of what I am trying to do: My main repo utilizes nemo 1.23.0 and until now I was training with it a hybrid model. However since the lhotse update was implemented a couple of months ago, I was thinking of speeding up training by using the current dev branch. I have gotten a similar behaviour as @itzsimpl and solved it by disabling CUDA graphs as you mentioned ( thanks!! ). My question: will I be able to use the model trained in the current dev branch with cuda graphs disable with the code from 1.23.0 release? |
Describe the bug
Running a basic fastconformer hybrid recipe fails with image
nemo:24.07
and newer; more specifically, the reported RNNT WER numbers are all over the place, whereas CTC WER numbers decrease normally. Tested with a known good setup and imagesnemo:24.01.01.framework
,nemo:24.05.01
,nemo:24.07
andnemo:dev
, running with 8 GPUs on two machine types, a DGX-H100 and a no-name system with 8x A100 80GB PCIe. The drivers are 550.90.07.Since the checkpoints are saved with RNNT WER this messes up training completely.
Up to
nemo:24.05.01
the (RNNT) WER and CTC WER charts areWith
nemo:24.07
andnemo:dev
the (RNNT) WER and CTC WER charts areThe charts for
training_batch_wer
andtraining_batch_ctc_wer
show no such anomaly. To me this all points to the newly introduced Cuda graphs. In the logs ofnemo:24.07
I have noticed messages that Cuda graphs get Disabled during training but Enabled during validation, even though the driver does not support cuda toolkit 12.6.Steps/Code to reproduce bug
Run a basic training with the fastconformer hybrid recipe.
Expected behavior
Training progresses normally as it did up to
nemo:24.05.01
.Environment overview (please complete the following information)
nemo:24.01.01.framework
,nemo:24.05.01
,nemo:24.07
andnemo:dev
docker pull
&docker run
commands usedThe text was updated successfully, but these errors were encountered: