Highlights

LLM and MM

Models

Megatron Core RETRO
- Pre-training
- Zero-shot Evaluation
Pretraining, conversion, evaluation, SFT, and PEFT for:
- Mixtral 8X22B
- Llama 3
- SpaceGemma
Embedding Models Fine Tuning
- Mistral
- BERT
BERT models
- Context Parallel
- Distributed checkpoint
Video capabilities with NeVa

Performance

Distributed Checkpointing
- Torch native backend
- Parallel read/write
- Async write
Multimodal LLM (LLAVA/NeVA)
- Pipeline Parallelism support
- Sequence packing support

Export

Integration of Export & Deploy Modules into NeMo Framework container
- Upgrade to TRT-LLM 0.9

Speech (ASR & TTS)

Models

AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
Multimodal Domain - Speech LLM supporting SALM Model
Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
- mel_codec_22khz_medium
- mel_codec_44khz_medium

Perf Improvements

Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.

Customization

Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
- Longform Inference
- Longform inference support for AED models
Transcription of multi-channel audio for AED models

Misc

Upgraded webdataset - Speech and LLM / Multimodal unified container

Detailed Changelogs

ASR

Changelog

Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828
TDT confidence fix by @GNroy :: PR: #8982
Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956
NeMo dev doc restructure by @yaoyu-33 :: PR: #8896
Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001
Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964
[ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007
Add ASR latest news by @titu1994 :: PR: #9073
Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006
PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061
RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972
Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100
Update branch for notebooks and ci in release by @ericharper :: PR: #9189
Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196
rename paths2audiofiles to audio by @nithinraok :: PR: #9209
Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233
Cherrypick: Support dataloader as input to audio for transcription (#9201) by @titu1994 :: PR: #9235
Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252
Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243
Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246
Fix loading github raw images on notebook by @nithinraok :: PR: #9282
typos by @nithinraok :: PR: #9314
Re-enable cuda graphs in training modes. by @galv :: PR: #9338
add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259
Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369
Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350
Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377
Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380

TTS

Changelog

[TTS] Add tutorial for training audio codecs by @rlangman :: PR: #8723
Update radtts.py by @blisc :: PR: #9097
[Nemo CICD] RADTTS test optional by @pablo-garay :: PR: #9112
Remove Radtts CI test by @blisc :: PR: #9144
Fix T5 G2P Input and Output Types by @blisc :: PR: #9224

LLM and MM

Changelog

Rachitg/dpa by @rachitgarg91 :: PR: #8911
Remove precision args in trainer due to PTL update by @yaoyu-33 :: PR: #8908
Huvu/mcore retro by @huvunvidia :: PR: #8861
fsdp tp > 1 bug fix by @dimapihtar :: PR: #8947
Fix memory leak at loss func by @minitu :: PR: #8868
change the condition for get qkv tensor from linear_qkv output in mcoremixin by @HuiyingLi :: PR: #8965
Add safety checks for 'data' key in MegatronGPTModel cfg by @HuiyingLi :: PR: #8991
[NeMo-UX] Adding MegatronParallel by @cuichenx :: PR: #8987
Skip top_p computations when set to 1.0 by @odelalleau :: PR: #8905
Gemma bug by @cuichenx :: PR: #8962
[NeMo-UX] Adding megatron strategy by @marcromeyn :: PR: #8995
Quantized checkpoint support in export and deploy modules by @janekl :: PR: #8859
add geglu to mlp swap by @JRD971000 :: PR: #8999
add timeout for new_group by @acphile :: PR: #8998
Zero-shot evaluation pipeline for mcore RETRO by @huvunvidia :: PR: #8941
Added fusion for squared relu by @sanandaraj5597 :: PR: #8963
Developer Documents for mcore RETRO by @huvunvidia :: PR: #9026
[NeMo-UX] Adding GPTModel & MockDataModule by @marcromeyn :: PR: #9011
Adding unit test for mcore RETRO model by @huvunvidia :: PR: #9022
docs and simplification of cmd args by @arendu :: PR: #8979
[NeMo-UX] Add checkpoint-io to MegatronStrategy by @marcromeyn :: PR: #9057
Enable Sequence Packing and Pipeline Parallel in NeVA by @yaoyu-33 :: PR: #8957
Mingyuanm/add back fp8 support to sd by @Victor49152 :: PR: #9070
unfused lora by @arendu :: PR: #9004
Handle case where num_query_groups is set to null for LoRA config setup by @vysarge :: PR: #9075
Alit/griffin by @JRD971000 :: PR: #9021
Implement DistributedCheckpointIO by @mikolajblaz :: PR: #9016
Video Neva Pretraining + Inference Implementation by @paul-gibbons :: PR: #9095
HF to .nemo for Mixtral-8x22B-instruct by @akoumpa :: PR: #9060
mcore ds updates by @dimapihtar :: PR: #8951
Alit/griffin perf by @JRD971000 :: PR: #9107
Add assert for max_steps to be positive in MegatronGPTSFTModel by @athitten :: PR: #9110
Extend sequence length padding for GPT SFT to account for context parallel by @vysarge :: PR: #8869
Update gpt dataset config parameter for mock by @thomasdhc :: PR: #9118
Add Mcore DistributedDataParallel and distributed optimizer into Nemo by @gdengk :: PR: #9034
Revert "Add assert for max_steps to be positive in MegatronGPTSFTMode… by @pablo-garay :: PR: #9128
scripts to convert HF lora to nemo by @arendu :: PR: #9102
Prevent duplicated checkpoints by @mikolajblaz :: PR: #9015
add TN/ITN link in speech tools list by @erastorgueva-nv :: PR: #9142
Cleanup deprecated files and temporary changes by @cuichenx :: PR: #9088
Use DP+CP groups as the FSDP sharding domain by @erhoo82 :: PR: #9145
CUDA memory profile by @erhoo82 :: PR: #9096
Fix missing func for T5 model by @gdengk :: PR: #9141
Add knob for load_directly_on_device by @mikolajblaz :: PR: #9125
Revert rope fusion defaults by @cuichenx :: PR: #9238
Update nemo.export module for quantized models by @janekl :: PR: #9250
Fix circular import for MM dataprep notebook by @cuichenx :: PR: #9287
neva media_type + text generation default fix by @paul-gibbons :: PR: #9257
fix lora and ptuning and isort/black by @oyilmaz-nvidia :: PR: #9290
add check if num layers is divisible by pp size by @dimapihtar :: PR: #9208
Fix P-tuning for Llama based models by @apanteleev :: PR: #9297
add deprecation warnings by @pablo-garay :: PR: #9266
move pooler under post_process by @dimapihtar :: PR: #9328
add deprecation note for nmt by @dimapihtar :: PR: #9342
Fix incorrect checkpoint removal logic (#9192) by @mikolajblaz :: PR: #9204
fix fp16 precision issue by @dimapihtar :: PR: #9376
Fix module.training for Neva in FusedAttn backward which causes nan by @yaoyu-33 :: PR: #8877

Export

Changelog

Updates for TRT-LLM 0.9 by @oyilmaz-nvidia :: PR: #8873
Mingyuanm/sdxl export by @Victor49152 :: PR: #8926
Avoid unpacking NeMo checkpoints before exporting to TRT-LLM by @apanteleev :: PR: #8866
Update gemma for trt-llm 0.9 by @oyilmaz-nvidia :: PR: #8974
TRT-LLM export P-tuning related fixes by @apanteleev :: PR: #8863

General Improvements

Changelog

Update package info by @ericharper :: PR: #8793
[Nemo CICD] Update mcore 4.13.24 by @pablo-garay :: PR: #8917
Akoumparouli/low mem mixtral ckpt converter by @akoumpa :: PR: #8895
Adding RETRO tests to Action Tests (cicd-main.yml) by @huvunvidia :: PR: #8942
Akoumparouli/fix sd train 2 by @akoumpa :: PR: #8883
Update te install for jenkins by @ericharper :: PR: #8954
[Nemo CICD] Add last job depending on others for blocking check by @pablo-garay :: PR: #8959
Minor quantization pipeline updates by @janekl :: PR: #8924
Fix External CLIP Converter by @yaoyu-33 :: PR: #8960
PP support in LoRA merge script by @cuichenx :: PR: #8934
Update PR template by @ericharper :: PR: #8978
Update Latest News by @shashank3959 :: PR: #8837
Fix incorrect link to latest news in README by @shashank3959 :: PR: #8985
Update dependency install for LLM and MM by @ericharper :: PR: #8990
Temporarily remove mcore dep by @ericharper :: PR: #9010
[Nemo CICD] further specialize runners for more parallelism by @pablo-garay :: PR: #9036
Update mm dataprep notebook based on feedback by @cuichenx :: PR: #9029
Fix import in lora merge script by @cuichenx :: PR: #9032
[Nemo CICD] Run when labeled:Run CICD by @pablo-garay :: PR: #9044
[Nemo CICD] Add tag/label for 1-gpu runner by @pablo-garay :: PR: #9046
[Nemo CICD] checkout v4 by @pablo-garay :: PR: #9048
[Nemo CICD] Remove temp test change by @pablo-garay :: PR: #9049
remove in-place addition for dreambooth train with text encoder by @Victor49152 :: PR: #8825
Mingyuanm/sdxl quantization notebook by @Victor49152 :: PR: #9042
[Nemo CICD] Trigger on comment issued by @pablo-garay :: PR: #9062
zarr ckpt to torch_dist ckpt converter by @dimapihtar :: PR: #8842
Restore PTQ tests for Llama2 (reopened) by @janekl :: PR: #9064
add clip H config by @JRD971000 :: PR: #9082
[NeMo-UX] Add mixed-precision plugin by @marcromeyn :: PR: #9065
Comment baichuan test and update pr template by @ericharper :: PR: #9085
Add safe extraction of nemo tar files by @athitten :: PR: #8976
Improved shard_id parsing in LazyNemoTarredIterator, enables AIS dataloading by @pzelasko :: PR: #9077
[NeMo-UX] Add mistral-7b model by @marcromeyn :: PR: #9066
Llama3 Conversion Script Update by @suiyoubi :: PR: #9089
dehardcode test string by @JimmyZhang12 :: PR: #8865
[Nemo CICD] Try trigger cicd run on comment by @pablo-garay :: PR: #9111
Lhotse dataloading: RIR augmentation and nemo/tarred input support for RIR and noise aug by @pzelasko :: PR: #9109
mixtral evaluation PR by @Slyne :: PR: #8989
[Nemo CICD] Revert: run GHA cicd on comment by @pablo-garay :: PR: #9119
[Nemo CICD] Comment out flaky test: running too long by @pablo-garay :: PR: #9123
[Nemo CICD] Add timeout to unit tests by @pablo-garay :: PR: #9132
[Nemo CICD] Indicate optional test in name (prefix) by @pablo-garay :: PR: #9139
video neva null image+video folder path fix by @paul-gibbons :: PR: #9116
[NeMo-UX] Add data module by @cuichenx :: PR: #9133
NeMo Inference Requirements by @oyilmaz-nvidia :: PR: #9093
Remove debug print by @maanug-nv :: PR: #9074
Remove legacy CI by @pablo-garay :: PR: #9149
Update support for push_to_hf_hub() by @titu1994 :: PR: #9159
[Nemo CICD] comment out flaky PTQ tests by @pablo-garay :: PR: #9160
Update branch by @ericharper :: PR: #9211
dist adam transpose fix by @dimapihtar :: PR: #9239
[Nemo CICD] Increase time limit for Speech_Checkpoints_tests (#9186) by @pablo-garay :: PR: #9247
Pin transformers by @ericharper :: PR: #9261
Fix typo in HF tutorial by @titu1994 :: PR: #9302

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA Neural Modules 2.0.0rc0

Highlights

LLM and MM

Models

Performance

Export

Speech (ASR & TTS)

Models

Perf Improvements

Customization

Misc

Detailed Changelogs

ASR

TTS

LLM and MM

Export

General Improvements

Contributors