Skip to content

NVIDIA Neural Modules 2.0.0rc0

Compare
Choose a tag to compare
@ericharper ericharper released this 06 Jun 05:46
· 1125 commits to main since this release

Highlights

LLM and MM

Models

  • Megatron Core RETRO

    • Pre-training
    • Zero-shot Evaluation
  • Pretraining, conversion, evaluation, SFT, and PEFT for:

    • Mixtral 8X22B
    • Llama 3
    • SpaceGemma
  • Embedding Models Fine Tuning

    • Mistral
    • BERT
  • BERT models

    • Context Parallel
    • Distributed checkpoint
  • Video capabilities with NeVa

Performance

  • Distributed Checkpointing

    • Torch native backend
    • Parallel read/write
    • Async write
  • Multimodal LLM (LLAVA/NeVA)

    • Pipeline Parallelism support
    • Sequence packing support

Export

  • Integration of Export & Deploy Modules into NeMo Framework container
    • Upgrade to TRT-LLM 0.9

Speech (ASR & TTS)

Models

  • AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
  • Multimodal Domain - Speech LLM supporting SALM Model
  • Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
  • Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
    • mel_codec_22khz_medium
    • mel_codec_44khz_medium

Perf Improvements

  • Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
  • Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
  • Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
  • Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.

Customization

  • Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
    • Longform Inference
    • Longform inference support for AED models
  • Transcription of multi-channel audio for AED models

Misc

  • Upgraded webdataset - Speech and LLM / Multimodal unified container

Detailed Changelogs

ASR

Changelog
  • Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828
  • TDT confidence fix by @GNroy :: PR: #8982
  • Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956
  • NeMo dev doc restructure by @yaoyu-33 :: PR: #8896
  • Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001
  • Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964
  • [ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007
  • Add ASR latest news by @titu1994 :: PR: #9073
  • Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006
  • PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061
  • RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972
  • Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100
  • Update branch for notebooks and ci in release by @ericharper :: PR: #9189
  • Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196
  • rename paths2audiofiles to audio by @nithinraok :: PR: #9209
  • Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233
  • Cherrypick: Support dataloader as input to audio for transcription (#9201) by @titu1994 :: PR: #9235
  • Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252
  • Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243
  • Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246
  • Fix loading github raw images on notebook by @nithinraok :: PR: #9282
  • typos by @nithinraok :: PR: #9314
  • Re-enable cuda graphs in training modes. by @galv :: PR: #9338
  • add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259
  • Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369
  • Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350
  • Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377
  • Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380

TTS

Changelog

LLM and MM

Changelog

Export

Changelog

General Improvements

Changelog