From 47b1553c948ac25ef27eb8710f57ec3ab0f946c2 Mon Sep 17 00:00:00 2001
From: "He Huang (Steve)" <105218074+stevehuang52@users.noreply.github.com>
Date: Sat, 11 May 2024 10:24:11 +0800
Subject: [PATCH 01/36] Add SpeechLM to main (#8741)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* update package info

Signed-off-by: ericharper <complex451@gmail.com>

* fix the mpt chatbot (#6957)

Signed-off-by: Yi Dong <yidong@nvidia.com>

* Remove `compute_on_step` from metrics (#6979)

* Remove `compute_on_step` from metrics

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove confusing log message

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update tests

Signed-off-by: smajumdar <titu1994@gmail.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Hybrid conformer export (#6983)

* Implemented generic kv-pair setting of export_config from args

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Hybrid conformer export

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Hybrid decoder export

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Changed from **kwargs

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Docstring

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Docs added

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Stringify args

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Added docs for ASR export configs

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* lowercase ctc

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

---------

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cache handling without input tensors mutation (#6980)

* Cache handling without input tensors mutation

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup#2

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup#3

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

---------

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

* fixes for spellmapper (#6994)

Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru>

* Fixing an issue with confidence ensembles (#6987)

* Bug fix for the confidence ensembles

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Relax constraints for the test

Signed-off-by: Igor Gitman <igitman@nvidia.com>

---------

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* [TTS] Append pretrained FastPitch & SpectrogamEnhancer pair to available models (#7012)

* [TTS] fastpitch: add english libritts model with asr stft parameters (25 ms 10 ms)

Signed-off-by: Roman Korostik <rkorostik@nvidia.com>

* [TTS] enhancer: add pretrained model intended for asr finetuning

Signed-off-by: Roman Korostik <rkorostik@nvidia.com>

---------

Signed-off-by: Roman Korostik <rkorostik@nvidia.com>

* Add ASR with TTS Tutorial. Fix enhancer usage. (#6955)

* Add ASR with TTS Tutorial
* Fix enhancer usage

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* install_bs (#7019)

Signed-off-by: Nikolay Karpov <karpnv@gmail.com>

* fix tab text gen (#7022)

Signed-off-by: Yi Dong <yidong@nvidia.com>

* TE bug fix (#7027)

Signed-off-by: Dmytro Pykhtar <dpykhtar@nvidia.com>

* Add support for Numba FP16 RNNT Loss (#6991) (#7038)

* Force working space memory to always be in fp32

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for fp16 testing in Numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for fp16 testing in Numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for fp16 testing in Numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix cost calculation by upcasting to fp32

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix cost calculation by upcasting to fp32

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support to check if numba fp16 is available

Signed-off-by: smajumdar <titu1994@gmail.com>

* add RNN-T loss implemented by PyTorch and test code (#5312)

* Fix the bugs in cache-aware streaming Conformer (#5032)

Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* IA3 support for GPT and T5 (#4909)

* init commit for ia3 adater training in GPT

Signed-off-by: arendu <adithya.r@gmail.com>

* ia3 adater training in GPT, models and adapter classes

Signed-off-by: arendu <adithya.r@gmail.com>

* reshape to operate even on non-contiguous tensors

Signed-off-by: arendu <adithya.r@gmail.com>

* configs

Signed-off-by: arendu <adithya.r@gmail.com>

* fixed none init

Signed-off-by: arendu <adithya.r@gmail.com>

* adding adapter and ia3 support for T5 based models

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* config update and t5 model adapter and ia3

Signed-off-by: arendu <adithya.r@gmail.com>

* removed unused imports

Signed-off-by: arendu <adithya.r@gmail.com>

* predict step for inference

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* adapter inference for t5

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* fixed bug micro and global batch size in eval

Signed-off-by: arendu <adithya.r@gmail.com>

* minor edit

Signed-off-by: arendu <adithya.r@gmail.com>

* agressive truncation if in test examples if no truncation field is given

Signed-off-by: arendu <adithya.r@gmail.com>

* corrected for language_model_path name changes in main

Signed-off-by: arendu <adithya.r@gmail.com>

* removed unused import

Signed-off-by: arendu <adithya.r@gmail.com>

* name change for language_model_path

Signed-off-by: arendu <adithya.r@gmail.com>

* include inter_attention to IA3

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fix in confg

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fixes

Signed-off-by: arendu <adithya.r@gmail.com>

* removed unused flag

Signed-off-by: arendu <adithya.r@gmail.com>

* addressing PR comments

Signed-off-by: arendu <adithya.r@gmail.com>

* address PR comments

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fix

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* CI test

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fix in jenkinsfile

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Bug fix - Limit val batches set to 1.0  (#5023)

* Bug fix

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Adressed sandeep's comments

* Fixing limit val batches support in bert

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixing limit val batches support in bert

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [bug_fix] kv_channels is used when available (#5066)

* fix bug s.t kv_channels is used when available

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* P&C Docs (#5068) (#5069)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
Co-authored-by: Matvei Novikov <mattyson.so@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add spe_split_by_unicode_script arg (#5072)

* Add spe_split_by_unicode_script arg

Signed-off-by: Anas <aabouallaban@pm.me>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Anas <aabouallaban@pm.me>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* probabilites -> probabilities (#5078) (#5079)

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* increase PR and Issue sweep quantity and active close PRs. (#5073)

* increase PR and Issue sweep quantity and active close PRs.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* update with stricter rules, 30 days to be stale and 7 days to be closed for both Issues and PRs.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] added missing German phoneme tokenizer. (#5070) (#5074)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* rename to match prompt leanring (#5076)

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Missing fixes from r1.11.0 to T5 finetuning eval (#5054) (#5061)

* Fixes to seq2seq eval

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Notebook bug fixes (#5084) (#5085)

* Notebook bug fixes

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Turned nemo install back on

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* reverted notebook

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Updated one line in entity linking nb

Signed-off-by: Virginia Adams <vadams@nvidia.com>

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* update strategy in notebook from ddp_fork to dp (#5088) (#5089)

Co-authored-by: Zhilin Wang <wangzhilin12061996@hotmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix bug in Squeezeformer Conv block (#5011) (#5024)

* Fix bug in Squeezeformer Conv block

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Fix kernel context

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Fix access mixin

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* fixed megatron lm conversion bug (PTL related) (#5038) (#5063)

Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>

Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: David Mosallanezhad <dmosallanezh@nvidia.com>

Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: David <amosalla@asu.edu>
Co-authored-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix Unhashable type list for Numba Cuda spec augment kernel (#5093) (#5094)

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix numba (#5098)

Signed-off-by: smajumdar <titu1994@gmail.com>

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Make it possible to specify output_filename in normalize_with_audio.py (#5092)

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Greedy decoding confidence for CTC and RNNT (#4931)

* rnnt confidence draft

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* word confidence

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* advanced entropies added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* refactoring

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* oops forgot a file

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* metrics and benchmarking script added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* style fix

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* texterrors installation added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* lgtm and bug fix

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* fix comments

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* fix typos

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* add missing import after rebase

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [Add] SLURP models and examples (#4668)

* add model, util and loss

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor annd update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update available models

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor data processing

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix typo

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor and update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* move transformer to asr.modules

Signed-off-by: stevehuang52 <heh@nvidia.com>

* move transformer to asr.modules

Signed-off-by: stevehuang52 <heh@nvidia.com>

* get rid of jsonlines

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* revert changes to nlp

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Jagadeesh Balam <4916480+jbalam-nv@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* only optimize params that are part of the adapter modules (#5086)

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Pipeline Parallel T5 Prompt Learning (#4956)

* Added pre process flag checks and pipeline parallel in fwd

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Added rank check for pipeline parallel

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* T5 prompt learning works!

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* IA3 passing CI

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Fixed typo

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* removed optimizer setup so Adi's change will not conflict

Signed-off-by: Virginia Adams <vadams@nvidia.com>

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] remove phonemizer.py (#5090)

remove phonemizer.py and convert code block to markdown in the tutorial.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* T5 Decoding with PP > 2 fix (#5091) (#5103)

* set sequence lenghts in the pipeline properly

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] fixed wrong val loss for epoch 0 and inconsistent metrics names (#5087) (#5102)

* fixed hifigan configs as well
* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix and refactor consumed samples save/restore for Megatron models. (#5077)

* Fixes and refactor

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused imports

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Empty

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* RIR corpus generator tool (#4927)

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Multiprocessing fix (#5106) (#5107)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
Co-authored-by: Matvei Novikov <mattyson.so@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [Bug fix] PC lexical + audio (#5109) (#5110)

* training running

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* revert

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* revert

Signed-off-by: ekmb <ebakhturina@nvidia.com>

Signed-off-by: ekmb <ebakhturina@nvidia.com>

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [Fix] schedulers with no max_steps param (#4564)

* fix schedulers

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update to use python inspect module

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* T5 prompt learning fixes missing from r.11.0 merge (#5075) (#5101)

* Fix special tokens

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Empty

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: David <amosalla@asu.edu>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: David <amosalla@asu.edu>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] Add NeMo TTS Primer Tutorial (#4933)

* [TTS] Add NeMo TTS Primer Tutorial

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add Squeezeformer CTC model checkpoints on Librispeech (#5121)

Signed-off-by: smajumdar <titu1994@gmail.com>

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* adding loss normalization options to rnnt joint  (#4829)

* adding normalization options to rnnt joint loss

* moving the param to joint

* moving loss normalization to rnnt loss config

* style

* cleaning up

* fixing sum reduction in joint

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* moving reduction into RNNT loss class

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactoring

* typos

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Asr concat dataloader (#5108)

* forced precision

* typo

* initial commit

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* typos and bugs

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* reverting conformer encoder

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* additional checks

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* adding support to CTC models as well

* reverting conformer_encoder

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* typo

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactoring

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactoring

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* merging

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>
Signed-off-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* fix blossom ci unittests

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* bugfix: pybtex.database.InvalidNameString: Too many commas in author field. (#5112) (#5115)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Uppdate container version to 22.09 (#5105)

* update container version

Signed-off-by: ericharper <complex451@gmail.com>

* pin click

Signed-off-by: ericharper <complex451@gmail.com>

* pin click 8.0.2

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Remove unsupported arguments from MegatronNMT (#5065)

* Fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* More fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* pp2 support for T5 IA3 learning and T5 Adapters learning (#5116)

* enabling pp2

Signed-off-by: arendu <adithya.r@gmail.com>

* optimizer update

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* T5 pp>1 support for adapters and ia3

Signed-off-by: arendu <adithya.r@gmail.com>

* fix bug with missing adapter_tuning

Signed-off-by: arendu <adithya.r@gmail.com>

* inference error fixed, pp=2

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* T5 Prompt Learning Fixes for Pipeline Parallel (#5120)

* Initial fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Added back validation acc

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Put num workers back

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* added relative encoding if statament

Signed-off-by: Virginia Adams <vadams@selene-login-01.nvidia.com>

* Added back val loss only validation

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Revert "Added back val loss only validation"

This reverts commit 86d8f4806fe30335c40c3716ce18259939df500f.

* Removed val acc for PP > 1

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Removed enc_seq_len if statement

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Added back validation acc calc

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: Virginia Adams <vadams@selene-login-01.nvidia.com>
Co-authored-by: Virginia Adams <vadams@nvidia.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Virginia Adams <vadams@selene-login-01.nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* add doc info (#4721)

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] Add SpanishCharsTokenizer (#5135)

* [TTS] Add SpanishCharsTokenizer

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Update megatron interface to dialogue (#4936)

* fix style formatting

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update template to include description of intent

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* changes based on requests in review

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add compatibility with assistant dataset

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove dialogue_state_tracking

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update huggingface utils for dialogue

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* rename dialogue_state_tracking_hybrid to dialogue_state_tracking_sgdqa

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix nemo/collections/nlp/models/dialogue_state_tracking_sgdqa/__init__.py

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix typo

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add docstrings for assistant data processsor

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins for SGDGEN local checkpoint

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* use local vocab file for Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* patch for Jenkins CI using local file

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add slot filling prediction and metrics

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused code

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* refactor metrics code out of Dialogue GPT Model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate backward compatible support for IntentSlotClassificationModel (bert model)

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* save prediction file for IntentSlotClassification

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue gpt model training for megatron gpt

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove batch generate for HF GPT2, which causes lower performance

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add few shot capability to dialogue gpt model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile and remove unused import

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update code description and clarity

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address PR comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate compatibility with ZeroShotIntentModel

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* rename folder to dialogue due to increased scope and further refactor for clarity

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* added dialogue GPT for sequence generation task (e.g. answer extender)

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add CI test for DialogueGPTGenerationModel

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate DialogueS2SGenerationModel for generation task (e.g. answer extender)

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* modify huggingface utils to support HF t5/BART models

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused imports

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update bleu metric

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix bleu metric style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* debug bleu metric

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* debug bleu metric

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update based on PR #3893

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update 2 based on PR #3893

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update 3 based on PR #3893

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate sgd generation based on user user utterance and system slot-values to generate system utterance

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add validation model saving capabilities

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* cleaned up code for SGD Based Answer extender

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Dialogue Generation CI

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix Jenkins CI issue"

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add support for design dataset

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unnecessary imports

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* support megatron for dialogue_s2s_generation_model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* reduce loaded samples in MSMarcoDataProcessor to 64 when cfg.model.dataset.debug_mode=True

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update CI

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update checkpoint and predictions filename to include epoch number

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate HF BART MNLI into zero shot intent model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate Dialogue Nearest Neighbour Model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* refactor Dialogue SGD Data Processor to make interface for models cleaner

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Dialogue S2S Generation model for DialogueSGDDataProcessor interface

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* support sgd and drive thru datasets by zero shot model and nearest neighbour model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add prediction saving code to nearest neighbour and zero shot intent models

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix typo in sgd data processor

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate Dialogue Mellon QA Data Processor

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update mellon qa

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue.py to remove outdated info

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue_config.yaml

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue_config.yaml

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add dialogue docs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address review comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix for cfg

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* make dependency on apex optional

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* change NLPDDPluggin calling logic to make it possible to run without apex

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add first draft of tutorial

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* reduce ms marco size by removing lines without wellFormedAnswers

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address pr comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update colab tutorial link in dialogue docs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* include unit test and some refactor to facilitate unit test

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address pr issues

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove typos in dialogue tutorial

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* support larger files for question answering

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unnecessary artifacts to reduce memory use

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* put 0 tensor to device

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update link within dialogue tutorial

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* restore previously delete files

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error handling when loss = nan

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update nan handling

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update spanning loss func

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update spanning loss

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix type error raised in qa_dataset.py

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add error checking message

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert back to float32

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert back to float32

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update exp logging

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update loading of large file from pickle to json

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update loading of large file from pickle to json

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* limit number of negative samples

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert post processing

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert post processing

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused methods and style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add more documentation

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused imports

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* changes base on PR review

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* set wandb logger falseby default

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update interface with megatron gpt prompt learning

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update inline documentation

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update prompt_ids

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msg

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update config

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update config

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* set inference = False for dialgue prompt learning during trainng

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* set inference = False for dialgue prompt learning during trainng

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused code

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update config yaml

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix bug for megatron gpt prompt learning

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused import

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address comments in PR

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address comments in PR

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address typo

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add megatron t5 inference

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix bug due to bert tokenizer not being space-aware

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update IntentSlotModel onnx export test

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update exportable

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address PR comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* replace functools.cache_property with functools.lru_cache to maintain python 3.7 compatibility

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* improve speed of rank_candidates and support for p tuning

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue.py

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix megatron prompt learning saving bug

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update generate_candidate method

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove repeated init text ids and invert attention masks

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update typo

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* custom collate fn to remove excess padding in batch

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update complete method to mitigate issue when max seq len is low

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address pr comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update generation interface

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Added save inference ready .nemo file with every checkpoint (#5055)

* Added save inference ready .nemo file with every checkpoint

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Python style fix

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* addressed Adi's comment

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Added ptuning check in model checkpoint saving

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Changed save_nemo_on_valdaition default to False

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changes global batch size of adapter CI

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Changed num workers to 0

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* added first stage of pipeline check

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fixes for docs/typos + remove max_utts parameter from tarred datasets as it causes hang in training (#5118)

* Remove ; from jupyter notebook cells

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix typos in documentation/code

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix output message to have 'or equal'

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Link formatting fixes

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Add error if max_utts is used in tarred datasets

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Remove max_utts parameter from tarred datasets

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix max_utts removal in tests

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix typo if -> is

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Merge r1.12.0 main (#5139)

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* Add cherry-pick action (#4958)

* add cherry-pick action

Signed-off-by: ericharper <complex451@gmail.com>

* Pin Transformers version to fix CI (#4955)

* Pin transformers version in CI to prevent offline tokenizer loading error

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop version

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Enable offline

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>

* upper bound transformers

Signed-off-by: ericharper <complex451@gmail.com>

* remove duplicate transformers requirement

Signed-off-by: ericharper <complex451@gmail.com>

* Release SOTA Lang ID model  (#5080)

* add pretrained lang id model ambernet

Signed-off-by: fayejf <fayejf07@gmail.com>

* update doc and style fix

Signed-off-by: fayejf <fayejf07@gmail.com>

Signed-off-by: fayejf <fayejf07@gmail.com>

* update branch and package info

Signed-off-by: ericharper <complex451@gmail.com>

* remove upper bounds on lightning and transformers

Signed-off-by: ericharper <complex451@gmail.com>

* remove transformers offline from ci

Signed-off-by: ericharper <complex451@gmail.com>

* upper bound transformers

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Added ASR model comparison to SDE (#5043)

SDE: Added ASR model comparison tool to SDE
transcribe speech: Added support for many predictions in one file, as well as custom field names
Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* fix nmt eval sampler (#5154)

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix Global init steps (#5143)

* move global step to base

Signed-off-by: Yi Dong <yidong@nvidia.com>

* fix fused softmax

Signed-off-by: Yi Dong <yidong@nvidia.com>

* add the missing file

Signed-off-by: Yi Dong <yidong@nvidia.com>

* update the fused kernel

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix import error

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix import again

Signed-off-by: Yi Dong <yidong@nvidia.com>

Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] bug fix - sample rate was being ignored in vocoder dataset (#4518)

* bug fix - sample rate was being ignored in vocoder dataset when not loading mel
* handled n segments for a different sampling rate than original sampling rate
* Added case for n_segments 0, warning for n_segments greater than file length

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add EMA support to NeMo (#4764)

* Added Base files

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Some refactors, swap to using MNIST Lnet

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add a few more tests, allow the callback to be set via the exp manager

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Actually run validation for testing

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Run isort

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add test for saving state/fix saving state

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Use dummy model

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix test

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add copyright

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Support saving separate EMA weight module

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add standalone functionality/logging

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Expose more parameters

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Modify to allow option to replace validation

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add jenkins test, formatting

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Pin Transformers version to fix CI (#4955)

* Pin transformers version in CI to prevent offline tokenizer loading error

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop version

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Enable offline

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add cherry-pick action (#4958) (#4961)

* add cherry-pick action

Signed-off-by: ericharper <complex451@gmail.com>

* Pin Transformers version to fix CI (#4955)

* Pin transformers version in CI to prevent offline tokenizer loading error

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop version

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Enable offline

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix changelog builder (#4962) (#4963)

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* fix cherry pick workflow (#4964) (#4965)

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* reorder model check (#4959) (#4967)

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* check for active conda environment (#4970) (#4971)

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [TTS] fix broken tutorial for MixerTTS. (#4949) (#4976)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Checkpoint averaging class fix (#4946)

* 1. Added args.class_path to provide it externally.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add ability to give seperate datasets for test, train and validation (#4798)

* Add ability to give seperate datasets for test, train and validation

* Addressed Sandeeps comments

* Addressed Sandeeps comments

* Add ability to give seperate datasets for test, train and validation

* Add ability to give seperate datasets for test, train and validation

* Addressed review comments

* Bug fix for common dataset utils

* Add CI tests

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* Reformat code

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* Bug fix

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* Bug fix

* Bug Fix

* Bug Fix

* Update Jenkinsfile

* Addressed comments

* Addressed Eriks comments.

* Addressed Sandeep

* Update Jenkinsfile

* Update Jenkinsfile

* Update dataset_utils.py

* Update Jenkinsfile

* Update Jenkinsfile

* Use GPT CI config

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* fix label models restoring issue from wrighted cross entropy (#4968) (#4975)

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add simple pre-commit file (#4983)

* Add simple pre-commit file

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Exclude docs folder

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit 053bd5ba579537a5f311b431871c21f3381b43eb.

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Import pycuda.autoprimaryctx or pycuda.autoinit to init pycuda execution environment (#4951)

Signed-off-by: Jin Li <liji@nvidia.com>

Signed-off-by: Jin Li <liji@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Adding speaker embedding conditioning in fastpitch (#4986)

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix ASR issues (#4984) (#4991)

* Fix ASR issues

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Revert fix

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix current tests

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* More test coverage

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Address reviews

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop bf16 test

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Address review

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* remove print

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add bf16

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: smajumdar <smajumdar@nvidia.com>
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Jin Li <liji@nvidia.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>
Co-authored-by: shanmugamr1992 <111910568+shanmugamr1992@users.noreply.github.com>
Co-authored-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: Subhankar Ghosh <subhankar2321@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix BF16 test (#5162)

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix errors in speaker diarization nemo docs (#5153)

* fix docs and docstrings for MSDD

Signed-off-by: Taejin Park <tango4j@gmail.com>

* fix nemo docs errors

Signed-off-by: Taejin Park <tango4j@gmail.com>

* reflected review comments

Signed-off-by: Taejin Park <tango4j@gmail.com>

Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add interleaved pipeline schedule to GPT (#5025)

* add virtual pipeline size to config

Signed-off-by: ericharper <complex451@gmail.com>

* convert model to list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* convert model to list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* convert model to list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* update for list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* add virtual to init

Signed-off-by: ericharper <complex451@gmail.com>

* update first last stage embedding all reduce

Signed-off-by: ericharper <complex451@gmail.com>

* update sequence parallel all reduce for virtual models

Signed-off-by: ericharper <complex451@gmail.com>

* runs but we get an error

Signed-off-by: ericharper <complex451@gmail.com>

* set virtual rank 0 after looping

Signed-off-by: ericharper <complex451@gmail.com>

* account for virtual when determinining first and last pipeline stages

Signed-off-by: ericharper <complex451@gmail.com>

* checkpointing for virtual models in progress

Signed-off-by: ericharper <complex451@gmail.com>

* add checkpoint hooks

Signed-off-by: ericharper <complex451@gmail.com>

* working on validation when resuming

Signed-off-by: ericharper <complex451@gmail.com>

* skip sanity val steps by default in config

Signed-off-by: ericharper <complex451@gmail.com>

* remove comment

Signed-off-by: ericharper <complex451@gmail.com>

* log number of params

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style

Signed-off-by: ericharper <complex451@gmail.com>

* check if self.model is a list

Signed-off-by: ericharper <complex451@gmail.com>

* make virtual pipeline default size None on init

Signed-off-by: ericharper <complex451@gmail.com>

* make virtual pipeline default to None in config

Signed-off-by: ericharper <complex451@gmail.com>

* remove ensure_divisibility call

Signed-off-by: ericharper <complex451@gmail.com>

* fix lgtm alerts

Signed-off-by: ericharper <complex451@gmail.com>

* remove num_sanity_val_steps from config

Signed-off-by: ericharper <complex451@gmail.com>

* default virtual pipeline size to none

Signed-off-by: ericharper <complex451@gmail.com>

* check for list

Signed-off-by: ericharper <complex451@gmail.com>

* update assert to make sure we are only doing virtual for gpt

Signed-off-by: ericharper <complex451@gmail.com>

* revert change to get_params_for_weight_decay

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* init var

Signed-off-by: ericharper <complex451@gmail.com>

* add import guard for set virtual model parallel world size

Signed-off-by: ericharper <complex451@gmail.com>

* use import guard

Signed-off-by: ericharper <complex451@gmail.com>

* update calls to fake init in eval scripts

Signed-off-by: ericharper <complex451@gmail.com>

* add _get_fwd_bwd_function

Signed-off-by: ericharper <complex451@gmail.com>

* log all total model parameters

Signed-off-by: ericharper <complex451@gmail.com>

* remove unused import

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* reduced to 14 inactive days to be stale for PRs. (#5165)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* refactor TTS documentation organization and add new contents. (#5137)

* refactor TTS documentation organization and add new contents.
* fix asr api bug.
* fix broken links.
* fix unexpected indentation errors.
* fixed unexpected indentation.
* fixed broken paper reference.
* fixed cross-reference and typos.
* fixed toctree errors.
* revert to 'Augmentors'
* reordered TTS tutorial list in starthere.
* ordered api classes alphabetically for each Section.
* fixed underscore typo for fastpitch checkpoint.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* upcase 'Tuning'

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fixed typo for RAD-TTS Aligner

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* reorder aligner section after mel-gen and vocoders in models.rst.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* clarify Mixer-TTS-X and reorder model descriptions alphabetically.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fixed some typos and formats.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* removed old megatron.rst.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fixed block quote ends without a blank line warnings.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* remove duplicate reference; fixed missing key nlp-megatron-shoeybi2019megatron

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Revert "removed old megatron.rst."

This reverts commit c5ea1dc3f23272eecfe8040e3abfa54fa122cf73.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* removed Russian, a hyphen, and add a note about G2P in tts/…

* Remove pyyaml (#7052)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix typo and branch in tutorial (#7048)

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Refined export_config (#7053)

* Refined export_config

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Rolling back hierarchy change

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

---------

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* fix pos id - hf update (#7075)

* fix pos id - hf update

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* add missing import

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Evelina <ebakhturina@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix documentation for Numba (#7065)

* Fix documentation for Numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update force float32 flag dynamically

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update force float32 flag dynamically

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix nemo version

Signed-off-by: smajumdar <titu1994@gmail.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* small Bugfix (#7079)

* fix branch

Signed-off-by: fayejf <fayejf07@gmail.com>

* fix typo

Signed-off-by: fayejf <fayejf07@gmail.com>

* fix link

Signed-off-by: fayejf <fayejf07@gmail.com>

---------

Signed-off-by: fayejf <fayejf07@gmail.com>

* Fix caching bug in causal convolutions for cache-aware ASR models (#7034)

* Adding docs and models for multiple lookahead cache-aware ASR (#7067)

* added docs on multiple look-ahead.

Signed-off-by: vnoroozi <vnoroozi@nvidia.com>

* added docs on multiple look-ahead.

Signed-off-by: vnoroozi <vnoroozi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added models.

Signed-off-by: vnoroozi <vnoroozi@nvidia.com>

* added models.

Signed-off-by: vnoroozi <vnoroozi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added models.

Signed-off-by: vnoroozi <vnoroozi@nvidia.com>

* added models.

Signed-off-by: vnoroozi <vnoroozi@nvidia.com>

---------

Signed-off-by: vnoroozi <vnoroozi@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix syntax error introduced in PR-7079 (#7102)

* fix syntax error introduced in PR-7079

Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru>

* fixes for pr review

Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru>

---------

Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru>

* fix links for TN (#7117)

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* Add updated fc ctc and rnnt xxl models (#7128)

* add updated fc xxl ctc and rnnt models

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add to docs

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* update branch (#7135)

Signed-off-by: ericharper <complex451@gmail.com>

* Fixed main and merging this to r1.20 (#7127)

* Fixed main and merging this to r1.20

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Update vad_utils.py

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

---------

Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* fix default attention size (#7141)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* Update evaluator.py (#7151)

reflecting changes in https://github.com/NVIDIA/NeMo/pull/7150

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* Eagerly accumulate embedding grads into fp32 buffer (#6958)

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Modular SpeechLLM implementation for Sept. 2023 submission (SALM) (#7634)

* add initial impl of ModularizedSpeechGPTModel and integration test

* fix typo in the test name (#1)

approve the nit change

* clean a initial version of example config; make sure it works by test (#2)

approve as no need to review

* add the test for training_step and fix the code correspondingly (test passed now) (#3)

* add test for validation_step (#4)

* mv audio and text emb concat to prepare_llm_input so as to write test to guard the llm input

* Merge heh and zhehuai's initial version of frozen am+llm (#5)

* Merge heh and zhehuai's initial version of frozen am+llm

The previous differences are summarized here:
https://docs.google.com/document/d/1zNI4hC6vJtUfcHbrUSPaMuYWRBQdN_36H0P2NiBiuPY/edit

This PR includes
1. Finish merging the model, dataset, and config code
2. Previous tests are still enabled and passed (prepare_llm_input, training_step,
    validation_step)
3. the example training script with LS960 has been run to make sure the training
pipeline works

The major remaining works are listed here
https://docs.google.com/document/d/1o0AM7v4gcTQkPZjE0Vl9TTX4vYnGTrbXEFGWh0UhGlk/edit#bookmark=id.pzvdadt5oxyw

---------

Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* fix a nit init bug broke test (#6)

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* Clean up implementation for SALM paper and sync to NEMO v1.20.0 (#18)

* wip

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* fix data

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* fix consumed_samples

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* fix the training restart problem by storing adapter+perception model and
init them from the ckpt

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* refix state dict

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* support wer and inf

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* nan guard

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* reimpl inf and bug fix

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* multi loader

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* unfreeze lm

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* flag for load am

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* tokenizer

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* overwrite vocab size

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* support bpe dropout

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* add tarred datasets

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix sample_alpha

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix bpe dropout bugs in the mismatched context in tokenization

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* add bleu metric

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update metrics

Signed-off-by: stevehuang52 <heh@nvidia.com>

* support inference and fix a bug in wer calculation

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* fix bucketing dataset

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix bleu implementation

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* support question set file per dataset/data loader in preparation for
multitask understanding; also fix bleu implementation

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* support simple random context for word boosting

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* use sacrebleu.corpus_bleu to be consistent with the rest

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* make audio_file optional in the data loader

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* add a tool to materialize mt and text data

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* compatible with tar dataset

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* temp fix for metric and speed up materialization

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* make num of context configurable

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* val_check_interval fix; make manifest dumping consistent with speech models

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* random_context_positive_ratio configurable to control precision

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* bug fix: freeze_llm flag is not passed to the model cfg

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* overwrite tensor_model_parallel_size

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* support both stt and ssl models for loading audio encoder

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* fix the inference config so as to use sampling; allow inference config update in training

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* refactorize and clean up code for preprocessing collections, dataset interface, model inference and rename some classes to be consistent with salm paper.
also make sure test passed

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* Undo changes in megatron_gpt_peft_models.py and move them to speechllm_models.py; make sure the correctness by test_speechllm_models.py::TestModularizedAudioGPTModel::test_predict_step

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* update default inference config and test golden value accordingly

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* integration test and minor fix

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* nit bug fix on manifest_filepath introduced by code cleanup

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* update workspace/ files; consider moving to examples later

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* further remove unnecessary stuff in the inference implementation

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* revert the update in default end_string to be compatible with legacy models

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

---------

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Co-authored-by: stevehuang52 <heh@nvidia.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* rename 'ModularizedAudioGPTModel' to 'ModularAudioGPTLoRAModel'; move speechllm stuff under nemo/collections/multimodal/speechllm

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* update copyright; remove workspace/scripts and workspace/tools folders since the main branch has LLaMA support

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

---------

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Co-authored-by: Zhehuai Chen <chenzhehuai.sjtu@aispeech.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: stevehuang52 <heh@nvidia.com>

* Add few-shot in-context learning and MLP modality adapter (#7705)

* add few-shot in-context learning and MLP modality adapter

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add init and copyright

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor fsl

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for mlp modality adapter (#7715)

* add few-shot in-context learning and MLP modality adapter

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add init and copyright

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor fsl

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for mlp modality adapter

Signed-off-by: stevehuang52 <heh@nvidia.com>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix speechllm few-shot inference (#7732)

* add few-shot in-context learning and MLP modality adapter

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add init and copyright

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor fsl

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for mlp modality adapter

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix few-shot inference

Signed-off-by: stevehuang52 <heh@nvidia.com>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Add training support for multiple audios in a sample (#7796)

* add few-shot in-context learning and MLP modality adapter

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add init and copyright

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor fsl

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for mlp modality adapter

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix few-shot inference

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix to allow num_workers > 0

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add training with multiple audios

Signed-off-by: stevehuang52 <heh@nvidia.com>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Create README.md

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* Update README.md

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* Update README.md

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* rename

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Update SpeechLLM code (#8475)

* add pleasefixme marker for potential failed nightly tests. (#7678)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Add new text segmentation library for better TTS quality (#7645)

* Add new text segmentation library for better TTS quality
* Update zh_cn_pinyin.py

added detailed instruction on how to install pkuseg.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Update requirements_tts.txt

remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>


---------

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774)

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer


* Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py


* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add '32-true' for precision values


---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models


* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <chcui@nvidia.com>

* make script work for llama2 models

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <chcui@nvidia.com>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix script for llama2 model

Signed-off-by: Chen Cui <chcui@nvidia.com>

* remove commented code

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <adithyare@nvidia.com>

* typo

Signed-off-by: arendu <adithyare@nvidia.com>

---------

Signed-off-by: arendu <adithyare@nvidia.com>

* add training with multiple audios

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Snake act (#7736)

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Abhishree <abhishreetm@gmail.com>

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* Update configuration files

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* add informative comment in config files

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* sample random index for reference audio selection

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <f.cariaggi4@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)


* update apex install in dockerfile


* use fetch head


---------

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>

* fix typo

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* Add files via upload

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <geshen@nvidia.com>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* fix typo

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <ebakhturina@nvidia.com>

---------

Signed-off-by: Evelina <ebakhturina@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <eharper@nvidia.com>

* update

Signed-off-by: eharper <eharper@nvidia.com>

* add cd

Signed-off-by: eharper <eharper@nvidia.com>

---------

Signed-off-by: eharper <eharper@nvidia.com>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <heh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add extrac hf text and update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* move dataset dependency to common

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add docstring

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* build transformer config

Signed-off-by: ericharper <complex451@gmail.com>

* add model to provider func

Signed-off-by: ericharper <complex451@gmail.com>

* update forward and float16 wrapper

Signed-off-by: ericharper <complex451@gmail.com>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* set virtual rank

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* remove optimizer_idx

Signed-off-by: eharper <eharper@nvidia.com>

* prefetch num microbatches

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* change layer names for SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug in SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* build transformer config

Signed-off-by: ericharper <complex451@gmail.com>

* add model to provider func

Signed-off-by: ericharper <complex451@gmail.com>

* update forward and float16 wrapper

Signed-off-by: ericharper <complex451@gmail.com>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* set virtual rank

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update for dist adam

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix variable names

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* small fix for lora and ptuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support layerwise peft

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support multiple target layers

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support lora GQA

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support amp O2

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* revert & more O2 fix

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support lora weight tying

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add copyright header

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* change layer names for SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug in SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* revert mcore config change

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>

* clean up ckpt conversion script

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* rollback git merge errors

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* one more fix for optional test set

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support merging lora weights in mcore

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update mcore for cpu init

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* disable dist ckpt in peft

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add ckpt conversion to CI

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update CI

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* mcore_mixin docstring

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* minor change in mcore peft error message

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* correct mcore fp8 config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add TE installation

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support mcore adapter tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* minor fix for prompt learning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: ericharper <complex451@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <eharper@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Kelvin Liu <lhb8125@users.noreply.github.com>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

---------

Signed-off-by: Micha Livne <mlivne@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <adithyare@nvidia.com>

* adjust key names based on O2

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <adithyare@nvidia.com>

* minor

Signed-off-by: arendu <adithyare@nvidia.com>

---------

Signed-off-by: arendu <adithyare@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <adithya.r@gmail.com>

* list of fields for context

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix minor typo

Signed-off-by: Abhishree <abhishreetm@gmail.com>

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <azzhipa@amazon.com>
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <rlangman@nvidia.com>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <adithyare@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

---------

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <eharper@nvidia.com>

* use mcore

Signed-off-by: eharper <eharper@nvidia.com>

* revert test change

Signed-off-by: eharper <eharper@nvidia.com>

* pleasefixme

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <eharper@nvidia.com>

* comment test temporarily

Signed-off-by: eharper <eharper@nvidia.com>

---------

Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <lhb8125@gmail.com>

* revert comment (#7368)

Signed-off-by: eharper <eharper@nvidia.com>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* upper bounding ptl (#7370)

Signed-off-by: eharper <eharper@nvidia.com>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <adithyare@nvidia.com>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add to llama inference config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* remove copy from other models

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* modify attribute not arg

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* rename function and add docstring

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* set default value

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* rename mapping function

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* ununsed import

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* save torch datatype to model

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* unused imports

Signed-off-by: Maanu Grover <maanug@nvidia.com>

---------

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix typo

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update Apex commit

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove unused variables

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* Update Jenkinsfile

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

---------

Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* update commit

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

---------

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use trainer arg

Signed-off-by: Maanu Grover <maanug@nvidia.com>

---------

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <titu1994@gmail.com>

* Refactor

Signed-off-by: smajumdar <titu1994@gmail.com>

* Refactor

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <eharper@nvidia.com>

* move dist ckpt

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <eharper@nvidia.com>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <eharper@nvidia.com>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <eharper@nvidia.com>

* fix load dist ckpt

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <eharper@nvidia.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

---------

Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <jasonwan@nvidia.com>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <eharper@nvidia.com>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <slym@nvidia.com>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add comprehensive error messages (#7261)

Signed-off-by: Anton Peganov <apeganov@nvidia.com>

* check NEMO_PATH (#7418)

Signed-off-by: Nikolay Karpov <karpnv@gmail.com>

* layer selection for ia3 (#7417)

* layer selection for ia3

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <adithyare@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix missing pip package 'einops' (#7397)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Fix failure of pyaudio in Google Colab (#7396)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Update README.md: output_path --> output_manifest_filepath (#7442)

Signed-off-by: Samuele Cornell <cornellsamuele@gmail.com>

* Updating FlashAttention API to match FlashAttentionV2

* Multiple fixes for mm

* Fix CI inductor issue and update to torch compile

* Remove suppress error

* Fix when conversion config uses fp16 and it complains about precision plugin

* Fixing FAv2 API usage

* Initial release of content filtering model

* Added synthetic dataloader for precached and online mode

* Mingyuanm/dreambooth opt

* Add llama2 support in neva training

* Fix sampler length

* Fix all precision issues in nemo multimodal

* Add rope dynamic linear scaling (#7437)

* Add dynamic linear scaling

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>

* Fix None dataloader issue in PTL2.0 (#7455)

* Fix None dataloader issue in PTL2.0

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [ASR] Confidence measure -> method renames (#7434)

* measure -> method

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)

* Add steps for document of getting dataset 'SF Bilingual Speech'

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Update datasets.rst

added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

---------

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* RNN-T confidence and alignment bugfix (#7381)

* new frame_confidence and alignments lists are now always created after the while loop

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* tests added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

---------

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* Fix resume from checkpoint in exp_manager (#7424) (#7426)

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix checking of cuda/cpu device for inputs of Decoder (#7444)

* Fix checking of cuda/cpu device for inputs of Decoder

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Update tacotron2.py

Signed-off-by: Jason <jasoli@nvidia.com>

---------

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>

* Fix failure of ljspeech's get_data.py (#7430)

* Fix failure of ljspeech's get_data.py

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Fix audio codec type checks (#7373)

* [TTS] Fix audio codec type checks

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Fix audio codec tests

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Add dataset to path of logged artifacts (#7462)

* [TTS] Add dataset to path of logged artifacts

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Revert axis name back to Audio Frames

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix sft dataset truncation (#7464)

* Add fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)

* striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* transpose conv1d inputs

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, s…

* Update README.md

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* update speechllm (#8486)

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models


* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <chcui@nvidia.com>

* make script work for llama2 models

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <chcui@nvidia.com>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix script for llama2 model

Signed-off-by: Chen Cui <chcui@nvidia.com>

* remove commented code

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <adithyare@nvidia.com>

* typo

Signed-off-by: arendu <adithyare@nvidia.com>

---------

Signed-off-by: arendu <adithyare@nvidia.com>

* add training with multiple audios

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Snake act (#7736)

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Abhishree <abhishreetm@gmail.com>

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* Update configuration files

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* add informative comment in config files

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* sample random index for reference audio selection

Signed-off-by: anferico <f.cariaggi4@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <f.cariaggi4@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)


* update apex install in dockerfile


* use fetch head


---------

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>

* fix typo

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* Add files via upload

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <geshen@nvidia.com>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* fix typo

Signed-off-by: Evelina <ebakhturina@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <ebakhturina@nvidia.com>

---------

Signed-off-by: Evelina <ebakhturina@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <eharper@nvidia.com>

* update

Signed-off-by: eharper <eharper@nvidia.com>

* add cd

Signed-off-by: eharper <eharper@nvidia.com>

---------

Signed-off-by: eharper <eharper@nvidia.com>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <heh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add extrac hf text and update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* move dataset dependency to common

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add docstring

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* build transformer config

Signed-off-by: ericharper <complex451@gmail.com>

* add model to provider func

Signed-off-by: ericharper <complex451@gmail.com>

* update forward and float16 wrapper

Signed-off-by: ericharper <complex451@gmail.com>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* set virtual rank

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* remove optimizer_idx

Signed-off-by: eharper <eharper@nvidia.com>

* prefetch num microbatches

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* change layer names for SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug in SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* build transformer config

Signed-off-by: ericharper <complex451@gmail.com>

* add model to provider func

Signed-off-by: ericharper <complex451@gmail.com>

* update forward and float16 wrapper

Signed-off-by: ericharper <complex451@gmail.com>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* set virtual rank

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update for dist adam

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix variable names

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* small fix for lora and ptuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support layerwise peft

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support multiple target layers

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support lora GQA

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support amp O2

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* revert & more O2 fix

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support lora weight tying

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add copyright header

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* change layer names for SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug in SFT

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* revert mcore config change

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>

* clean up ckpt conversion script

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* rollback git merge errors

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* one more fix for optional test set

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support merging lora weights in mcore

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update mcore for cpu init

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* disable dist ckpt in peft

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add ckpt conversion to CI

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* update CI

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* mcore_mixin docstring

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* minor change in mcore peft error message

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* correct mcore fp8 config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add TE installation

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* support mcore adapter tuning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* minor fix for prompt learning

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: ericharper <complex451@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <eharper@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Kelvin Liu <lhb8125@users.noreply.github.com>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com>

---------

Signed-off-by: Micha Livne <mlivne@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <adithyare@nvidia.com>

* adjust key names based on O2

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <adithyare@nvidia.com>

* minor

Signed-off-by: arendu <adithyare@nvidia.com>

---------

Signed-off-by: arendu <adithyare@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <adithya.r@gmail.com>

* list of fields for context

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix minor typo

Signed-off-by: Abhishree <abhishreetm@gmail.com>

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <azzhipa@amazon.com>
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <rlangman@nvidia.com>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <adithyare@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typos

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* fix typo

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

---------

Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <eharper@nvidia.com>

* use mcore

Signed-off-by: eharper <eharper@nvidia.com>

* revert test change

Signed-off-by: eharper <eharper@nvidia.com>

* pleasefixme

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <eharper@nvidia.com>

* comment test temporarily

Signed-off-by: eharper <eharper@nvidia.com>

---------

Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <lhb8125@gmail.com>

* revert comment (#7368)

Signed-off-by: eharper <eharper@nvidia.com>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* upper bounding ptl (#7370)

Signed-off-by: eharper <eharper@nvidia.com>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <adithyare@nvidia.com>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* add to llama inference config

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* remove copy from other models

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* modify attribute not arg

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* rename function and add docstring

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* set default value

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* rename mapping function

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* ununsed import

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* save torch datatype to model

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* unused imports

Signed-off-by: Maanu Grover <maanug@nvidia.com>

---------

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix typo

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update Apex commit

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove unused variables

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* Update Jenkinsfile

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

---------

Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* update commit

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

---------

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use trainer arg

Signed-off-by: Maanu Grover <maanug@nvidia.com>

---------

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <titu1994@gmail.com>

* Refactor

Signed-off-by: smajumdar <titu1994@gmail.com>

* Refactor

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <eharper@nvidia.com>

* move dist ckpt

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <eharper@nvidia.com>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <eharper@nvidia.com>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <eharper@nvidia.com>

* fix load dist ckpt

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <eharper@nvidia.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

---------

Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <jasonwan@nvidia.com>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <eharper@nvidia.com>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <slym@nvidia.com>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add comprehensive error messages (#7261)

Signed-off-by: Anton Peganov <apeganov@nvidia.com>

* check NEMO_PATH (#7418)

Signed-off-by: Nikolay Karpov <karpnv@gmail.com>

* layer selection for ia3 (#7417)

* layer selection for ia3

Signed-off-by: arendu <adithyare@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <adithyare@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix missing pip package 'einops' (#7397)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Fix failure of pyaudio in Google Colab (#7396)

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Update README.md: output_path --> output_manifest_filepath (#7442)

Signed-off-by: Samuele Cornell <cornellsamuele@gmail.com>

* Updating FlashAttention API to match FlashAttentionV2

* Multiple fixes for mm

* Fix CI inductor issue and update to torch compile

* Remove suppress error

* Fix when conversion config uses fp16 and it complains about precision plugin

* Fixing FAv2 API usage

* Initial release of content filtering model

* Added synthetic dataloader for precached and online mode

* Mingyuanm/dreambooth opt

* Add llama2 support in neva training

* Fix sampler length

* Fix all precision issues in nemo multimodal

* Add rope dynamic linear scaling (#7437)

* Add dynamic linear scaling

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>

* Fix None dataloader issue in PTL2.0 (#7455)

* Fix None dataloader issue in PTL2.0

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [ASR] Confidence measure -> method renames (#7434)

* measure -> method

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)

* Add steps for document of getting dataset 'SF Bilingual Speech'

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Update datasets.rst

added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

---------

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* RNN-T confidence and alignment bugfix (#7381)

* new frame_confidence and alignments lists are now always created after the while loop

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* tests added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

---------

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* Fix resume from checkpoint in exp_manager (#7424) (#7426)

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix checking of cuda/cpu device for inputs of Decoder (#7444)

* Fix checking of cuda/cpu device for inputs of Decoder

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* Update tacotron2.py

Signed-off-by: Jason <jasoli@nvidia.com>

---------

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>

* Fix failure of ljspeech's get_data.py (#7430)

* Fix failure of ljspeech's get_data.py

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Fix audio codec type checks (#7373)

* [TTS] Fix audio codec type checks

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Fix audio codec tests

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Add dataset to path of logged artifacts (#7462)

* [TTS] Add dataset to path of logged artifacts

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Revert axis name back to Audio Frames

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix sft dataset truncation (#7464)

* Add fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)

* striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* transpose conv1d inputs

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* Update subsampling.py

change striding_conv1d_k5 to striding_conv1d

Signed-off-by: Maxime Burchi <60737204+burchim@users.noreply.github.com>

* cv branch

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* video manifest

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* add collection classes

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add test_step_outputs

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* clean references

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* freeze unfreeze transcribe cv models

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* correct manifest get_full_path bug

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* update for PR

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* guard torchvision

Signed-off-by: mburchi <maxime.burchi@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-aut…

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc and infer

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* minor update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix import

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix pretrained info

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update dockerfile

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for merging main

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix for merge main

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix speechlm test

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix multi-layer feat

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for webdataset

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* force str to avoid bugs with implicit conversion of str  to bool type

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Update examples/multimodal/speech_llm/README.md

Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* Update examples/multimodal/speech_llm/README.md

Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for saving nemo

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update eval and ngc ckpt

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Update nemo/collections/multimodal/speech_llm/data/audio_text_qa_dataset.py

Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* Update nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_utils.py

Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* Update tests/collections/multimodal/test_speechllm_models.py

Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* refactor and remove nlp adapter mixin assert

Signed-off-by: stevehuang52 <heh@nvidia.com>

* remove random context augmentation

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix docstring

Signed-off-by: stevehuang52 <heh@nvidia.com>

* add docstring

Signed-off-by: stevehuang52 <heh@nvidia.com>

* minor refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor and fix missing import

Signed-off-by: stevehuang52 <heh@nvidia.com>

* major refactor on input format and minor update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix codeQL

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update for NGC ckpt and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* clean up

Signed-off-by: stevehuang52 <heh@nvidia.com>

* skip speechlm test until data moved to CI machines

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor and update to avoid changing nlp_adapter_mixin

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com>

* minor fix

Signed-off-by: stevehuang52 <heh@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com>

---------

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Roman Korostik <rkorostik@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Dmytro Pykhtar <dpykhtar@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
Signed-off-by: Anas …
Signed-off-by: Evelina <ebakhturina@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: vnoroozi <vnoroozi@nvidia.com>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org>
Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: anferico <f.cariaggi4@gmail.com>
Signed-off-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: arendu <adithyare@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>
Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Signed-off-by: Alexander Jipa <azzhipa@amazon.com>
Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>
Signed-off-by: lhb8125 <lhb8125@gmail.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Sangkug Lym <slym@nvidia.com>
Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Signed-off-by: Anton Peganov <apeganov@nvidia.com>
Signed-off-by: Samuele Cornell <cornellsamuele@gmail.com>
Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>
Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: mburchi <maxime.burchi@gmail.com>
Signed-off-by: Maxime Burchi <60737204+burchim@users.noreply.github.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Tamerlan Tabolov <tktabolov@gmail.com>
Signed-off-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com>
Signed-off-by: Stas Bekman <stas00@users.noreply.github.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: GiacomoLeoneMaria <giacomoleonemaria@gmail.com>
Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Signed-off-by: hkelly33 <58792115+hkelly33@users.noreply.github.com>
Signed-off-by: Adi Renduchintala <adithyare@nvidia.com>
Signed-off-by: BestJuly <chntaoli@163.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: Mehadi Hasan Menon <mehadihasan80@gmail.com>
Signed-off-by: Sasha Meister <sasha.meister.work@gmail.com>
Signed-off-by: Sasha Meister <117230141+ssh-meister@users.noreply.github.com>
Signed-off-by: Jan Baczek <jbaczek@nvidia.com>
Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Signed-off-by: Seonghun Noh <jzi040941@naver.com>
Signed-off-by: Seonghun <jzi040941@naver.com>
Signed-off-by: Eric Harper <complex451@gmail.com>
Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-3.nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Faith Wenyi Nchifor <52848633+Faith-Nchifor@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Martin <martin.ku@skysource.com.tw>
Signed-off-by: Oren Amsalem <oren.a4@gmail.com>
Signed-off-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>
Signed-off-by: Vivian <xuanzic@nvidia.com>
Signed-off-by: Vivian chen <xuanzic@nvidia.com>
Signed-off-by: Vivian Chen <140748220+xuanzic@users.noreply.github.com>
Signed-off-by: Vivian Chen <xuanzic@nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>
Signed-off-by: Shantanu Acharya <shantanua@nvidia.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Stephen <stephen.mcconnachie@bfi.org.uk>
Signed-off-by: Travis Bartley <tbartley@nvidia.com>
Signed-off-by: popcornell <cornellsamuele@gmail.com>
Signed-off-by: Michal Futrega <michal.futrega@gmail.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: Iztok Lebar Bajec <itzsimpl@gmail.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Harishankar G <harishankar.gopalan@ymail.com>
Signed-off-by: jiemingz <jiemingz@nvidia.com>
Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>
Signed-off-by: Jacek Bieniusiewicz <jbieniusiewi@nvidia.com>
Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>
Signed-off-by: Huiying Li <huiyingl@nvidia.com>
Signed-off-by: Huiying Li <willwin.lee@gmail.com>
Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com>
Co-authored-by: ericharper <complex451@gmail.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: bene-ges <antonova_sasha@list.ru>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Roman Korostik <racoiaws@users.noreply.github.com>
Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>
Co-authored-by: Nikolay Karpov <karpnv@gmail.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Taejin Park <tango4j@gmail.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com>
Co-authored-by: Zhehuai Chen <chenzhehuai.sjtu@aispeech.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Francesco Cariaggi <f.cariaggi4@gmail.com>
Co-authored-by: Adi Renduchintala <adithyare@nvidia.com>
Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: shuoer86 <129674997+shuoer86@users.noreply.github.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Ryan Langman <rlangman@nvidia.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com>
Co-authored-by: anteju <108555623+anteju@users.noreply.github.com>
Co-authored-by: Gerald Shen <119401249+gshennvm@users.noreply.github.com>
Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Mingyuan Ma <mingyuanm@nvidia.com>
Co-authored-by: Yu Yao <yuya@nvidia.com>
Co-authored-by: Alexandre Milesi <alexandrem@nvidia.com>
Co-authored-by: Ao Tang <aot@nvidia.com>
Co-authored-by: Bobby Chen <bobchen@nvidia.com>
Co-authored-by: Maanu Grover <maanug@nvidia.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@nvidia.com>
Co-authored-by: Mateusz Sieniawski <msieniawski@nvidia.com>
Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>
Co-authored-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: eharper <eharper@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Kelvin Liu <lhb8125@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com>
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
Co-authored-by: omahs <73983677+omahs@users.noreply.github.com>
Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com>
Co-authored-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com>
Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: PeganovAnton <apeganov@nvidia.com>
Co-authored-by: Samuele Cornell <cornellsamuele@gmail.com>
Co-authored-by: Parth Mannan <pmannan@nvidia.com>
Co-authored-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
Co-authored-by: Kunal Dhawan <kunaldhawan97@gmail.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Maxime Burchi <60737204+burchim@users.noreply.github.com>
Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: Tamerlan Tabolov <nektonikto999@gmail.com>
Co-authored-by: Xuesong Yang <16880-xueyang@users.noreply.gitlab-master.nvidia.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: Giacomo Leone Maria Cavallini <72698188+GiacomoLeoneMaria@users.noreply.github.com>
Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Co-authored-by: meatybobby <meatybobby@gmail.com>
Co-authored-by: Marc Romeyn <marcromeyn@gmail.com>
Co-authored-by: hkelly33 <58792115+hkelly33@users.noreply.github.com>
Co-authored-by: Yuanzhe Dong <yudong@nvidia.com>
Co-authored-by: Li Tao <chntaoli@163.com>
Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Co-authored-by: Mehadi Hasan Menon <mehadihasan80@gmail.com>
Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Co-authored-by: Sasha Meister <117230141+ssh-meister@users.noreply.github.com>
Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Co-authored-by: Seonghun Noh <jzi040941@naver.com>
Co-authored-by: David <amosalla@asu.edu>
Co-authored-by: Selvaraj Anandaraj <anandaraj@wisc.edu>
Co-authored-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-3.nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
Co-authored-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
Co-authored-by: trias702 <25867060+trias702@users.noreply.github.com>
Co-authored-by: Faith Wenyi Nchifor <52848633+Faith-Nchifor@users.noreply.github.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Co-authored-by: Martin <martin.ku@skysource.com.tw>
Co-authored-by: Oren Amsalem <oren.amsalem1@mail.huji.ac.il>
Co-authored-by: Szymon Mikler <sjmikler@gmail.com>
Co-authored-by: Vivian Chen <140748220+xuanzic@users.noreply.github.com>
Co-authored-by: Huiying Li <huiyingl@nvidia.com>
Co-authored-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Shantanu Acharya <shantanua@nvidia.com>
Co-authored-by: Oren Amsalem <oren.a4@gmail.com>
Co-authored-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: Cathy <815244047@qq.com>
Co-authored-by: Stephen <stephen.mcconnachie@bfi.org.uk>
Co-authored-by: tbartley94 <90423858+tbartley94@users.noreply.github.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Michal Futrega <michal.futrega@gmail.com>
Co-authored-by: Iztok Lebar Bajec <ilb@fri.uni-lj.si>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Co-authored-by: Zhuoyao Wang <zhuoyaow@nvidia.com>
Co-authored-by: Szymon Mikler <smikler@nvidia.com>
Co-authored-by: Marek Wawrzos <mwawrzos@nvidia.com>
Co-authored-by: Chia-Chih Chen <chiachihc@nvidia.com>
Co-authored-by: Ali Taghibakhshi <ataghibakhsh@nvidia.com>
Co-authored-by: Harishankar G <harishankar.gopalan@ymail.com>
Co-authored-by: Layali R <31741533+layalir@users.noreply.github.com>
Co-authored-by: Hainan Xu <hainan.xv@gmail.com>
Co-authored-by: Hainan Xu <hainanx@nvidia.com>
Co-authored-by: akoumpa <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com>
Co-authored-by: Krishna Puvvada <93558329+krishnacpuvvada@users.noreply.github.com>
Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com>
Co-authored-by: jbieniusiewi <152396322+jbieniusiewi@users.noreply.github.com>
Co-authored-by: Andrei Andrusenko <52885736+andrusenkoau@users.noreply.github.com>
Co-authored-by: stevehuang52 <stevehuang52@users.noreply.github.com>
---
 examples/multimodal/speech_llm/README.md      |  189 ++
 .../conf/modular_audio_gpt_config_eval.yaml   |  128 ++
 .../conf/modular_audio_gpt_config_peft.yaml   |  327 ++++
 .../conf/modular_audio_gpt_config_sft.yaml    |  299 ++++
 ...dular_audio_gpt_multi_enc_config_peft.yaml |  307 ++++
 .../speech_llm/conf/salm/salm_config.yaml     |  339 ++++
 .../speech_llm/modular_audio_gpt_eval.py      |  118 ++
 .../speech_llm/modular_audio_gpt_train.py     |   70 +
 .../asr/modules/conformer_encoder.py          |  121 +-
 .../asr/parts/mixins/transcription.py         |   10 +-
 nemo/collections/common/data/dataset.py       |   14 +-
 nemo/collections/common/metrics/__init__.py   |    6 +-
 .../metrics/metric_string_to_torchmetric.py   |   10 +-
 .../common/parts/preprocessing/collections.py |  344 +++-
 .../tokenizers/sentencepiece_tokenizer.py     |    9 +-
 .../multimodal/speech_llm/__init__.py         |   15 +
 .../multimodal/speech_llm/data/__init__.py    |   13 +
 .../speech_llm/data/audio_text_dataset.py     | 1327 ++++++++++++++
 .../multimodal/speech_llm/models/__init__.py  |   15 +
 .../speech_llm/models/modular_models.py       | 1563 +++++++++++++++++
 .../multimodal/speech_llm/modules/__init__.py |   20 +
 .../common/audio_text_generation_strategy.py  |  175 ++
 .../common/audio_text_generation_utils.py     |  698 ++++++++
 .../speech_llm/modules/modality_adapters.py   |  134 ++
 .../speech_llm/modules/perception_modules.py  |  431 +++++
 .../multimodal/speech_llm/parts/__init__.py   |   21 +
 .../speech_llm/parts/mixins/__init__.py       |   13 +
 .../speech_llm/parts/mixins/adapter_mixin.py  |   75 +
 .../speech_llm/parts/utils/__init__.py        |   13 +
 .../speech_llm/parts/utils/data_utils.py      |  157 ++
 .../language_modeling/megatron_gpt_model.py   |  171 +-
 .../megatron_gpt_sft_model.py                 |   17 +-
 .../nlp/modules/common/megatron/utils.py      |   54 +-
 nemo/core/classes/common.py                   |   15 +-
 .../convert_to_tarred_audio_dataset.py        |   23 +-
 .../multimodal/test_speechllm_models.py       |  266 +++
 36 files changed, 7370 insertions(+), 137 deletions(-)
 create mode 100644 examples/multimodal/speech_llm/README.md
 create mode 100644 examples/multimodal/speech_llm/conf/modular_audio_gpt_config_eval.yaml
 create mode 100644 examples/multimodal/speech_llm/conf/modular_audio_gpt_config_peft.yaml
 create mode 100644 examples/multimodal/speech_llm/conf/modular_audio_gpt_config_sft.yaml
 create mode 100644 examples/multimodal/speech_llm/conf/modular_audio_gpt_multi_enc_config_peft.yaml
 create mode 100644 examples/multimodal/speech_llm/conf/salm/salm_config.yaml
 create mode 100644 examples/multimodal/speech_llm/modular_audio_gpt_eval.py
 create mode 100644 examples/multimodal/speech_llm/modular_audio_gpt_train.py
 create mode 100644 nemo/collections/multimodal/speech_llm/__init__.py
 create mode 100644 nemo/collections/multimodal/speech_llm/data/__init__.py
 create mode 100644 nemo/collections/multimodal/speech_llm/data/audio_text_dataset.py
 create mode 100644 nemo/collections/multimodal/speech_llm/models/__init__.py
 create mode 100644 nemo/collections/multimodal/speech_llm/models/modular_models.py
 create mode 100644 nemo/collections/multimodal/speech_llm/modules/__init__.py
 create mode 100644 nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_strategy.py
 create mode 100644 nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_utils.py
 create mode 100644 nemo/collections/multimodal/speech_llm/modules/modality_adapters.py
 create mode 100644 nemo/collections/multimodal/speech_llm/modules/perception_modules.py
 create mode 100644 nemo/collections/multimodal/speech_llm/parts/__init__.py
 create mode 100644 nemo/collections/multimodal/speech_llm/parts/mixins/__init__.py
 create mode 100644 nemo/collections/multimodal/speech_llm/parts/mixins/adapter_mixin.py
 create mode 100644 nemo/collections/multimodal/speech_llm/parts/utils/__init__.py
 create mode 100644 nemo/collections/multimodal/speech_llm/parts/utils/data_utils.py
 create mode 100644 tests/collections/multimodal/test_speechllm_models.py

diff --git a/examples/multimodal/speech_llm/README.md b/examples/multimodal/speech_llm/README.md
new file mode 100644
index 000000000000..b6a9c7486331
--- /dev/null
+++ b/examples/multimodal/speech_llm/README.md
@@ -0,0 +1,189 @@
+# Modular SpeechLLM
+
+This directory contains example scripts to train and evaluate modular SpeechLLM (e.g, SALM[1], etc). 
+
+## Requirements
+You will need to install this specific branch of NeMo, or use the provided Dockerfile in the root directory of this repository to build a Docker image with all the necessary dependencies.
+
+## Architecture
+
+In general, there're three main components of a modular SpeechLLM: 
+- An audio encoder that processes the input audio and produces a sequence of audio embeddings.
+- A modality adapter that processes the audio embeddings and produces a sequence of embeddings in the same latent space as the token embeddings of a pretrained large language model (LLM).
+- A pretrained large language model (LLM) that processes embeddings from the modality adapter as well as token embeddings of input prompt, and produces the text output. The audio embeddings and text token embeddings are concatenated in time dimension before going into the LLM.
+- The LLM produces text outputs based on the concatenated input audio and text embedding.
+
+## Usage
+
+### Input Format
+
+You'll need to prepare data in the NeMo manifest format, where each line is a python dictionary with some keys, for example:
+```
+{
+    "audio_filepath": "path/to/audio.wav",
+    "offset": 0.0, # offset of the audio in seconds, this is an optional field
+    "duration": 10.0 , # duration of the audio in seconds, can set to `None` to load the whole audio
+    "context": "what is the transcription of the audio?", # text prompt for the audio, see below for more details
+    "answer": "the transcription of the audio", # optional for inference, default to "na" in dataloader
+}
+```
+
+The `context` field in the manifest is optional, and you can put a list of context in a context file (one context for each line) then set `++model.data.train_ds.context_file=<path to to context file>` to ask the dataloader to randomly pick a context from the file for each audio sample. This is useful for training with multiple prompts for the same task. If neither `context` field nor `context_file` is provided, the dataloader will use a default context `what does the audio mean?` for all audios. During inference, it is recommended to have the `context` field in the manifest. 
+
+#### **Customizing the fields to use**
+
+You can also use other fields in the manifest to replace the `context` and `answer`fields, but you'll also need to change the `prompt_template` to use the new field names. For example, if you desire to use the new fields `input_text` and `output_text`, you need to set:
+```bash
+++model.data.train_ds.context_key=input_text \
+++model.data.train_ds.answer_key=output_text \
+++model.data.train_ds.prompt_template="'Q: {input_text}\nA: {output_text}'"
+```
+Note that there're single quotes around the prompt template (to avoid hydra errors), and the field names are wrapped in curly braces.
+
+#### **Customizing the input format**
+
+If you would like to use multiple audios, you can set the `audio_filepath` to be a list of audio file paths, and specify the location of each audio by using a special `audio_locator` string in the context. The choice of `audio_locator` should also be passed into the config. For example, if you have a manifest item like this:
+```
+{
+    "audio_filepath": ["path/to/audio1.wav", "path/to/audio2.wav"],
+    "context": "what is the transcription of the [audio] and [audio]?", # text prompt for the audio, see below for more details
+    "answer": "the transcription of the audio1 and audio2", # optional for inference, default to "na" in dataloader
+}
+```
+You can set the `audio_locator` to be `[audio]` in the config:
+```bash
+++model.data.train_ds.audio_locator='[audio]'
+```
+
+By using `audio_locator`, the dataloader will replace the `audio_locator` in the context with the corresponding audio features extracted for each audio. You need to make sure that the number of audio locators in the context matches the number of audio files in the `audio_filepath` field. 
+
+### Training
+
+There are several configs for training a SpeechLLM:
+- `conf/modular_audio_gpt_config_peft.yaml`: a config for training a SpeechLLM with PEFT (e.g., LoRA), where you don't want to tune the whole LLM but still want to adapt the LLM to your needs.
+- `conf/modular_audio_gpt_config_sft.yaml`: a config for training a SpeechLLM without PEFT, where you might want to tune the whole LLM or simply freeze it and use as is.
+- `conf/modular_audio_gpt_multi_enc_config_peft.yaml`: a config for training a SpeechLLM with multiple audio encoders and PEFT, where you can add speaker embeddings to the audio embeddings. Currently only TitaNet is supported as the speaker encoder.
+
+With any config, you can set the following flags to control which components to train or freeze:
+- `model.freeze_llm` # Generally set to `True` unless you want to fine-tune the whole LLM.
+- `model.freeze_audio_encoder` # Generally set to `False` unless you want to freeze the audio encoder.
+- `model.freeze_modality_adapter` # Generally set to `False` since we want to train the modality adapter.
+
+In addition to the config file, you will also need to prepare the audio encoder and the LLM as `*.nemo` files.
+
+To train a SpeechLLM that uses LoRA, you can run the following script:
+```bash
+MEGATRON_MODEL=/path/to/megatron-model.nemo
+ASR_MODEL=/path/to/audio-model.nemo  # only the encoder part will be loaded. e.g, stt_en_fastconformer_transducer_large.nemo 
+
+TRAIN_MANIFESTS="[/data/train_1.json,/data/train_2.json]"
+VAL_MANIFESTS="[/data/dev_1.json,/data/dev_2.json]"
+VAL_NAMES="[dev-1,dev-2]"  # names to display when logging validation results for each dataset
+
+CUDA_VISIBLE_DEVICES="0,1" python modular_audio_gpt_train.py --config-path="./conf" --config-name "modular_audio_gpt_config_peft" \
+    trainer.devices=-1 \
+    model.freeze_audio_encoder=True \
+    model.freeze_llm=True \
+    model.global_batch_size=4 \  # global_batch_size = micro_batch_size * num_gpus_per_node * num_nodes * accumulate_grad_batches
+    model.micro_batch_size=2 \  # micro_batch_size = batch_size_per_gpu
+    model.pretrained_audio_model=$ASR_MODEL \
+    model.restore_from_path=$MEGATRON_MODEL \
+    model.data.train_ds.manifest_filepath=$TRAIN_MANIFESTS \
+    model.data.validation_ds.manifest_filepath=$VAL_MANIFESTS \
+    ++model.data.validation_ds.names=$VAL_NAMES \
+```
+
+You can also use tarred datasets for faster training by converting normal NeMo datasets to tarred datasets using this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py) and follow the same dataset setting as shown in the script. Also, `accumulate_grad_batches` is automatically set by the model based on `global_batch_size` and `micro_batch_size`, so there's no need to manually calculate and set `trainer.accumulate_grad_batches`.
+
+
+#### **Multi-task Training**
+
+In order to use a context file, you can set `++model.data.train_ds.context_file=<path to to context file>` in the command line or use multiple context files with `++model.data.train_ds.context_file=[<path to to context file1>,<path to context file2>,...]`. If the number of context files is equal to the number of provided datasets, the dataloader will assigne each context file to a dataset. Otherwise, the dataloader will randomly pick a context file from all provided context files for each audio sample. Using multiple context files is useful for training with multiple tasks, where each task has its own set of prompts. Meanwhile, you can control the weights for different tasks/datasets by using concatentated tarred datasets, where you can assign weights to datasets by:
+```
+++model.data.train_ds.is_tarred=True \
+++model.data.train_ds.is_concat=True \
+++model.data.train_ds.manifest_filepath=[/path/to/data1/tarred_audio_manifest.json,/path/to/data2/tarred_audio_manifest.json] \
+++model.data.train_ds.tarred_audio_filepaths=[/path/to/data1/audio__OP_0..1023_CL_.tar,/path/to/data2/audio__OP_0..1023_CL_.tar] \
+++model.data.train_ds.concat_sampling_technique='random' \
+++model.data.train_ds.concat_sampling_probabilities=[0.4,0.6] \
+```
+
+#### **Available Audio Encoders**
+
+Currently all NeMo ASR models are supported, others may also work if they have an `encoder` attribute that returns a sequence of audio embeddings, and a `preprocessor` that takes raw audios and returns a sequence of features for the encoder. The model should also have a `cfg` attribute that returns a `omegaconf.DictConfig` object of model configuration. In addition to a local model, you can also set `pretrained_audio_model` to a model from NGC (e.g., `stt_en_fastconformer_transducer_large`) or Huggingface (e.g., `nvidia/parakeet-rnnt-1.1b`), and the script will download the model and use it for training.
+
+
+### Inference
+
+The script you need to perform inference is `modular_audio_gpt_eval.py`, and the corresponding config file is `conf/modular_audio_gpt_config_eval.yaml`, where you mainly need to set the `model.data.test_ds` fields as well as paths to the checkpoints.
+
+#### **Inference with Intermediate Checkpoints**
+
+If you want to perform inference with intermediate checkpoints, where there's no single NeMo checkpoint file that contains all the model parameters, you can use the following script to load each component from its own checkpoint file and perform inference:
+
+```bash
+MEGATRON_CKPT=/path/to/megatron-llm.nemo
+ALM_DIR=/path/to/nemo_experiments/job_name
+# below is the path to the config used during training
+ALM_YAML=$ALM_DIR/version_0/hparams.yaml
+# this checkpoint file only contains the trainable params, the backslash is used to avoid hyrda parsing error
+ALM_CKPT="$ALM_DIR/checkpoints/AudioGPT--validation_wer\=0.2-step\=100000-epoch\=0-last.ckpt"  
+
+TEST_MANIFESTS="[/data/test_1.json,/data/test_2.json]"
+TEST_NAMES="[test-1,test-2]"
+
+CUDA_VISIBLE_DEVICES=0 python modular_audio_gpt_eval.py \
+    model.restore_from_path=$MEGATRON_CKPT \
+    model.peft.restore_from_path=$ALM_CKPT \
+    model.peft.restore_from_hparams_path=$ALM_YAML \
+    model.data.test_ds.manifest_filepath=$TEST_MANIFESTS \
+    model.data.test_ds.names=$TEST_NAMES \
+    model.data.test_ds.metric.name="bleu" \
+    model.data.test_ds.global_batch_size=8 \
+    model.data.test_ds.micro_batch_size=8 \
+    model.data.test_ds.tokens_to_generate=256 \
+    ++inference.greedy=False \
+    ++inference.top_k=50 \
+    ++inference.top_p=0.95 \
+    ++inference.temperature=0.4 \
+    ++inference.repetition_penalty=1.2 \
+    ++model.data.test_ds.output_dir=${ALM_DIR}
+```
+
+If you froze the audio encoder during training, you will also need to add the following line to the above script:
+```bash
+++model.pretrained_audio_model=/path/to/audio/model.nemo
+```
+
+If you want to save the intermediate checkpoints to a single NeMo checkpoint file, you can add the following line to the above script:
+```bash
+++save_to_nemo=/path/to/save/model.nemo
+```
+
+#### **Inference with Complete SpeechLLM Checkpoints**
+
+If you want to load a trained SpeechLLM from cloud, you can use the following script:
+```bash
+TEST_MANIFESTS="[/data/test_1.json,/data/test_2.json]"
+TEST_NAMES="[test-1,test-2]"
+
+CUDA_VISIBLE_DEVICES=0 python modular_audio_gpt_eval.py \
+    model.from_pretrained="speechllm_fc_llama2_7b" \
+    model.data.test_ds.manifest_filepath=$TEST_MANIFESTS \
+    model.data.test_ds.names=$TEST_NAMES \
+    model.data.test_ds.global_batch_size=8 \
+    model.data.test_ds.micro_batch_size=8 \
+	model.data.test_ds.tokens_to_generate=256 \
+    ++inference.greedy=False \
+    ++inference.top_k=50 \
+    ++inference.top_p=0.95 \
+    ++inference.temperature=0.4 \
+    ++inference.repetition_penalty=1.2 \
+    ++model.data.test_ds.output_dir="./test_outputs"
+```
+
+If you have a local `.nemo` file, you can use `model.restore_from_path=/path/to/model.nemo` to replace the line `model.from_pretrained="speechllm_fc_llama2_7b"` in the above example.
+
+
+## Reference
+[1] Chen, Z.\*, Huang, H.\*, Andrusenko, A., Hrinchuk, O., Puvvada, K.C., Li, J., Ghosh, S., Balam, J. and Ginsburg, B., 2023. SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation. ICASSP'24.
\ No newline at end of file
diff --git a/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_eval.yaml b/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_eval.yaml
new file mode 100644
index 000000000000..e2ef61a8046d
--- /dev/null
+++ b/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_eval.yaml
@@ -0,0 +1,128 @@
+# this config is used to perform inference on SpeechLLM checkpoints
+name: megatron_audio_gpt_eval
+
+trainer:
+  devices: 1
+  accelerator: gpu
+  num_nodes: 1
+  precision: bf16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  use_distributed_sampler: False
+  max_epochs: 1
+  max_steps: 1000000
+  log_every_n_steps: 10 # frequency with which training steps are logged 
+  val_check_interval: 1.0 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
+  gradient_clip_val: 1.0
+
+exp_manager:
+  explicit_log_dir: null
+  exp_dir: null
+  name: ${name}
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: validation_${model.data.validation_ds.metric.name}
+    save_top_k: 1
+    mode: min
+    save_nemo_on_train_end: True
+    filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}'
+    model_parallel_size: ${model.tensor_model_parallel_size}
+    always_save_nemo: True
+    save_best_model: False
+
+model:
+  from_pretrained: null  # pretrained model name on NGC or HF
+  restore_from_path: null # Path to an existing .nemo model you wish to add new tasks to or run inference with
+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+  pretrained_audio_model: null  # Path to a .nemo model for audio encoder
+
+  seed: 1234
+  tensor_model_parallel_size: 1 # intra-layer model parallelism
+  pipeline_model_parallel_size: 1 # inter-layer model parallelism
+  
+  global_batch_size: 1
+  micro_batch_size: 1
+  sync_batch_comm: False
+  megatron_amp_O2: False
+
+  ## Sequence Parallelism
+  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  sequence_parallel: False
+
+  ## Activation Checkpoint 
+  activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
+  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
+  # of each chunk at the specified granularity
+  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+  activations_checkpoint_num_layers: null # not used with 'selective'
+  activations_checkpoint_layers_per_pipeline: null
+  answer_only_loss: False # not used right now
+  gradient_as_bucket_view: False
+
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+
+  peft: # keep these basic params for reusing in both sft and peft SpeechLMs
+    restore_from_path: null
+    restore_from_hparams_path: null
+    restore_from_ckpt:
+      checkpoint_name: null
+      checkpoint_dir: null
+
+
+  data:
+    test_ds:
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      names: null # Names of the corresponding datasets used to log metrics.
+      global_batch_size: 1
+      micro_batch_size: 1
+      shuffle: False
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: False
+      end_string: ${data.train_ds.end_string}  # don't change, let hydra resolve from saved config
+      context_key: ${data.train_ds.context_key} # don't change, let hydra resolve from saved config
+      answer_key: ${data.train_ds.answer_key} # don't change, let hydra resolve from saved config
+      add_eos: ${data.train_ds.add_eos} # don't change, let hydra resolve from saved config
+      add_sep: ${data.train_ds.add_sep} # don't change, let hydra resolve from saved config
+      add_bos: ${data.train_ds.add_bos} # don't change, let hydra resolve from saved config
+      separate_prompt_and_response_with_newline: ${data.train_ds.separate_prompt_and_response_with_newline}
+      write_predictions_to_file: True
+      output_file_path_prefix: "preds" # Prefix of the file to write predictions to.
+      truncation_field: ${data.train_ds.truncation_field}  # don't change, let hydra resolve from saved config
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: ${data.train_ds.prompt_template} # don't change, let hydra resolve from saved config
+      tokens_to_generate: 512
+      log_every_n_steps: 1
+      sample_rate: ${data.train_ds.sample_rate} # don't change, let hydra resolve from saved config
+      audio_locator: null # set it to allow multiple audios in a sample, e.g. '|audio|', and use it in the context field of manifest to specify the locations of audios (`audio_filepath` is a list of audios).
+
+      metric:
+        name: "bleu" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss', 'wer', 'bleu', 'rouge']
+        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+        num_classes: null
+
+save_as_nemo: null  # optional string, set to save the whole model into a single nemo file
+
+inference:
+  greedy: True # Whether or not to use sampling ; use greedy decoding otherwise
+  top_k: 0  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+  top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+  temperature: 1.0 # sampling temperature
+  all_probs: False  # whether return the log prob for all the tokens in vocab
+  repetition_penalty: 1.2  # The parameter for repetition penalty. 1.0 means no penalty.
+  min_tokens_to_generate: 0  # The minimum length of the sequence to be generated.
+  compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+  outfile_path: output.txt
+  compute_attention_mask: True
diff --git a/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_peft.yaml b/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_peft.yaml
new file mode 100644
index 000000000000..172a8f37cf1c
--- /dev/null
+++ b/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_peft.yaml
@@ -0,0 +1,327 @@
+name: megatron_audio_gpt_peft
+
+trainer:
+  devices: 1
+  accelerator: gpu
+  num_nodes: 1
+  precision: 16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  use_distributed_sampler: False
+  max_epochs: 1000  # used to keep epoch logging correctly, but training will stop based on max_steps
+  max_steps: 1000000 # 1M steps
+  log_every_n_steps: 10 # frequency with which training steps are logged 
+  val_check_interval: 3000 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
+  gradient_clip_val: 1.0
+  accumulate_grad_batches: 1
+
+exp_manager:
+  # explicit_log_dir: null
+  exp_dir: null
+  name: ${name}
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: validation_${model.data.validation_ds.metric.name}
+    save_top_k: 1
+    mode: min
+    save_nemo_on_train_end: True
+    filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{epoch}'
+    model_parallel_size: ${model.tensor_model_parallel_size}
+    always_save_nemo: False
+    save_best_model: True
+  create_early_stopping_callback: False
+  early_stopping_callback_params:
+    monitor: "val_loss"
+    mode: "min"
+    min_delta: 0.001
+    patience: 10
+    verbose: True
+    strict: False # Should be False to avoid a runtime error where EarlyStopping says monitor is unavailable, which sometimes happens with resumed training.
+
+
+model:
+  seed: 1234
+  tensor_model_parallel_size: 1 # intra-layer model parallelism
+  pipeline_model_parallel_size: 1 # inter-layer model parallelism
+  
+  pretrained_audio_model: ???
+  freeze_llm: True
+  freeze_audio_encoder: False
+  freeze_modality_adapter: False
+
+  global_batch_size: 128
+  micro_batch_size: 4
+  restore_from_path: ??? # Path to an existing .nemo model you wish to add new tasks to or run inference with
+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+  save_nemo_on_validation_end: False # Saves an inference ready .nemo file every time a checkpoint is saved during training. 
+  sync_batch_comm: False
+  megatron_amp_O2: False
+
+  ## Sequence Parallelism
+  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  sequence_parallel: False
+
+  ## Activation Checkpoint 
+  activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
+  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
+  # of each chunk at the specified granularity
+  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+  activations_checkpoint_num_layers: null # not used with 'selective'
+  activations_checkpoint_layers_per_pipeline: null
+  answer_only_loss: True
+  gradient_as_bucket_view: False
+
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+
+  peft:
+    peft_scheme: "lora"  # can be either lora, adapter, ia3 or ptuning
+    restore_from_path: null
+    
+    # Used for adapter peft training
+    adapter_tuning:
+      type: 'parallel_adapter' # this should be either 'parallel_adapter' or 'linear_adapter'
+      adapter_dim: 32
+      adapter_dropout: 0.0
+      norm_position: 'pre' # This can be set to 'pre', 'post' or null, 'pre' is normally what is used.
+      column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      norm_type: 'mixedfusedlayernorm' # IGNORED if layer_adapter is used,  options are ['layernorm', 'mixedfusedlayernorm']
+      layer_selection: null  # selects in which layers to add adapters, e.g. [1,12] will add adapters to layer 1 (lowest) and 12. null will apply adapters to all layers
+      weight_tying: False
+      position_embedding_strategy: null # used only when weight_tying is True
+
+    lora_tuning:
+      target_modules: ['attention_qkv','attention_dense','mlp_fc1','mlp_fc2'] # this can either be 'attention_qkv','attention_dense','mlp_fc1','mlp_fc2', attention (qkv & dense), mlp (fc1 & fc2)
+      adapter_dim: 32
+      alpha: ${model.peft.lora_tuning.adapter_dim} 
+      adapter_dropout: 0.0
+      column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      layer_selection:  null  # selects in which layers to add lora adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers
+      weight_tying: False
+      position_embedding_strategy: null # used only when weight_tying is True
+
+    # Used for p-tuning peft training
+    p_tuning:
+      virtual_tokens: 10  # The number of virtual tokens the prompt encoder should add at the start of the sequence
+      bottleneck_dim: 1024  # the size of the prompt encoder mlp bottleneck
+      embedding_dim: 1024  # the size of the prompt encoder embeddings
+      init_std: 0.023
+
+    ia3_tuning:
+      layer_selection:  null  # selects in which layers to add ia3 adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers
+    
+    selective_tuning:
+      tunable_base_param_names: ["self_attention", "word_embeddings"]  # TODO: regex support @adithyre
+
+
+  perception:
+    use_multi_layer_feat: false  # whether to extract multi-layer features, only supports conformer encoder
+    multi_layer_feat:
+      layer_idx_list: [0,16]  # layer indices to extract features from
+      aggregator:
+        mode: "cat"  # ways to combine features from different layers, choices=['cat','sum','mean', 'max', 'min'], default to concat ('cat')
+        pooling: "avg"  # ways to pool features if they have different temporal lengths and align_mode=min, choices=['mean', 'max', 'min']
+        align_mode: "min"  # if features have different temporal lengths, set `min` to pool to the shortest length or `max` to repeat to the longest.
+
+    modality_adapter: 
+      _target_: nemo.collections.asr.modules.ConformerEncoder
+      feat_in: 1024
+      feat_out: -1 # you may set it if you need different output size other than the default d_model
+      n_layers: 2
+      d_model: 512
+
+      # Sub-sampling parameters
+      subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
+      subsampling_factor: 8 # must be power of 2 for striding and vggnet
+      subsampling_conv_channels: 256 # set to -1 to make it equal to the d_model
+      causal_downsampling: false
+
+      # Reduction parameters: Can be used to add another subsampling layer at a given position.
+      # Having a 2x reduction will speedup the training and inference speech while keeping similar WER.
+      # Adding it at the end will give the best WER while adding it at the beginning will give the best speedup.
+      reduction: null # pooling, striding, or null
+      reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder
+      reduction_factor: 1
+
+      # Feed forward module's params
+      ff_expansion_factor: 4
+
+      # Multi-headed Attention Module's params
+      self_attention_model: rel_pos # rel_pos or abs_pos
+      n_heads: 8 # may need to be lower for smaller d_models
+      # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
+      att_context_size: [-1, -1] # -1 means unlimited context
+      att_context_style: regular # regular or chunked_limited
+      xscaling: true # scales up the input embeddings by sqrt(d_model)
+      untie_biases: true # unties the biases of the TransformerXL layers
+      pos_emb_max_len: 5000
+
+      # Convolution module's params
+      conv_kernel_size: 9
+      conv_norm_type: 'batch_norm' # batch_norm or layer_norm or groupnormN (N specifies the number of groups)
+      # conv_context_size can be"causal" or a list of two integers while conv_context_size[0]+conv_context_size[1]+1==conv_kernel_size
+      # null means [(kernel_size-1)//2, (kernel_size-1)//2], and 'causal' means [(kernel_size-1), 0]
+      conv_context_size: null
+
+      ### regularization
+      dropout: 0.1 # The dropout used in most of the Conformer Modules
+      dropout_pre_encoder: 0.1 # The dropout used before the encoder
+      dropout_emb: 0.0 # The dropout used for embeddings
+      dropout_att: 0.1 # The dropout for multi-headed attention modules
+
+      # set to non-zero to enable stochastic depth
+      stochastic_depth_drop_prob: 0.0
+      stochastic_depth_mode: linear  # linear or uniform
+      stochastic_depth_start_layer: 1
+
+    spec_augment:
+      _target_: nemo.collections.asr.modules.SpectrogramAugmentation
+      freq_masks: 2 # set to zero to disable it
+      time_masks: 10 # set to zero to disable it
+      freq_width: 27
+      time_width: 0.05
+
+    # the following are read from the pretrained audio encoder:
+    # output_dim: null
+    # encoder: null
+    # preprocessor: null
+
+  data:
+    end_string: "[EOG]"
+    train_ds:
+      # Example of how to specify paths to multiple datasets
+      # manifest_filepath:
+      #   - /path/to/squad.jsonl
+      #   - /path/to/mnli.jsonl
+      #   - /path/to/boolq.jsonl
+      # Example of how each dataset is formatted
+      # {'audio_filepath': 'audio1.wav', 'offset': 0.0, 'duration': 12.3, 'context': 'transcribe this audio', 'answer': 'I have a dream...'}
+      # the 'answer' field can also be 'text', and a default 'context' field is added if missing in manigests, so as to work with ASR manifests
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: True
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: True
+      # Notably, the data weights are controlled by either bucketing_weights
+      # or concat_sampling_probabilities depending on the dataset type (tar and
+      # non-tar).
+      # See audio_text_qa_dataset.py for details.
+      concat_sampling_probabilities: null # When providing a list of datasets, this arg defines the sampling probabilities from each dataset when strategy='random'
+      context_key: 'context'
+      answer_key: 'answer'
+      add_eos: True
+      # add_eos: False
+      end_string: ${model.data.end_string}
+      add_sep: False
+      add_bos: False
+      separate_prompt_and_response_with_newline: False
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: "Q: {context}\nA: {answer}" # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+      max_duration: 24 # it is set for LibriSpeech, you may need to update it for your dataset
+      min_duration: 0.1
+      # tarred datasets
+      is_tarred: false
+      tarred_audio_filepaths: null
+      shuffle_n: 2048
+      # bucketing params
+      bucketing_strategy: "fully_randomized"
+      bucketing_batch_size: null
+      sample_alpha: null
+      audio_locator: null
+
+    validation_ds:
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: False
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: False
+      context_key: ${model.data.train_ds.context_key}
+      answer_key: ${model.data.train_ds.answer_key}
+      add_eos: ${model.data.train_ds.add_eos}
+      end_string: ${model.data.end_string}
+      add_sep: ${model.data.train_ds.add_sep}
+      add_bos: ${model.data.train_ds.add_bos}
+      separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+      write_predictions_to_file: False
+      output_file_path_prefix: null # Prefix of the file to write predictions to.
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: ${model.data.train_ds.prompt_template} # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      tokens_to_generate: 128
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+      audio_locator: ${model.data.train_ds.audio_locator}
+
+      log_every_n_steps: 10
+      metric:
+        name: "loss" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss', 'wer', 'bleu', 'rouge']
+        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+        num_classes: null
+
+    # test_ds:
+    #   manifest_filepath: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+    #   names: null # Names of the corresponding datasets used to log metrics.
+    #   global_batch_size: ${model.global_batch_size}
+    #   micro_batch_size: ${model.micro_batch_size}
+    #   shuffle: False
+    #   num_workers: 4
+    #   pin_memory: True
+    #   max_seq_length: 2048
+    #   min_seq_length: 1
+    #   drop_last: False
+    #   context_key: 'context'
+    #   answer_key: 'answer'
+    #   add_eos: ${model.data.train_ds.add_eos}
+    #   end_string: ${model.data.end_string}
+    #   add_sep: ${model.data.train_ds.add_sep}
+    #   add_bos: ${model.data.train_ds.add_bos}
+    #   separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+    #   write_predictions_to_file: False
+    #   output_file_path_prefix: null # Prefix of the file to write predictions to.
+    #   truncation_field: "context" # Options: ['context', 'answer']
+    #   index_mapping_dir: null # Path to a directory to write index mapping files.
+    #   prompt_template: ${model.data.train_ds.prompt_template}
+    #   # ASR configs
+    #   sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+
+    #   metric:
+    #     name: "loss" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss']
+    #     average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+    #     num_classes: null
+
+  optim:
+    name: fused_adam
+    lr: 1e-4
+    weight_decay: 0.001 
+    betas: 
+    - 0.9
+    - 0.98
+    sched:
+      name: CosineAnnealing
+      warmup_steps: 5000
+      min_lr: 0.0 # min_lr must be 0.0 for prompt learning when pipeline parallel > 1
+      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
+      monitor: val_loss
+      reduce_on_plateau: false
diff --git a/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_sft.yaml b/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_sft.yaml
new file mode 100644
index 000000000000..7f8512fbb19e
--- /dev/null
+++ b/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_sft.yaml
@@ -0,0 +1,299 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: megatron_audio_gpt_sft
+
+trainer:
+  devices: 1
+  accelerator: gpu
+  num_nodes: 1
+  precision: 16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  use_distributed_sampler: False
+  max_epochs: 1000  # used to keep epoch logging correctly, but training will stop based on max_steps
+  max_steps: 1000000 # 1M steps
+  log_every_n_steps: 10 # frequency with which training steps are logged 
+  val_check_interval: 3000 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
+  gradient_clip_val: 1.0
+  accumulate_grad_batches: 1
+
+exp_manager:
+  # explicit_log_dir: null
+  exp_dir: null
+  name: ${name}
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: validation_${model.data.validation_ds.metric.name}
+    save_top_k: 1
+    mode: min
+    save_nemo_on_train_end: True
+    filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{epoch}'
+    model_parallel_size: ${model.tensor_model_parallel_size}
+    always_save_nemo: False
+    save_best_model: True
+  create_early_stopping_callback: False
+  early_stopping_callback_params:
+    monitor: "val_loss"
+    mode: "min"
+    min_delta: 0.001
+    patience: 10
+    verbose: True
+    strict: False # Should be False to avoid a runtime error where EarlyStopping says monitor is unavailable, which sometimes happens with resumed training.
+
+
+model:
+  seed: 1234
+  tensor_model_parallel_size: 1 # intra-layer model parallelism
+  pipeline_model_parallel_size: 1 # inter-layer model parallelism
+  
+  pretrained_audio_model: ???
+  freeze_llm: True
+  freeze_audio_encoder: True
+  freeze_modality_adapter: False
+
+  global_batch_size: 128
+  micro_batch_size: 4
+  restore_from_path: ??? # Path to an existing .nemo model you wish to add new tasks to or run inference with
+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+  save_nemo_on_validation_end: False # Saves an inference ready .nemo file every time a checkpoint is saved during training. 
+  sync_batch_comm: False
+  megatron_amp_O2: False
+
+  ## Sequence Parallelism
+  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  sequence_parallel: False
+
+  ## Activation Checkpoint 
+  activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
+  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
+  # of each chunk at the specified granularity
+  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+  activations_checkpoint_num_layers: null # not used with 'selective'
+  activations_checkpoint_layers_per_pipeline: null
+  answer_only_loss: True
+  gradient_as_bucket_view: False
+
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+
+  perception:
+    use_multi_layer_feat: false
+    multi_layer_feat:
+      layer_idx_list: [0,16]
+      aggregator:
+        mode: "cat"  # ways to combine features from different layers, choices=['cat','sum','mean', 'max', 'min'], default to concat ('cat')
+        pooling: "avg"  # ways to pool features if they have different temporal lengths and align_mode=min, choices=['mean', 'max', 'min']
+        align_mode: "min"  # if features have different temporal lengths, set `min` to pool to the shortest length or `max` to repeat to the longest.
+
+    modality_adapter: 
+      _target_: nemo.collections.asr.modules.ConformerEncoder
+      feat_in: 1024
+      feat_out: -1 # you may set it if you need different output size other than the default d_model
+      n_layers: 2
+      d_model: 512
+
+      # Sub-sampling parameters
+      subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
+      subsampling_factor: 8 # must be power of 2 for striding and vggnet
+      subsampling_conv_channels: 256 # set to -1 to make it equal to the d_model
+      causal_downsampling: false
+
+      # Reduction parameters: Can be used to add another subsampling layer at a given position.
+      # Having a 2x reduction will speedup the training and inference speech while keeping similar WER.
+      # Adding it at the end will give the best WER while adding it at the beginning will give the best speedup.
+      reduction: null # pooling, striding, or null
+      reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder
+      reduction_factor: 1
+
+      # Feed forward module's params
+      ff_expansion_factor: 4
+
+      # Multi-headed Attention Module's params
+      self_attention_model: rel_pos # rel_pos or abs_pos
+      n_heads: 8 # may need to be lower for smaller d_models
+      # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
+      att_context_size: [-1, -1] # -1 means unlimited context
+      att_context_style: regular # regular or chunked_limited
+      xscaling: true # scales up the input embeddings by sqrt(d_model)
+      untie_biases: true # unties the biases of the TransformerXL layers
+      pos_emb_max_len: 5000
+
+      # Convolution module's params
+      conv_kernel_size: 9
+      conv_norm_type: 'batch_norm' # batch_norm or layer_norm or groupnormN (N specifies the number of groups)
+      # conv_context_size can be"causal" or a list of two integers while conv_context_size[0]+conv_context_size[1]+1==conv_kernel_size
+      # null means [(kernel_size-1)//2, (kernel_size-1)//2], and 'causal' means [(kernel_size-1), 0]
+      conv_context_size: null
+
+      ### regularization
+      dropout: 0.1 # The dropout used in most of the Conformer Modules
+      dropout_pre_encoder: 0.1 # The dropout used before the encoder
+      dropout_emb: 0.0 # The dropout used for embeddings
+      dropout_att: 0.1 # The dropout for multi-headed attention modules
+
+      # set to non-zero to enable stochastic depth
+      stochastic_depth_drop_prob: 0.0
+      stochastic_depth_mode: linear  # linear or uniform
+      stochastic_depth_start_layer: 1
+
+    spec_augment:
+      _target_: nemo.collections.asr.modules.SpectrogramAugmentation
+      freq_masks: 2 # set to zero to disable it
+      time_masks: 10 # set to zero to disable it
+      freq_width: 27
+      time_width: 0.05
+
+    # the following are read from the pretrained audio encoder:
+    # output_dim: null
+    # encoder: null
+    # preprocessor: null
+
+  data:
+    end_string: "[EOG]"
+    train_ds:
+      # Example of how to specify paths to multiple datasets
+      # manifest_filepath:
+      #   - /path/to/squad.jsonl
+      #   - /path/to/mnli.jsonl
+      #   - /path/to/boolq.jsonl
+      # Example of how each dataset is formatted
+      # {'audio_filepath': 'audio1.wav', 'offset': 0.0, 'duration': 12.3, 'context': 'transcribe this audio', 'answer': 'I have a dream...'}
+      # the 'answer' field can also be 'text', and a default 'context' field is added if missing in manigests, so as to work with ASR manifests
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: True
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: True
+      # Notably, the data weights are controlled by either bucketing_weights
+      # or concat_sampling_probabilities depending on the dataset type (tar and
+      # non-tar).
+      # See audio_text_qa_dataset.py for details.
+      concat_sampling_probabilities: null # When providing a list of datasets, this arg defines the sampling probabilities from each dataset when strategy='random'
+      context_key: 'context'
+      answer_key: 'answer'
+      add_eos: True
+      # add_eos: False
+      end_string: ${model.data.end_string}
+      add_sep: False
+      add_bos: False
+      separate_prompt_and_response_with_newline: False
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: "Q: {context}\nA: {answer}" # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+      max_duration: 24 # it is set for LibriSpeech, you may need to update it for your dataset
+      min_duration: 0.1
+      # tarred datasets
+      is_tarred: false
+      tarred_audio_filepaths: null
+      shuffle_n: 2048
+      # bucketing params
+      bucketing_strategy: "fully_randomized"
+      bucketing_batch_size: null
+      sample_alpha: null
+      audio_locator: null
+
+    validation_ds:
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: False
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: False
+      context_key: ${model.data.train_ds.context_key}
+      answer_key: ${model.data.train_ds.answer_key}
+      add_eos: ${model.data.train_ds.add_eos}
+      end_string: ${model.data.end_string}
+      add_sep: ${model.data.train_ds.add_sep}
+      add_bos: ${model.data.train_ds.add_bos}
+      separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+      write_predictions_to_file: False
+      output_file_path_prefix: null # Prefix of the file to write predictions to.
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: ${model.data.train_ds.prompt_template} # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      tokens_to_generate: 128
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+      audio_locator: ${model.data.train_ds.audio_locator}
+
+      log_every_n_steps: 10
+      metric:
+        name: "loss" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss', 'wer', 'bleu', 'rouge']
+        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+        num_classes: null
+
+    # test_ds:
+    #   manifest_filepath: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+    #   names: null # Names of the corresponding datasets used to log metrics.
+    #   global_batch_size: ${model.global_batch_size}
+    #   micro_batch_size: ${model.micro_batch_size}
+    #   shuffle: False
+    #   num_workers: 4
+    #   pin_memory: True
+    #   max_seq_length: 2048
+    #   min_seq_length: 1
+    #   drop_last: False
+    #   context_key: 'context'
+    #   answer_key: 'answer'
+    #   add_eos: ${model.data.train_ds.add_eos}
+    #   end_string: ${model.data.end_string}
+    #   add_sep: ${model.data.train_ds.add_sep}
+    #   add_bos: ${model.data.train_ds.add_bos}
+    #   separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+    #   write_predictions_to_file: False
+    #   output_file_path_prefix: null # Prefix of the file to write predictions to.
+    #   truncation_field: "context" # Options: ['context', 'answer']
+    #   index_mapping_dir: null # Path to a directory to write index mapping files.
+    #   prompt_template: ${model.data.train_ds.prompt_template}
+    #   # ASR configs
+    #   sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+
+    #   metric:
+    #     name: "loss" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss']
+    #     average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+    #     num_classes: null
+
+  optim:
+    name: fused_adam
+    lr: 1e-4
+    weight_decay: 0.001 
+    betas: 
+    - 0.9
+    - 0.98
+    sched:
+      name: CosineAnnealing
+      warmup_steps: 5000
+      min_lr: 0.0 # min_lr must be 0.0 for prompt learning when pipeline parallel > 1
+      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
+      monitor: val_loss
+      reduce_on_plateau: false
diff --git a/examples/multimodal/speech_llm/conf/modular_audio_gpt_multi_enc_config_peft.yaml b/examples/multimodal/speech_llm/conf/modular_audio_gpt_multi_enc_config_peft.yaml
new file mode 100644
index 000000000000..656e7df287f1
--- /dev/null
+++ b/examples/multimodal/speech_llm/conf/modular_audio_gpt_multi_enc_config_peft.yaml
@@ -0,0 +1,307 @@
+name: megatron_audio_gpt_multi_enc_peft_tuning
+
+trainer:
+  devices: 1
+  accelerator: gpu
+  num_nodes: 1
+  precision: 16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  use_distributed_sampler: False
+  max_epochs: 10000  # used to keep epoch logging correctly, but training will stop based on max_steps
+  max_steps: 1000000 # 1M steps
+  log_every_n_steps: 10 # frequency with which training steps are logged 
+  val_check_interval: 3000 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
+  gradient_clip_val: 1.0
+  accumulate_grad_batches: 1
+
+exp_manager:
+  # explicit_log_dir: null
+  exp_dir: null
+  name: ${name}
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: validation_${model.data.validation_ds.metric.name}
+    save_top_k: 1
+    mode: min
+    save_nemo_on_train_end: True
+    filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{epoch}'
+    model_parallel_size: ${model.tensor_model_parallel_size}
+    always_save_nemo: False
+    save_best_model: True
+  create_early_stopping_callback: False
+  early_stopping_callback_params:
+    monitor: "val_loss"
+    mode: "min"
+    min_delta: 0.001
+    patience: 10
+    verbose: True
+    strict: False # Should be False to avoid a runtime error where EarlyStopping says monitor is unavailable, which sometimes happens with resumed training.
+
+
+model:
+  seed: 1234
+  tensor_model_parallel_size: 1 # intra-layer model parallelism
+  pipeline_model_parallel_size: 1 # inter-layer model parallelism
+  
+  freeze_llm: True
+  freeze_audio_encoder: True
+  freeze_modality_adapter: False
+
+  global_batch_size: 128
+  micro_batch_size: 4
+  restore_from_path: ??? # Path to an existing .nemo model you wish to add new tasks to or run inference with
+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+  save_nemo_on_validation_end: False # Saves an inference ready .nemo file every time a checkpoint is saved during training. 
+  sync_batch_comm: False
+  megatron_amp_O2: False
+
+  ## Sequence Parallelism
+  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  sequence_parallel: False
+
+  ## Activation Checkpoint 
+  activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
+  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
+  # of each chunk at the specified granularity
+  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+  activations_checkpoint_num_layers: null # not used with 'selective'
+  activations_checkpoint_layers_per_pipeline: null
+  answer_only_loss: True
+  gradient_as_bucket_view: False
+
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+
+  peft:
+    peft_scheme: "lora"  # can be either adapter,ia3, or ptuning
+    restore_from_path: null
+    
+    # Used for adapter peft training
+    adapter_tuning:
+      type: 'parallel_adapter' # this should be either 'parallel_adapter' or 'linear_adapter'
+      adapter_dim: 32
+      adapter_dropout: 0.0
+      norm_position: 'pre' # This can be set to 'pre', 'post' or null, 'pre' is normally what is used.
+      column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      norm_type: 'mixedfusedlayernorm' # IGNORED if layer_adapter is used,  options are ['layernorm', 'mixedfusedlayernorm']
+      layer_selection: null  # selects in which layers to add adapters, e.g. [1,12] will add adapters to layer 1 (lowest) and 12. null will apply adapters to all layers
+      weight_tying: False
+      position_embedding_strategy: null # used only when weight_tying is True
+
+    lora_tuning:
+      target_modules: ['attention_qkv','attention_dense','mlp_fc1','mlp_fc2'] # this can either be 'attention_qkv','attention_dense','mlp_fc1','mlp_fc2', attention (qkv & dense), mlp (fc1 & fc2)
+      adapter_dim: 32
+      alpha: ${model.peft.lora_tuning.adapter_dim} 
+      adapter_dropout: 0.0
+      column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      layer_selection:  null  # selects in which layers to add lora adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers
+      weight_tying: False
+      position_embedding_strategy: null # used only when weight_tying is True
+
+    # Used for p-tuning peft training
+    p_tuning:
+      virtual_tokens: 10  # The number of virtual tokens the prompt encoder should add at the start of the sequence
+      bottleneck_dim: 1024  # the size of the prompt encoder mlp bottleneck
+      embedding_dim: 1024  # the size of the prompt encoder embeddings
+      init_std: 0.023
+
+    ia3_tuning:
+      layer_selection:  null  # selects in which layers to add ia3 adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers
+    
+    selective_tuning:
+      tunable_base_param_names: ["self_attention", "word_embeddings"]  # TODO: regex support @adithyre
+
+
+  perception:
+    modality_adapter: 
+      _target_: nemo.collections.multimodal.speech_llm.modules.PoolingMLPConnectors
+      hidden_dim: 512
+      pooling: 'cat'
+      pooling_factor: 2
+      num_layers: 4
+      input_dim: -1
+      output_dim: -1
+
+    spec_augment:
+      _target_: nemo.collections.asr.modules.SpectrogramAugmentation
+      freq_masks: 2 # set to zero to disable it
+      time_masks: 10 # set to zero to disable it
+      freq_width: 27
+      time_width: 0.05
+
+    encoders:
+    # use `target` instead of `_target_` to avoid auto initialization by hydra, need to do manual instantiation
+      asr_model:
+        target: nemo.collections.asr.models.EncDecRNNTBPEModel
+        model_dim_key: d_model
+        freeze: True
+        pretrained_model: stt_en_fastconformer_transducer_large
+      ssl_model:
+        target: nemo.collections.asr.models.SpeechEncDecSelfSupervisedModel
+        model_dim_key: d_model
+        freeze: True
+        pretrained_model: ssl_en_conformer_large
+        use_multi_layer_feat: True
+        multi_layer_feat:
+          layer_idx_list: [0,16]
+          aggregator:
+            mode: "cat"
+            pooling: "avg"
+            rounding: "floor"
+  
+    speaker_model:
+      segment_length_in_secs: 0.4
+      freeze: True
+      pretrained_model: titanet_large
+
+    ref_model: asr_model
+    aggregator:
+      mode: "cat"
+      pooling: "mean"
+      rounding: "floor"
+
+    # the following are read from the pretrained audio encoder:
+    # output_dim: null
+    # encoder: null
+    # preprocessor: null
+
+  data:
+    end_string: "[EOG]"
+    train_ds:
+      # Example of how to specify paths to multiple datasets
+      # manifest_filepath:
+      #   - /path/to/squad.jsonl
+      #   - /path/to/mnli.jsonl
+      #   - /path/to/boolq.jsonl
+      # Example of how each dataset is formatted
+      # {'audio_filepath': 'audio1.wav', 'offset': 0.0, 'duration': 12.3, 'context': 'transcribe this audio', 'answer': 'I have a dream...'}
+      # the 'answer' field can also be 'text', and a default 'context' field is added if missing in manigests, so as to work with ASR manifests
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: True
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: True
+      # Notably, the data weights are controlled by either bucketing_weights
+      # or concat_sampling_probabilities depending on the dataset type (tar and
+      # non-tar).
+      # See audio_text_qa_dataset.py for details.
+      concat_sampling_probabilities: null # When providing a list of datasets, this arg defines the sampling probabilities from each dataset when strategy='random'
+      context_key: 'context'
+      answer_key: 'answer'
+      # add_eos: True
+      add_eos: False
+      end_string: ${model.data.end_string}
+      add_sep: False
+      add_bos: False
+      separate_prompt_and_response_with_newline: False
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: "Q: {context}\nA: {answer}" # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+      max_duration: 24 # it is set for LibriSpeech, you may need to update it for your dataset
+      min_duration: 0.1
+      # tarred datasets
+      is_tarred: false
+      tarred_audio_filepaths: null
+      shuffle_n: 2048
+      # bucketing params
+      bucketing_strategy: "synced_randomized"
+      bucketing_batch_size: null
+      sample_alpha: null
+      audio_locator: null
+
+    validation_ds:
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: False
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: False
+      context_key: ${model.data.train_ds.context_key}
+      answer_key: ${model.data.train_ds.answer_key}
+      add_eos: ${model.data.train_ds.add_eos}
+      end_string: ${model.data.end_string}
+      add_sep: ${model.data.train_ds.add_sep}
+      add_bos: ${model.data.train_ds.add_bos}
+      separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+      write_predictions_to_file: False
+      output_file_path_prefix: null # Prefix of the file to write predictions to.
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: ${model.data.train_ds.prompt_template} # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      tokens_to_generate: 128
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+      audio_locator: ${model.data.train_ds.audio_locator}
+
+      log_every_n_steps: 20
+      metric:
+        name: "wer" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss']
+        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+        num_classes: null
+
+    # test_ds:
+    #   manifest_filepath: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+    #   names: null # Names of the corresponding datasets used to log metrics.
+    #   global_batch_size: ${model.global_batch_size}
+    #   micro_batch_size: ${model.micro_batch_size}
+    #   shuffle: False
+    #   num_workers: 4
+    #   pin_memory: True
+    #   max_seq_length: 2048
+    #   min_seq_length: 1
+    #   drop_last: False
+    #   context_key: 'context'
+    #   answer_key: 'answer'
+    #   add_eos: ${model.data.train_ds.add_eos}
+    #  end_string: ${model.data.end_string}
+    #   add_sep: ${model.data.train_ds.add_sep}
+    #   add_bos: ${model.data.train_ds.add_bos}
+    #   separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+    #   write_predictions_to_file: False
+    #   output_file_path_prefix: null # Prefix of the file to write predictions to.
+    #   truncation_field: "context" # Options: ['context', 'answer']
+    #   index_mapping_dir: null # Path to a directory to write index mapping files.
+    #   prompt_template: ${model.data.train_ds.prompt_template}
+    #   # ASR configs
+    #   sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+
+    #   metric:
+    #     name: "loss" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss']
+    #     average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+    #     num_classes: null
+
+  optim:
+    name: fused_adam
+    lr: 1e-4
+    weight_decay: 0.001 
+    betas: 
+    - 0.9
+    - 0.98
+    sched:
+      name: CosineAnnealing
+      warmup_steps: 5000
+      min_lr: 0.0 # min_lr must be 0.0 for prompt learning when pipeline parallel > 1
+      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
+      monitor: val_loss
+      reduce_on_plateau: false
diff --git a/examples/multimodal/speech_llm/conf/salm/salm_config.yaml b/examples/multimodal/speech_llm/conf/salm/salm_config.yaml
new file mode 100644
index 000000000000..c49e335c8d66
--- /dev/null
+++ b/examples/multimodal/speech_llm/conf/salm/salm_config.yaml
@@ -0,0 +1,339 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: salm_fastconformer_gpt_lora_tuning
+
+trainer:
+  devices: 1
+  accelerator: gpu
+  num_nodes: 1
+  precision: 16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  use_distributed_sampler: False
+  max_epochs: 100
+  max_steps: 1000000 # 1M steps
+  log_every_n_steps: 10 # frequency with which training steps are logged 
+  val_check_interval: 3000 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
+  gradient_clip_val: 1.0
+  accumulate_grad_batches: 1
+
+exp_manager:
+  # explicit_log_dir: null
+  exp_dir: null
+  name: ${name}
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: validation_${model.data.validation_ds.metric.name}
+    save_top_k: 1
+    mode: min
+    save_nemo_on_train_end: True
+    filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{epoch}'
+    model_parallel_size: ${model.tensor_model_parallel_size}
+    always_save_nemo: False
+    save_best_model: True
+  create_early_stopping_callback: False
+  early_stopping_callback_params:
+    monitor: "val_loss"
+    mode: "min"
+    min_delta: 0.001
+    patience: 10
+    verbose: True
+    strict: False # Should be False to avoid a runtime error where EarlyStopping says monitor is unavailable, which sometimes happens with resumed training.
+
+
+model:
+  seed: 1234
+  tensor_model_parallel_size: 1 # intra-layer model parallelism
+  pipeline_model_parallel_size: 1 # inter-layer model parallelism
+  
+  pretrained_audio_model: stt_en_fastconformer_transducer_large
+  freeze_llm: True
+  freeze_audio_encoder: False
+  freeze_modality_adapter: False
+
+  global_batch_size: 128
+  micro_batch_size: 4
+  restore_from_path: ??? # Path to an existing .nemo model you wish to add new tasks to or run inference with
+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+  save_nemo_on_validation_end: False # Saves an inference ready .nemo file every time a checkpoint is saved during training. 
+  sync_batch_comm: False
+  megatron_amp_O2: False
+
+  ## Sequence Parallelism
+  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  sequence_parallel: False
+
+  ## Activation Checkpoint 
+  activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
+  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
+  # of each chunk at the specified granularity
+  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+  activations_checkpoint_num_layers: null # not used with 'selective'
+  activations_checkpoint_layers_per_pipeline: null
+  answer_only_loss: True
+  gradient_as_bucket_view: False
+
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+
+  peft:
+    peft_scheme: "lora"  # can be either lora, adapter, ia3 or ptuning
+    restore_from_path: null
+    
+    # Used for adapter peft training
+    adapter_tuning:
+      type: 'parallel_adapter' # this should be either 'parallel_adapter' or 'linear_adapter'
+      adapter_dim: 32
+      adapter_dropout: 0.0
+      norm_position: 'pre' # This can be set to 'pre', 'post' or null, 'pre' is normally what is used.
+      column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      norm_type: 'mixedfusedlayernorm' # IGNORED if layer_adapter is used,  options are ['layernorm', 'mixedfusedlayernorm']
+      layer_selection: null  # selects in which layers to add adapters, e.g. [1,12] will add adapters to layer 1 (lowest) and 12. null will apply adapters to all layers
+      weight_tying: False
+      position_embedding_strategy: null # used only when weight_tying is True
+
+    lora_tuning:
+      target_modules: ['attention_qkv','attention_dense','mlp_fc1','mlp_fc2'] # this can either be 'attention_qkv','attention_dense','mlp_fc1','mlp_fc2', attention (qkv & dense), mlp (fc1 & fc2)
+      adapter_dim: 32
+      alpha: ${model.peft.lora_tuning.adapter_dim} 
+      adapter_dropout: 0.0
+      column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal
+      layer_selection:  null  # selects in which layers to add lora adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers
+      weight_tying: False
+      position_embedding_strategy: null # used only when weight_tying is True
+
+    # Used for p-tuning peft training
+    p_tuning:
+      virtual_tokens: 10  # The number of virtual tokens the prompt encoder should add at the start of the sequence
+      bottleneck_dim: 1024  # the size of the prompt encoder mlp bottleneck
+      embedding_dim: 1024  # the size of the prompt encoder embeddings
+      init_std: 0.023
+
+    ia3_tuning:
+      layer_selection:  null  # selects in which layers to add ia3 adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers
+    
+    selective_tuning:
+      tunable_base_param_names: ["self_attention", "word_embeddings"]  # TODO: regex support @adithyre
+
+
+  perception:
+    use_multi_layer_feat: false  # whether to extract multi-layer features, only supports conformer encoder
+    multi_layer_feat:
+      layer_idx_list: [0,16]  # layer indices to extract features from
+      aggregator:
+        mode: "cat"  # ways to combine features from different layers, choices=['cat','sum','mean', 'max', 'min'], default to concat ('cat')
+        pooling: "avg"  # ways to pool features if they have different temporal lengths and align_mode=min, choices=['mean', 'max', 'min']
+        align_mode: "min"  # if features have different temporal lengths, set `min` to pool to the shortest length or `max` to repeat to the longest.
+
+    modality_adapter: 
+      _target_: nemo.collections.asr.modules.ConformerEncoder
+      feat_in: 1024
+      feat_out: -1 # you may set it if you need different output size other than the default d_model
+      n_layers: 2
+      d_model: 512
+
+      # Sub-sampling parameters
+      subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
+      subsampling_factor: 8 # must be power of 2 for striding and vggnet
+      subsampling_conv_channels: 256 # set to -1 to make it equal to the d_model
+      causal_downsampling: false
+
+      # Reduction parameters: Can be used to add another subsampling layer at a given position.
+      # Having a 2x reduction will speedup the training and inference speech while keeping similar WER.
+      # Adding it at the end will give the best WER while adding it at the beginning will give the best speedup.
+      reduction: null # pooling, striding, or null
+      reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder
+      reduction_factor: 1
+
+      # Feed forward module's params
+      ff_expansion_factor: 4
+
+      # Multi-headed Attention Module's params
+      self_attention_model: rel_pos # rel_pos or abs_pos
+      n_heads: 8 # may need to be lower for smaller d_models
+      # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
+      att_context_size: [-1, -1] # -1 means unlimited context
+      att_context_style: regular # regular or chunked_limited
+      xscaling: true # scales up the input embeddings by sqrt(d_model)
+      untie_biases: true # unties the biases of the TransformerXL layers
+      pos_emb_max_len: 5000
+
+      # Convolution module's params
+      conv_kernel_size: 9
+      conv_norm_type: 'batch_norm' # batch_norm or layer_norm or groupnormN (N specifies the number of groups)
+      # conv_context_size can be"causal" or a list of two integers while conv_context_size[0]+conv_context_size[1]+1==conv_kernel_size
+      # null means [(kernel_size-1)//2, (kernel_size-1)//2], and 'causal' means [(kernel_size-1), 0]
+      conv_context_size: null
+
+      ### regularization
+      dropout: 0.1 # The dropout used in most of the Conformer Modules
+      dropout_pre_encoder: 0.1 # The dropout used before the encoder
+      dropout_emb: 0.0 # The dropout used for embeddings
+      dropout_att: 0.1 # The dropout for multi-headed attention modules
+
+      # set to non-zero to enable stochastic depth
+      stochastic_depth_drop_prob: 0.0
+      stochastic_depth_mode: linear  # linear or uniform
+      stochastic_depth_start_layer: 1
+
+    spec_augment:
+      _target_: nemo.collections.asr.modules.SpectrogramAugmentation
+      freq_masks: 2 # set to zero to disable it
+      time_masks: 10 # set to zero to disable it
+      freq_width: 27
+      time_width: 0.05
+
+    # the following are read from the pretrained audio encoder:
+    # output_dim: null
+    # encoder: null
+    # preprocessor: null
+
+  data:
+    end_string: "[EOG]"
+    train_ds:
+      # Example of how to specify paths to multiple datasets
+      # manifest_filepath:
+      #   - /path/to/squad.jsonl
+      #   - /path/to/mnli.jsonl
+      #   - /path/to/boolq.jsonl
+      # Example of how each dataset is formatted
+      # {'audio_filepath': 'audio1.wav', 'offset': 0.0, 'duration': 12.3, 'question': 'transcribe this audio', 'answer': 'I have a dream...'}
+      # the 'answer' field can also be 'text', and a default 'question' field is added if missing in manigests, so as to work with ASR manifests
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: True
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: True
+      # Notably, the data weights are controlled by either bucketing_weights
+      # or concat_sampling_probabilities depending on the dataset type (tar and
+      # non-tar).
+      # See audio_text_qa_dataset.py for details.
+      concat_sampling_probabilities: null # When providing a list of datasets, this arg defines the sampling probabilities from each dataset when strategy='random'
+      context_key: 'context'
+      answer_key: 'answer'
+      add_eos: True
+      # add_eos: False
+      end_string: ${model.data.end_string}
+      add_sep: False
+      add_bos: False
+      separate_prompt_and_response_with_newline: False
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: "Q: {context}\nA: {answer}" # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+      max_duration: 24 # it is set for LibriSpeech, you may need to update it for your dataset
+      min_duration: 0.1
+      # tarred datasets
+      is_tarred: false
+      tarred_audio_filepaths: null
+      shuffle_n: 2048
+      # bucketing params
+      bucketing_strategy: "fully_randomized"
+      bucketing_batch_size: null
+      # sample_alpha: 0.1
+
+    validation_ds:
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      global_batch_size: ${model.global_batch_size}
+      micro_batch_size: ${model.micro_batch_size}
+      shuffle: False
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: False
+      context_key: ${model.data.train_ds.context_key}
+      answer_key: ${model.data.train_ds.answer_key}
+      add_eos: ${model.data.train_ds.add_eos}
+      end_string: ${model.data.end_string}
+      add_sep: ${model.data.train_ds.add_sep}
+      add_bos: ${model.data.train_ds.add_bos}
+      separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+      write_predictions_to_file: False
+      output_file_path_prefix: null # Prefix of the file to write predictions to.
+      truncation_field: "context" # Options: ['context', 'answer']
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: ${model.data.train_ds.prompt_template} # fstring to use for assistant prompt. Example: "Q: {input}\nA: {output}"
+      tokens_to_generate: 128
+      # ASR configs
+      sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+
+      log_every_n_steps: 10
+      metric:
+        name: "loss" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss', 'wer', 'bleu', 'rouge']
+        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+        num_classes: null
+
+    # test_ds:
+    #   manifest_filepath: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+    #   names: null # Names of the corresponding datasets used to log metrics.
+    #   global_batch_size: ${model.global_batch_size}
+    #   micro_batch_size: ${model.micro_batch_size}
+    #   shuffle: False
+    #   num_workers: 4
+    #   pin_memory: True
+    #   max_seq_length: 2048
+    #   min_seq_length: 1
+    #   drop_last: False
+    #   context_key: 'input'
+    #   answer_key: 'output'
+    #   add_eos: ${model.data.train_ds.add_eos}
+    #  end_string: ${model.data.end_string}
+    #   add_sep: ${model.data.train_ds.add_sep}
+    #   add_bos: ${model.data.train_ds.add_bos}
+    #   separate_prompt_and_response_with_newline: ${model.data.train_ds.separate_prompt_and_response_with_newline}
+    #   write_predictions_to_file: False
+    #   output_file_path_prefix: null # Prefix of the file to write predictions to.
+    #   truncation_field: "context" # Options: ['context', 'answer']
+    #   index_mapping_dir: null # Path to a directory to write index mapping files.
+    #   prompt_template: ${model.data.train_ds.prompt_template}
+    #   # ASR configs
+    #   sample_rate: 16000 #${model.audio_encoder.preprocessor.sample_rate}
+
+    #   metric:
+    #     name: "loss" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss']
+    #     average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+    #     num_classes: null
+
+  optim:
+    name: fused_adam
+    lr: 1e-4
+    weight_decay: 0.001 
+    betas: 
+    - 0.9
+    - 0.98
+    sched:
+      name: CosineAnnealing
+      warmup_steps: 2000
+      min_lr: 0.0 # min_lr must be 0.0 for prompt learning when pipeline parallel > 1
+      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
+      monitor: val_loss
+      reduce_on_plateau: false
diff --git a/examples/multimodal/speech_llm/modular_audio_gpt_eval.py b/examples/multimodal/speech_llm/modular_audio_gpt_eval.py
new file mode 100644
index 000000000000..d76e479829fa
--- /dev/null
+++ b/examples/multimodal/speech_llm/modular_audio_gpt_eval.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from pathlib import Path
+
+import torch.multiprocessing as mp
+from omegaconf.omegaconf import OmegaConf
+
+from nemo.collections.multimodal.speech_llm.models.modular_models import ModularAudioGPTModel
+from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronTrainerBuilder
+from nemo.core.config import hydra_runner
+from nemo.utils import logging
+
+mp.set_start_method("spawn", force=True)
+
+"""
+This is the script to run inference with a ModularAudioGPTModel.
+
+If you want to evaluate an ModularAudioGPTModel:
+
+MEGATRON_CKPT=/path/to/megatron-llm.nemo
+ALM_DIR=/path/to/nemo_experiments/job_name
+ALM_YAML=$ALM_DIR/version_0/hparams.yaml
+ALM_CKPT="$ALM_DIR/checkpoints/AudioGPT--validation_wer\=0.5-step\=103-epoch\=0-last.ckpt"
+
+VAL_MANIFESTS="[/data/libri-test-other.json,/data/MCV_7.1_test.json,/data/wsj-test.json]"
+VAL_NAMES="[ls-test-other,mcv7.1-test,wsj-test]"
+
+HYDRA_FULL_ERROR=1 \
+CUDA_VISIBLE_DEVICES=0 python modular_audio_gpt_eval.py \
+    model.restore_from_path=$MEGATRON_CKPT \
+    model.peft.restore_from_path=$ALM_CKPT \
+    model.peft.restore_from_hparams_path=$ALM_YAML \
+    model.data.test_ds.manifest_filepath=$VAL_MANIFESTS \
+    model.data.test_ds.names=$VAL_NAMES \
+    model.data.test_ds.global_batch_size=8 \
+	model.data.test_ds.micro_batch_size=8 \
+	model.data.test_ds.tokens_to_generate=256 \
+    ++inference.greedy=False \
+    ++inference.top_k=50 \
+    ++inference.top_p=0.95 \
+    ++inference.temperature=0.4 \
+    ++inference.repetition_penalty=1.2 \
+    ++model.data.test_ds.output_dir=${ALM_DIR}
+"""
+
+
+@hydra_runner(config_path="conf", config_name="modular_audio_gpt_config_eval")
+def main(cfg) -> None:
+    logging.info("\n\n************** Experiment configuration ***********")
+    logging.info(f"\n{OmegaConf.to_yaml(cfg)}")
+    logging.info("**************************************************\n\n")
+
+    trainer = MegatronTrainerBuilder(cfg).create_trainer()
+
+    if cfg.model.from_pretrained:
+        # Load model from NGC or HuggingFace
+        logging.info(f"Loading model from cloud: {cfg.model.from_pretrained}")
+        model_cfg = ModularAudioGPTModel.from_pretrained(
+            cfg.model.from_pretrained, trainer=trainer, return_config=True
+        )
+        model_cfg = ModularAudioGPTModel.merge_inference_cfg(cfg, trainer, model_cfg)
+        model_file = ModularAudioGPTModel.from_pretrained(
+            cfg.model.from_pretrained, trainer=trainer, return_model_file=True
+        )
+        model = ModularAudioGPTModel.restore_from(
+            restore_path=model_file,
+            trainer=trainer,
+            override_config_path=model_cfg,
+            strict=False,
+            map_location="cpu",
+        )
+        if "peft" in model_cfg and model_cfg.peft.get("peft_scheme", None):
+            # need this due to the way that MegatronGPTSFTModel doesn't load adapters in model initialization
+            model.load_adapters(model_file, map_location="cpu")
+    else:
+        # Load model from a local file
+        model_cfg = ModularAudioGPTModel.merge_inference_cfg(cfg, trainer)
+        model = ModularAudioGPTModel.restore_from(
+            restore_path=cfg.model.restore_from_path,
+            trainer=trainer,
+            override_config_path=model_cfg,
+            strict=False,
+            map_location="cpu",
+        )
+        model = ModularAudioGPTModel.load_adapters_for_inference(cfg, model_cfg, model)
+        model = ModularAudioGPTModel.load_audio_encoder_for_inference(cfg, model_cfg, model)
+
+    model.freeze()
+    if cfg.get("save_as_nemo", None):
+        model.setup("predict")  # need to call setup() to load adapters and prepare for saving
+        model.save_to(cfg.save_as_nemo)
+        logging.info(f"Model saved to {Path(cfg.save_as_nemo).absolute()}, exiting...")
+        exit(0)
+
+    if not cfg.model.get('use_flash_attention', False):
+        cfg.inference.compute_attention_mask = True
+    config = OmegaConf.to_container(cfg.inference, resolve=True)
+    model.set_inference_config(config)
+
+    # run inference
+    trainer.test(model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/multimodal/speech_llm/modular_audio_gpt_train.py b/examples/multimodal/speech_llm/modular_audio_gpt_train.py
new file mode 100644
index 000000000000..04bff37e7a3f
--- /dev/null
+++ b/examples/multimodal/speech_llm/modular_audio_gpt_train.py
@@ -0,0 +1,70 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch.multiprocessing as mp
+from omegaconf.omegaconf import OmegaConf, open_dict
+
+from nemo.collections.multimodal.speech_llm.models.modular_models import ModularAudioGPTModel
+from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronLMPPTrainerBuilder
+from nemo.core.config import hydra_runner
+from nemo.utils import logging
+from nemo.utils.exp_manager import exp_manager
+
+mp.set_start_method("spawn", force=True)
+
+"""
+MEGATRON_CKPT=/path/to/megatron-llm.nemo
+ASR_MODEL=/path/to/asr-model.nemo
+
+TRAIN_MANIFESTS="[/data/train_1.json,/data/train_2.json]"
+VAL_MANIFESTS="[/data/dev_1.json,/data/dev_2.json]"
+VAL_NAMES="[dev-1,dev-2]"
+
+CUDA_VISIBLE_DEVICES="0,1" python modular_audio_gpt_train.py --config-path="./conf" --config-name "modular_audio_gpt_config_peft" \
+    trainer.devices=-1 \
+    model.freeze_audio_encoder=True \
+    model.freeze_llm=True \
+    model.global_batch_size=4 \
+    model.micro_batch_size=2 \
+    model.pretrained_audio_model=$ASR_MODEL \
+    model.restore_from_path=$MEGATRON_MODEL \
+    model.data.train_ds.manifest_filepath=$TRAIN_MANIFESTS \
+    model.data.validation_ds.manifest_filepath=$VAL_MANIFESTS \
+    ++model.data.validation_ds.names=$VAL_NAMES \
+"""
+
+
+@hydra_runner(config_path="conf", config_name="modular_audio_gpt_config_peft")
+def main(cfg) -> None:
+    logging.info("\n\n************** Experiment configuration ***********")
+    logging.info(f'\n{OmegaConf.to_yaml(cfg)}')
+    # hydra interpolation does not work here as the interpolation key is lost when PTL saves hparams
+    with open_dict(cfg):
+        cfg.model.precision = cfg.trainer.precision
+
+    precision = cfg.trainer.precision
+    trainer = MegatronLMPPTrainerBuilder(cfg).create_trainer()
+    cfg.trainer.precision = precision
+
+    exp_manager(trainer, cfg.exp_manager)
+    # update resume from checkpoint found by exp_manager
+    logging.info(f'Resuming training from checkpoint: {trainer.ckpt_path}')
+
+    model = ModularAudioGPTModel.restore_from_pretrained_models(cfg, trainer=trainer)
+
+    trainer.fit(model)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/nemo/collections/asr/modules/conformer_encoder.py b/nemo/collections/asr/modules/conformer_encoder.py
index b9642b3ea5dc..d0e014e42a37 100644
--- a/nemo/collections/asr/modules/conformer_encoder.py
+++ b/nemo/collections/asr/modules/conformer_encoder.py
@@ -16,7 +16,7 @@
 import random
 from collections import OrderedDict
 from dataclasses import dataclass
-from typing import List, Optional, Set
+from typing import List, Optional, Set, Tuple
 
 import torch
 import torch.distributed
@@ -356,7 +356,9 @@ def __init__(
         if reduction and reduction_factor > 1:
             assert reduction_position >= -1 and reduction_position < n_layers
             self.reduction_subsampling = SubsamplingReductionModule(
-                reduction=reduction, d_model=d_model, reduction_factor=reduction_factor,
+                reduction=reduction,
+                d_model=d_model,
+                reduction_factor=reduction_factor,
             )
             self.reduction_position = reduction_position
         else:
@@ -804,15 +806,15 @@ def setup_streaming_params(
         max_context: int = 10000,
     ):
         """
-            This function sets the needed values and parameters to perform streaming. The configuration would be stored in self.streaming_cfg.
-            The streaming configuration is needed to simulate streaming inference.
-
-            Args:
-                chunk_size (int): overrides the chunk size
-                shift_size (int): overrides the shift size for chunks
-                left_chunks (int): overrides the number of left chunks visible to each chunk
-                max_context (int): the value used for the cache size of last_channel layers if left context is set to infinity (-1)
-                    Defaults to -1 (means feat_out is d_model)
+        This function sets the needed values and parameters to perform streaming. The configuration would be stored in self.streaming_cfg.
+        The streaming configuration is needed to simulate streaming inference.
+
+        Args:
+            chunk_size (int): overrides the chunk size
+            shift_size (int): overrides the shift size for chunks
+            left_chunks (int): overrides the number of left chunks visible to each chunk
+            max_context (int): the value used for the cache size of last_channel layers if left context is set to infinity (-1)
+                Defaults to -1 (means feat_out is d_model)
         """
         streaming_cfg = CacheAwareStreamingConfig()
 
@@ -903,12 +905,19 @@ def get_initial_cache_state(self, batch_size=1, dtype=torch.float32, device=None
             create_tensor = torch.zeros
         last_time_cache_size = self.conv_context_size[0]
         cache_last_channel = create_tensor(
-            (len(self.layers), batch_size, self.streaming_cfg.last_channel_cache_size, self.d_model,),
+            (
+                len(self.layers),
+                batch_size,
+                self.streaming_cfg.last_channel_cache_size,
+                self.d_model,
+            ),
             device=device,
             dtype=dtype,
         )
         cache_last_time = create_tensor(
-            (len(self.layers), batch_size, self.d_model, last_time_cache_size), device=device, dtype=dtype,
+            (len(self.layers), batch_size, self.d_model, last_time_cache_size),
+            device=device,
+            dtype=dtype,
         )
         if max_dim > 0:
             cache_last_channel_len = torch.randint(
@@ -934,7 +943,6 @@ def change_attention_model(
         update_config: bool = True,
         device: torch.device = None,
     ):
-
         """
         Update the self_attention_model which changes the positional encoding and attention layers.
 
@@ -1053,7 +1061,7 @@ def change_attention_model(
 
     def change_subsampling_conv_chunking_factor(self, subsampling_conv_chunking_factor: int):
         """
-        Update the conv_chunking_factor (int) 
+        Update the conv_chunking_factor (int)
         Default is 1 (auto)
         Set it to -1 (disabled) or to a specific value (power of 2) if you OOM in the conv subsampling layers
 
@@ -1098,7 +1106,9 @@ def _update_adapter_cfg_input_dim(self, cfg: DictConfig):
         cfg = adapter_utils.update_adapter_cfg_input_dim(self, cfg, module_dim=self.d_model)
         return cfg
 
-    def get_accepted_adapter_types(self,) -> Set[type]:
+    def get_accepted_adapter_types(
+        self,
+    ) -> Set[type]:
         types = super().get_accepted_adapter_types()
 
         if len(types) == 0:
@@ -1113,6 +1123,85 @@ def get_accepted_adapter_types(self,) -> Set[type]:
         return types
 
 
+class ConformerMultiLayerFeatureExtractor(NeuralModule, Exportable, AccessMixin):
+    """
+    A wrapper module that extracts features from multiple layers of a ConformerEncoder,
+    by reusing existing mechanisim for interctc loss.
+    To use it, set `layer_idx_list` to  specify the indices of layers to extract from.
+    Also, you can specify an `aggretator` module to aggregate the features from different layers, default not aggregating.
+    """
+
+    def __init__(
+        self,
+        encoder: ConformerEncoder,
+        layer_idx_list: List[int],
+        aggregator: NeuralModule = None,
+        detach: bool = False,
+        convert_to_cpu: bool = False,
+    ):
+        super().__init__()
+        self.encoder = encoder
+        self.layer_idx_list = [int(l) for l in layer_idx_list]
+        for x in self.layer_idx_list:
+            if x < 0 or x >= len(encoder.layers):
+                raise ValueError(f"layer index {x} out of range [0, {len(encoder.layers)})")
+        self.enc_access_cfg = {
+            "interctc": {
+                "capture_layers": self.layer_idx_list,
+            },
+            "detach": detach,
+            "convert_to_cpu": convert_to_cpu,
+        }
+        self.aggregator = aggregator
+
+    def forward(
+        self, audio_signal, length, cache_last_channel=None, cache_last_time=None, cache_last_channel_len=None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        old_access_flag = self.is_access_enabled(guid=getattr(self, "model_guid", None))
+        self.update_access_cfg(self.enc_access_cfg, guid=getattr(self, "model_guid", None))
+        self.set_access_enabled(access_enabled=True, guid=getattr(self, "model_guid", None))
+
+        _ = self.encoder(
+            audio_signal=audio_signal,
+            length=length,
+            cache_last_channel=cache_last_channel,
+            cache_last_time=cache_last_time,
+            cache_last_channel_len=cache_last_channel_len,
+        )
+
+        ### chunk of code adapted from ConformerEncoder.forward_internal()
+        total_registry = {}
+        for module_registry in self.get_module_registry(self.encoder).values():
+            for key in module_registry:
+                if key.startswith("interctc/") and key in total_registry:
+                    raise RuntimeError(f"layer {key} has been logged multiple times!")
+            total_registry.update(module_registry)
+
+        encoded_list = []
+        encoded_len_list = []
+        for layer_idx in self.layer_idx_list:
+            try:
+                layer_outputs = total_registry[f"interctc/layer_output_{layer_idx}"]
+                layer_lengths = total_registry[f"interctc/layer_length_{layer_idx}"]
+            except KeyError:
+                raise RuntimeError(
+                    f"Intermediate layer {layer_idx} was not captured! Check the layer index and the number of ConformerEncoder layers."
+                )
+            if len(layer_outputs) > 1 or len(layer_lengths) > 1:
+                raise RuntimeError("Make sure encoder.forward is called exactly one time")
+            encoded_list.append(layer_outputs[0])  # [B, D, T]
+            encoded_len_list.append(layer_lengths[0])  # [B]
+
+        self.encoder.reset_registry()
+        self.set_access_enabled(access_enabled=old_access_flag, guid=getattr(self, "model_guid", None))
+        ### end of adapted chunk
+
+        if self.aggregator is not None:
+            return self.aggregator(encoded_list, encoded_len_list)  # Tensor[B,D*L,T], Tensor[B]
+        else:
+            return encoded_list, encoded_len_list  # List[Tensor[B,D,T]], List[Tensor[B]]
+
+
 """
 Register any additional information
 """
diff --git a/nemo/collections/asr/parts/mixins/transcription.py b/nemo/collections/asr/parts/mixins/transcription.py
index 5a71679607be..c252d498dc08 100644
--- a/nemo/collections/asr/parts/mixins/transcription.py
+++ b/nemo/collections/asr/parts/mixins/transcription.py
@@ -67,18 +67,18 @@ class TranscribeConfig:
     _internal: Optional[InternalTranscribeConfig] = None
 
 
-def move_to_device(batch, device):
+def move_to_device(batch, device, non_blocking=False):
     """
     Recursively move all tensors in `batch` to `device`.
     """
     if isinstance(batch, torch.Tensor):
-        return batch.to(device)
+        return batch.to(device, non_blocking=non_blocking)
     elif isinstance(batch, (list, tuple)):
-        return [move_to_device(x, device) for x in batch]
+        return [move_to_device(x, device, non_blocking) for x in batch]
     elif isinstance(batch, dict):
-        return {k: move_to_device(v, device) for k, v in batch.items()}
+        return {k: move_to_device(v, device, non_blocking) for k, v in batch.items()}
     else:
-        raise TypeError(f"Unsupported type: {type(batch)}")
+        return batch  # do nothing if not supported type
 
 
 def get_value_from_transcription_config(trcfg, key, default):
diff --git a/nemo/collections/common/data/dataset.py b/nemo/collections/common/data/dataset.py
index c2c29b54f7f6..71220dd9d5f2 100644
--- a/nemo/collections/common/data/dataset.py
+++ b/nemo/collections/common/data/dataset.py
@@ -26,12 +26,12 @@
 
 class ConcatDataset(IterableDataset):
     """
-    A dataset that accepts as argument multiple datasets and then samples from them based on the specified 
+    A dataset that accepts as argument multiple datasets and then samples from them based on the specified
     sampling technique.
 
     Args:
         datasets (list): A list of datasets to sample from.
-        shuffle (bool): Whether to shuffle individual datasets. Only works with non-iterable datasets. 
+        shuffle (bool): Whether to shuffle individual datasets. Only works with non-iterable datasets.
             Defaults to True.
         sampling_technique (str): Sampling technique to choose which dataset to draw a sample from.
             Defaults to 'temperature'. Currently supports 'temperature', 'random' and 'round-robin'.
@@ -73,7 +73,9 @@ def __init__(
             self.sampling_kwargs['seed'] = seed
         elif sampling_technique == 'random':
             self.index_generator = ConcatDataset.random_generator
-            self.sampling_kwargs['p'] = sampling_probabilities
+            self.sampling_kwargs['p'] = (
+                sampling_probabilities if sampling_probabilities else [1 / len(datasets)] * len(datasets)
+            )
             self.sampling_kwargs['seed'] = seed
         elif sampling_technique == 'round-robin':
             self.index_generator = ConcatDataset.round_robin_generator
@@ -200,7 +202,7 @@ def random_generator(datasets, **kwargs):
 
 class ConcatMapDataset(Dataset):
     """
-    A dataset that accepts as argument multiple datasets and then samples from them based on the specified 
+    A dataset that accepts as argument multiple datasets and then samples from them based on the specified
     sampling technique.
 
     Args:
@@ -300,7 +302,7 @@ class CodeSwitchedDataset(IterableDataset):
     Args:
         datasets (list): A list of datasets
         lang_probs (list): A list of probabilities (which must sum to 1) corresponding to the sampling probability for each dataset
-        shuffle (bool): Whether to shuffle individual datasets. Only works with non-iterable datasets. 
+        shuffle (bool): Whether to shuffle individual datasets. Only works with non-iterable datasets.
             Defaults to True.
         min_duration (int): the minimum duration (secs) of each synthetic code-switched sample. Will draw randomly until this is hit.
             Defaults to 4
@@ -535,7 +537,7 @@ def build_single_CS_sample(self):
                 wav = np.trim_zeros(wav)
 
             # normalise to provided DB level
-            wav_norm = wav * (10.0 ** (self.db_norm / 20.0) / np.maximum(0.01, (wav ** 2).mean(axis=0) ** 0.5))
+            wav_norm = wav * (10.0 ** (self.db_norm / 20.0) / np.maximum(0.01, (wav**2).mean(axis=0) ** 0.5))
 
             # this part appends the normed waveform to the existing waveform, and inserts pause_join amount of silence
             # if necessary, otherwise just a straight append
diff --git a/nemo/collections/common/metrics/__init__.py b/nemo/collections/common/metrics/__init__.py
index 322e62214ead..9e21d93816a9 100644
--- a/nemo/collections/common/metrics/__init__.py
+++ b/nemo/collections/common/metrics/__init__.py
@@ -14,5 +14,9 @@
 
 from nemo.collections.common.metrics.classification_accuracy import TopKClassificationAccuracy
 from nemo.collections.common.metrics.global_average_loss_metric import GlobalAverageLossMetric
-from nemo.collections.common.metrics.metric_string_to_torchmetric import MetricStringToTorchMetric
+from nemo.collections.common.metrics.metric_string_to_torchmetric import (
+    ClassificationMetricsSet,
+    MetricStringToTorchMetric,
+    TextMetricsSet,
+)
 from nemo.collections.common.metrics.perplexity import Perplexity
diff --git a/nemo/collections/common/metrics/metric_string_to_torchmetric.py b/nemo/collections/common/metrics/metric_string_to_torchmetric.py
index b38047b576cc..f91c915309f2 100644
--- a/nemo/collections/common/metrics/metric_string_to_torchmetric.py
+++ b/nemo/collections/common/metrics/metric_string_to_torchmetric.py
@@ -13,11 +13,13 @@
 # limitations under the License.
 
 from torchmetrics import Accuracy, AveragePrecision, F1Score, MatthewsCorrCoef, PearsonCorrCoef, SpearmanCorrCoef
+from torchmetrics.text import SacreBLEUScore
 from torchmetrics.text.rouge import ROUGEScore
+from torchmetrics.text.wer import WordErrorRate
 
 from nemo.collections.common.metrics.classification_accuracy import ExactStringMatchMetric, TokenF1Score
 
-__all__ = ['MetricStringToTorchMetric']
+__all__ = ['MetricStringToTorchMetric', 'TextMetricsSet', 'ClassificationMetricsSet']
 
 # Dictionary that maps a metric string name to its corresponding torchmetric class.
 
@@ -31,4 +33,10 @@
     'matthews_corr_coef': MatthewsCorrCoef,
     'exact_string_match': ExactStringMatchMetric,
     'rouge': ROUGEScore,
+    'wer': WordErrorRate,
+    'bleu': SacreBLEUScore,
 }
+
+TextMetricsSet = set(['rouge', 'wer', 'bleu'])
+
+ClassificationMetricsSet = set(['accuracy', 'average_precision', 'f1', 'exact_string_match'])
diff --git a/nemo/collections/common/parts/preprocessing/collections.py b/nemo/collections/common/parts/preprocessing/collections.py
index 66def034400f..24ca6cffe458 100644
--- a/nemo/collections/common/parts/preprocessing/collections.py
+++ b/nemo/collections/common/parts/preprocessing/collections.py
@@ -17,11 +17,11 @@
 import os
 from itertools import combinations
 from typing import Any, Dict, Iterable, List, Optional, Union
-
+import numpy as np
 import pandas as pd
 
 from nemo.collections.common.parts.preprocessing import manifest, parsers
-from nemo.utils import logging
+from nemo.utils import logging, logging_mode
 
 
 class _Collection(collections.UserList):
@@ -320,7 +320,13 @@ def __init__(self, manifests_files: Union[str, List[str]], *args, **kwargs):
             **kwargs: Kwargs to pass to `AudioText` constructor.
         """
 
-        ids, audio_files, durations, texts, offsets, = (
+        (
+            ids,
+            audio_files,
+            durations,
+            texts,
+            offsets,
+        ) = (
             [],
             [],
             [],
@@ -343,6 +349,19 @@ def __init__(self, manifests_files: Union[str, List[str]], *args, **kwargs):
         )
 
 
+class SpeechLLMAudioTextEntity(object):
+    def __init__(self, sid, audio_file, duration, context, answer, offset, speaker, orig_sr, lang) -> None:
+        self.id = sid
+        self.audio_file = audio_file
+        self.duration = duration
+        self.context = context
+        self.answer = answer
+        self.offset = offset
+        self.speaker = speaker
+        self.orig_sr = orig_sr
+        self.lang = lang
+
+
 class ASRVideoText(VideoText):
     """`VideoText` collector from cv structured json files."""
 
@@ -356,7 +375,13 @@ def __init__(self, manifests_files: Union[str, List[str]], *args, **kwargs):
             **kwargs: Kwargs to pass to `VideoText` constructor.
         """
 
-        ids, video_files, durations, texts, offsets, = (
+        (
+            ids,
+            video_files,
+            durations,
+            texts,
+            offsets,
+        ) = (
             [],
             [],
             [],
@@ -379,10 +404,272 @@ def __init__(self, manifests_files: Union[str, List[str]], *args, **kwargs):
         )
 
 
+class SpeechLLMAudioText(object):
+    """List of audio-transcript text correspondence with preprocessing.
+
+    All of the audio, duration, context, answer are optional.
+    If answer is not present, text is treated as the answer.
+    """
+
+    def __init__(
+        self,
+        ids: List[int],
+        audio_files: List[str],
+        durations: List[float],
+        context_list: List[str],
+        answers: List[str],
+        offsets: List[str],
+        speakers: List[Optional[int]],
+        orig_sampling_rates: List[Optional[int]],
+        langs: List[Optional[str]],
+        min_duration: Optional[float] = None,
+        max_duration: Optional[float] = None,
+        max_number: Optional[int] = None,
+        do_sort_by_duration: bool = False,
+        index_by_file_id: bool = False,
+        max_num_samples: Optional[int] = None,
+    ):
+        """Instantiates audio-context-answer manifest with filters and preprocessing.
+
+
+        Args:
+            ids: List of examples positions.
+            audio_files: List of audio files.
+            durations: List of float durations.
+            context_list: List of raw text transcripts.
+            answers: List of raw text transcripts.
+            offsets: List of duration offsets or None.
+            speakers: List of optional speakers ids.
+            orig_sampling_rates: List of original sampling rates of audio files.
+            langs: List of language ids, one for eadh sample, or None.
+            min_duration: Minimum duration to keep entry with (default: None).
+            max_duration: Maximum duration to keep entry with (default: None).
+            max_number: Maximum number of samples to collect.
+            do_sort_by_duration: True if sort samples list by duration. Not compatible with index_by_file_id.
+            index_by_file_id: If True, saves a mapping from filename base (ID) to index in data.
+        """
+
+        data, duration_filtered, num_filtered, total_duration = [], 0.0, 0, 0.0
+        if index_by_file_id:
+            self.mapping = {}
+
+        for id_, audio_file, duration, offset, context, answer, speaker, orig_sr, lang in zip(
+            ids, audio_files, durations, offsets, context_list, answers, speakers, orig_sampling_rates, langs
+        ):
+            # Duration filters.
+            if duration is not None:
+                curr_min_dur = min(duration) if isinstance(duration, list) else duration
+                curr_max_dur = max(duration) if isinstance(duration, list) else duration
+                curr_sum_dur = sum(duration) if isinstance(duration, list) else duration
+                if min_duration is not None and curr_min_dur < min_duration:
+                    duration_filtered += curr_sum_dur
+                    num_filtered += 1
+                    continue
+
+                if max_duration is not None and curr_max_dur > max_duration:
+                    duration_filtered += curr_sum_dur
+                    num_filtered += 1
+                    continue
+                total_duration += curr_sum_dur
+
+            if answer is None:
+                duration_filtered += curr_sum_dur
+                num_filtered += 1
+                continue
+
+            data.append(
+                SpeechLLMAudioTextEntity(id_, audio_file, duration, context, answer, offset, speaker, orig_sr, lang)
+            )
+            if index_by_file_id and audio_file is not None:
+                file_id, _ = os.path.splitext(os.path.basename(audio_file))
+                if file_id not in self.mapping:
+                    self.mapping[file_id] = []
+                self.mapping[file_id].append(len(data) - 1)
+
+            # Max number of entities filter.
+            if len(data) == max_number:
+                break
+
+        if max_num_samples is not None and not index_by_file_id:
+            if max_num_samples <= len(data):
+                logging.info(f"Subsampling dataset from {len(data)} to {max_num_samples} samples")
+                data = data[:max_num_samples]
+            else:
+                logging.info(f"Oversampling dataset from {len(data)} to {max_num_samples} samples")
+                data = data * (max_num_samples // len(data))
+                res_num = max_num_samples % len(data)
+                res_data = [data[idx] for idx in np.random.choice(len(data), res_num, replace=False)]
+                data.extend(res_data)
+        elif max_num_samples is not None and index_by_file_id:
+            logging.warning("Tried to subsample dataset by max_num_samples, but cannot since index_by_file_id is set.")
+
+        if do_sort_by_duration:
+            if index_by_file_id:
+                logging.warning("Tried to sort dataset by duration, but cannot since index_by_file_id is set.")
+            else:
+                data.sort(key=lambda entity: entity.duration)
+
+        logging.info("Dataset loaded with %d files totalling %.2f hours", len(data), total_duration / 3600)
+        logging.info("%d files were filtered totalling %.2f hours", num_filtered, duration_filtered / 3600)
+
+        self.data = data
+
+    def __getitem__(self, idx):
+        if idx < 0 or idx > len(self.data):
+            raise ValueError(f"index out of range [0,{len(self.data)}), got {idx} instead")
+        return self.data[idx]
+
+    def __len__(self):
+        return len(self.data)
+
+
+class SpeechLLMAudioTextCollection(SpeechLLMAudioText):
+    """`SpeechLLMAudioText` collector from SpeechLLM json files.
+
+    This collector also keeps backward compatibility with SpeechLLMAudioText.
+    """
+
+    def __init__(
+        self,
+        manifests_files: Union[str, List[str]],
+        context_file: Optional[Union[List[str], str]] = None,
+        context_key: str = "context",
+        answer_key: str = "answer",
+        *args,
+        **kwargs,
+    ):
+        """Parse lists of audio files, durations and transcripts texts.
+
+        Args:
+            manifests_files: Either single string file or list of such -
+                manifests to yield items from.
+            *args: Args to pass to `AudioText` constructor.
+            **kwargs: Kwargs to pass to `AudioText` constructor.
+        """
+        self.context_key = context_key
+        self.answer_key = answer_key
+
+        (
+            ids,
+            audio_files,
+            durations,
+            context_list,
+            answers,
+            offsets,
+        ) = (
+            [],
+            [],
+            [],
+            [],
+            [],
+            [],
+        )
+        speakers, orig_srs, langs = (
+            [],
+            [],
+            [],
+        )
+        if context_file is not None:
+            question_file_list = context_file.split(",") if isinstance(context_file, str) else context_file
+            self.context_list = []
+            for filepath in question_file_list:
+                with open(filepath, 'r') as f:
+                    for line in f.readlines():
+                        line = line.strip()
+                        if line:
+                            self.context_list.append(line)
+            logging.info(f"Use random text context from {context_file} for {manifests_files}")
+        else:
+            self.context_list = None
+
+        for item in manifest.item_iter(manifests_files, parse_func=self.__parse_item):
+            ids.append(item['id'])
+            audio_files.append(item['audio_file'])
+            durations.append(item['duration'])
+            context_list.append(item['context'])
+            answers.append(item['answer'])
+            offsets.append(item['offset'])
+            speakers.append(item['speaker'])
+            orig_srs.append(item['orig_sr'])
+            langs.append(item['lang'])
+        super().__init__(
+            ids, audio_files, durations, context_list, answers, offsets, speakers, orig_srs, langs, *args, **kwargs
+        )
+
+    def __parse_item(self, line: str, manifest_file: str) -> Dict[str, Any]:
+        item = json.loads(line)
+
+        # Audio file
+        if 'audio_filename' in item:
+            item['audio_file'] = item.pop('audio_filename')
+        elif 'audio_filepath' in item:
+            item['audio_file'] = item.pop('audio_filepath')
+        elif 'audio_file' not in item:
+            item['audio_file'] = None
+
+        # If the audio path is a relative path and does not exist,
+        # try to attach the parent directory of manifest to the audio path.
+        # Revert to the original path if the new path still doesn't exist.
+        # Assume that the audio path is like "wavs/xxxxxx.wav".
+        if item['audio_file'] is not None:
+            item['audio_file'] = manifest.get_full_path(audio_file=item['audio_file'], manifest_file=manifest_file)
+
+        # Duration.
+        if 'duration' not in item:
+            item['duration'] = None
+
+        # Answer.
+        if self.answer_key in item:
+            item['answer'] = item.pop(self.answer_key)
+        elif 'text' in item:
+            # compatability with ASR manifests that uses 'text' as answer key
+            item['answer'] = item.pop('text')
+        elif 'text_filepath' in item:
+            with open(item.pop('text_filepath'), 'r') as f:
+                item['answer'] = f.read()
+        else:
+            item['answer'] = "na"
+
+        # context.
+        if self.context_key in item:
+            item['context'] = item.pop(self.context_key)
+        elif 'context_filepath' in item:
+            with open(item.pop('context_filepath'), 'r') as f:
+                item['context'] = f.read()
+        elif self.context_list is not None:
+            context = np.random.choice(self.context_list).strip()
+            item['context'] = context
+        elif 'question' in item:
+            # compatability with old manifests that uses 'question' as context key
+            logging.warning(
+                f"Neither `{self.context_key}` is found nor `context_file` is set, but found `question` in item: {item}",
+                mode=logging_mode.ONCE,
+            )
+            item['context'] = item.pop('question')
+        else:
+            # default context if nothing is found
+            item['context'] = "what does this audio mean"
+
+        item = dict(
+            audio_file=item['audio_file'],
+            duration=item['duration'],
+            context=str(item['context']),
+            answer=str(item['answer']),
+            offset=item.get('offset', None),
+            speaker=item.get('speaker', None),
+            orig_sr=item.get('orig_sample_rate', None),
+            lang=item.get('lang', None),
+        )
+        return item
+
+
 class SpeechLabel(_Collection):
     """List of audio-label correspondence with preprocessing."""
 
-    OUTPUT_TYPE = collections.namedtuple(typename='SpeechLabelEntity', field_names='audio_file duration label offset',)
+    OUTPUT_TYPE = collections.namedtuple(
+        typename='SpeechLabelEntity',
+        field_names='audio_file duration label offset',
+    )
 
     def __init__(
         self,
@@ -532,7 +819,10 @@ def __parse_item(self, line: str, manifest_file: str) -> Dict[str, Any]:
 class FeatureSequenceLabel(_Collection):
     """List of feature sequence of label correspondence with preprocessing."""
 
-    OUTPUT_TYPE = collections.namedtuple(typename='FeatureSequenceLabelEntity', field_names='feature_file seq_label',)
+    OUTPUT_TYPE = collections.namedtuple(
+        typename='FeatureSequenceLabelEntity',
+        field_names='feature_file seq_label',
+    )
 
     def __init__(
         self,
@@ -614,9 +904,11 @@ class ASRFeatureSequenceLabel(FeatureSequenceLabel):
     """`FeatureSequenceLabel` collector from asr structured json files."""
 
     def __init__(
-        self, manifests_files: Union[str, List[str]], max_number: Optional[int] = None, index_by_file_id: bool = False,
+        self,
+        manifests_files: Union[str, List[str]],
+        max_number: Optional[int] = None,
+        index_by_file_id: bool = False,
     ):
-
         """Parse lists of feature files and sequences of labels.
 
         Args:
@@ -655,7 +947,10 @@ def _parse_item(self, line: str, manifest_file: str) -> Dict[str, Any]:
                 f"Manifest file has invalid json line " f"structure: {line} without proper seq_label key."
             )
 
-        item = dict(feature_file=item['feature_file'], seq_label=item['seq_label'],)
+        item = dict(
+            feature_file=item['feature_file'],
+            seq_label=item['seq_label'],
+        )
 
         return item
 
@@ -759,7 +1054,8 @@ def __init__(
                 data.sort(key=lambda entity: entity.duration)
 
         logging.info(
-            "Filtered duration for loading collection is %f.", duration_filtered,
+            "Filtered duration for loading collection is %f.",
+            duration_filtered,
         )
         logging.info(f"Total {len(data)} session files loaded accounting to # {len(audio_files)} audio clips")
 
@@ -937,8 +1233,7 @@ def __parse_item_rttm(self, line: str, manifest_file: str) -> Dict[str, Any]:
 
 
 class Audio(_Collection):
-    """Prepare a list of all audio items, filtered by duration.
-    """
+    """Prepare a list of all audio items, filtered by duration."""
 
     OUTPUT_TYPE = collections.namedtuple(typename='Audio', field_names='audio_files duration offset text')
 
@@ -999,11 +1294,14 @@ def __init__(
 
 
 class AudioCollection(Audio):
-    """List of audio files from a manifest file.
-    """
+    """List of audio files from a manifest file."""
 
     def __init__(
-        self, manifest_files: Union[str, List[str]], audio_to_manifest_key: Dict[str, str], *args, **kwargs,
+        self,
+        manifest_files: Union[str, List[str]],
+        audio_to_manifest_key: Dict[str, str],
+        *args,
+        **kwargs,
     ):
         """Instantiates a list of audio files loaded from a manifest file.
 
@@ -1045,6 +1343,7 @@ def __parse_item(self, line: str, manifest_file: str) -> Dict[str, Any]:
         Returns:
             Dictionary with audio_files, duration, and offset.
         """
+
         # Local utility function
         def get_audio_file(item: Dict, manifest_key: Union[str, List[str]]):
             """Get item[key] if key is string, or a list
@@ -1117,7 +1416,10 @@ def get_audio_file(item: Dict, manifest_key: Union[str, List[str]]):
 class FeatureLabel(_Collection):
     """List of feature sequence and their label correspondence with preprocessing."""
 
-    OUTPUT_TYPE = collections.namedtuple(typename='FeatureLabelEntity', field_names='feature_file label duration',)
+    OUTPUT_TYPE = collections.namedtuple(
+        typename='FeatureLabelEntity',
+        field_names='feature_file label duration',
+    )
 
     def __init__(
         self,
@@ -1194,7 +1496,6 @@ def __init__(
         *args,
         **kwargs,
     ):
-
         """Parse lists of feature files and sequences of labels.
 
         Args:
@@ -1383,7 +1684,14 @@ def __init__(self, manifests_files: Union[str, List[str]], *args, **kwargs):
             **kwargs: Kwargs to pass to `AudioText` constructor.
         """
 
-        ids, feature_files, rttm_files, durations, texts, offsets, = (
+        (
+            ids,
+            feature_files,
+            rttm_files,
+            durations,
+            texts,
+            offsets,
+        ) = (
             [],
             [],
             [],
diff --git a/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py b/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py
index b686322c0882..aed05673f6fa 100644
--- a/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py
+++ b/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py
@@ -28,7 +28,7 @@
 class SentencePieceTokenizer(TokenizerSpec):
     """
     Sentencepiecetokenizer https://github.com/google/sentencepiece.
-    
+
     Args:
         model_path: path to sentence piece tokenizer model. To create the model use create_spt_model()
         special_tokens: either list of special tokens or dictionary of token name to token value
@@ -87,7 +87,7 @@ def text_to_tokens(self, text):
 
         return self.tokenizer.encode_as_pieces(text)
 
-    def text_to_ids(self, text):
+    def text_to_ids(self, text, sample_alpha=None):
         if self.legacy:
             ids = []
             idx = 0
@@ -115,7 +115,10 @@ def text_to_ids(self, text):
             ids.extend(self.tokenizer.encode_as_ids(text[idx:]))
             return ids
 
-        return self.tokenizer.encode_as_ids(text)
+        if sample_alpha is not None:
+            return self.tokenizer.encode_as_ids(text, enable_sampling=True, alpha=sample_alpha, nbest_size=-1)
+        else:
+            return self.tokenizer.encode_as_ids(text)
 
     def tokens_to_text(self, tokens):
         if isinstance(tokens, np.ndarray):
diff --git a/nemo/collections/multimodal/speech_llm/__init__.py b/nemo/collections/multimodal/speech_llm/__init__.py
new file mode 100644
index 000000000000..f0c19a3eebb9
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from nemo.collections.multimodal.speech_llm import models, modules
diff --git a/nemo/collections/multimodal/speech_llm/data/__init__.py b/nemo/collections/multimodal/speech_llm/data/__init__.py
new file mode 100644
index 000000000000..d9155f923f18
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/data/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/nemo/collections/multimodal/speech_llm/data/audio_text_dataset.py b/nemo/collections/multimodal/speech_llm/data/audio_text_dataset.py
new file mode 100644
index 000000000000..7d0ee6afbfa2
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/data/audio_text_dataset.py
@@ -0,0 +1,1327 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import io
+import os
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import torch
+import webdataset as wds
+from omegaconf import DictConfig, ListConfig, open_dict
+
+from nemo.collections.asr.data.audio_to_text import (
+    VALID_FILE_FORMATS,
+    cache_datastore_manifests,
+    expand_sharded_filepaths,
+    shard_manifests_if_needed,
+)
+from nemo.collections.asr.data.audio_to_text_dataset import ConcatDataset, convert_to_config_list, get_chain_dataset
+from nemo.collections.asr.parts.preprocessing.features import WaveformFeaturizer
+from nemo.collections.asr.parts.utils.audio_utils import ChannelSelectorType
+from nemo.collections.common.parts.preprocessing import collections
+from nemo.collections.multimodal.speech_llm.parts.utils.data_utils import (
+    ceil_to_nearest,
+    get_num_samples_from_files,
+    maybe_cast_to_list,
+)
+from nemo.collections.nlp.data.language_modeling.megatron.base_dataset_utils import (
+    get_datasets_weights_and_num_samples,
+)
+from nemo.collections.nlp.data.language_modeling.megatron.blendable_dataset import BlendableDataset
+from nemo.core.classes import Dataset, IterableDataset
+from nemo.utils import logging, logging_mode
+from nemo.utils.distributed import webdataset_split_by_workers
+
+try:
+    from megatron.core import parallel_state
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+    HAVE_MEGATRON_CORE = False
+
+__all__ = [
+    'AudioTextDataset',
+    'TarredAudioTextDataset',
+    'get_tarred_audio_text_dataset_from_config',
+    'get_audio_text_dataset_from_config',
+]
+
+
+def _audio_collate_fn(audio_signals, audio_lengths):
+    """collate batch of audio sig, audio len, tokens, tokens len
+    Args:
+        audio_signals: List[Tensor]
+        audio_lengths: List[Tensor]
+    """
+
+    max_audio_len = 0
+    has_audio = audio_lengths[0] is not None
+    if has_audio:
+        max_audio_len = max(audio_lengths).item()
+
+    audio_signals_padded = []
+    for sig, sig_len in zip(audio_signals, audio_lengths):
+        if has_audio:
+            sig_len = sig_len.item()
+            if sig_len < max_audio_len:
+                pad = (0, max_audio_len - sig_len)
+                sig = torch.nn.functional.pad(sig, pad)
+            audio_signals_padded.append(sig)
+
+    if has_audio:
+        audio_signals_padded = torch.stack(audio_signals_padded)
+        audio_lengths = torch.stack(audio_lengths)
+    else:
+        audio_signals_padded, audio_lengths = None, None
+
+    return audio_signals_padded, audio_lengths
+
+
+def _build_loss_mask(processed_example: Dict, answer_only_loss: bool = True):
+    """Pad input_ids in batch to max batch length while building loss mask"""
+    # function copied from nemo/collections/nlp/data/language_modelling/megatron/gpt_sft_dataset.py
+    input_ids = processed_example['input_ids']
+    answer_start_idx = processed_example['answer_start_idx']
+    if answer_only_loss:
+        loss_mask = [float(idx >= answer_start_idx) for idx in range(len(input_ids))]
+    else:
+        loss_mask = [1.0] * len(input_ids)
+
+    return loss_mask
+
+
+def _collate_item(item: Union[torch.Tensor, np.ndarray, List], max_length: int, pad_id: int = 0):
+    # function copied from nemo/collections/nlp/data/language_modelling/megatron/gpt_sft_dataset.py
+    item = maybe_cast_to_list(item)
+    # max_length = max([len(x) for x in item]) if item else 0
+    # here [0] should be tokenizer.pad_id
+    item = [x + [pad_id] * (max_length - len(x)) for x in item]
+    return item
+
+
+def _speechllm_audio_text_collate_fn(
+    batch: Dict,
+    tokens_to_generate: int,
+    pad_to_max_length: bool,
+    max_seq_length: int,
+    text_pad_id: int,
+):
+    sample_ids = [x["idx"] for x in batch]
+    sample_ids = torch.tensor(sample_ids, dtype=torch.int32)
+
+    audio_signal = [x["audio_signal"] for x in batch]
+    audio_lengths = [x["audio_length"] for x in batch]
+    audio_signal, audio_lengths = _audio_collate_fn(audio_signal, audio_lengths)
+
+    input_ids = [item['input_ids'][:-1] for item in batch]
+    labels = [item['input_ids'][1:] for item in batch]
+    contexts = [item['context_ids'] for item in batch]
+    context_lengths = torch.LongTensor([item['context_length'] for item in batch])
+    answers = [item['answer_ids'] for item in batch]
+
+    loss_mask = [_build_loss_mask(item)[1:] for item in batch]
+
+    max_length = max([len(x) for x in input_ids]) + tokens_to_generate
+    # increase max length to nearest multiple of 4 or 8
+    if pad_to_max_length:
+        max_length = max_seq_length
+    else:
+        max_length = min(max_seq_length, ceil_to_nearest(max_length, 8))
+    assert max_length <= max_seq_length
+
+    position_ids = [list(range(max_length)) for _ in batch]
+    position_ids = torch.LongTensor(position_ids)
+    input_ids = torch.LongTensor(_collate_item(input_ids, max_length=max_length, pad_id=text_pad_id))
+    input_length = torch.LongTensor([len(x) for x in input_ids])
+    labels = torch.LongTensor(_collate_item(labels, max_length=max_length, pad_id=text_pad_id))
+    loss_mask = torch.LongTensor(_collate_item(loss_mask, max_length=max_length, pad_id=0))
+    contexts = torch.LongTensor(_collate_item(contexts, max_length=max_length, pad_id=text_pad_id))
+    answers = torch.LongTensor(_collate_item(answers, max_length=max_length, pad_id=text_pad_id))
+
+    batch = {
+        'sample_ids': sample_ids,
+        'audio_signal': audio_signal,
+        'audio_signal_length': audio_lengths,
+        'tokens': input_ids,
+        'tokens_length': input_length,
+        'labels': labels,
+        'loss_mask': loss_mask,
+        'position_ids': position_ids,
+        'contexts': contexts,
+        'context_lengths': context_lengths,
+        'answers': answers,
+        'max_length': torch.LongTensor(max_length),
+        'metadata': [x['metadata'] for x in batch],
+    }
+
+    return batch
+
+
+def _speechllm_multi_audio_text_collate_fn(
+    batch: Dict,
+    tokens_to_generate: int,
+    pad_to_max_length: bool,
+    max_seq_length: int,
+    text_pad_id: int,
+):
+    """Collate function for multi audio case."""
+    context_start_idx = [item['context_start_idx'] for item in batch]
+
+    audio_signals = [x["audio_signal"] for x in batch]
+    audio_lengths = [x["audio_length"] for x in batch]
+    num_audios = [len(x) for x in audio_signals]
+
+    # put all audios from all samples in one batch
+    audio_signals_merged = [item for audio_list in audio_signals for item in audio_list]
+    audio_lengths_merged = [item for length_list in audio_lengths for item in length_list]
+    audio_signals_merged, audio_lengths_merged = _audio_collate_fn(audio_signals_merged, audio_lengths_merged)
+
+    for i in range(len(batch)):
+        # create dummy audio_signal and audio_length for _speechllm_audio_text_collate_fn()
+        batch[i]["audio_signal"] = audio_signals[i][0]
+        batch[i]["audio_length"] = audio_lengths[i][0]
+
+    batch = _speechllm_audio_text_collate_fn(batch, tokens_to_generate, pad_to_max_length, max_seq_length, text_pad_id)
+
+    # add multi audio specific fields
+    batch['context_start_idx'] = list(context_start_idx)
+    batch['num_audios'] = torch.LongTensor(num_audios)
+    batch['audio_signal'] = audio_signals_merged
+    batch['audio_signal_length'] = audio_lengths_merged
+
+    return batch
+
+
+class TextProcessing(object):
+    """
+    Text processing pipeline for AudioTextDataset and TarredAudioTextDataset.
+    This class is adapted from the one used in nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py
+    """
+
+    def __init__(
+        self,
+        tokenizer: 'nemo.collections.common.tokenizers.TokenizerSpec',
+        max_seq_length: int = 1024,
+        min_seq_length: int = 1,
+        add_bos: bool = False,
+        add_eos: bool = True,
+        add_sep: bool = False,
+        sep_id: Optional[int] = None,
+        seed: int = 1234,
+        separate_prompt_and_response_with_newline: bool = False,
+        answer_only_loss: bool = True,
+        truncation_field: str = "answer",
+        pad_to_max_length: bool = False,  # (@adithyare) allows for much faster training especially in PEFT settings.
+        prompt_template: str = None,
+        virtual_tokens: int = 0,
+        tokens_to_generate: int = 0,
+        context_key: str = 'context',
+        answer_key: str = 'answer',
+        end_string: Optional[str] = None,
+        sample_alpha: Optional[float] = None,
+        audio_locator: Optional[str] = None,
+    ):
+        self.context_key = context_key
+        self.answer_key = answer_key
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+        self.min_seq_length = min_seq_length
+        self.seed = seed
+        self.separate_prompt_and_response_with_newline = separate_prompt_and_response_with_newline
+        self.answer_only_loss = answer_only_loss
+        self.truncation_field = truncation_field
+        self.pad_to_max_length = pad_to_max_length
+        self.prompt_template = prompt_template
+        self.virtual_tokens = virtual_tokens
+        self.tokens_to_generate = tokens_to_generate
+        self.add_bos = add_bos
+        self.add_eos = add_eos
+        self.add_sep = add_sep
+        self.end_string = end_string
+        self.sample_alpha = sample_alpha
+        self.audio_locator = audio_locator
+
+        if add_bos and hasattr(tokenizer, "bos_id") and tokenizer.bos_id > 0:
+            self.bos_id = tokenizer.bos_id
+        else:
+            self.bos_id = None
+
+        if add_eos and hasattr(tokenizer, "eos_id") and tokenizer.eos_id > 0:
+            self.eos_id = tokenizer.eos_id
+        else:
+            self.eos_id = None
+
+        if hasattr(tokenizer, "pad_id") and tokenizer.pad_id > 0:
+            self.pad_id = tokenizer.pad_id
+        else:
+            self.pad_id = self.eos_id if self.eos_id is not None else 0
+
+        self.sep_id = sep_id if add_sep else None
+
+        if self.prompt_template is not None:
+            # When providing things like newlines in the prompt template via the CLI, they are escaped. This line unescapes them.
+            self.prompt_template = self.prompt_template.encode('utf-8').decode('unicode_escape')
+        assert self.truncation_field in ["answer", "context"]
+
+    def _process_example(self, context: str, output: str):
+        """
+        Create an example by concatenating text and answer.
+        Truncation is carried out when needed, but it is performed only on the prompt side.
+        BOS, EOS, and SEP, are added if specified.
+
+        function copied from nemo/collections/nlp/data/language_modelling/megatron/gpt_sft_dataset.py
+        """
+        if self.prompt_template is not None:
+            if self.context_key not in self.prompt_template or self.answer_key not in self.prompt_template:
+                if "input" in self.prompt_template and "output" in self.prompt_template:
+                    logging.warning(
+                        f"Using 'input' and 'output' as context and answer keys, since given ones ({self.context_key}, {self.answer_key}) are not found in the prompt template: {self.prompt_template}.",
+                        mode=logging_mode.ONCE,
+                    )
+                    self.context_key = "input"
+                    self.answer_key = "output"
+            assert f'{{{self.context_key}}}' in self.prompt_template
+            assert f'{{{self.answer_key}}}' in self.prompt_template
+            # Make sure that '{output}' always occurs at the end of the prompt template string
+            assert self.prompt_template.index(f'{{{self.answer_key}}}') == len(self.prompt_template) - len(
+                f'{{{self.answer_key}}}'
+            )
+            # Get the context by replacing only the input
+            original_context = context
+            context = (
+                self.prompt_template.replace(f'{{{self.context_key}}}', context)
+                .replace(f'{{{self.answer_key}}}', '')
+                .strip(' ')
+            )
+            # Replace the input and output placeholders with the actual input and output
+            text = self.prompt_template.replace(f'{{{self.context_key}}}', original_context).replace(
+                f'{{{self.answer_key}}}', output
+            )
+
+        elif self.separate_prompt_and_response_with_newline:
+            text = context + '\n' + output
+        else:
+            text = context + ' ' + output
+
+        if self.virtual_tokens:
+            # (@adithyare) we are going to insert "pad/eos" tokens in the beginning of the text and context
+            # these pad/eos tokens are placeholders for virtual tokens
+            pre_pad = [self.tokenizer.eos_id] * self.virtual_tokens
+        else:
+            pre_pad = []
+        answer_text = text[len(context) :]
+        answer_ids = pre_pad + self.tokenizer.text_to_ids(answer_text, self.sample_alpha)
+        if self.end_string:
+            answer_ids += self.tokenizer.text_to_ids(self.end_string)
+
+        if self.audio_locator is None:
+            # signle audio case
+            context_ids = self.tokenizer.text_to_ids(context)
+            context_start_idx = [0]
+        else:
+            # multiple audio case
+            context_ids = []
+            context_start_idx = []
+            for context_seg in context.split(self.audio_locator):
+                context_start_idx.append(len(context_ids))
+                context_ids.extend(self.tokenizer.text_to_ids(context_seg))
+        context_ids = pre_pad + context_ids
+        context_start_idx = [x + len(pre_pad) for x in context_start_idx]
+
+        # for the long context cases, collate_fn includes self.tokens_to_generate for padding
+        total_ids = len(context_ids) + max(len(answer_ids), self.tokens_to_generate)
+        if self.add_bos:
+            total_ids += 1
+        if self.add_sep:
+            total_ids += 1
+        # Only training need to consider eos token
+        if self.add_eos and self.tokens_to_generate == 0:
+            total_ids += 1
+
+        # If the total number of token is greater than the max, we will try to truncate the answer
+        if total_ids > self.max_seq_length:
+            truncation_length = total_ids - self.max_seq_length
+            if self.truncation_field == "answer":
+                answer_ids = answer_ids[: -min(truncation_length, len(answer_ids))]
+            elif self.truncation_field == "context":
+                context_ids = context_ids[: -min(truncation_length, len(context_ids))]
+
+        input_ids = context_ids
+        answer_start_idx = len(input_ids)
+
+        # Adds bos token in the start
+        if self.add_bos:
+            context_ids = [self.tokenizer.bos_id] + context_ids
+            input_ids = [self.tokenizer.bos_id] + input_ids
+            answer_start_idx += 1
+
+        # Adds sep token between text/prompt and answer
+        if self.add_sep:
+            context_ids = context_ids + [self.sep_id]
+            input_ids = input_ids + [self.sep_id]
+            answer_start_idx += 1
+
+        input_ids = input_ids + answer_ids
+
+        # Only training need to consider eos token
+        if self.add_eos and self.tokens_to_generate == 0:
+            input_ids = input_ids + [self.tokenizer.eos_id]
+
+        if len(input_ids) > self.max_seq_length:
+            logging.warning(f'Input ids length {len(input_ids)} exceed max sequence length {self.max_seq_length}')
+            input_ids = input_ids[: self.max_seq_length]
+
+        processed_example = {
+            'input_ids': input_ids,
+            'answer_start_idx': answer_start_idx,
+            'context_ids': context_ids,
+            'context_length': len(context_ids),
+            'answer_ids': answer_ids,
+            'context_start_idx': context_start_idx,
+        }
+
+        return processed_example
+
+
+class AudioTextDataset(TextProcessing, Dataset):
+    """
+    Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds).
+    Each new line is a different sample. Example below:
+    {"audio_filepath": "1.wav", "duration": 1.12, "question": "what is the capital of France?", "answer": "Paris"}
+    {"audio_filepath": "2.wav", "duration": 2.15, "question": "what is the capital of Italy?", "answer": "Rome"}
+    Args:
+        manifest_filepath: Path to manifest json as described above. Can be comma-separated paths.
+        tokenizer: text tokenizer object
+        sample_rate (int): Sample rate to resample loaded audio to
+        int_values (bool): If true, load samples as 32-bit integers. Defauts to False.
+        augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor): An AudioAugmentor object used to augment loaded
+            audio
+        max_duration: If audio exceeds this length, do not include in dataset
+        min_duration: If audio is less than this length, do not include in dataset
+        max_utts: Limit number of utterances
+        trim: whether or not to trim silence. Defaults to False
+        channel_selector (int | Iterable[int] | str): select a single channel or a subset of channels from multi-channel audio. If set to `'average'`, it performs averaging across channels. Disabled if set to `None`. Defaults to `None`. Uses zero-based indexing.
+        --------- NLP SPECIFIC ARGS -------------
+        max_seq_length (int): maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.
+        min_seq_length (int): min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements.
+        add_bos (bool): Whether to add a beginning of sentence token to each data example
+        add_eos (bool): Whether to add an end of sentence token to each data example
+        add_sep (bool): Whether to add a separation token to each data example (goes between prompt and answer)
+        tokens_to_generate (int): (inference only) Number of tokens to generate during inference
+        seed: Random seed for data shuffling.
+        max_num_samples: Maximum number of samples to load. This can be > dataset length if you want to oversample data. If None, all samples will be loaded.
+        seed: int = 1234,
+        context_key: Key to use for the context in your JSONL file
+        answer_key: Key to use for the label in your JSONL file
+        separate_prompt_and_response_with_newline: Adds a newline between prompt and response.
+        answer_only_loss: If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input.
+        truncation_field: Field to use for truncation. (Options: "answer", "context"). Field to be used for truncation if the combined length exceeds the max sequence length.
+        pad_to_max_length: Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.
+        prompt_template: Prompt template to inject via an fstring. Formatted like Q: {input}\n\nA: {output}
+        end_string: Optional[str] = None, if not None, add this string to the end of the answer.
+        --------------- additional args for misc purposes ----------------
+        context_file: Optional[Union[List[str], str]] = None, if provided, will use this file to load random questions from, if question is not in manifest.
+        sample_alpha: Optional[float] = None, for SPE subword sampling
+        audio_locator: Optional[str] = None, a special string to split the context into multiple audio segments.
+    """
+
+    def __init__(
+        self,
+        manifest_filepath: str,
+        tokenizer: 'nemo.collections.common.tokenizers.TokenizerSpec',
+        sample_rate: int,
+        int_values: bool = False,
+        augmentor: 'nemo.collections.asr.parts.perturb.AudioAugmentor' = None,
+        max_duration: Optional[int] = None,
+        min_duration: Optional[int] = None,
+        max_utts: int = 0,
+        trim: bool = False,
+        channel_selector: Optional[ChannelSelectorType] = None,
+        max_seq_length: int = 1024,
+        min_seq_length: int = 1,
+        add_bos: bool = False,
+        add_eos: bool = True,
+        add_sep: bool = False,
+        sep_id: Optional[int] = None,
+        max_num_samples: Optional[int] = None,
+        seed: int = 1234,
+        separate_prompt_and_response_with_newline: bool = False,
+        answer_only_loss: bool = True,
+        truncation_field: str = "answer",
+        pad_to_max_length: bool = False,  # (@adithyare) allows for much faster training especially in PEFT settings.
+        prompt_template: str = None,
+        virtual_tokens: int = 0,
+        tokens_to_generate: int = 0,
+        index_by_file_id: bool = False,
+        context_key: str = 'context',
+        answer_key: str = 'answer',
+        end_string: Optional[str] = None,
+        context_file: Optional[Union[List[str], str]] = None,
+        sample_alpha: Optional[float] = None,
+        audio_locator: Optional[str] = None,
+    ):
+        super().__init__(
+            tokenizer=tokenizer,
+            max_seq_length=max_seq_length,
+            min_seq_length=min_seq_length,
+            add_bos=add_bos,
+            add_eos=add_eos,
+            add_sep=add_sep,
+            sep_id=sep_id,
+            seed=seed,
+            separate_prompt_and_response_with_newline=separate_prompt_and_response_with_newline,
+            answer_only_loss=answer_only_loss,
+            truncation_field=truncation_field,
+            pad_to_max_length=pad_to_max_length,
+            prompt_template=prompt_template,
+            virtual_tokens=virtual_tokens,
+            tokens_to_generate=tokens_to_generate,
+            context_key=context_key,
+            answer_key=answer_key,
+            end_string=end_string,
+            sample_alpha=sample_alpha,
+            audio_locator=audio_locator,
+        )
+
+        if isinstance(manifest_filepath, str):
+            manifest_filepath = manifest_filepath.split(",")
+
+        # If necessary, cache manifests and audio from object store
+        cache_datastore_manifests(manifest_filepaths=manifest_filepath, cache_audio=True)
+
+        self.collection = collections.SpeechLLMAudioTextCollection(
+            manifests_files=manifest_filepath,
+            min_duration=min_duration,
+            max_duration=max_duration,
+            max_number=max_utts,
+            index_by_file_id=index_by_file_id,
+            max_num_samples=max_num_samples,
+            context_file=context_file,
+            context_key=context_key,
+            answer_key=answer_key,
+        )
+
+        self.featurizer = WaveformFeaturizer(sample_rate=sample_rate, int_values=int_values, augmentor=augmentor)
+        self.trim = trim
+        self.channel_selector = channel_selector
+
+    def get_manifest_sample(self, sample_id):
+        return self.collection[sample_id]
+
+    def __getitem__(self, index):
+        output = {"idx": index}
+        sample = self.collection[index]
+        offset = sample.offset
+
+        if offset is None:
+            offset = 0
+
+        if sample.audio_file is not None:
+            features = self.featurizer.process(
+                sample.audio_file,
+                offset=offset,
+                duration=sample.duration,
+                trim=self.trim,
+                orig_sr=sample.orig_sr,
+                channel_selector=self.channel_selector,
+            )
+            f, fl = features, torch.tensor(features.shape[0]).long()
+            output["audio_signal"] = f
+            output["audio_length"] = fl
+        else:
+            # dummy features
+            output["audio_signal"] = torch.zeros([80])
+            # accomodates normalize_batch
+            output["audio_length"] = torch.tensor(80)
+
+        text_data = self._process_example(context=sample.context, output=sample.answer)
+
+        output.update(text_data)
+        output['metadata'] = {
+            'audio_filepath': sample.audio_file,
+            'offset': offset,
+            'duration': sample.duration,
+        }
+        return output
+
+    def __len__(self):
+        return len(self.collection)
+
+    def _collate_fn(self, batch):
+        return _speechllm_audio_text_collate_fn(
+            batch=batch,
+            tokens_to_generate=self.tokens_to_generate,
+            pad_to_max_length=self.pad_to_max_length,
+            max_seq_length=self.max_seq_length,
+            text_pad_id=self.pad_id,
+        )
+
+    def collate_fn(self, batch):
+        # override collate_fn to skip type checking
+        return self._collate_fn(batch)
+
+
+class MultiAudioTextDataset(AudioTextDataset):
+    """
+    Dataset for having multi audios per sample, for example in few-shot in-context learning.
+    To use this dataset, you need to specify the `audio_locator` field in the dataset config,
+    and use that to specify the locations of the audio files in your manifest. In this case,
+    the `audio_filepath` field in the manifest is a list of audio filepaths, and the `duration`
+    field is a list of durations, one for each audio file. The `offset` field is optional, and
+    if not specified, it is assumed to be 0.0. The `offset` field is also a list of offsets if specified.
+
+    Example manifest item for audio_locator='|audio|':
+    {
+    "audio_filepath": ["1.wav","2.wav","3.wav"],
+    "duration": [1.05,1.05,2.0],
+    "answer": "this was her dream as nearly as she could recall it",
+    "question": "Following are examples of speech audios and their transcriptions.
+        Example 1: audio is |audio|, transcription is 'I have a dream'.
+        Example 2: audio is |audio|, transcription is ' I don't have a dream'.
+        Given the following audio |audio|, transcribe the audio into words."
+    }
+    """
+
+    def __init__(
+        self,
+        *args,
+        **kwargs,
+    ):
+        super().__init__(*args, **kwargs)
+
+    def _collate_fn(self, batch):
+        return _speechllm_multi_audio_text_collate_fn(
+            batch=batch,
+            tokens_to_generate=self.tokens_to_generate,
+            pad_to_max_length=self.pad_to_max_length,
+            max_seq_length=self.max_seq_length,
+            text_pad_id=self.pad_id,
+        )
+
+    def __getitem__(self, index):
+        output = {"idx": index}
+        sample = self.collection[index]
+        offsets = sample.offset if sample.offset else 0.0
+        durations = sample.duration if sample.duration else 0.0
+        num_audios = 0
+        output["audio_signal"] = []
+        output["audio_length"] = []
+        if sample.audio_file is not None:
+            audio_list = sample.audio_file
+            if isinstance(sample.audio_file, str):
+                audio_list = [sample.audio_file]
+            if not isinstance(audio_list, list):
+                raise ValueError(
+                    f"The field `audio_file` must be either a str or a list of str, but got type {type(sample.audio_file)} instead"
+                )
+
+            num_audios = len(audio_list)
+            if isinstance(durations, list) and len(durations) != num_audios:
+                raise ValueError(
+                    f"The number of durations ({len(durations)}) must match the number of audio clips ({num_audios})"
+                )
+            if isinstance(offsets, list) and len(offsets) != num_audios:
+                raise ValueError(
+                    f"The number of offsets ({len(offsets)}) must match the number of audio clips ({num_audios})"
+                )
+
+            for i, audio_file in enumerate(audio_list):
+                duration = durations[i] if isinstance(durations, list) else 0
+                offset = offsets[i] if isinstance(offsets, list) else 0
+                features = self.featurizer.process(
+                    audio_file,
+                    offset=offset,
+                    duration=duration,
+                    trim=self.trim,
+                    orig_sr=sample.orig_sr,
+                    channel_selector=self.channel_selector,
+                )
+                f, fl = features, torch.tensor(features.shape[0]).long()
+                output["audio_signal"].append(f)
+                output["audio_length"].append(fl)
+        else:
+            # dummy features
+            output["audio_signal"] = [torch.zeros([8])]
+            # accomodates normalize_batch
+            output["audio_length"] = [torch.tensor(8)]
+
+        text_data = self._process_example(context=sample.context, output=sample.answer)
+
+        if isinstance(output["audio_signal"], list) and len(output["audio_signal"]) + 1 != len(
+            text_data['context_start_idx']
+        ):
+            raise ValueError(
+                f"The number of text segments ({len(text_data['context_start_idx'])}) must be one more than number of audios ({len(output['audio_signal'])})"
+            )
+
+        output.update(text_data)
+        output['metadata'] = {
+            'audio_filepath': sample.audio_file,
+            'offset': offsets,
+            'duration': sample.duration,
+        }
+        return output
+
+
+class TarredAudioFilter:
+    def __init__(self, collection, iterator):
+        self.iterator = iterator
+        self.collection = collection
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        while True:
+            audio_bytes, audio_filename = next(self.iterator)
+            file_id, _ = os.path.splitext(os.path.basename(audio_filename))
+            if file_id in self.collection.mapping:
+                return audio_bytes, audio_filename
+
+
+class TarredAudioLoopOffsets:
+    def __init__(self, collection, iterator):
+        self.iterator = iterator
+        self.collection = collection
+        self.current_fn = None
+        self.current_bytes = None
+        self.offset_id = 0
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        if self.current_fn is None:
+            self.current_bytes, self.current_fn = next(self.iterator)
+            self.offset_id = 0
+        else:
+            offset_list = self.collection.mapping[self.current_fn]
+            if len(offset_list) == self.offset_id + 1:
+                self.current_bytes, self.current_fn = next(self.iterator)
+                self.offset_id = 0
+            else:
+                self.offset_id += 1
+
+        return self.current_bytes, self.current_fn, self.offset_id
+
+
+class TarredAudioTextDataset(TextProcessing, IterableDataset):
+    """
+    A similar Dataset to the AudioTextDataset, but which loads tarred audio files.
+
+    Accepts a single comma-separated JSON manifest file (in the same style as for the AudioTextDataset),
+    as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should
+    contain the information for one audio file, including at least the transcript and name of the audio
+    file within the tarball.
+
+    Valid formats for the audio_tar_filepaths argument include:
+    (1) a single string that can be brace-expanded, e.g. 'path/to/audio.tar' or 'path/to/audio_{1..100}.tar.gz', or
+    (2) a list of file paths that will not be brace-expanded, e.g. ['audio_1.tar', 'audio_2.tar', ...].
+
+    Note: For brace expansion in (1), there may be cases where `{x..y}` syntax cannot be used due to shell interference.
+    This occurs most commonly inside SLURM scripts. Therefore we provide a few equivalent replacements.
+    Supported opening braces - { <=> (, [, < and the special tag _OP_.
+    Supported closing braces - } <=> ), ], > and the special tag _CL_.
+    For SLURM based tasks, we suggest the use of the special tags for ease of use.
+
+    See the WebDataset documentation for more information about accepted data and input formats.
+
+    If using multiple workers the number of shards should be divisible by world_size to ensure an
+    even split among workers. If it is not divisible, logging will give a warning but training will proceed.
+    In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering
+    is applied. We currently do not check for this, but your program may hang if the shards are uneven!
+
+    Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest
+    after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.
+
+    Args:
+        audio_tar_filepaths: Either a list of audio tarball filepaths, or a
+            string (can be brace-expandable).
+        manifest_filepath (str): Path to the manifest.
+        parser (callable): A callable which is used to pre-process the text output.
+        sample_rate (int): Sample rate to resample loaded audio to
+        int_values (bool): If true, load samples as 32-bit integers. Defauts to False.
+        augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor): An AudioAugmentor
+            object used to augment loaded audio
+        shuffle_n (int): How many samples to look ahead and load to be shuffled.
+            See WebDataset documentation for more details.
+            Defaults to 0.
+        min_duration (float): Dataset parameter.
+            All training files which have a duration less than min_duration
+            are dropped. Note: Duration is read from the manifest JSON.
+            Defaults to 0.1.
+        max_duration (float): Dataset parameter.
+            All training files which have a duration more than max_duration
+            are dropped. Note: Duration is read from the manifest JSON.
+            Defaults to None.
+        blank_index (int): Blank character index, defaults to -1.
+        unk_index (int): Unknown character index, defaults to -1.
+        normalize (bool): Dataset parameter.
+            Whether to use automatic text cleaning.
+            It is highly recommended to manually clean text for best results.
+            Defaults to True.
+        trim (bool): Whether to use trim silence from beginning and end
+            of audio signal using librosa.effects.trim().
+            Defaults to False.
+        bos_id (id): Dataset parameter.
+            Beginning of string symbol id used for seq2seq models.
+            Defaults to None.
+        eos_id (id): Dataset parameter.
+            End of string symbol id used for seq2seq models.
+            Defaults to None.
+        pad_id (id): Token used to pad when collating samples in batches.
+            If this is None, pads using 0s.
+            Defaults to None.
+        shard_strategy (str): Tarred dataset shard distribution strategy chosen as a str value during ddp.
+            -   `scatter`: The default shard strategy applied by WebDataset, where each node gets
+                a unique set of shards, which are permanently pre-allocated and never changed at runtime.
+            -   `replicate`: Optional shard strategy, where each node gets all of the set of shards
+                available in the tarred dataset, which are permanently pre-allocated and never changed at runtime.
+                The benefit of replication is that it allows each node to sample data points from the entire
+                dataset independently of other nodes, and reduces dependence on value of `shuffle_n`.
+
+                .. warning::
+                    Replicated strategy allows every node to sample the entire set of available tarfiles,
+                    and therefore more than one node may sample the same tarfile, and even sample the same
+                    data points! As such, there is no assured guarantee that all samples in the dataset will be
+                    sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific
+                    occasions (when the number of shards is not divisible with ``world_size``), will not sample
+                    the entire dataset. For these reasons it is not advisable to use tarred datasets as validation
+                    or test datasets.
+        shard_manifests (bool): Whether or not to try / shard manifests. Defaults to False.
+        global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.
+        world_size (int): Total number of processes, used for partitioning shards. Defaults to 0.
+        --------- NLP SPECIFIC ARGS -------------
+        max_seq_length (int): maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.
+        min_seq_length (int): min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements.
+        add_bos (bool): Whether to add a beginning of sentence token to each data example
+        add_eos (bool): Whether to add an end of sentence token to each data example
+        add_sep (bool): Whether to add a separation token to each data example (goes between prompt and answer)
+        tokens_to_generate (int): (inference only) Number of tokens to generate during inference
+        seed: Random seed for data shuffling.
+        seed: int = 1234,
+        context_key: Key to use for the context in your JSONL file
+        answer_key: Key to use for the label in your JSONL file
+        separate_prompt_and_response_with_newline: Adds a newline between prompt and response.
+        answer_only_loss: If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input.
+        truncation_field: Field to use for truncation. (Options: "answer", "context"). Field to be used for truncation if the combined length exceeds the max sequence length.
+        pad_to_max_length: Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.
+        prompt_template: Prompt template to inject via an fstring. Formatted like Q: {input}\n\nA: {output}
+        end_string: Optional[str] = None, if not None, add this string to the end of the answer.
+        --------------- additional args for misc purposes ----------------
+        context_file: Optional[Union[List[str], str]] = None, if provided, will use this file to load random questions from, if question is not in manifest.
+        sample_alpha: Optional[float] = None, for SPE subword sampling
+    """
+
+    def __init__(
+        self,
+        audio_tar_filepaths: Union[str, List[str]],
+        manifest_filepath: str,
+        tokenizer: 'nemo.collections.common.tokenizers.TokenizerSpec',
+        sample_rate: int,
+        int_values: bool = False,
+        augmentor: Optional['nemo.collections.asr.parts.perturb.AudioAugmentor'] = None,
+        shuffle_n: int = 0,
+        min_duration: Optional[float] = None,
+        max_duration: Optional[float] = None,
+        trim: bool = False,
+        shard_strategy: str = "scatter",
+        shard_manifests: bool = False,
+        global_rank: int = 0,
+        world_size: int = 0,
+        max_seq_length: int = 1024,
+        min_seq_length: int = 1,
+        add_bos: bool = False,
+        add_eos: bool = True,
+        add_sep: bool = False,
+        sep_id: int = None,
+        seed: int = 1234,
+        separate_prompt_and_response_with_newline: bool = False,
+        answer_only_loss: bool = True,
+        truncation_field: str = "answer",  # choices=["answer", "context"]
+        pad_to_max_length: bool = False,  # (@adithyare) allows for much faster training especially in PEFT settings.
+        prompt_template: str = None,
+        virtual_tokens: int = 0,
+        tokens_to_generate: int = 0,
+        context_key: str = 'context',
+        answer_key: str = 'answer',
+        end_string: Optional[str] = None,
+        context_file: Optional[Union[List[str], str]] = None,
+        sample_alpha: Optional[float] = None,
+    ):
+        super().__init__(
+            tokenizer=tokenizer,
+            max_seq_length=max_seq_length,
+            min_seq_length=min_seq_length,
+            add_bos=add_bos,
+            add_eos=add_eos,
+            add_sep=add_sep,
+            sep_id=sep_id,
+            seed=seed,
+            separate_prompt_and_response_with_newline=separate_prompt_and_response_with_newline,
+            answer_only_loss=answer_only_loss,
+            truncation_field=truncation_field,
+            pad_to_max_length=pad_to_max_length,
+            prompt_template=prompt_template,
+            virtual_tokens=virtual_tokens,
+            tokens_to_generate=tokens_to_generate,
+            context_key=context_key,
+            answer_key=answer_key,
+            end_string=end_string,
+            sample_alpha=sample_alpha,
+        )
+        self.is_megatron_iterable = True
+        self.shard_manifests = shard_manifests
+
+        # Shard manifests if necessary and possible and then expand the paths
+        manifest_filepath = shard_manifests_if_needed(
+            shard_manifests=shard_manifests,
+            shard_strategy=shard_strategy,
+            manifest_filepaths=manifest_filepath,
+            world_size=world_size,
+            global_rank=global_rank,
+        )
+
+        # If necessary, cache manifests from object store
+        cache_datastore_manifests(manifest_filepaths=manifest_filepath)
+
+        self.collection = collections.SpeechLLMAudioTextCollection(
+            manifests_files=manifest_filepath,
+            min_duration=min_duration,
+            max_duration=max_duration,
+            index_by_file_id=True,
+            context_file=context_file,
+            context_key=context_key,
+            answer_key=answer_key,
+        )
+
+        self.len = self._compute_len()
+
+        self.featurizer = WaveformFeaturizer(sample_rate=sample_rate, int_values=int_values, augmentor=augmentor)
+        self.trim = trim
+
+        audio_tar_filepaths = expand_sharded_filepaths(
+            sharded_filepaths=audio_tar_filepaths,
+            shard_strategy=shard_strategy,
+            world_size=world_size,
+            global_rank=global_rank,
+        )
+
+        # Put together WebDataset
+        self._dataset = wds.WebDataset(urls=audio_tar_filepaths, nodesplitter=None)
+
+        if shuffle_n == 0:
+            logging.info("WebDataset will not shuffle files within the tar files.")
+
+        # Put together WebDataset pipeline
+        self._dataset = wds.DataPipeline(
+            wds.SimpleShardList(urls=audio_tar_filepaths),
+            webdataset_split_by_workers,
+            wds.shuffle(shuffle_n),
+            wds.tarfile_to_samples(),
+            wds.rename(audio=VALID_FILE_FORMATS, key='__key__'),
+            wds.to_tuple('audio', 'key'),
+            self._filter,
+            self._loop_offsets,
+            wds.map(self._build_sample),
+        )
+
+    def _filter(self, iterator):
+        """This function is used to remove samples that have been filtered out by ASRAudioText already.
+        Otherwise, we would get a KeyError as _build_sample attempts to find the manifest entry for a sample
+        that was filtered out (e.g. for duration).
+        Note that if using multi-GPU training, filtering may lead to an imbalance in samples in each shard,
+        which may make your code hang as one process will finish before the other.
+        """
+        return TarredAudioFilter(self.collection, iterator)
+
+    def _loop_offsets(self, iterator):
+        """This function is used to iterate through utterances with different offsets for each file."""
+        return TarredAudioLoopOffsets(self.collection, iterator)
+
+    def _collate_fn(self, batch):
+        return _speechllm_audio_text_collate_fn(
+            batch=batch,
+            tokens_to_generate=self.tokens_to_generate,
+            pad_to_max_length=self.pad_to_max_length,
+            max_seq_length=self.max_seq_length,
+            text_pad_id=self.pad_id,
+        )
+
+    def collate_fn(self, batch):
+        # override collate_fn to skip type checking
+        return self._collate_fn(batch)
+
+    def _build_sample(self, tup):
+        """Builds the training sample by combining the data from the WebDataset with the manifest info."""
+        audio_bytes, audio_filename, offset_id = tup
+
+        if audio_filename is not None:
+            # Grab manifest entry from self.manifest_preprocessor.collection
+            file_id, _ = os.path.splitext(os.path.basename(audio_filename))
+            manifest_idx = self.collection.mapping[file_id][offset_id]
+            manifest_entry = self.collection[manifest_idx]
+
+            # init output dict
+            output = {"idx": manifest_idx}
+
+            offset = manifest_entry.offset
+            if offset is None:
+                offset = 0
+            # Convert audio bytes to IO stream for processing (for SoundFile to read)
+            audio_filestream = io.BytesIO(audio_bytes)
+            features = self.featurizer.process(
+                audio_filestream,
+                offset=offset,
+                duration=manifest_entry.duration,
+                trim=self.trim,
+                orig_sr=manifest_entry.orig_sr,
+            )
+            audio_filestream.close()
+
+            # Audio features
+            output["audio_signal"] = features
+            output["audio_length"] = torch.tensor(features.shape[0]).long()
+        else:
+            # dummy features
+            output["audio_signal"] = torch.zeros([80])
+            # accomodates normalize_batch
+            output["audio_length"] = torch.tensor(80)
+
+        # Text features
+        text_data = self._process_example(context=manifest_entry.context, output=manifest_entry.answer)
+
+        output.update(text_data)
+
+        output['metadata'] = {
+            'audio_filepath': audio_filename,
+            'offset': offset,
+            'duration': manifest_entry.duration,
+        }
+        return output
+
+    def get_manifest_sample(self, sample_id):
+        return self.collection[sample_id]
+
+    def __iter__(self):
+        return self._dataset.__iter__()
+
+    def _compute_len(self):
+        # TODO: need to figure out why here needs to be divided by world_size, while in ASR we don't need to.
+        if self.shard_manifests and torch.distributed.is_available() and torch.distributed.is_initialized():
+            my_len = torch.tensor(len(self.collection), dtype=torch.int32).cuda()
+            torch.distributed.all_reduce(my_len)
+            my_len = my_len.int() // parallel_state.get_data_parallel_world_size()
+            logging.info(f'Sharded manifests: Total length: {my_len}')
+        else:
+            my_len = len(self.collection) // parallel_state.get_data_parallel_world_size()
+
+        return my_len
+
+    def __len__(self):
+        return self.len
+
+
+def get_tarred_audio_text_dataset(
+    config,
+    tokenizer,
+    augmentor,
+    global_rank=0,
+    world_size=1,
+    shuffle_n=0,
+    sep_id=None,
+    answer_only_loss=True,
+    virtual_tokens=0,
+):
+    tarred_audio_filepaths = config['tarred_audio_filepaths']
+    manifest_filepaths = config['manifest_filepath']
+    datasets = []
+    tarred_audio_filepaths = convert_to_config_list(tarred_audio_filepaths)
+    manifest_filepaths = convert_to_config_list(manifest_filepaths)
+
+    bucketing_weights = config.get('bucketing_weights', None)  # For upsampling buckets
+    if bucketing_weights:
+        for idx, weight in enumerate(bucketing_weights):
+            if not isinstance(weight, int) or weight <= 0:
+                raise ValueError(f"bucket weights must be positive integers")
+
+    if len(manifest_filepaths) != len(tarred_audio_filepaths):
+        raise ValueError(
+            f"manifest_filepaths (length={len(manifest_filepaths)}) and tarred_audio_filepaths (length={len(tarred_audio_filepaths)}) need to have the same number of buckets."
+        )
+
+    if 'labels' not in config:
+        logging.warning(f"dataset does not have explicitly defined labels")
+
+    if 'max_utts' in config:
+        raise ValueError('"max_utts" parameter is not supported for tarred datasets')
+
+    for dataset_idx, (tarred_audio_filepath, manifest_filepath) in enumerate(
+        zip(tarred_audio_filepaths, manifest_filepaths)
+    ):
+        if len(tarred_audio_filepath) == 1:
+            tarred_audio_filepath = tarred_audio_filepath[0]
+        if len(manifest_filepath) == 1:
+            manifest_filepath = manifest_filepath[0]
+
+        dataset = TarredAudioTextDataset(
+            audio_tar_filepaths=tarred_audio_filepath,
+            manifest_filepath=manifest_filepath,
+            tokenizer=tokenizer,
+            sample_rate=config['sample_rate'],
+            int_values=config.get('int_values', False),
+            augmentor=augmentor,
+            shuffle_n=shuffle_n,
+            max_duration=config.get('max_duration', None),
+            min_duration=config.get('min_duration', None),
+            trim=config.get('trim_silence', False),
+            shard_strategy=config.get('tarred_shard_strategy', 'scatter'),
+            shard_manifests=config.get('shard_manifests', False),
+            global_rank=global_rank,
+            world_size=world_size,
+            max_seq_length=config.max_seq_length,
+            min_seq_length=config.min_seq_length,
+            add_bos=config.get('add_bos', False),
+            add_eos=config.get('add_eos', True),
+            add_sep=config.get('add_sep', False),
+            sep_id=sep_id,
+            separate_prompt_and_response_with_newline=config.get('separate_prompt_and_response_with_newline', True),
+            answer_only_loss=answer_only_loss,
+            truncation_field=config.get('truncation_field', 'context'),
+            pad_to_max_length=False,
+            prompt_template=config.get('prompt_template', None),
+            virtual_tokens=virtual_tokens,
+            tokens_to_generate=config.get(
+                'tokens_to_generate', 0
+            ),  # used at inference time to allocate tensor positions for tokens that will be generated by inf procedure.
+            context_key=config.get('context_key', 'context'),
+            answer_key=config.get('answer_key', 'answer'),
+            end_string=config.get('end_string', None),
+            sample_alpha=config.get('sample_alpha', None),
+            context_file=config.get('context_file', None),
+        )
+
+        if bucketing_weights:
+            [datasets.append(dataset) for _ in range(bucketing_weights[dataset_idx])]
+        else:
+            datasets.append(dataset)
+
+    with open_dict(config):  # patch for bucketing tarred datasets
+        config['batch_size'] = config.get("micro_batch_size", 1)
+    return get_chain_dataset(datasets=datasets, ds_config=config, rank=global_rank)
+
+
+def get_concat_tarred_audio_text_dataset(
+    config,
+    tokenizer,
+    augmentor,
+    global_rank=0,
+    world_size=1,
+    shuffle_n=0,
+    sep_id=None,
+    answer_only_loss=True,
+    virtual_tokens=0,
+):
+    tarred_audio_filepaths = config['tarred_audio_filepaths']
+    manifest_filepaths = config['manifest_filepath']
+    datasets = []
+    for dataset_idx, (tarred_audio_filepath, manifest_filepath) in enumerate(
+        zip(tarred_audio_filepaths, manifest_filepaths)
+    ):
+        conf = copy.deepcopy(config)
+        conf['manifest_filepath'] = manifest_filepath
+        conf['tarred_audio_filepaths'] = tarred_audio_filepath
+        context_files = config.get('context_file', None)
+        if isinstance(context_files, ListConfig) and len(context_files) == len(manifest_filepaths):
+            conf['context_file'] = context_files[dataset_idx]
+        else:
+            conf['context_file'] = context_files
+        dataset = get_tarred_audio_text_dataset(
+            config=conf,
+            tokenizer=tokenizer,
+            shuffle_n=shuffle_n,
+            global_rank=global_rank,
+            world_size=world_size,
+            augmentor=augmentor,
+            sep_id=sep_id,
+            answer_only_loss=answer_only_loss,
+            virtual_tokens=virtual_tokens,
+        )
+        datasets.append(dataset)
+
+    concat_sampling_probabilities = config.get('concat_sampling_probabilities', None)
+    if not isinstance(concat_sampling_probabilities, ListConfig) or len(concat_sampling_probabilities) != len(
+        datasets
+    ):
+        logging.info(
+            f"concat_sampling_probabilities is not provided or is not of the same size as datasets, using uniform sampling."
+        )
+        concat_sampling_probabilities = [1.0 / len(datasets)] * len(datasets)
+
+    dataset = ConcatDataset(
+        datasets,
+        sampling_technique=config.get('concat_sampling_technique', 'temperature'),
+        sampling_temperature=config.get('concat_sampling_temperature', 5),
+        sampling_scale=config.get('concat_sampling_scale', 1),
+        sampling_probabilities=concat_sampling_probabilities,
+        shuffle=config.get('concat_shuffle', True),
+        seed=config.get('concat_sampling_seed', None),
+        global_rank=global_rank,
+        world_size=world_size,
+    )
+    return dataset
+
+
+def get_tarred_audio_text_dataset_from_config(
+    config: DictConfig,
+    tokenizer,
+    augmentor,
+    global_rank: int = 0,
+    world_size: int = 1,
+    sep_id: Optional[int] = None,
+    answer_only_loss: bool = True,
+    virtual_tokens: int = 0,
+):
+    is_concat = config.get('is_concat', False)
+    if is_concat:
+        if 'concat_sampling_technique' in config and config['concat_sampling_technique'] is None:
+            logging.warning(
+                f"Concat dataset requires `concat_sampling_technique` but it was not provided. Config: {config}"
+            )
+            return None
+
+    data_parallel_size = parallel_state.get_data_parallel_world_size()
+    num_micro_batches = config.global_batch_size // (config.micro_batch_size * data_parallel_size)
+    global_batch_size_on_this_data_parallel_rank = num_micro_batches * config.micro_batch_size
+    shuffle = config['shuffle']
+    shuffle_n = config.get('shuffle_n', 4 * global_batch_size_on_this_data_parallel_rank) if shuffle else 0
+    if is_concat:
+        dataset = get_concat_tarred_audio_text_dataset(
+            config=config,
+            tokenizer=tokenizer,
+            augmentor=augmentor,
+            shuffle_n=shuffle_n,
+            global_rank=global_rank,
+            world_size=world_size,
+            sep_id=sep_id,
+            answer_only_loss=answer_only_loss,
+            virtual_tokens=virtual_tokens,
+        )
+    else:
+        dataset = get_tarred_audio_text_dataset(
+            config=config,
+            tokenizer=tokenizer,
+            augmentor=augmentor,
+            shuffle_n=shuffle_n,
+            global_rank=global_rank,
+            world_size=world_size,
+            sep_id=sep_id,
+            answer_only_loss=answer_only_loss,
+            virtual_tokens=virtual_tokens,
+        )
+    return dataset
+
+
+def get_audio_text_dataset_from_config(
+    manifest_filepath: str,
+    config: DictConfig,
+    tokenizer,
+    augmentor,
+    is_train,
+    sep_id: Optional[int] = None,
+    answer_only_loss: bool = True,
+    virtual_tokens: int = 0,
+):
+    if isinstance(config.manifest_filepath, str):
+        manifest_filepath = config.manifest_filepath.split(',')
+    else:
+        manifest_filepath = config.manifest_filepath
+
+    data_cls = MultiAudioTextDataset if config.get('audio_locator', None) else AudioTextDataset
+    datasets = []
+    if is_train:
+        # Construct the data prefix list for `get_datasets_weights_and_num_samples()`
+        # that is of the format [weight1,file_name1,weight2,file_name2,...]
+        concat_sampling_probabilities = config.get('concat_sampling_probabilities', None)
+        if concat_sampling_probabilities is None:
+            concat_sampling_probabilities = [1.0 / len(manifest_filepath)] * len(manifest_filepath)
+        elif len(config.get('concat_sampling_probabilities', None)) != len(manifest_filepath):
+            raise ValueError(
+                (
+                    f"concat_sampling_probabilities must be of the same size as manifest_filepath.",
+                    f"Provided size {len(config.concat_sampling_probabilities)}, number of datasets {len(manifest_filepath)}",
+                )
+            )
+        data_prefix = []
+        for weight, prefix in zip(concat_sampling_probabilities, manifest_filepath):
+            data_prefix.append(weight)
+            data_prefix.append(prefix)
+
+        num_samples_per_dataset = get_num_samples_from_files(manifest_filepath)
+        num_train_samples = [len(manifest_filepath) * max(num_samples_per_dataset)]
+        _, _, num_train_samples_per_dataset = get_datasets_weights_and_num_samples(data_prefix, num_train_samples)
+        num_train_samples_after_blend = sum([x[0] for x in num_train_samples_per_dataset])
+    else:
+        num_train_samples_per_dataset = [[None]] * len(manifest_filepath)
+
+    for dataset_idx, (file_path, num_samples) in enumerate(zip(manifest_filepath, num_train_samples_per_dataset)):
+        context_file = config.get('context_file', None)
+        if isinstance(context_file, ListConfig) and len(context_file) == len(manifest_filepath):
+            context_file = context_file[dataset_idx]
+        dataset = data_cls(
+            manifest_filepath=file_path,
+            tokenizer=tokenizer,
+            sample_rate=config.sample_rate,
+            int_values=config.get('int_values', False),
+            augmentor=augmentor,
+            max_duration=getattr(config, 'max_duration', None),
+            min_duration=getattr(config, 'min_duration', None),
+            max_utts=getattr(config, 'max_utts', -1),
+            trim=getattr(config, 'trim_silence', False),
+            channel_selector=getattr(config, 'channel_selector', None),
+            max_seq_length=config.max_seq_length,
+            min_seq_length=config.min_seq_length,
+            add_bos=config.get('add_bos', False),
+            add_eos=config.get('add_eos', True),
+            add_sep=config.get('add_sep', False),
+            sep_id=sep_id,
+            max_num_samples=num_samples[0],
+            seed=config.get('seed', 1234),
+            separate_prompt_and_response_with_newline=config.get('separate_prompt_and_response_with_newline', True),
+            answer_only_loss=answer_only_loss,
+            truncation_field=config.get('truncation_field', 'context'),
+            pad_to_max_length=config.get('pad_to_max_length', False),
+            prompt_template=config.get('prompt_template', None),
+            virtual_tokens=virtual_tokens,
+            tokens_to_generate=config.get(
+                'tokens_to_generate', 0
+            ),  # used at inference time to allocate tensor positions for tokens that will be generated by inf procedure.
+            context_key=config.get('context_key', 'context'),
+            answer_key=config.get('answer_key', 'answer'),
+            end_string=config.get('end_string', None),
+            sample_alpha=config.get('sample_alpha', None),
+            context_file=context_file,
+            audio_locator=config.get('audio_locator', None),
+        )
+        datasets.append(dataset)
+
+    if is_train:
+        dataset = BlendableDataset(
+            datasets=datasets, weights=concat_sampling_probabilities, size=num_train_samples_after_blend
+        )
+        return dataset
+    else:
+        return datasets
diff --git a/nemo/collections/multimodal/speech_llm/models/__init__.py b/nemo/collections/multimodal/speech_llm/models/__init__.py
new file mode 100644
index 000000000000..ec188828ec87
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/models/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from nemo.collections.multimodal.speech_llm.models.modular_models import ModularAudioGPTModel
diff --git a/nemo/collections/multimodal/speech_llm/models/modular_models.py b/nemo/collections/multimodal/speech_llm/models/modular_models.py
new file mode 100644
index 000000000000..39bc37c33e56
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/models/modular_models.py
@@ -0,0 +1,1563 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import json
+import os
+from typing import List, Optional, Union
+
+import hydra
+import sacrebleu
+import torch
+from hydra.utils import get_class
+from omegaconf import ListConfig
+from omegaconf.dictconfig import DictConfig
+from omegaconf.omegaconf import OmegaConf, open_dict
+from pytorch_lightning.trainer.trainer import Trainer
+from pytorch_lightning.utilities import rank_zero_only
+
+from nemo.collections.asr.models import ASRModel, EncDecSpeakerLabelModel
+from nemo.collections.asr.parts.mixins.transcription import move_to_device
+from nemo.collections.asr.parts.preprocessing.perturb import process_augmentations
+from nemo.collections.asr.parts.utils.eval_utils import remove_punctuations
+from nemo.collections.common.metrics import MetricStringToTorchMetric, TextMetricsSet
+from nemo.collections.multimodal.speech_llm.data.audio_text_dataset import (
+    get_audio_text_dataset_from_config,
+    get_tarred_audio_text_dataset_from_config,
+)
+from nemo.collections.multimodal.speech_llm.modules.common.audio_text_generation_utils import generate
+from nemo.collections.multimodal.speech_llm.modules.perception_modules import (
+    AudioPerceptionModule,
+    MultiAudioPerceptionModule,
+)
+from nemo.collections.multimodal.speech_llm.parts.mixins.adapter_mixin import SpeechLLMAdapterMixin
+from nemo.collections.multimodal.speech_llm.parts.utils.data_utils import get_nested_dict_value
+from nemo.collections.nlp.data.language_modeling.megatron.blendable_dataset import BlendableDataset
+from nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers import (
+    MegatronPretrainingBatchSampler,
+)
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_sft_model import MegatronGPTSFTModel
+from nemo.collections.nlp.modules.common.megatron.utils import (
+    average_losses_across_data_parallel_group,
+    build_position_ids,
+)
+from nemo.collections.nlp.modules.common.text_generation_utils import get_computeprob_response
+from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP
+from nemo.collections.nlp.parts.utils_funcs import get_last_rank
+from nemo.core.classes import ModelPT
+from nemo.core.classes.common import PretrainedModelInfo
+from nemo.core.classes.mixins import adapter_mixins
+from nemo.utils import AppState, logging
+from nemo.utils.model_utils import inject_model_parallel_rank
+
+try:
+    from apex.transformer.pipeline_parallel.utils import _reconfigure_microbatch_calculator, get_num_microbatches
+
+    HAVE_APEX = True
+except (ImportError, ModuleNotFoundError):
+    HAVE_APEX = False
+
+try:
+    from megatron.core import InferenceParams, parallel_state, tensor_parallel
+    from megatron.core.models.gpt import GPTModel as MCoreGPTModel
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+    HAVE_MEGATRON_CORE = False
+
+
+__all__ = ["ModularAudioGPTModel"]
+
+
+default_inference_config = {'tokens_to_generate': 30}
+
+
+class ModularAudioGPTModel(SpeechLLMAdapterMixin, MegatronGPTSFTModel):
+    """Modularized speech GPT model."""
+
+    def __init__(self, cfg: DictConfig, trainer: Trainer):
+        self.cfg = cfg
+        super().__init__(cfg, trainer)
+
+        self.perception = (
+            AudioPerceptionModule(cfg=cfg.perception)
+            if "encoders" not in cfg.perception
+            else MultiAudioPerceptionModule(cfg=cfg.perception)
+        )
+        # print out params in more details
+        self.summarize(max_depth=2)
+
+    def parameters(self):
+        # override the same method in MegatronGPT model to include parameters ouside of LM
+        all_names = []
+        all_params = []
+        for name, param in self.named_parameters(recurse=True):
+            all_names.append(name)
+            all_params.append(param)
+
+        if isinstance(self.model, list):
+            for module in self.model:
+                for name, param in module.named_parameters(recurse=True):
+                    all_names.append(name)
+                    all_params.append(param)
+
+        return itertools.chain(all_params)
+
+    def setup_optimizer_param_groups(self):
+        """
+        Override parent method to setup optimizer groups for training/freezing different parts of the model.
+        """
+        known_groups = []
+        if self.cfg.get('freeze_llm', True):
+            for param in self.model.parameters():
+                param.requires_grad = False
+            known_groups.append('model.')
+
+        if self.cfg.get('freeze_audio_encoder', False):
+            # freeze speaker model if there is any
+            if self.cfg.perception.get("speaker_model", None) is not None:
+                if self.cfg.perception.speaker_model.get("freeze", False):
+                    self.perception.speaker_model.freeze()
+                    known_groups.append('perception.speaker_model.')
+            # freeze other audio encoders
+            if self.cfg.perception.get("encoders", None) is not None:
+                # multiple audio encoders
+                for key, enc_cfg in self.cfg.perception.encoders.items():
+                    if enc_cfg.get("freeze", False):
+                        self.perception.encoders[key].freeze()
+                        known_groups.append(f'perception.encoders.{key}.')
+            else:
+                # single audio encoder
+                self.perception.encoder.freeze()
+                known_groups.append('perception.encoder.')
+
+        if self.cfg.get('freeze_modality_adapter', False):
+            # freeze modality adapter
+            self.perception.modality_adapter.freeze()
+            known_groups.append('perception.modality_adapter.')
+
+        opt_params = []
+        for _, module in self.named_modules():
+            if isinstance(module, adapter_mixins.AdapterModuleMixin) and module.is_adapter_available():
+                # add adapters to the optimizer
+                module.set_enabled_adapters(enabled=True)
+                module.unfreeze_enabled_adapters()  # selectively unfreeze the adapter modules.
+                opt_params += [p for p in module.parameters()]
+
+        # add param groups with specified args, if any
+        param_groups = []
+        if "optim_param_groups" in self.cfg:
+            param_groups_cfg = self.cfg.optim_param_groups
+            for group, group_cfg in param_groups_cfg.items():
+                module = getattr(self, group, None)
+                if module is None:
+                    raise ValueError(f"{group} not found in model.")
+                elif hasattr(module, "parameters"):
+                    known_groups.append(f"{group}.")
+                    new_group = {"params": module.parameters()}
+                    for k, v in group_cfg.items():
+                        new_group[k] = v
+                    param_groups.append(new_group)
+                else:
+                    raise ValueError(f"{group} does not have parameters.")
+
+        # add other trainable params
+        for n, p in self.named_parameters():
+            is_unknown = True
+            for group in known_groups:
+                if n.startswith(group):
+                    is_unknown = False
+            if is_unknown:
+                opt_params.append(p)
+
+        param_groups = [{"params": opt_params}] + param_groups
+
+        self._optimizer_param_groups = param_groups
+        logging.info(f"Optimizer groups set:\n{self.summarize(max_depth=2)}")
+
+    def _create_attention_mask(self, encoder_input: torch.Tensor):
+        # Create causal attention mask for whole input
+        batch_size = encoder_input.shape[0]
+        max_len = encoder_input.shape[1]
+        attention_mask = torch.tril(torch.ones((batch_size, max_len, max_len), device=encoder_input.device)).view(
+            batch_size, 1, max_len, max_len
+        )
+        # Convert attention mask from float to bool
+        attention_mask = attention_mask < 0.5
+        return attention_mask
+
+    def _concat_features(self, embs1, emb1_lens, embs2, emb2_lens):
+        """Concatenate two sets of embeddings and their lengths."""
+        concat_emb = []
+        concat_len = []
+        for emb1, emb1_len, emb2, emb2_len in zip(embs1, emb1_lens, embs2, emb2_lens):
+            new_len = emb1_len + emb2_len
+            new_emb = torch.concat([emb1[:emb1_len], emb2[:emb2_len]], axis=0)
+            padded_new_emb = torch.zeros(emb1.shape[0] + emb2.shape[0], emb1.shape[-1], device=emb1.device)
+            padded_new_emb[:new_len, ...] = new_emb
+            concat_emb.append(padded_new_emb)
+            concat_len.append(new_len)
+        concat_emb = torch.stack(concat_emb, dim=0)
+        concat_len = torch.stack(concat_len, dim=0)
+        return concat_emb, concat_len
+
+    def _concat_multi_features(
+        self,
+        encoded: List[torch.Tensor],
+        encoded_len: List[torch.Tensor],
+        input_embeds: torch.Tensor,
+        input_length: torch.Tensor,
+        context_start_idx: List[List[int]],
+    ):
+        """Concatenate multiple audio features with text segments."""
+        encoder_input_list, encoder_length_list = [], []
+        batch_size = input_embeds.size(0)
+        max_length = 0
+        for i in range(batch_size):
+            start_idx_list_i = context_start_idx[i] + [
+                input_embeds.size(1)
+            ]  # use input_embeds instead of input_length to handle tokens_to_generate in inference
+            input_len_list = [start_idx_list_i[j + 1] - start_idx_list_i[j] for j in range(len(start_idx_list_i) - 1)]
+            input_emb_list = input_embeds[i].split(input_len_list)
+            encoder_input_i = [input_emb_list[0]]
+            for j in range(1, len(input_emb_list)):
+                encoder_input_i.append(encoded[i][j - 1][: encoded_len[i][j - 1]])
+                encoder_input_i.append(input_emb_list[j])
+            encoder_input_i = torch.cat(encoder_input_i)  # T, C
+            encoder_length_i = encoded_len[i].sum() + input_length[i]  # total length of audio and text features
+            max_length = max(max_length, encoder_input_i.size(0))
+            encoder_input_list.append(encoder_input_i)
+            encoder_length_list.append(encoder_length_i)
+
+        encoder_input = torch.stack(
+            [torch.nn.functional.pad(f, (0, 0, 0, max_length - f.size(0))) for f in encoder_input_list]
+        )
+        encoder_length = torch.LongTensor(encoder_length_list).to(encoder_input.device)
+        return encoder_input, encoder_length
+
+    def inject_perception_input(
+        self,
+        encoded: Union[torch.Tensor, List[torch.Tensor]],
+        encoded_len: Union[torch.Tensor, List[torch.Tensor]],
+        input_ids: torch.Tensor,
+        input_length: torch.Tensor,
+        context_start_idx: Optional[List[List[int]]] = None,
+    ):
+        """Inject audio features into the text input and return the final input embeddings to LLM."""
+        # [b, t, c]
+        lm_embedding = (
+            self.model.language_model.embedding if hasattr(self.model, 'language_model') else self.model.embedding
+        )
+        input_embeds = lm_embedding.word_embeddings(input_ids)
+        if isinstance(encoded, torch.Tensor):
+            # single audio
+            encoder_input, encoder_length = self._concat_features(encoded, encoded_len, input_embeds, input_length)
+        else:
+            # concat multiple audios with text segments
+            encoder_input, encoder_length = self._concat_multi_features(
+                encoded, encoded_len, input_embeds, input_length, context_start_idx
+            )
+
+        attention_mask = self._create_attention_mask(encoder_input)
+        position_ids = build_position_ids(encoder_input[:, :, 0])
+
+        # Add position embeddings
+        if (
+            getattr(lm_embedding, "position_embeddings", None) is not None
+            and lm_embedding.position_embedding_type == 'learned_absolute'
+        ):
+            position_embeddings = lm_embedding.position_embeddings(position_ids)
+            encoder_input = encoder_input + position_embeddings
+
+        encoder_max_length = encoder_input.shape[1]
+        if not hasattr(lm_embedding, 'transpose_batch_sequence') or lm_embedding.transpose_batch_sequence:
+            encoder_input = encoder_input.transpose(0, 1).contiguous()
+        if self.cfg.get("sequence_parallel", False):
+            encoder_input = tensor_parallel.mappings.scatter_to_sequence_parallel_region(encoder_input)
+        return encoder_input, attention_mask, encoder_length, position_ids, encoder_max_length
+
+    def _shift_labels_by_emb_len(self, labels, label_lens, emb_lens, max_len, pad_token=0):
+        """Shift labels to the right by the length of the audio embeddings."""
+        shifted_labels = []
+        for label, label_len, emb_len in zip(labels, label_lens, emb_lens):
+            shifted_label = torch.full([max_len], pad_token, device=label.device)
+            shifted_label[emb_len : emb_len + label_len] = label[:label_len]
+            shifted_labels.append(shifted_label)
+        shifted_labels = torch.stack(shifted_labels, dim=0)
+        return shifted_labels
+
+    def _get_text_embeddings(self, text_tokens, position_ids):
+        """Get text embeddings for the input text tokens."""
+        lm_embedding = (
+            self.model.language_model.embedding if hasattr(self.model, 'language_model') else self.model.embedding
+        )
+        text_embeddings = lm_embedding.word_embeddings(text_tokens)  # (batch_size, seq_len, hidden_size)
+        if hasattr(lm_embedding, 'position_embeddings'):
+            position_embeddings = lm_embedding.position_embeddings(position_ids)
+            text_embeddings = text_embeddings + position_embeddings
+        return text_embeddings.transpose(0, 1)
+
+    def prepare_llm_input(self, audio_batch):
+        """Prepare input for the LLM."""
+        input_signal = audio_batch['audio_signal']
+        input_signal_length = audio_batch['audio_signal_length']
+
+        input_ids, input_length, labels, loss_mask = (
+            audio_batch['tokens'],
+            audio_batch['tokens_length'],
+            audio_batch['labels'],
+            audio_batch['loss_mask'],
+        )
+
+        num_audios = audio_batch.get("num_audios", None)
+        context_start_idx = audio_batch.get("context_start_idx", None)
+
+        # [b, t, c]
+        encoded, encoded_len = self.perception(
+            input_signal=input_signal,
+            input_signal_length=input_signal_length,
+            processed_signal=None,
+            processed_signal_length=None,
+        )
+
+        if num_audios is not None:
+            # split the encoded and encoded_len by num_audios, used when there're multiple audio files per sample
+            encoded = encoded.split(num_audios.tolist())
+            encoded_len = encoded_len.split(num_audios.tolist())
+
+        encoder_input, attention_mask, encoder_length, _, encoder_max_length = self.inject_perception_input(
+            encoded, encoded_len, input_ids, input_length, context_start_idx
+        )
+        if num_audios is not None:
+            # sum up the audio_feat_lens for each sample in the batch
+            encoded_len = torch.stack([torch.sum(lens) for lens in encoded_len])
+
+        # Shift labels to the right
+        labels = self._shift_labels_by_emb_len(labels, input_length, encoded_len, encoder_max_length, pad_token=0)
+        # Loss mask where answer tokens are 1.0 and all other tokens are 0.0
+        loss_mask = self._shift_labels_by_emb_len(
+            loss_mask, input_length, encoded_len, encoder_max_length, pad_token=0
+        )
+
+        return encoder_input, attention_mask, labels, loss_mask, encoder_length
+
+    def forward(
+        self,
+        audio_batch,
+        checkpoint_activations_all_layers,
+    ):
+        """
+        Forward pass of the model. We prepend audio embeddings to the instruction and label text tokens as the LLM input.
+        """
+        encoder_input, attention_mask, labels, loss_mask, _ = self.prepare_llm_input(audio_batch)
+        if self.mcore_gpt:
+            output = self.model(
+                input_ids=None,
+                position_ids=None,
+                decoder_input=encoder_input,
+                attention_mask=attention_mask,
+                labels=labels,
+            )
+        else:
+            output = self.model(
+                input_ids=None,
+                position_ids=None,
+                encoder_input=encoder_input,
+                attention_mask=attention_mask,
+                labels=labels,
+                checkpoint_activations_all_layers=checkpoint_activations_all_layers,
+            )
+
+        return output, loss_mask
+
+    def get_forward_output_only_func(self):
+        def fwd_output_only_func(dataloader_iter, model):
+            batch = next(dataloader_iter)
+            extra_arg = {}
+            # take the batch produced by prepare_batch_at_step
+            (
+                tokens,
+                input_embeddings,
+                attention_mask,
+                position_ids,
+                set_inference_key_value_memory,
+                inference_max_sequence_len,
+            ) = batch
+            tokens = tokens.cuda()
+
+            if attention_mask is not None:
+                attention_mask = attention_mask.cuda()
+                attention_mask = attention_mask[0:1]
+            if self.mcore_gpt:
+                # if first step, then clear KV cache, otherwise reuse inference_paarms
+                if set_inference_key_value_memory[0].item():
+                    self.inference_params = InferenceParams(
+                        max_batch_size=tokens.size(0), max_sequence_length=inference_max_sequence_len[0].item()
+                    )
+                extra_arg['inference_params'] = self.inference_params
+            else:
+                extra_arg['set_inference_key_value_memory'] = set_inference_key_value_memory[0].item()
+                extra_arg['inference_max_sequence_len'] = inference_max_sequence_len[0].item()
+
+            # Currently for all MCore transformer layer specs causal attention mask
+            # is used so we can delegate creating it to MCore/TE and pass None below
+            if (
+                isinstance(model, MCoreGPTModel)
+                or hasattr(model, "module")
+                and isinstance(model.module, MCoreGPTModel)
+            ):
+                attention_mask = None
+
+            output_tensor = model(
+                input_ids=None,
+                position_ids=None,
+                decoder_input=input_embeddings,
+                attention_mask=attention_mask,
+                **extra_arg,
+            )
+
+            # Advance inference sequence offset.
+            if self.inference_params:
+                # if last stage, then (final) output is [b, s, h], otherwise it's [s, b, h]
+                if parallel_state.is_pipeline_last_stage():
+                    self.inference_params.sequence_len_offset += output_tensor.size(1)
+                else:
+                    self.inference_params.sequence_len_offset += output_tensor.size(0)
+
+            def id_func(output_tensor):
+                return output_tensor, {'logits': output_tensor}
+
+            return output_tensor, id_func
+
+        return fwd_output_only_func
+
+    def get_forward_output_and_loss_func(self, validation_step=False, tuning=False):
+        def fwd_output_and_loss_func(dataloader_iter, model, checkpoint_activations_all_layers=None):
+            batch = next(dataloader_iter)
+
+            # Transfer needed data to GPU
+            required_keys = set()
+            if parallel_state.get_pipeline_model_parallel_world_size() == 1:
+                required_keys.update(batch.keys())
+            else:
+                required_keys.add('attention_mask')
+                if parallel_state.is_pipeline_first_stage():
+                    required_keys.update(('tokens', 'position_ids'))
+                if parallel_state.is_pipeline_last_stage():
+                    required_keys.update(('labels', 'loss_mask'))
+            if self.get_attention_mask_from_fusion and 'attention_mask' in required_keys:
+                required_keys.remove('attention_mask')
+
+            batch = move_to_device(batch, self.device)
+            batch = self.get_batch_on_this_context_parallel_rank(batch)
+
+            if not self.mcore_gpt:
+                batch['checkpoint_activations_all_layers'] = checkpoint_activations_all_layers
+
+            output_tensor, loss_mask = self.forward(
+                batch, checkpoint_activations_all_layers=checkpoint_activations_all_layers
+            )
+            batch['loss_mask'] = loss_mask
+
+            def loss_func(output_tensor):
+                # Loss for a micro-batch (ub)
+                loss_for_ub = self.loss_func(batch['loss_mask'], batch['num_valid_tokens_in_ub'], output_tensor)
+                cp_size = self.cfg.get('context_parallel_size', 1)
+                if self.cfg.data.get(
+                    "return_output_tensors", False
+                ):  # TODO: need a better way to check if loss_func is returning more stuff than just loss... (@adithyare)
+                    loss_for_ub, q_hs, d_hs, pos_cs, neg_cs, diff_cs = loss_for_ub
+                    reduced_loss = average_losses_across_data_parallel_group([loss_for_ub])
+                    pos_cs = average_losses_across_data_parallel_group([pos_cs])
+                    neg_cs = average_losses_across_data_parallel_group([neg_cs])
+                    diff_cs = average_losses_across_data_parallel_group([diff_cs])
+                    return (
+                        loss_for_ub * cp_size,
+                        {
+                            'avg': reduced_loss,
+                            'query_hs': q_hs,
+                            'doc_hs': d_hs,
+                            'avg_pos_cs': pos_cs,
+                            'avg_neg_cs': neg_cs,
+                            'diff_cs': diff_cs,
+                        },
+                    )
+                elif validation_step and not self.cfg.data.get('validation_drop_last', True):
+                    num_valid_tokens_in_ub = batch['num_valid_tokens_in_ub']
+                    if loss_for_ub.isnan():
+                        assert batch['loss_mask'].count_nonzero() == 0, 'Got NaN loss with non-empty input'
+                        loss_sum_for_ub = torch.zeros_like(num_valid_tokens_in_ub)
+                    else:
+                        loss_sum_for_ub = num_valid_tokens_in_ub * loss_for_ub
+
+                    loss_sum_and_ub_size_all_gpu = torch.cat(
+                        [
+                            loss_sum_for_ub.clone().detach().view(1),
+                            torch.tensor([num_valid_tokens_in_ub]).cuda().clone().detach(),
+                        ]
+                    )
+                    # Could potentially reduce num_valid_samples_in_microbatch and use that to aggregate instead of len(self._validation_ds)
+                    torch.distributed.all_reduce(
+                        loss_sum_and_ub_size_all_gpu, group=parallel_state.get_data_parallel_group()
+                    )
+                    return loss_for_ub * cp_size, {'loss_sum_and_ub_size': loss_sum_and_ub_size_all_gpu}
+                else:
+                    reduced_loss = average_losses_across_data_parallel_group([loss_for_ub])
+                    return loss_for_ub * cp_size, {'avg': reduced_loss}
+
+            return output_tensor, loss_func
+
+        return fwd_output_and_loss_func
+
+    def _build_dataset(self, data_cfg, is_train=True):
+        if 'augmentor' in data_cfg:
+            augmentor = process_augmentations(
+                data_cfg['augmentor'], global_rank=self.global_rank, world_size=self.world_size
+            )
+        else:
+            augmentor = None
+
+        # Check dataset max_seq_legnth and max_position_embeddings size
+        if (
+            self.cfg.get('position_embedding_type', None) in [None, 'learned_absolute']
+            and data_cfg.max_seq_length > self.cfg.max_position_embeddings
+        ):
+            logging.warning(
+                f"Set dataset max_seq_length to max_position_embeddings {self.cfg.max_position_embeddings} if using learned_absolute position embedding"
+            )
+            data_cfg.max_seq_length = self.cfg.max_position_embeddings
+
+        # Notably, the data weights are controlled by either bucketing_weights
+        # or concat_sampling_probabilities depending on the dataset type.
+        if data_cfg.get('is_tarred', False):
+            return get_tarred_audio_text_dataset_from_config(
+                config=data_cfg,
+                tokenizer=self.tokenizer,
+                augmentor=augmentor,
+                sep_id=self.sep_id,
+                answer_only_loss=self.cfg.get('answer_only_loss', True),
+                virtual_tokens=self.virtual_tokens,
+                global_rank=parallel_state.get_data_parallel_rank(),
+                world_size=parallel_state.get_data_parallel_world_size(),
+            )
+        else:
+            return get_audio_text_dataset_from_config(
+                manifest_filepath=data_cfg.manifest_filepath,
+                config=data_cfg,
+                tokenizer=self.tokenizer,
+                augmentor=augmentor,
+                is_train=is_train,
+                sep_id=self.sep_id,
+                answer_only_loss=self.cfg.get('answer_only_loss', True),
+                virtual_tokens=self.virtual_tokens,
+            )
+
+    def build_data_loader(self, dataset, data_cfg, consumed_samples=0, is_predict=False):
+        """Buld dataloader given an input dataset."""
+        logging.info(f'Building dataloader with consumed samples: {consumed_samples}')
+        if isinstance(dataset, BlendableDataset):
+            collate_fn = dataset.datasets[0].collate_fn
+        elif hasattr(dataset, 'collate_fn'):
+            collate_fn = dataset.collate_fn
+        elif hasattr(dataset.datasets[0], 'collate_fn'):
+            # support datasets that are lists of entries
+            collate_fn = dataset.datasets[0].collate_fn
+        else:
+            # support datasets that are lists of lists
+            collate_fn = dataset.datasets[0].datasets[0].collate_fn
+
+        if isinstance(dataset, torch.utils.data.IterableDataset):
+            data_parallel_size = parallel_state.get_data_parallel_world_size()
+            num_micro_batches = data_cfg.global_batch_size // (data_cfg.micro_batch_size * data_parallel_size)
+            global_batch_size_on_this_data_parallel_rank = num_micro_batches * data_cfg.micro_batch_size
+
+            dataloader = torch.utils.data.DataLoader(
+                dataset,
+                collate_fn=collate_fn,
+                shuffle=False,
+                batch_size=global_batch_size_on_this_data_parallel_rank,
+                drop_last=True,
+                num_workers=data_cfg.num_workers,
+                pin_memory=data_cfg.pin_memory,
+            )
+            return dataloader
+
+        if is_predict:
+            # MegatronPretrainingBatchSampler doesn't work with trainer.predict()
+            dataloader = torch.utils.data.DataLoader(
+                dataset,
+                collate_fn=collate_fn,
+                batch_size=data_cfg.micro_batch_size,
+                num_workers=data_cfg.num_workers,
+                pin_memory=data_cfg.pin_memory,
+            )
+            return dataloader
+
+        batch_sampler = MegatronPretrainingBatchSampler(
+            total_samples=len(dataset),
+            consumed_samples=consumed_samples,
+            micro_batch_size=data_cfg.micro_batch_size,
+            global_batch_size=data_cfg.global_batch_size,
+            data_parallel_rank=parallel_state.get_data_parallel_rank(),
+            data_parallel_size=parallel_state.get_data_parallel_world_size(),
+            drop_last=data_cfg.drop_last,
+            pad_samples_to_global_batch_size=not data_cfg.drop_last,
+        )
+
+        dataloader = torch.utils.data.DataLoader(
+            dataset,
+            batch_sampler=batch_sampler,
+            collate_fn=collate_fn,
+            num_workers=data_cfg.num_workers,
+            pin_memory=data_cfg.pin_memory,
+            persistent_workers=True if data_cfg.num_workers > 0 else False,
+        )
+        return dataloader
+
+    @classmethod
+    def _modify_audio_encoder_config(cls, gpt_cfg, audio_cfg, speaker_cfg=None):
+        """load the ecoder configs from the pretrained audio models and updating the model's config."""
+        with open_dict(gpt_cfg):
+            use_multi_encoder = gpt_cfg.perception.get("encoders", None) is not None
+            if not use_multi_encoder:
+                gpt_cfg.perception.preprocessor = audio_cfg.preprocessor
+                gpt_cfg.perception.encoder = audio_cfg.encoder
+            else:
+                for key in gpt_cfg.perception.encoders:
+                    model_key = gpt_cfg.perception.encoders[key].get("model_key", "encoder")
+                    gpt_cfg.perception.encoders[key]["model"] = audio_cfg[key][model_key]
+                    if "preprocessor" in audio_cfg[key]:
+                        gpt_cfg.perception.encoders[key]['preprocessor'] = audio_cfg[key].preprocessor
+                if speaker_cfg is not None:
+                    gpt_cfg.perception.speaker_model.model = speaker_cfg
+
+            gpt_cfg.perception.output_dim = gpt_cfg.hidden_size
+            modality_adapter_cfg = gpt_cfg.perception.modality_adapter
+            if 'output_dim' in modality_adapter_cfg:
+                modality_adapter_cfg.output_dim = gpt_cfg.hidden_size
+            if not use_multi_encoder:
+                model_dim_key = gpt_cfg.perception.get("model_dim_key", "d_model")
+                encoder_dim = get_nested_dict_value(audio_cfg.encoder, model_dim_key)
+                input_dim = encoder_dim
+                if (
+                    gpt_cfg.perception.get('use_multi_layer_feat', False)
+                    and gpt_cfg.perception.multi_layer_feat.aggregator.get("mode", "cat") == "cat"
+                ):
+                    input_dim = encoder_dim * len(gpt_cfg.perception.multi_layer_feat.layer_idx_list)
+            else:
+                input_dim = 0
+                if speaker_cfg is not None:
+                    input_dim += speaker_cfg.decoder.emb_sizes
+                for enc_cfg in gpt_cfg.perception.encoders.values():
+                    encoder_dim = get_nested_dict_value(enc_cfg.model, enc_cfg.get("model_dim_key", "d_model"))
+                    if (
+                        enc_cfg.get('use_multi_layer_feat', False)
+                        and enc_cfg.multi_layer_feat.aggregator.get("mode", "cat") == "cat"
+                    ):
+                        input_dim += encoder_dim * len(enc_cfg.multi_layer_feat.layer_idx_list)
+                    else:
+                        input_dim += encoder_dim
+
+            if 'feat_in' in modality_adapter_cfg:
+                modality_adapter_cfg.feat_in = input_dim
+            elif 'input_dim' in modality_adapter_cfg:
+                modality_adapter_cfg.input_dim = input_dim
+
+    @classmethod
+    def _modify_config(cls, gpt_cfg, cfg, audio_cfg, add_cfg_to_tree=False, speaker_cfg=None):
+        """
+        This function modifies the original gpt pre-training config (gpt_cfg) with attributes from the finetuning config (cfg).
+        The `add_cfg_to_tree` arg adds `cfg` to the top of the yaml tree which is needed for all `hparams.yaml` files when passed as an arg to `load_from_checkpoint()`.
+        """
+        OmegaConf.set_struct(gpt_cfg, True)
+        OmegaConf.resolve(cfg)
+        with open_dict(gpt_cfg):
+            # for AudioGPTLoRAModel
+            gpt_cfg.target = f"{cls.__module__}.{cls.__name__}"
+            gpt_cfg.perception = cfg.model.perception
+            # inject audio encoder configs into the target config (gpt_cfg)
+            cls._modify_audio_encoder_config(gpt_cfg, audio_cfg, speaker_cfg)
+
+            # inject the sample rate from the audio encoder into the gpt config
+            if isinstance(audio_cfg, (ListConfig, list)):
+                sample_rate = [_cfg.preprocessor.sample_rate for _cfg in audio_cfg]
+                if not all([sr == sample_rate[0] for sr in sample_rate]):
+                    raise ValueError("All audio encoders must have the same sample rate.")
+                gpt_cfg.data.train_ds.sample_rate = sample_rate[0]
+                gpt_cfg.data.validation_ds.sample_rate = sample_rate[0]
+            else:
+                sample_rate = audio_cfg.preprocessor.sample_rate
+                gpt_cfg.data.train_ds.sample_rate = sample_rate
+                gpt_cfg.data.validation_ds.sample_rate = sample_rate
+
+            # This is needed when modifying a hparam file directly to load `.ckpt` files.
+            # This is not needed to modify the cfg in `.nemo` files.
+            if add_cfg_to_tree:
+                OmegaConf.resolve(gpt_cfg)
+                gpt_cfg.cfg = gpt_cfg
+
+        return gpt_cfg
+
+    @classmethod
+    def get_pretraind_audio_model(cls, encoder_cfg: DictConfig) -> ModelPT:
+        """load pretrained audio model from a given config"""
+        if encoder_cfg.get("_target_", None) is not None:
+            encoder_cls = get_class(encoder_cfg.get("_target_"))
+        elif encoder_cfg.get("target", None) is not None:
+            encoder_cls = get_class(encoder_cfg.get("target"))
+        else:
+            encoder_cls = ASRModel
+
+        pretrained_model = encoder_cfg.get('pretrained_model', None)
+        if pretrained_model is None:
+            return None
+        if encoder_cls is None:
+            raise ValueError(
+                f"Must specify a valid encoder class in the via the `_target_` field in the config: {encoder_cfg}"
+            )
+
+        if pretrained_model.endswith('.nemo'):
+            logging.info(f'Loading pretrained audio model from local file: {pretrained_model}')
+            audio_model = encoder_cls.restore_from(pretrained_model, map_location='cpu')
+        else:
+            logging.info(f'Loading pretrained audio model from NGC: {pretrained_model}')
+            audio_model = encoder_cls.from_pretrained(pretrained_model, map_location='cpu')
+        return audio_model
+
+    @classmethod
+    def get_speaker_model_and_config(cls, cfg):
+        """load speaker embedding model and config if present in the config."""
+        if 'speaker_model' in cfg.model.perception:
+            if cfg.model.get("_target_", None) is not None:
+                model_cls = get_class(cfg.model.get("_target_"))
+            elif cfg.model.get("target", None) is not None:
+                model_cls = get_class(cfg.model.get("target"))
+            else:
+                model_cls = EncDecSpeakerLabelModel
+
+            speaker_cfg = cfg.model.perception.speaker_model
+            if speaker_cfg.get('pretrained_model', None) is not None:
+                if speaker_cfg.pretrained_model.endswith('.nemo'):
+                    logging.info(f'Loading pretrained speaker model from local file: {speaker_cfg.pretrained_model}')
+                    speaker_model = model_cls.restore_from(speaker_cfg.pretrained_model, map_location='cpu')
+                else:
+                    logging.info(f'Loading pretrained speaker model from NGC: {speaker_cfg.pretrained_model}')
+                    speaker_model = model_cls.from_pretrained(speaker_cfg.pretrained_model, map_location='cpu')
+                return speaker_model, speaker_model.cfg
+            return None, None
+        else:
+            return None, None
+
+    @classmethod
+    def get_audio_encoder_models_and_configs(cls, cfg):
+        if 'encoders' in cfg.model.perception:
+            audio_encoders = {}
+            audio_enc_cfgs = {}
+            for key, encoder_cfg in cfg.model.perception.encoders.items():
+                audio_encoders[key] = cls.get_pretraind_audio_model(encoder_cfg)
+                audio_enc_cfgs[key] = audio_encoders[key].cfg
+            return audio_encoders, audio_enc_cfgs
+        else:
+            pretrained_audio_model = cfg.model.get("pretrained_audio_model", None)
+            pretrained_audio_model_class = cfg.model.get(
+                "pretrained_audio_model_target", "nemo.collections.asr.models.ASRModel"
+            )
+
+            model_class = hydra.utils.get_class(pretrained_audio_model_class)
+            if pretrained_audio_model.endswith('.nemo'):
+                logging.info(f'Loading pretrained audio model from local file: {pretrained_audio_model}')
+                audio_model = model_class.restore_from(pretrained_audio_model, map_location='cpu')
+            else:
+                logging.info(f'Loading pretrained audio model from NGC: {pretrained_audio_model}')
+                audio_model = model_class.from_pretrained(pretrained_audio_model, map_location='cpu')
+            return audio_model, audio_model.cfg
+
+    @classmethod
+    def load_pretrained_audio_weights(
+        cls, cfg, model, audio_model, speaker_model: Optional[EncDecSpeakerLabelModel] = None
+    ):
+        use_multi_encoder = cfg.model.perception.get("encoders", None) is not None
+        if not use_multi_encoder:
+            if cfg.model.perception.get("use_multi_layer_feat", False):
+                model.perception.encoder.encoder.load_state_dict(audio_model.encoder.state_dict(), strict=True)
+            else:
+                model.perception.encoder.load_state_dict(audio_model.encoder.state_dict(), strict=True)
+            logging.info(f'Loaded pretrained audio model weights from {cfg.model.pretrained_audio_model}')
+            if cfg.model.get('use_am_tokenizer', False):
+                model.tokenizer = audio_model.tokenizer
+                logging.info(f'Use AM tokenizer: {audio_model.tokenizer}')
+            return model
+        else:
+            for key, enc_cfg in cfg.model.perception.encoders.items():
+                if enc_cfg.get("use_multi_layer_feat", False):
+                    model.perception.encoders[key].encoder.load_state_dict(
+                        audio_model[key].encoder.state_dict(), strict=True
+                    )
+                else:
+                    model.perception.encoders[key].load_state_dict(audio_model[key].encoder.state_dict(), strict=True)
+                logging.info(f'Loaded pretrained audio model weights for {key}')
+            if speaker_model is not None:
+                model.perception.speaker_model.load_state_dict(speaker_model.state_dict(), strict=True)
+                logging.info(f'Loaded pretrained speaker model weights')
+            return model
+
+    @classmethod
+    def restore_from_pretrained_models(
+        cls,
+        cfg: Optional[Union[OmegaConf, str]] = None,
+        trainer: Optional[Trainer] = None,
+    ):
+        """
+        load pretrained LLM and audio encoders, and maybe add adapters, used for training.
+        Args:
+            cfg: input yaml config, with trainer, model, exp_manager, etc.
+            trainer: trainer object
+        """
+        if (
+            cfg.model.get("pretrained_audio_model", None) is None
+            and cfg.model.perception.get("encoders", None) is None
+        ):
+            raise RuntimeError("PEFT training needs at least one pretrained audio model present.")
+
+        if not cfg.model.restore_from_path:
+            raise RuntimeError("PEFT training needs a trained base model present.")
+
+        base_model_cfg = MegatronGPTSFTModel.merge_cfg_with(cfg.model.restore_from_path, cfg)
+        audio_model, audio_model_cfg = cls.get_audio_encoder_models_and_configs(cfg)
+        speaker_model, speaker_cfg = cls.get_speaker_model_and_config(cfg)
+        model_cfg = cls._modify_config(
+            base_model_cfg, cfg, audio_model_cfg, add_cfg_to_tree=False, speaker_cfg=speaker_cfg
+        )
+
+        # load llm
+        model = cls.restore_from(
+            restore_path=cfg.model.restore_from_path,
+            trainer=trainer,
+            override_config_path=model_cfg,
+            strict=False,
+            map_location="cpu",
+        )
+
+        if "peft" in cfg.model:
+            peft_cfg_cls = PEFT_CONFIG_MAP[cfg.model.peft.peft_scheme]
+            if cfg.model.peft.restore_from_path is not None:
+                # initialize peft weights from a checkpoint instead of randomly
+                # This is not the same as resume training because optimizer states are not restored.
+                logging.info("PEFT Weights will be loaded from", cfg.model.peft.restore_from_path)
+                model.load_adapters(cfg.model.peft.restore_from_path, peft_cfg_cls(model_cfg), map_location="cpu")
+            elif peft_cfg_cls is not None:
+                logging.info("Adding adapter weights to the model for PEFT")
+                model.add_adapter(peft_cfg_cls(model_cfg))
+            else:
+                raise ValueError(f"PEFT scheme not not found in PEFT_CONFIG_MAP: {cfg.model.peft.peft_scheme}")
+        else:
+            logging.info(f"Running full finetuning since no peft scheme is given.\n{model.summarize()}")
+
+        # load audio model weights
+        model = cls.load_pretrained_audio_weights(cfg, model, audio_model, speaker_model)
+
+        if 'inference' in cfg:
+            inference_cfg = OmegaConf.to_container(cfg.inference, resolve=True)
+            model.set_inference_config(inference_cfg)
+        return model
+
+    @classmethod
+    def load_audio_encoder_for_inference(cls, cfg: DictConfig, model_cfg: DictConfig, model: ModelPT) -> ModelPT:
+        """
+        Maybe load audio encoders for inference, if they were not tunable during training.
+        Args:
+            cfg: inference config
+            model_cfg: model config
+            model: model object
+        Returns:
+            model: model object with audio encoder weights loaded
+        """
+        if model_cfg.freeze_audio_encoder and model_cfg.get("pretrained_audio_model", None) is not None:
+            with open_dict(cfg):
+                cfg.model.perception = model_cfg.perception
+
+            audio_model, _ = cls.get_audio_encoder_models_and_configs(cfg)
+            speaker_model, _ = cls.get_speaker_model_and_config(cfg)
+            model = cls.load_pretrained_audio_weights(cfg, model, audio_model, speaker_model)
+        return model
+
+    @classmethod
+    def merge_inference_cfg(
+        cls, cfg: DictConfig, trainer: Trainer, pretrained_model_cfg: DictConfig = None
+    ) -> DictConfig:
+        """
+        Merge the inference config with the model config, used for inference only.
+        if no pretrained_model_cfg is given, it will be loaded from the checkpoint specified in cfg.
+        Args:
+            cfg: inference config
+            trainer: trainer object
+            pretrained_model_cfg: a pre-loaded SpeechLLM model config
+        Returns:
+            model_cfg: merged model config
+        """
+        if pretrained_model_cfg:
+            model_cfg = pretrained_model_cfg
+        elif cfg.model.peft.restore_from_path:
+            if cfg.model.peft.restore_from_path.endswith(".nemo"):
+                model_cfg = ModularAudioGPTModel.restore_from(
+                    restore_path=cfg.model.peft.restore_from_path,
+                    trainer=trainer,
+                    return_config=True,
+                )
+            elif cfg.model.peft.restore_from_hparams_path:  # not a .nemo model we expect a hparams.yaml file
+                model_cfg = OmegaConf.to_container(OmegaConf.load(cfg.model.peft.restore_from_hparams_path).cfg)
+                model_cfg = OmegaConf.create(model_cfg)
+                # extract dict inside cfg key and convert it to DictConfig
+                # this allows interpolation to work the same way as config from the .restore_from method
+            else:
+                raise RuntimeError(
+                    "This script requires a .nemo peft model or path to hparams.yaml (and a ckpt path)."
+                )
+        else:
+            model_cfg = MegatronGPTSFTModel.restore_from(
+                restore_path=cfg.model.restore_from_path,
+                trainer=trainer,
+                return_config=True,
+            )
+
+        if hasattr(model_cfg, 'peft') and model_cfg.peft.peft_scheme not in [None, 'none']:
+            # before PEFT migrates to distributed ckpt, eval must use same TP/PP as training
+            for p in ['tensor_model_parallel_size', 'pipeline_model_parallel_size']:
+                assert model_cfg.get(p) == cfg.model.get(
+                    p
+                ), f"PEFT evaluation {p} ({cfg.model.get(p)}) must equal training {p} ({model_cfg.get(p)})"
+
+        with open_dict(model_cfg):
+            # to be compatible with old checkpoints
+            if "context_key" not in model_cfg.data.train_ds or "answer_key" not in model_cfg.data.train_ds:
+                model_cfg.data.train_ds.context_key = "question"
+                model_cfg.data.train_ds.answer_key = "answer"
+
+            # update the model config of the trained model with params we want to set at inference time.
+            model_cfg.precision = cfg.trainer.precision
+            for key, val in cfg.model.items():
+                if key != 'data' and key != 'peft':
+                    model_cfg[key] = val
+            model_cfg.data.test_ds = cfg.model.data.test_ds
+
+        with open_dict(cfg):
+            if model_cfg.data.test_ds is not None:
+                cfg.inference.add_BOS = model_cfg.data.test_ds.get("add_BOS", False)
+                cfg.inference.tokens_to_generate = model_cfg.data.test_ds.get("tokens_to_generate", 1)
+
+        model_cfg.megatron_amp_O2 = False  # always evaluate with O1
+        return model_cfg
+
+    @classmethod
+    def load_adapters_for_inference(cls, cfg: DictConfig, model_cfg: DictConfig, model: ModelPT) -> ModelPT:
+        if cfg.model.peft.restore_from_path:
+            if '\\' in cfg.model.peft.restore_from_path:
+                cfg.model.peft.restore_from_path = cfg.model.peft.restore_from_path.replace('\\', '')
+            if "peft" in model_cfg:
+                peft_cfg_cls = PEFT_CONFIG_MAP[model_cfg.peft.peft_scheme]
+                model.load_adapters(cfg.model.peft.restore_from_path, peft_cfg_cls(model_cfg), map_location="cpu")
+            else:
+                model.load_state_dict(torch.load(cfg.model.peft.restore_from_path), strict=False)
+        elif cfg.model.peft.restore_from_ckpt.checkpoint_dir and cfg.model.peft.restore_from_ckpt.checkpoint_name:
+            checkpoint_path = os.path.join(
+                cfg.model.peft.restore_from_ckpt.checkpoint_dir, cfg.model.peft.restore_from_ckpt.checkpoint_name
+            )
+            # checkpoint_path is a dir in case of distributed checkpointing
+            if not os.path.isdir(checkpoint_path):
+                # legacy checkpoint needs model parallel rank injection
+                checkpoint_path = inject_model_parallel_rank(
+                    os.path.join(
+                        cfg.model.peft.restore_from_ckpt.checkpoint_dir,
+                        cfg.model.peft.restore_from_ckpt.checkpoint_name,
+                    )
+                )
+                if "peft" in model_cfg:
+                    peft_cfg_cls = PEFT_CONFIG_MAP[cfg.model.peft.peft_scheme]
+                    model.load_adapters(checkpoint_path, peft_cfgs=peft_cfg_cls(model_cfg), map_location="cpu")
+                else:
+                    model.load_state_dict(torch.load(checkpoint_path), strict=False)
+            else:
+                raise NotImplementedError("distributed checkpointing of PEFT weights is not supported")
+        elif model_cfg.peft.get("peft_scheme", None):
+            # special case for loading a complete speechllm checkpoint in nemo format
+            peft_cfg_cls = PEFT_CONFIG_MAP[model_cfg.peft.peft_scheme]
+            model.load_adapters(cfg.model.restore_from_path, peft_cfg_cls(model_cfg), map_location="cpu")
+        return model
+
+    def _build_vocab(self):
+        """
+        Manipulate vocabulary (e.g., pad vocabulary for increased performance)/
+        """
+        if self._cfg.get('override_vocab_size', None) is not None:
+            self.padded_vocab_size = self._cfg.override_vocab_size
+        else:
+            self.padded_vocab_size = self._vocab_size_with_padding(
+                orig_vocab_size=self.tokenizer.vocab_size,
+                make_vocab_size_divisible_by=self._cfg.get('make_vocab_size_divisible_by', 128),
+                tensor_model_parallel_size=self._cfg.get('tensor_model_parallel_size', 1),
+            )
+
+    def state_dict(self, destination=None, prefix=None, keep_vars=False):
+        """
+        Overwrite the state_dict method to include only the trainable parameters.
+        """
+        if self.setup_complete and self.trainer.state.fn == "fit":
+            # Once setup is complete we only need adapter and perception model.
+            if self.cfg.freeze_llm and self.cfg.get("peft", None) is not None:
+                return_state_dict = self.get_peft_state_dict()
+            elif not self.cfg.freeze_llm:
+                return_state_dict = self.model.state_dict(prefix="model.")
+            else:
+                return_state_dict = {}
+
+            state_dict = self.perception.state_dict(prefix="perception.")
+            if self.cfg.freeze_audio_encoder:
+                state_dict = {k: v for k, v in state_dict.items() if not k.startswith("perception.encoder.")}
+
+            return_state_dict.update(state_dict)
+            state_dict = self.perception.state_dict(prefix="perception.")
+            return_state_dict.update(state_dict)
+            return return_state_dict
+        elif self.setup_complete and self.trainer.state.fn != "fit":
+            # used to save the whole model as a nemo file
+            return_state_dict = self.model.state_dict(prefix="model.")
+            state_dict = self.perception.state_dict(prefix="perception.")
+            return_state_dict.update(state_dict)
+            return return_state_dict
+        else:
+            # we want all the params with the same keys as calling self.state_dict()
+            # but we can't call self.state_dict() here as it would be a recursive call.
+            # so we call self.model.state_dict(prefix="model.") which will return all the keys and params same as calling self.state_dict()
+            if not self.cfg.freeze_llm:
+                return_state_dict = self.model.state_dict(prefix="model.")
+            else:
+                return_state_dict = {}
+            state_dict = self.perception.state_dict(prefix="perception.")
+            if self.cfg.freeze_audio_encoder:
+                state_dict = {k: v for k, v in state_dict.items() if not k.startswith("perception.encoder.")}
+            return_state_dict.update(state_dict)
+            return return_state_dict
+
+    def load_state_dict(self, state_dict, strict: bool = True):
+        if not self.setup_complete:
+            if self.cfg.get('override_vocab_size', False):
+                exclude_list = [
+                    "model.language_model.embedding.word_embeddings.weight",
+                    "model.language_model.output_layer.weight",
+                ]
+            else:
+                exclude_list = []
+            state_dict = {k: v for k, v in state_dict.items() if k not in exclude_list}
+        else:
+            strict = False
+
+        if len(state_dict) == 0:
+            return  # checkpoint is loaded in on_load_checkpoint()
+        if self.use_peft and self.setup_complete:
+            # at this stage only adapter params will appear in the state_dict arg
+            # so we only update those while the rest of the model is frozen.
+            # setting strict=False will ignore the missing keys (which are not being updated anyway)
+            # explicitly check if state_dict.keys matches all the expected self.adapter_keys since we don't have the
+            # safety in strict=True anymore.
+            if not self.ptuning_only_and_non_first_stage:
+                if set(state_dict.keys()) != self.adapter_keys.union(self.tunable_base_param_keys):
+                    logging.warning(
+                        f"Unexpected keys found in state_dict: {set(state_dict.keys()) - self.adapter_keys.union(self.tunable_base_param_keys)}, missing keys in state_dict: {self.adapter_keys.union(self.tunable_base_param_keys) - set(state_dict.keys())}"
+                    )
+                super(MegatronGPTModel, self).load_state_dict(state_dict, strict=False)
+        else:
+            super(MegatronGPTModel, self).load_state_dict(state_dict, strict=strict)
+
+    def on_load_checkpoint(self, checkpoint) -> None:
+        """LightningModule hook:
+        https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#on-load-checkpoint
+        """
+        checkpoint_state_dict = checkpoint['state_dict']
+        self.load_state_dict(checkpoint_state_dict, strict=False)
+
+    def setup_metric(self, data_cfg):
+        metric_name = "exact_string_match"
+        if not hasattr(data_cfg, "metric"):
+            metric = MetricStringToTorchMetric["exact_string_match"]
+        else:
+            if not hasattr(data_cfg.metric, "name"):
+                raise ValueError("Metric name is not provided in the metric config.")
+            if data_cfg.metric.name == "loss":
+                return None, "loss"
+            if data_cfg.metric.name not in MetricStringToTorchMetric:
+                raise KeyError(
+                    f"{data_cfg.metric.name} is not supported. List of supported metrics: {MetricStringToTorchMetric.keys()}"
+                )
+            if data_cfg.metric.name in self._metrics_require_string2category_map:
+                if data_cfg.metric.average is None:
+                    raise ValueError(
+                        f"{data_cfg.metric.name} requires specifying whether you want to compute a micro or macro average. Found None."
+                    )
+            if (
+                data_cfg.metric.get('labels_are_strings', False)
+                and data_cfg.metric.name in self._metrics_require_string2category_map
+            ):
+                if data_cfg.metric.num_classes is None:
+                    raise ValueError(
+                        "Number of classes is not provided in the metric section within the data config. "
+                        f"Please provide the number of classes in the data config to use the {data_cfg.metric.name} metric."
+                    )
+                if data_cfg.metric.get('class_labels', None) is None or not isinstance(
+                    data_cfg.metric.get('class_labels', None), ListConfig
+                ):
+                    raise ValueError(
+                        "Class labels are not provided properly in the metric section witnin the data config. "
+                        f"Please provide the class labels as a list of strings in the data config to use the {data_cfg.metric.name} metric."
+                    )
+                if len(data_cfg.metric.get('class_labels', None)) != data_cfg.metric.num_classes:
+                    raise ValueError(
+                        f"Number of class labels {len(data_cfg.metric.get('class_labels', None))} does not match `num_classes` : {data_cfg.metric.num_classes}"
+                    )
+
+            metric_name = data_cfg.metric.name
+            metric_cls = MetricStringToTorchMetric[metric_name]
+            if metric_name not in TextMetricsSet:
+                metric = [metric_cls(**data_cfg.metric)]
+            else:
+                metric = [metric_cls()]
+        return metric, metric_name
+
+    def inference_step(self, dataloader_iter, mode):
+        """
+        Used for validation and test steps, added postprocessing after calling self.predict_step().
+        """
+        batch, batch_idx, dataloader_idx = next(dataloader_iter)
+        data_cfg = self.cfg.data.validation_ds if mode == 'validation' else self.cfg.data.test_ds
+        self._reconfigure_and_process_inference_batch(batch, data_cfg)
+        # Meta data from dataset
+        metadata = batch.get('metadata', [{}] * len(batch['tokens']))
+        loss = super(MegatronGPTSFTModel, self).validation_step(itertools.chain([batch]), dataloader_idx)
+
+        # We need _inference_config to get generation params
+        # add_BOS and tokens_to_generate are set in dataset
+        if self.get_inference_config() is None:
+            logging.warning(f'inference_config is not set. Use default: {default_inference_config}')
+            self.set_inference_config(inference_config=default_inference_config)
+        self._inference_config['add_BOS'] = data_cfg.add_bos
+        self._inference_config['tokens_to_generate'] = data_cfg.get('tokens_to_generate')
+
+        output = self.predict_step(batch, batch_idx, dataloader_idx)
+
+        inputs_text = [self.tokenizer.ids_to_text(c.tolist()) for c in batch['contexts']]
+        labels_text = [self.tokenizer.ids_to_text(a.tolist()) for a in batch['answers']]
+        preds_text = [
+            self.tokenizer.ids_to_text(t[l.item() :][: data_cfg.get('tokens_to_generate')])
+            for t, l in zip(output['token_ids'], batch['context_lengths'])
+        ]
+
+        if data_cfg.get("end_string", None):
+            # sometimes data_cfg.end_string != self.tokenizer.ids_to_text(self.tokenizer.text_to_ids(data_cfg.end_string))
+            # for example when data_cfg.end_string = "<end>", the end_string_re will start with " ?? "
+            end_string_re = self.tokenizer.ids_to_text(self.tokenizer.text_to_ids(data_cfg.end_string))
+            preds_text_cleaned = []
+            labels_text_cleaned = []
+            for p, l in zip(preds_text, labels_text):
+                # remove end_string from the end of the string
+                for es in [end_string_re, data_cfg.end_string]:
+                    if p.endswith(es):
+                        p = p[: -len(es)].strip()
+                    if l.endswith(es):
+                        l = l[: -len(es)].strip()
+                preds_text_cleaned.append(p)
+                labels_text_cleaned.append(l)
+            preds_text = preds_text_cleaned
+            labels_text = labels_text_cleaned
+
+        if data_cfg.get("remove_text_pc", False):
+            preds_text = [remove_punctuations(p.lower(), data_cfg.get("punctuations", None)) for p in preds_text]
+            labels_text = [remove_punctuations(l.lower(), data_cfg.get("punctuations", None)) for l in labels_text]
+
+        if data_cfg.get("log_every_n_steps", None) is not None:
+            if batch_idx % data_cfg.log_every_n_steps == 0:
+                logging.info(f"Input: `{inputs_text[0]}`")
+                logging.info(f"Label: `{labels_text[0]}`")
+                logging.info(f"Pred: `{preds_text[0]}`")
+
+        # if loss is nan, print the input, label and pred
+        if loss.isnan():
+            logging.info("++++++++++++++ NaN loss detected ++++++++++++++")
+            for i in range(len(inputs_text)):
+                logging.info(f"Input: `{inputs_text[i]}`")
+                logging.info(f"Label: `{labels_text[i]}`")
+                logging.info(f"Pred: `{preds_text[i]}`")
+            logging.info("++++++++++++++++++++++++++++++++++++++++++++++++")
+
+        outputs = {
+            'loss': loss,
+            'preds': preds_text,  # [str]
+            'labels': labels_text,  # [str]
+            'inputs': inputs_text,  # [str]
+            'metadata': metadata,  # [dict]
+        }
+
+        if mode == 'validation':
+            if len(self._validation_dl) > 1:
+                # super().validation_step appends just loss to self.validation_step_outputs, replace the last appended loss with the outputs dict
+                self.validation_step_outputs[dataloader_idx][-1] = outputs
+            else:
+                # super().validation_step appends just loss to self.validation_step_outputs, replace the last appended loss with the outputs dict
+                self.validation_step_outputs[-1] = outputs
+        else:
+            if len(self._test_dl) > 1:
+                self.test_step_outputs[dataloader_idx][-1] = outputs
+            else:
+                self.test_step_outputs[-1] = outputs
+        return outputs
+
+    def predict_step(self, batch: dict, batch_idx: int, dataloader_idx: Optional[int] = None):
+        """
+        Used to get LLM predictions for validation and test steps based on the given inference config.
+        """
+        inference_config = self.get_inference_config()
+        if inference_config is not None:
+            # need to overwrite some configuration, make it immutable
+            inference_config = inference_config.copy()
+        else:
+            self.set_inference_config(inference_config=default_inference_config)
+            logging.warning(f'inference_config is not set. Use default: {default_inference_config}')
+            inference_config = self.get_inference_config()
+
+        if self.cfg.data.get('end_string', None):
+            inference_config['end_strings'] = [self.cfg.data.end_string]
+
+        global_batch_size_per_gpu = batch['tokens'].size(0)
+        num_micro_batches_before_decode = get_num_microbatches()
+
+        compute_logprob = inference_config.get('compute_logprob', False)
+        if compute_logprob:
+            inference_config['inputs'] = batch
+            inference_config['tokens_to_generate'] = 1
+            inference_config['all_probs'] = True
+            inference_config["add_BOS"] = False
+            inference_config['greedy'] = True
+            response = generate(self, **inference_config)
+            response = get_computeprob_response(self.tokenizer, response, batch)
+        else:
+            # for megatron_gpt_eval.py
+            if isinstance(batch, list):
+                inference_config['inputs'] = batch
+            elif 'num_audios' in batch:
+                # peft_eval.py
+                inference_config['inputs'] = (
+                    batch['contexts'].cuda(),
+                    batch['context_lengths'].cuda(),
+                    batch['audio_signal'].cuda(),
+                    batch['audio_signal_length'].cuda(),
+                    batch['num_audios'].cuda(),
+                    batch['context_start_idx'],
+                )
+            else:
+                # peft_eval.py
+                inference_config['inputs'] = (
+                    batch['contexts'].cuda(),
+                    batch['context_lengths'].cuda(),
+                    batch['audio_signal'].cuda(),
+                    batch['audio_signal_length'].cuda(),
+                )
+            response = generate(self, **inference_config)
+
+        app_state = AppState()
+        _reconfigure_microbatch_calculator(
+            rank=app_state.global_rank,
+            rampup_batch_size=None,
+            global_batch_size=global_batch_size_per_gpu * parallel_state.get_data_parallel_world_size(),
+            micro_batch_size=global_batch_size_per_gpu // num_micro_batches_before_decode,
+            data_parallel_size=parallel_state.get_data_parallel_world_size(),
+        )
+
+        # add audio offsets to context lengths for properly decoding only the response
+        batch['context_lengths'] = batch['context_lengths'].cuda() + response['audio_feat_lens']
+
+        return response
+
+    def inference_epoch_end(self, outputs, mode, data_cfg):
+        # Parent class will handle logging of the loss.
+        if not outputs or (all([not x for x in outputs])):
+            return None
+
+        if isinstance(outputs[0], dict):
+            outputs = [outputs]
+
+        averaged_loss = []
+        averaged_metric = []
+        # Log metrics for each provided validation/test dataset.
+        for dataloader_idx, output in enumerate(outputs):
+            if len(output) == 0:
+                logging.warning(f"Empty output for dataloader_idx: {dataloader_idx}")
+                continue
+            # Expand on_validation_epoch_end from parent class MegatronGPTModel as on_validation_epoch_end doesnt take outputs arg
+            loss_vals = [x['loss'] for x in output]
+            if parallel_state.is_pipeline_last_stage():
+                # only the last pipeline parallel stages return loss with their batch size
+                if self.cfg.data.get('validation_drop_last', True):
+                    loss = torch.stack(loss_vals).mean()
+                else:
+                    # Compute the avg loss by total_loss across all samples / total number of samples
+                    total_loss_and_total_samples = torch.vstack(loss_vals).sum(axis=0)
+                    avg_loss = total_loss_and_total_samples[0] / total_loss_and_total_samples[1]
+                    loss = avg_loss.type(torch.float32).cuda()
+            else:
+                loss = torch.tensor(0.0, dtype=torch.float32).cuda()
+
+            # we can only log on one rank if it is rank zero so we broadcast from last rank
+            torch.distributed.broadcast(loss, get_last_rank())
+
+            self.log('val_loss', loss, prog_bar=True, rank_zero_only=True, batch_size=1, sync_dist=True)
+
+            # Determine the key used to log the loss based on the user provided name of the dataset or the dataloader index.
+            loss_log_key = self._determine_log_key(data_cfg, dataloader_idx, "loss", mode)
+            self.log(loss_log_key, loss, batch_size=1)
+            averaged_loss.append(loss)
+
+            # Gather the outputs object from all data parallel ranks since we are using the DistributedSampler which splits data across DDP ranks.
+            gathered_outputs = [None for _ in range(parallel_state.get_data_parallel_world_size())]
+            torch.distributed.all_gather_object(
+                gathered_outputs,
+                [
+                    {'preds': x['preds'], 'labels': x['labels'], 'inputs': x['inputs'], 'metadata': x['metadata']}
+                    for x in output
+                ],
+                group=parallel_state.get_data_parallel_group(),
+            )
+
+            # Remove duplicate examples due to distributed sampler.
+            inp_label_set = set()
+            deduplicated_outputs = {
+                'preds': [],
+                'labels': [],
+                'inputs': [],
+                'metadata': [],
+            }
+            total_size = 0
+            for rank in range(0, parallel_state.get_data_parallel_world_size()):
+                for batch in gathered_outputs[rank]:
+                    for pred, label, input, metadata in zip(
+                        batch['preds'], batch['labels'], batch['inputs'], batch['metadata']
+                    ):
+                        key = input + label + str(metadata)
+                        total_size += 1
+                        if key not in inp_label_set:
+                            inp_label_set.add(key)
+                            deduplicated_outputs['preds'].append(pred)
+                            deduplicated_outputs['labels'].append(label)
+                            deduplicated_outputs['inputs'].append(input)
+                            deduplicated_outputs['metadata'].append(metadata)
+
+            # Compute metric score
+            metric_name = self.val_metric_name if mode == 'validation' else self.test_metric_name
+            metric_label_key = self.val_metric_label_key if mode == 'validation' else self.test_metric_label_key
+            if metric_name != 'loss':
+                metric_log_key = self._determine_log_key(data_cfg, dataloader_idx, metric_name, mode)
+                metric_fn = self.val_metric[0] if mode == 'validation' else self.test_metric[0]
+                if metric_label_key in deduplicated_outputs['metadata'][0]:
+                    labels = [m[metric_label_key] for m in deduplicated_outputs['metadata']]
+                else:
+                    labels = deduplicated_outputs['labels']
+
+                # sacrebleu.corpus_bleu is commonly used which does not share
+                # the same interface as other metrics. We handle it separately.
+                if metric_name == 'bleu':
+                    metric_result = torch.Tensor(
+                        [sacrebleu.corpus_bleu(deduplicated_outputs['preds'], [labels]).score]
+                    ).to(self.device)
+                else:
+                    for pred, label in zip(deduplicated_outputs['preds'], labels):
+                        _ = metric_fn(pred, label)
+
+                    metric_result = metric_fn.compute()
+
+                if metric_name == 'rouge':
+                    for k, v in metric_result.items():
+                        if 'fmeasure' in k:
+                            self.log(metric_log_key + f'_{k}', v.item(), sync_dist=True, batch_size=1)
+                            logging.info(f"{mode} {metric_name} {k}: {v.item()}")
+                    metric_result = metric_result['rouge1_fmeasure']
+                else:
+                    self.log(metric_log_key, metric_result.item(), sync_dist=True, batch_size=1)
+                    logging.info(f"{mode} {metric_name}: {metric_result.item()}")
+
+                metric_fn.reset()
+                averaged_metric.append(metric_result)
+
+            # Write predictions to file
+            if self.global_rank == 0 and data_cfg.get("write_predictions_to_file", False):
+                logging.info(
+                    f"Total deduplicated inference data size: {total_size} to {len(deduplicated_outputs['inputs'])}"
+                )
+
+                # Check if the user provided a prefix path to the file(s) they want to write.
+                if not hasattr(data_cfg, "output_file_path_prefix") or data_cfg.output_file_path_prefix is None:
+                    raise ValueError(
+                        f"Cannot write predictions to file when output_file_path_prefix is not set or present in the yaml config file."
+                    )
+                filename_log_key = self._determine_log_key(data_cfg, dataloader_idx, None, mode)
+                output_dir = data_cfg.get("output_dir", "./")
+                self.write_predictions_to_file(
+                    deduplicated_outputs, f"{data_cfg.output_file_path_prefix}_{filename_log_key}", output_dir
+                )
+
+            torch.distributed.barrier(group=parallel_state.get_data_parallel_group())
+            outputs[dataloader_idx].clear()  # free memory
+
+        # Logging of the averaged metrics:
+        averaged_loss = sum(averaged_loss) / len(averaged_loss)
+        averaged_metric = sum(averaged_metric) / len(averaged_metric) if len(averaged_metric) > 0 else None
+        averaged_loss = averaged_loss.to(self.device)
+        if averaged_metric is not None:
+            averaged_metric = averaged_metric.to(self.device)
+
+        # Handle case where metrics can be nan or inf. This can break checkpoint save/load.
+        if averaged_metric is not None and (torch.isinf(averaged_metric) or torch.isnan(averaged_metric)):
+            app_state = AppState()
+            monitor_mode = app_state.checkpoint_callback_params.mode
+            assert monitor_mode in ['min', 'max']
+            averaged_metric = 0.0 if monitor_mode == 'max' else 1e5
+
+        if mode == 'validation':
+            self.log("validation_loss", averaged_loss, batch_size=1, sync_dist=True)
+            if averaged_metric is not None:
+                self.log(f"validation_{self.val_metric_name}", averaged_metric, sync_dist=True, batch_size=1)
+        elif mode == 'test':
+            self.log("test_loss", averaged_loss, batch_size=1, sync_dist=True)
+            if averaged_metric is not None:
+                self.log(f"test_{self.test_metric_name}", averaged_metric, sync_dist=True, batch_size=1)
+
+        # Merge the functionality of previous on_inference_epoch_end() within inference_epoch_end() func here
+        app_state = AppState()
+        self._restore_activation_checkpointing_args()
+        if hasattr(self, "_train_ds"):
+            _reconfigure_microbatch_calculator(
+                rank=app_state.global_rank,
+                rampup_batch_size=None,
+                global_batch_size=self.cfg.data.train_ds.global_batch_size,
+                micro_batch_size=self.cfg.data.train_ds.micro_batch_size,
+                data_parallel_size=parallel_state.get_data_parallel_world_size(),
+            )
+        # When running `trainer.validate()`, the training dataset is not available.
+        else:
+            logging.warning('No training data found, reconfiguring microbatches based on validation batch sizes.')
+            _reconfigure_microbatch_calculator(
+                rank=app_state.global_rank,
+                rampup_batch_size=None,
+                global_batch_size=data_cfg.global_batch_size,
+                micro_batch_size=data_cfg.micro_batch_size,
+                data_parallel_size=parallel_state.get_data_parallel_world_size(),
+            )
+
+        return averaged_loss, averaged_metric
+
+    # consistent with speech models
+    @rank_zero_only
+    def write_predictions_to_file(self, outputs, output_file_path_prefix, output_dir):
+        os.makedirs(output_dir, exist_ok=True)
+        output_file_path = output_file_path_prefix + "_inputs_preds_labels.jsonl"
+        output_file_path = os.path.join(output_dir, output_file_path)
+        with open(output_file_path, "w") as f_json:
+            assert (
+                len(outputs['inputs']) == len(outputs['preds']) == len(outputs['labels']) == len(outputs['metadata'])
+            )
+            for i, p, l, m in zip(outputs['inputs'], outputs['preds'], outputs['labels'], outputs['metadata']):
+                json_string = {'input': i, 'pred_text': p, 'text': l}
+                for k, v in m.items():
+                    if k not in json_string:
+                        json_string[k] = v
+                f_json.write(json.dumps(json_string) + '\n')
+
+        logging.info(f'Predictions saved to {output_file_path}')
+
+    def setup_eval_dataloader(self, datasets, data_cfg):
+        dataloaders = []
+        if not isinstance(datasets, list):
+            return self.build_data_loader(dataset=datasets, data_cfg=data_cfg, consumed_samples=0)
+        for dataset in datasets:
+            eval_dl = self.build_data_loader(dataset=dataset, data_cfg=data_cfg, consumed_samples=0)
+            dataloaders.append(eval_dl)
+        return dataloaders
+
+    def setup_predict_dataloader(self, data_cfg):
+        datasets = self._build_dataset(data_cfg, False)
+        dataloaders = []
+        if not isinstance(datasets, list):
+            return self.build_data_loader(dataset=datasets, data_cfg=data_cfg, consumed_samples=0, is_predict=True)
+        for dataset in datasets:
+            eval_dl = self.build_data_loader(dataset=dataset, data_cfg=data_cfg, consumed_samples=0, is_predict=True)
+            dataloaders.append(eval_dl)
+        return dataloaders
+
+    def sharded_state_dict(self, prefix: str = ''):
+        """
+        Force None for the parent class's sharded_state_dict() method if setup is complete.
+        """
+        if self.setup_complete:
+            return None
+        else:
+            return super().sharded_state_dict(prefix=prefix)
+
+    def maybe_build_test(self):
+        # overwrite the parent class's maybe_build_test() method in MegatronGPTModel
+        if hasattr(self.cfg.data, 'test_ds'):
+            logging.info('Building test datasets...')
+            # Wrap this in a list since the general finetuning parent class supports multi-validation.
+            self._test_ds = self._build_dataset(self.cfg.data.test_ds, is_train=False)
+            lengths = [len(x) for x in self._test_ds]
+            logging.info(f'Length of test datasets: {lengths}, total: {sum(lengths)}')
+        return
+
+    def maybe_setup_test(self):
+        # overwrite the parent class's maybe_build_test() method in MegatronGPTModel
+        if hasattr(self.cfg.data, 'test_ds'):
+            self._test_dl = self.setup_eval_dataloader(self._test_ds, self.cfg.data.test_ds)
+        return
+
+    def build_train_valid_test_datasets(self, stage):
+        if stage != 'test':
+            logging.info('Building validation datasets.')
+            # Wrap this in a list since the general finetuning parent class supports multi-validation.
+            self._validation_ds = self._build_dataset(self.cfg.data.validation_ds, is_train=False)
+            lengths = [len(x) for x in self._validation_ds]
+            logging.info(f'Length of validation datasets: {lengths}, total: {sum(lengths)}')
+
+        if stage != 'validate':
+            self.maybe_build_test()
+
+        if stage == 'validate' or stage == 'test':
+            return
+        logging.info('Building training datasets.')
+        self._train_ds = self._build_dataset(self.cfg.data.train_ds)
+        logging.info(f'Length training datasets: {len(self._train_ds)}')
+
+    @classmethod
+    def list_available_models(cls) -> Optional[PretrainedModelInfo]:
+        """
+        This method returns a list of pre-trained model which can be instantiated directly from NVIDIA's NGC cloud.
+
+        Returns:
+            List of available pre-trained models.
+        """
+        results = []
+
+        model = PretrainedModelInfo(
+            pretrained_model_name="speechllm_fc_llama2_7b",
+            description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia/nemo/speechllm_fc_llama2_7b",
+            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/speechllm_fc_llama2_7b/versions/1.23.1/files/speechllm_fc_llama2_7b.nemo",
+        )
+        results.append(model)
+        return results
diff --git a/nemo/collections/multimodal/speech_llm/modules/__init__.py b/nemo/collections/multimodal/speech_llm/modules/__init__.py
new file mode 100644
index 000000000000..d9562652ce84
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/modules/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from nemo.collections.multimodal.speech_llm.modules.modality_adapters import PoolingMLPConnectors
+from nemo.collections.multimodal.speech_llm.modules.perception_modules import (
+    AudioPerceptionModule,
+    MultiAudioPerceptionModule,
+    MultiFeatureAggregator,
+)
diff --git a/nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_strategy.py b/nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_strategy.py
new file mode 100644
index 000000000000..0cd48502bb84
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_strategy.py
@@ -0,0 +1,175 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Optional, Tuple
+
+import torch
+
+import nemo.collections.nlp.modules.common.text_generation_strategy as text_generation_strategy
+from nemo.collections.multimodal.speech_llm.parts.utils.data_utils import shift_tokens_by_multi_audios
+
+
+# the text representation of eos_id, it applies for all tokenizers
+END_OF_SEQ = '<|endoftext|>'
+
+
+def switch(val1, val2, boolean):
+    boolean = boolean.type_as(val1)
+    boolean = boolean.unsqueeze(0).unsqueeze(-1)
+    return (1 - boolean) * val1 + boolean * val2
+
+
+class AudioToTextGenerationStrategy(text_generation_strategy.GPTModelTextGenerationStrategy):
+    def init_batch(
+        self,
+        context_tokens: torch.Tensor,
+        context_lengths: torch.Tensor,
+        audio_signal: torch.Tensor,
+        audio_length: torch.Tensor,
+        compute_attention_mask: bool,
+        num_audios: Optional[torch.Tensor] = None,
+        context_start_idx: Optional[List[List[int]]] = None,
+    ):
+        """initialize the batch data before the inference steps."""
+        # Move to GPU.
+
+        audio_feats, audio_feat_lens = self.model.perception(
+            input_signal=audio_signal,
+            input_signal_length=audio_length,
+            processed_signal=None,
+            processed_signal_length=None,
+        )
+
+        if num_audios is not None:
+            # handle multiple audio files per sample
+            audio_feats = audio_feats.split(num_audios.tolist())
+            audio_feat_lens = audio_feat_lens.split(num_audios.tolist())
+
+        encoder_input, attention_mask, _, position_ids, encoder_max_length = self.model.inject_perception_input(
+            audio_feats, audio_feat_lens, context_tokens, context_lengths, context_start_idx
+        )
+
+        self.attention_mask = attention_mask
+        self.position_ids = position_ids
+
+        if num_audios is not None:
+            # handle multiple audio files per sample
+            new_context_tokens = shift_tokens_by_multi_audios(
+                context_tokens, context_lengths, audio_feat_lens, context_start_idx, encoder_max_length
+            )
+            audio_feat_lens = torch.stack([torch.sum(lens) for lens in audio_feat_lens])  # [batch,]
+        else:
+            new_context_tokens = self.model._shift_labels_by_emb_len(
+                context_tokens, context_lengths, audio_feat_lens, encoder_max_length, pad_token=0
+            )
+
+        return new_context_tokens, encoder_input, audio_feat_lens
+
+    def clip_max_len(self, maxlen: int) -> int:
+        """clip the max len based on the LM model max sequence length"""
+        # for positional embedding types that allow length extrapolation, don't clip the max length
+        if self.model.cfg.get("position_embedding_type", "learned_absolute") == "learned_absolute":
+            if maxlen > self.model.cfg.encoder_seq_length + 1:
+                maxlen = self.model.cfg.encoder_seq_length + 1
+        return maxlen
+
+    def prepare_batch_at_step(
+        self,
+        tokens: torch.Tensor,
+        input_embeddings: torch.Tensor,
+        maxlen: int,
+        micro_batch_size: int,
+        step: int,
+        context_lengths: torch.Tensor,
+        curr_context_length: int,
+        compute_attention_mask: bool,
+    ) -> Tuple[List[torch.Tensor], List[int]]:
+        # types2use = None
+        if step == 0:
+            # Allocate memory for the entire context.
+            set_inference_key_value_memory = True
+            tokens2use = tokens[:, :curr_context_length]
+            positions2use = self.position_ids[:, :curr_context_length]
+            embeddings2use = input_embeddings[:curr_context_length]
+        else:
+            # Set this to false so the memory is not reallocated.
+            set_inference_key_value_memory = False
+            tokens2use = tokens[:, curr_context_length - 1].view(micro_batch_size, -1)
+            positions2use = self.position_ids[:, curr_context_length - 1].view(micro_batch_size, -1)
+            embeddings2use = self.model._get_text_embeddings(tokens2use, positions2use)
+            started = context_lengths <= curr_context_length
+            embeddings2use = switch(input_embeddings[curr_context_length - 1].unsqueeze(0), embeddings2use, started)
+
+        """Prepare batch for each of the inference steps"""
+        setkey_value_array = torch.tensor(
+            [set_inference_key_value_memory] * micro_batch_size, device=torch.cuda.current_device()
+        )
+        len_array = torch.tensor([maxlen] * micro_batch_size, device=torch.cuda.current_device())
+
+        batch = [tokens2use, embeddings2use, self.attention_mask, positions2use, setkey_value_array, len_array]
+        tensor_shape = [tokens2use.shape[1], micro_batch_size, self.model.cfg.hidden_size]
+        return batch, tensor_shape
+
+    def post_process(self, tokens: torch.Tensor, new_tokens: torch.Tensor, context_length: int):
+        """
+        At the end of the inference, post process the inference results
+        """
+        pass
+
+    def end_of_generation_condition(
+        self, tokens: torch.Tensor, prev: torch.Tensor, eod_id: int, end_strings: List[str]
+    ) -> torch.Tensor:
+        """
+        return whether the generation should stop based on the previous token
+        Args:
+            tokens (torch.Tensor): the generated tokens so far
+            prev  (torch.Tensor): the previous token
+            eod_id (int): the end of document token id
+            end_strings (List[str]): the list of end of generation strings
+        returns:
+            a boolean tensor indicating whether the generation should stop
+        """
+        if len(end_strings) == 1 and end_strings[0] == END_OF_SEQ:
+            return prev == eod_id
+        else:
+            tokenizer = self.model.tokenizer
+            conditions = []
+            end_tokens = set()
+            end_tokens.add(eod_id)
+            for end_string in end_strings:
+                if len(end_string) > 1:
+                    continue
+                ids_1 = tokenizer.text_to_ids(f'<extra_id_1>{end_string}')
+                ids_2 = tokenizer.text_to_ids('<extra_id_1>')
+                if len(ids_1) <= len(ids_2):
+                    continue
+                token_id = ids_1[len(ids_2) :][0]
+
+                end_tokens.add(token_id)
+
+            for p, token_item in zip(prev, tokens):
+                text = tokenizer.ids_to_text(token_item.tolist())
+                conditions.append(
+                    any([text.endswith(end_string) for end_string in end_strings] + [p.item() in end_tokens])
+                )
+            return torch.tensor(conditions, dtype=torch.bool, device=tokens.device)
+
+
+def model_inference_strategy_dispatcher(model, **args):
+    from nemo.collections.multimodal.speech_llm.models.modular_models import ModularAudioGPTModel
+
+    if isinstance(model, ModularAudioGPTModel):
+        return AudioToTextGenerationStrategy(model, **args)
+    else:
+        return text_generation_strategy.model_inference_strategy_dispatcher(model, **args)
diff --git a/nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_utils.py b/nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_utils.py
new file mode 100644
index 000000000000..136418031586
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/modules/common/audio_text_generation_utils.py
@@ -0,0 +1,698 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Utilities for generating text."""
+
+import pickle
+from collections.abc import Iterable
+from typing import List, Optional, Tuple, Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+
+import nemo.collections.nlp.modules.common.text_generation_utils as text_generation_utils
+from nemo.collections.common.tokenizers.tabular_tokenizer import TabularTokenizer
+from nemo.collections.multimodal.speech_llm.modules.common.audio_text_generation_strategy import (
+    model_inference_strategy_dispatcher,
+)
+from nemo.collections.nlp.modules.common.transformer.text_generation import OutputType
+from nemo.utils import AppState
+
+try:
+    from apex.transformer.pipeline_parallel.utils import _reconfigure_microbatch_calculator
+
+    HAVE_APEX = True
+
+except (ImportError, ModuleNotFoundError):
+
+    HAVE_APEX = False
+
+try:
+    from megatron.core import parallel_state, tensor_parallel
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+
+    HAVE_MEGATRON_CORE = False
+
+__all__ = [
+    "get_computeprob_response",
+    "generate",
+]
+
+
+def get_computeprob_response(tokenizer, response, inputs):
+    return text_generation_utils.get_computeprob_response(tokenizer, response, inputs)
+
+
+def send_generate_info(
+    context_tokens_tensor,
+    context_length_tensor,
+    audio_signal,
+    audio_signal_length,
+    tokens_to_generate,
+    all_probs,
+    compute_logprob,
+    temperature,
+    top_k,
+    top_p,
+    greedy,
+    repetition_penalty,
+    min_tokens_to_generate,
+    end_strings,
+    num_audios: Optional[torch.Tensor] = None,
+    context_start_idx: Optional[List[List[int]]] = None,
+):
+    """
+    Needs to be synced up with receive_generate_info
+    """
+    model_parallel_group = parallel_state.get_model_parallel_group()
+    src = text_generation_utils.get_model_parallel_src_rank()
+
+    audio_max_len = audio_signal.size(1) if audio_signal is not None else 0
+
+    # Send the sizes of the tensors
+    input_info = [
+        context_tokens_tensor.size(0),  # batch_size
+        context_tokens_tensor.size(1),  # seq_len
+        audio_max_len,  # audio_max_len
+        tokens_to_generate,
+        all_probs,
+        compute_logprob,  # whether to compute log probabilities matrix
+        temperature,
+        top_k,
+        top_p,
+        greedy,
+        repetition_penalty,
+        min_tokens_to_generate,
+    ]
+    input_info_tensor = torch.cuda.FloatTensor(input_info)
+    torch.distributed.broadcast(input_info_tensor, src, model_parallel_group)
+
+    # Send variables to all ranks
+    torch.distributed.broadcast(context_length_tensor, src, model_parallel_group)
+    torch.distributed.broadcast(context_tokens_tensor, src, model_parallel_group)
+
+    torch.distributed.broadcast(audio_signal, src, model_parallel_group)
+    torch.distributed.broadcast(audio_signal_length, src, model_parallel_group)
+
+    # send end strings
+    string_tensor = torch.as_tensor(
+        np.frombuffer(pickle.dumps(end_strings), dtype=np.int8), device=torch.cuda.current_device()
+    )
+    size = torch.as_tensor([string_tensor.size(0)], device=torch.cuda.current_device(), dtype=torch.int64)
+    torch.distributed.broadcast(size, src, model_parallel_group)
+    torch.distributed.broadcast(string_tensor, src, model_parallel_group)
+
+    if num_audios is not None:
+        torch.distributed.broadcast(num_audios, src, model_parallel_group)
+
+    if context_start_idx is not None:
+        context_idx_tensor = torch.as_tensor(
+            np.frombuffer(pickle.dumps(context_start_idx), dtype=np.int8), device=torch.cuda.current_device()
+        )
+        ctx_size = torch.as_tensor([context_idx_tensor.size(0)], device=torch.cuda.current_device(), dtype=torch.int64)
+        torch.distributed.broadcast(ctx_size, src, model_parallel_group)
+        torch.distributed.broadcast(context_idx_tensor, src, model_parallel_group)
+
+
+def receive_generate_info(has_multi_audios=False):
+    """
+    Needs to be synced up with send_generate_info
+    """
+    model_parallel_group = parallel_state.get_model_parallel_group()
+    src = text_generation_utils.get_model_parallel_src_rank()
+    input_info_tensor = torch.empty(12, dtype=torch.float32, device=torch.cuda.current_device())
+    torch.distributed.broadcast(input_info_tensor, src, model_parallel_group)
+    batch_size = int(input_info_tensor[0].item())
+    seq_len = int(input_info_tensor[1].item())
+    audio_len = int(input_info_tensor[2].item())
+    tokens_to_generate = int(input_info_tensor[3].item())
+    all_probs = bool(input_info_tensor[4].item())
+    compute_logprob = bool(input_info_tensor[5].item())  # whether to compute log probabilities matrix
+    temperature = float(input_info_tensor[6].item())
+    top_k = int(input_info_tensor[7].item())
+    top_p = float(input_info_tensor[8].item())
+    greedy = bool(input_info_tensor[9].item())
+    repetition_penalty = float(input_info_tensor[10].item())
+    min_tokens_to_generate = int(input_info_tensor[11].item())
+
+    context_length_tensor = torch.empty(batch_size, dtype=torch.int64, device=torch.cuda.current_device())
+    context_tokens_tensor = torch.empty(batch_size, seq_len, dtype=torch.int64, device=torch.cuda.current_device())
+    # Send variables to all ranks
+    torch.distributed.broadcast(context_length_tensor, src, model_parallel_group)
+    torch.distributed.broadcast(context_tokens_tensor, src, model_parallel_group)
+
+    audio_signal = torch.empty(batch_size, audio_len, dtype=torch.float32, device=torch.cuda.current_device())
+    audio_signal_length = torch.empty(batch_size, dtype=torch.int64, device=torch.cuda.current_device())
+    # Send variables to all ranks
+    torch.distributed.broadcast(audio_signal, src, model_parallel_group)
+    torch.distributed.broadcast(audio_signal_length, src, model_parallel_group)
+
+    array_size = torch.empty(1, dtype=torch.int64, device=torch.cuda.current_device())
+    torch.distributed.broadcast(array_size, src, model_parallel_group)
+
+    string_tensor = torch.empty(array_size[0], dtype=torch.int8, device=torch.cuda.current_device())
+    torch.distributed.broadcast(string_tensor, src, model_parallel_group)
+    bytes = string_tensor.cpu().numpy().tobytes()
+    end_strings = pickle.loads(bytes)
+
+    num_audios = None
+    context_start_idx = None
+    if has_multi_audios:
+        num_audios = torch.empty(batch_size, dtype=torch.int64, device=torch.cuda.current_device())
+        torch.distributed.broadcast(num_audios, src, model_parallel_group)
+
+        array_size = torch.empty(1, dtype=torch.int64, device=torch.cuda.current_device())
+        torch.distributed.broadcast(array_size, src, model_parallel_group)
+        context_idx_tensor = torch.empty(array_size[0], dtype=torch.int8, device=torch.cuda.current_device())
+        torch.distributed.broadcast(context_idx_tensor, src, model_parallel_group)
+        bytes = context_idx_tensor.cpu().numpy().tobytes()
+        context_start_idx = pickle.loads(bytes)
+
+    return (
+        context_length_tensor,
+        context_tokens_tensor,
+        audio_signal,
+        audio_signal_length,
+        tokens_to_generate,
+        all_probs,
+        compute_logprob,
+        temperature,
+        top_k,
+        top_p,
+        greedy,
+        repetition_penalty,
+        min_tokens_to_generate,
+        end_strings,
+        num_audios,
+        context_start_idx,
+    )
+
+
+def synced_generate(
+    model,
+    inference_strategy,
+    context_tokens_tensor,
+    context_length_tensor,
+    audio_signal,
+    audio_signal_length,
+    tokens_to_generate,
+    all_probs,
+    temperature,
+    top_k=0,
+    top_p=0.0,
+    greedy=False,
+    compute_attention_mask=True,
+    compute_logprob=False,
+    repetition_penalty=1.2,
+    end_strings=[],
+    min_tokens_to_generate=0,
+    num_audios: Optional[torch.Tensor] = None,
+    context_start_idx: Optional[List[List[int]]] = None,
+):
+    context_length = context_length_tensor.min().item()
+    tokenizer = model.tokenizer
+    if isinstance(tokenizer, TabularTokenizer):
+        raise NotImplementedError("Tabular generation is not supported yet")
+    else:
+        batch_token_iterator = sample_sequence_batch(
+            model,
+            inference_strategy,
+            context_tokens_tensor,
+            context_length_tensor,
+            audio_signal,
+            audio_signal_length,
+            tokens_to_generate,
+            all_probs,
+            compute_attention_mask=compute_attention_mask,
+            compute_logprob=compute_logprob,
+            temperature=temperature,
+            end_strings=end_strings,
+            extra={
+                "top_p": top_p,
+                "top_k": top_k,
+                "greedy": greedy,
+                "repetition_penalty": repetition_penalty,
+                "min_tokens_to_generate": min_tokens_to_generate,
+            },
+            num_audios=num_audios,
+            context_start_idx=context_start_idx,
+        )
+
+    for tokens, lengths, output_logits, full_logits, audio_feat_lens in batch_token_iterator:
+        context_length += 1
+    context_length += audio_feat_lens.min().item()
+    if parallel_state.is_pipeline_last_stage():
+        src = parallel_state.get_pipeline_model_parallel_last_rank()
+        group = parallel_state.get_embedding_group()
+        if compute_logprob:
+            torch.distributed.broadcast(output_logits, src, group)
+        if all_probs:
+            src = parallel_state.get_pipeline_model_parallel_last_rank()
+            group = parallel_state.get_embedding_group()
+            torch.distributed.broadcast(full_logits, src, group)
+
+    else:
+        if parallel_state.is_pipeline_first_stage():
+            src = parallel_state.get_pipeline_model_parallel_last_rank()
+            group = parallel_state.get_embedding_group()
+
+            if compute_logprob:
+                precision = model._trainer.precision
+                if precision in [16, "16"]:
+                    dtype = torch.float16
+                elif precision == "bf16":
+                    dtype = torch.bfloat16
+                else:
+                    dtype = torch.float32
+                output_logits = torch.empty(
+                    tokens.size(0), context_length - 1, dtype=dtype, device=torch.device("cuda")
+                )
+                torch.distributed.broadcast(output_logits, src, group)
+
+            if all_probs:
+                src = parallel_state.get_pipeline_model_parallel_last_rank()
+                group = parallel_state.get_embedding_group()
+                full_logits = torch.empty(
+                    tokens.size(0),
+                    context_length - 1,
+                    model.padded_vocab_size,
+                    dtype=dtype,
+                    device=torch.device("cuda"),
+                )
+                torch.distributed.broadcast(full_logits, src, group)
+    if tokens is not None:
+        return tokens[:, :context_length], output_logits, full_logits, audio_feat_lens
+    return None
+
+
+def generate(
+    model,
+    inputs: Union[Tuple, List[str]],
+    tokens_to_generate=0,
+    all_probs=False,
+    temperature=1.0,
+    add_BOS=False,
+    top_k=0,
+    top_p=0.0,
+    greedy=False,
+    compute_attention_mask=True,
+    compute_logprob=False,
+    repetition_penalty=1.0,
+    end_strings=['<|endoftext|>'],
+    min_tokens_to_generate=0,
+    **strategy_args,
+) -> OutputType:
+    """
+    Args:
+        model (NLPModel): text generative model
+        inputs (Union[tuple, List[str]]): if it is a tuple, it is assumed to be (context_tokens_tensor, context_length_tensor). Otherwise it it a list of prompt text strings
+        tokens_to_generate (int): The maximum length of the tokens to be generated.
+        all_probs (bool): Return the log prob for all the tokens
+        temperature (float): sampling temperature
+        add_BOS (bool): add the bos token at the begining of the prompt
+        top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
+        top_p (float): If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+        greedy (bool):  Whether or not to use sampling ; use greedy decoding otherwise
+        repetition_penalty (float): The parameter for repetition penalty. 1.0 means no penalty
+        min_tokens_to_generate (int): The minimum length of the tokens to be generated
+        strategy_args, the extra arguments are treated as inference strategy arguments
+        end_strings, a list of strings to stop generation when they are encountered in the output.
+    Returns:
+        OutputType: It generates the output in a dictionary type. It has the following keys:
+            sentences: List[str], output sentences
+            tokens: List[List[str]], output sentences borken into tokens
+            logprob: List[Tensor], log prob of generated tokens
+            full_logprob: List[Tensor], log prob of all the tokens in the vocab
+            token_ids: List[Tensor], output sentence token ids
+            offsets: List[List[int]]  # list of tokens start positions in text
+    """
+    if 'strategy' in strategy_args:
+        inference_strategy = strategy_args['strategy']
+    else:
+        inference_strategy = model_inference_strategy_dispatcher(model)
+    tokenizer = model.tokenizer
+    has_multi_audios = False
+    num_audios = None
+    context_start_idx = None
+    audio_signal, audio_signal_length = None, None
+    if torch.distributed.get_rank() == text_generation_utils.get_model_parallel_src_rank():
+        if isinstance(inputs, tuple) and len(inputs) == 2:
+            context_tokens_tensor, context_length_tensor = inputs
+        elif isinstance(inputs, tuple) and len(inputs) == 4:
+            context_tokens_tensor, context_length_tensor, audio_signal, audio_signal_length = inputs
+        elif isinstance(inputs, tuple) and len(inputs) == 6:  # multi-audio
+            has_multi_audios = True
+            (
+                context_tokens_tensor,
+                context_length_tensor,
+                audio_signal,
+                audio_signal_length,
+                num_audios,
+                context_start_idx,
+            ) = inputs
+        else:
+            context_tokens_tensor, context_length_tensor = inference_strategy.tokenize_batch(
+                inputs, tokens_to_generate, add_BOS
+            )
+
+        send_generate_info(
+            context_tokens_tensor,
+            context_length_tensor,
+            audio_signal,
+            audio_signal_length,
+            tokens_to_generate,
+            all_probs,
+            compute_logprob,
+            temperature,
+            top_k,
+            top_p,
+            greedy,
+            repetition_penalty,
+            min_tokens_to_generate,
+            end_strings,
+            num_audios,
+            context_start_idx,
+        )
+    else:
+        (
+            context_length_tensor,
+            context_tokens_tensor,
+            audio_signal,
+            audio_signal_length,
+            tokens_to_generate,
+            all_probs,
+            compute_logprob,
+            temperature,
+            top_k,
+            top_p,
+            greedy,
+            repetition_penalty,
+            min_tokens_to_generate,
+            end_strings,
+            num_audios,
+            context_start_idx,
+        ) = receive_generate_info(has_multi_audios)
+
+    output = synced_generate(
+        model,
+        inference_strategy,
+        context_tokens_tensor,
+        context_length_tensor,
+        audio_signal,
+        audio_signal_length,
+        tokens_to_generate,
+        all_probs,
+        temperature,
+        compute_attention_mask=compute_attention_mask,
+        compute_logprob=compute_logprob,
+        top_k=top_k,
+        top_p=top_p,
+        greedy=greedy,
+        repetition_penalty=repetition_penalty,
+        end_strings=end_strings,
+        min_tokens_to_generate=min_tokens_to_generate,
+        num_audios=num_audios,
+        context_start_idx=context_start_idx,
+    )
+    special_tokens = set()
+    if hasattr(tokenizer, 'pad_token') and tokenizer.pad_token is not None:
+        special_tokens.add(tokenizer.pad_token)
+    if hasattr(tokenizer, 'eos_token') and tokenizer.eos_token is not None:
+        special_tokens.add(tokenizer.eos_token)
+    if hasattr(tokenizer, 'bos_token') and tokenizer.bos_token is not None:
+        special_tokens.add(tokenizer.bos_token)
+    if hasattr(tokenizer, 'cls_token') and tokenizer.cls_token is not None:
+        special_tokens.add(tokenizer.cls_token)
+    if hasattr(tokenizer, 'unk_token') and tokenizer.unk_token is not None:
+        special_tokens.add(tokenizer.unk_token)
+    if hasattr(tokenizer, 'sep_token') and tokenizer.sep_token is not None:
+        special_tokens.add(tokenizer.sep_token)
+    if hasattr(tokenizer, 'mask_token') and tokenizer.mask_token is not None:
+        special_tokens.add(tokenizer.mask_token)
+    if output is not None:
+        decode_tokens, output_logits, full_logits, audio_feat_lens = output
+        resp_sentences = []
+        resp_sentences_seg = []
+
+        decode_tokens = decode_tokens.cpu().numpy().tolist()
+        for decode_token in decode_tokens:
+            sentence = tokenizer.ids_to_text(decode_token)
+            resp_sentences.append(sentence)
+            if not isinstance(tokenizer, TabularTokenizer):
+                words = []
+                for token in decode_token:
+                    if not isinstance(token, Iterable):
+                        token = [token]
+                    word = tokenizer.ids_to_tokens(token)
+                    if isinstance(word, Iterable):
+                        word = word[0]
+                    if hasattr(tokenizer.tokenizer, 'byte_decoder'):
+                        word = bytearray([tokenizer.tokenizer.byte_decoder[c] for c in word]).decode(
+                            'utf-8', errors='replace'
+                        )
+                    words.append(word)
+                resp_sentences_seg.append(words)
+            else:
+                words = tokenizer.text_to_tokens(sentence)
+                resp_sentences_seg.append(words)
+
+        # offsets calculation
+        all_offsets = []
+        for item in resp_sentences_seg:
+            offsets = [0]
+            for index, token in enumerate(item):
+                if index != len(item) - 1:
+                    if token in special_tokens:
+                        offsets.append(offsets[-1])
+                    else:
+                        offsets.append(len(token) + offsets[-1])
+            all_offsets.append(offsets)
+
+        output = {}
+        output['sentences'] = resp_sentences
+        output['tokens'] = resp_sentences_seg
+        output['logprob'] = output_logits
+        output['full_logprob'] = full_logits
+        output['token_ids'] = decode_tokens
+        output['offsets'] = all_offsets
+        output['audio_feat_lens'] = audio_feat_lens
+        output = inference_strategy.post_generation_process(output)
+        return output
+    return None
+
+
+def switch(val1, val2, boolean):
+    boolean = boolean.type_as(val1)
+    return (1 - boolean) * val1 + boolean * val2
+
+
+def sample_sequence_batch(
+    model,
+    inference_strategy,
+    context_tokens,
+    context_lengths,
+    audio_signal,
+    audio_signal_length,
+    tokens_to_generate,
+    all_probs=False,
+    compute_attention_mask=True,
+    compute_logprob=False,
+    type_ids=None,
+    temperature=None,
+    end_strings=['<|endoftext|>'],
+    extra={},
+    num_audios: Optional[torch.Tensor] = None,
+    context_start_idx: Optional[List[List[int]]] = None,
+):
+    app_state = AppState()
+    micro_batch_size = context_tokens.shape[0]
+    _reconfigure_microbatch_calculator(
+        rank=app_state.global_rank,
+        rampup_batch_size=None,
+        global_batch_size=micro_batch_size,
+        micro_batch_size=micro_batch_size,
+        data_parallel_size=1,
+    )
+    assert tokens_to_generate > 0, "tokens_to_generate should be > 0"
+    assert (
+        model.cfg.get('sequence_parallel', False) == False
+    ), 'sequence_parallel should be False during inference. Disable it in the model config if restoring from nemo or in hparams.yaml if restoring from PTL checkpoint'
+    assert (
+        model.cfg.get('activations_checkpoint_granularity', None) is None
+    ), 'activations_checkpoint_granularity should be None during inference. Disable it in the model config if restoring from nemo or in hparams.yaml if restoring from PTL checkpoint'
+    assert (
+        model.cfg.get('activations_checkpoint_method', None) is None
+    ), 'activations_checkpoint_method should be None during inference. Disable it in the model config if restoring from nemo or in hparams.yaml if restoring from PTL checkpoint'
+
+    tokenizer = model.tokenizer
+    # initialize the batch
+    with torch.no_grad():
+        context_tokens, input_embeddings, audio_feat_lens = inference_strategy.init_batch(
+            context_tokens,
+            context_lengths,
+            audio_signal,
+            audio_signal_length,
+            compute_attention_mask,
+            num_audios,
+            context_start_idx,
+        )
+        audio_text_context_lengths = context_lengths + audio_feat_lens
+        context_length = audio_text_context_lengths.min().item()
+        # added eos_id to support the function generate_samples_eval that passes
+        # eos_id as an argument and needs termination when that id id found.
+        eod_id = tokenizer.eos_id
+        counter = 0
+        batch_size = context_tokens.size(0)
+        is_done = torch.zeros([batch_size]).byte().cuda()
+        tokens = context_tokens
+        output_logits = None
+        all_generated_indices = None  # used to track all generated indices
+        # Generate enough tokens for the longest sequence
+        maxlen = tokens_to_generate + audio_text_context_lengths.max().item()
+        maxlen = inference_strategy.clip_max_len(maxlen)
+        lengths = torch.ones([batch_size]).long().cuda() * maxlen
+        while context_length < maxlen:
+            batch, tensor_shape = inference_strategy.prepare_batch_at_step(
+                tokens,
+                input_embeddings,
+                maxlen,
+                micro_batch_size,
+                counter,
+                audio_text_context_lengths,
+                context_length,
+                compute_attention_mask,
+            )
+            output = inference_strategy.forward_step(batch, tensor_shape)
+            if parallel_state.is_pipeline_last_stage():
+                if compute_logprob:
+                    output = output[0]['logits']
+                    output = tensor_parallel.gather_from_tensor_model_parallel_region(output)
+                    assert output is not None
+                    logits = output[:, -1].view(batch_size, -1).contiguous()
+
+                else:
+                    logits = output[0]['logits'][:, -1].contiguous()
+                    logits = tensor_parallel.gather_from_tensor_model_parallel_region(logits)
+                    assert logits is not None
+                    logits = logits.view(batch_size, -1)
+
+                # make sure it will generate at least min_length
+                min_length = extra.get('min_tokens_to_generate', 0)
+                if min_length > 0:
+                    within_min_length = (context_length - audio_text_context_lengths) < min_length
+                    logits[within_min_length, eod_id] = -float('Inf')
+                # make sure it won't sample outside the vocab_size range
+                logits[:, tokenizer.vocab_size :] = -float('Inf')
+
+                # started indicates whether the current token step passes the context_length, so we make sure not to overwrite the context tokens
+                started = audio_text_context_lengths <= context_length
+                if extra.get('greedy', False):
+                    prev = torch.argmax(logits, dim=-1).view(-1)
+                else:
+                    logits = logits.float()
+                    logits /= temperature
+                    # handle repetition penality
+                    logits = text_generation_utils.repetition_penalty(
+                        logits, extra.get('repetition_penalty', 1.2), all_generated_indices
+                    )
+                    logits = text_generation_utils.top_k_logits(
+                        logits, top_k=extra.get('top_k', 0), top_p=extra.get('top_p', 0.9), started=started
+                    )
+                    probs = F.softmax(logits, dim=-1)
+                    # TODO(zhehuai)
+                    probs = probs.nan_to_num(1.0)
+                    prev = torch.multinomial(probs, num_samples=1).view(-1)
+
+                # Clamp the predicted out of vocabulary tokens
+                prev = torch.clamp(prev, max=tokenizer.vocab_size - 1)
+                new_tokens = switch(tokens[:, context_length].view(-1), prev, started)
+
+                # Replace sampled tokens w/ done token if EOD has already been sampled
+                new_tokens = switch(new_tokens, eod_id, is_done)
+
+                # post process the inference tokens based on the strategy
+                inference_strategy.post_process(tokens, new_tokens, context_length)
+
+                # Insert either new predicted or next prompt token
+                tokens[:, context_length] = new_tokens
+
+                if compute_logprob:
+                    if output_logits is None:
+                        output = F.log_softmax(output[:, :context_length, :], 2)
+
+                        indices = torch.unsqueeze(tokens[:, 1 : context_length + 1], 2)
+                        output_logits = torch.gather(output, 2, indices).squeeze(2)
+                        all_generated_indices = indices[:, :, 0]
+                        if all_probs:
+                            full_logits = output
+                    else:
+                        output = F.log_softmax(output, 2)
+                        indices = torch.unsqueeze(new_tokens, 1).unsqueeze(2)
+                        new_output_logits = torch.gather(output, 2, indices).squeeze(2)
+
+                        # TODO(rprenger) we're copying output_logits every time.  Should pre-allocate
+                        output_logits = torch.cat([output_logits, new_output_logits], 1)
+                        all_generated_indices = torch.cat([all_generated_indices, indices[:, :, 0]], 1)
+                        if all_probs:
+                            full_logits = torch.cat([full_logits, output], 1)
+
+                src = parallel_state.get_pipeline_model_parallel_last_rank()
+                group = parallel_state.get_embedding_group()
+                torch.distributed.broadcast(new_tokens, src, group)
+
+                #                done_token = (prev == eod_id).byte() & started.byte()
+                done_token = inference_strategy.end_of_generation_condition(
+                    tokens[:, : context_length + 1], prev, eod_id, end_strings
+                )
+                done_token = done_token.byte() & started.byte()
+
+                just_finished = (done_token & ~is_done).bool()
+                lengths[just_finished.view(-1)] = context_length
+                is_done = is_done | done_token
+
+                done = torch.all(is_done)
+                src = parallel_state.get_pipeline_model_parallel_last_rank()
+                group = parallel_state.get_pipeline_model_parallel_group()
+                torch.distributed.broadcast(done, src, group)
+                if compute_logprob:
+                    if all_probs:
+                        yield tokens, lengths, output_logits, full_logits, audio_feat_lens
+                    else:
+                        yield tokens, lengths, output_logits, None, audio_feat_lens
+                else:
+                    yield tokens, lengths, None, None, audio_feat_lens
+
+            else:
+                if parallel_state.is_pipeline_first_stage():
+                    src = parallel_state.get_pipeline_model_parallel_last_rank()
+                    group = parallel_state.get_embedding_group()
+                    new_tokens = torch.empty_like(tokens[:, context_length])
+                    torch.distributed.broadcast(new_tokens, src, group)
+                    tokens[:, context_length] = new_tokens
+                    yield tokens, None, None, None, audio_feat_lens
+                else:
+                    yield None, None, None, None, audio_feat_lens
+
+                done = torch.cuda.ByteTensor([0])
+                src = parallel_state.get_pipeline_model_parallel_last_rank()
+                group = parallel_state.get_pipeline_model_parallel_group()
+                torch.distributed.broadcast(done, src, group)
+
+            context_length += 1
+            counter += 1
+            if done:
+                break
diff --git a/nemo/collections/multimodal/speech_llm/modules/modality_adapters.py b/nemo/collections/multimodal/speech_llm/modules/modality_adapters.py
new file mode 100644
index 000000000000..408231adcc6d
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/modules/modality_adapters.py
@@ -0,0 +1,134 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+
+import torch
+import torch.nn as nn
+
+from nemo.collections.common.parts.multi_layer_perceptron import MultiLayerPerceptron as MLP
+from nemo.core.classes.common import typecheck
+from nemo.core.classes.exportable import Exportable
+from nemo.core.classes.mixins import AccessMixin
+from nemo.core.classes.module import NeuralModule
+from nemo.core.neural_types import AcousticEncodedRepresentation, LengthsType, NeuralType
+
+__all__ = ['PoolingMLPConnectors']
+
+
+class ConcatPooling(nn.Module):
+    """
+    A module that perform pooling by concatenating the features of every pooling_factor frames.
+    """
+
+    def __init__(self, pooling_factor):
+        super().__init__()
+        self.pooling_factor = pooling_factor
+
+    def forward(self, x):
+        # x: [batch_size, seq_len, input_dim]
+        batch_size, seq_len, input_dim = x.shape
+        if seq_len % self.pooling_factor != 0:
+            x = x[:, : -(seq_len % self.pooling_factor), :]
+        x = x.reshape(batch_size, seq_len // self.pooling_factor, input_dim * self.pooling_factor)
+        return x
+
+
+class PoolingMLPConnectors(NeuralModule, Exportable, AccessMixin):
+    """
+    A module that performs pooling and MLP on the input features.
+    Currently only supports mean pooling and concatenation pooling.
+    """
+
+    def __init__(
+        self,
+        input_dim,
+        hidden_dim,
+        output_dim=None,
+        num_layers: int = 2,
+        activation: str = "relu",
+        pooling: str = "mean",
+        pooling_factor: int = 2,
+        **kwargs,  # keep this to avoid breaking existing code
+    ):
+        """
+        Args:
+            input_dim: input dimension of the features
+            hidden_dim: hidden dimension of the MLP layers
+            output_dim: output dimension of the features
+            num_layers: number of layers in the MLP
+            activation: activation function used in MLP
+            pooling: type of pooling, currently only supports "mean" and "cat"
+            pooling_factor: size of the pooling window
+        """
+        super().__init__()
+        self.input_dim = input_dim
+        self.hidden_dim = hidden_dim
+        self.output_dim = output_dim if output_dim else input_dim
+        self.num_layers = num_layers
+        self.activation = activation
+        self.pooling = pooling
+        self.pooling_factor = pooling_factor
+
+        if num_layers == 1:
+            self.hidden_dim = output_dim
+
+        if pooling == "cat":
+            self.preprocess = nn.Sequential(
+                ConcatPooling(pooling_factor), nn.Linear(input_dim * pooling_factor, self.hidden_dim)
+            )
+        else:
+            self.preprocess = nn.Sequential(
+                nn.AvgPool1d(pooling_factor, stride=pooling_factor), nn.Linear(input_dim, self.hidden_dim)
+            )
+
+        if num_layers == 1:
+            self.mlp = nn.Identity()
+        else:
+            self.mlp = MLP(self.hidden_dim, output_dim, num_layers, activation, log_softmax=False)
+
+    @property
+    def input_types(self):
+        """Returns definitions of module input ports."""
+        return OrderedDict(
+            {
+                "audio_signal": NeuralType(("B", "D", "T"), AcousticEncodedRepresentation()),
+                "length": NeuralType(tuple("B"), LengthsType()),
+            }
+        )
+
+    @property
+    def output_types(self):
+        """Returns definitions of module output ports."""
+        return OrderedDict(
+            {
+                "outputs": NeuralType(("B", "D", "T"), AcousticEncodedRepresentation()),
+                "outputs_len": NeuralType(tuple("B"), LengthsType()),
+            }
+        )
+
+    @typecheck()
+    def forward(self, audio_signal, length=None):
+        """
+        Args:
+            audio_signal: [batch_size, input_dim, seq_len]
+            length: [batch_size]
+        Returns:
+            outputs: [batch_size, output_dim, seq_len//pooling_factor]
+            outputs_len: [batch_size]
+        """
+        outputs = self.preprocess(audio_signal.transpose(1, 2))
+        outputs = self.mlp(outputs)
+        outputs_len = torch.div(length, self.pooling_factor, rounding_mode='floor')
+        return outputs.transpose(1, 2), outputs_len
diff --git a/nemo/collections/multimodal/speech_llm/modules/perception_modules.py b/nemo/collections/multimodal/speech_llm/modules/perception_modules.py
new file mode 100644
index 000000000000..2f0565982941
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/modules/perception_modules.py
@@ -0,0 +1,431 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+from typing import List, Optional, Tuple
+
+import torch
+import torch.distributed
+import torch.nn as nn
+from omegaconf import DictConfig
+
+from nemo.collections.asr.models import EncDecSpeakerLabelModel
+from nemo.collections.asr.modules.conformer_encoder import ConformerEncoder, ConformerMultiLayerFeatureExtractor
+from nemo.collections.multimodal.speech_llm.parts.utils.data_utils import align_feat_seq_list
+from nemo.core.classes import Exportable, NeuralModule
+from nemo.core.classes.common import typecheck
+from nemo.core.neural_types import AcousticEncodedRepresentation, AudioSignal, LengthsType, NeuralType, SpectrogramType
+from nemo.utils.decorators import experimental
+
+
+__all__ = ["AudioPerceptionModule", "MultiAudioPerceptionModule"]
+
+
+class AudioPerceptionModule(NeuralModule, Exportable):
+    """Audio perception module that consists of audio encoder(s) and modality adapter."""
+
+    def input_example(self, max_batch: int = 8, max_dim: int = 32000, min_length: int = 200):
+        batch_size = torch.randint(low=1, high=max_batch, size=[1]).item()
+        max_length = torch.randint(low=min_length, high=max_dim, size=[1]).item()
+        signals = torch.rand(size=[batch_size, max_length]) * 2 - 1
+        lengths = torch.randint(low=min_length, high=max_dim, size=[batch_size])
+        lengths[0] = max_length
+        return signals, lengths, None, None
+
+    @property
+    def input_types(self):
+        """Returns definitions of module input ports."""
+        return OrderedDict(
+            {
+                "input_signal": NeuralType(("B", "T"), AudioSignal(freq=self.preprocessor._sample_rate)),
+                "input_signal_length": NeuralType(
+                    tuple("B"), LengthsType()
+                ),  # Please note that length should be in samples not seconds.
+                "processed_signal": NeuralType(("B", "D", "T"), SpectrogramType()),
+                "processed_signal_length": NeuralType(tuple("B"), LengthsType()),
+            }
+        )
+
+    @property
+    def output_types(self):
+        """Returns definitions of module output ports."""
+        return OrderedDict(
+            {
+                "encoded": NeuralType(("B", "T", "D"), AcousticEncodedRepresentation()),
+                "encoded_len": NeuralType(tuple("B"), LengthsType()),
+            }
+        )
+
+    def __init__(self, cfg: DictConfig):
+        super().__init__()
+        # Initialize components
+        self.preprocessor = self.from_config_dict(cfg.preprocessor)
+        self.encoder = self.from_config_dict(cfg.encoder)
+
+        if cfg.get("use_multi_layer_feat", False) and cfg.get("multi_layer_feat", None):
+            if "_target_" in cfg.multi_layer_feat.aggregator:
+                aggregator = self.from_config_dict(cfg.multi_layer_feat.aggregator)
+            else:
+                aggregator = MultiFeatureAggregator(cfg.multi_layer_feat.aggregator, channel_dim=1)
+            self.encoder = ConformerMultiLayerFeatureExtractor(
+                encoder=self.encoder, layer_idx_list=cfg.multi_layer_feat.layer_idx_list, aggregator=aggregator
+            )
+
+        if 'spec_augment' in cfg and cfg.spec_augment is not None:
+            self.spec_augmentation = self.from_config_dict(cfg.spec_augment)
+        else:
+            self.spec_augmentation = None
+        self.modality_adapter = self.from_config_dict(cfg.modality_adapter)
+        if 'output_dim' not in cfg.modality_adapter and "d_model" in cfg.modality_adapter:  # e.g., conformer encoder
+            self.proj = nn.Linear(cfg.modality_adapter.d_model, cfg.output_dim)
+        else:
+            self.proj = nn.Identity()
+
+    def maybe_preprocess_audio(
+        self,
+        input_signal=None,
+        input_signal_length=None,
+        processed_signal=None,
+        processed_signal_length=None,
+    ):
+        has_input_signal = input_signal is not None and input_signal_length is not None
+        has_processed_signal = processed_signal is not None and processed_signal_length is not None
+        if (has_input_signal ^ has_processed_signal) is False:
+            raise ValueError(
+                f"{self.__class__} Arguments ``input_signal`` and ``input_signal_length`` are mutually exclusive "
+                " with ``processed_signal`` and ``processed_signal_len`` arguments."
+            )
+
+        if not has_processed_signal:
+            processed_signal, processed_signal_length = self.preprocessor(
+                input_signal=input_signal,
+                length=input_signal_length,
+            )
+        return processed_signal, processed_signal_length
+
+    # disable type checks to avoid type-check errors when using Conformer as modality adapter
+    @typecheck.disable_checks()
+    def forward(
+        self,
+        input_signal=None,
+        input_signal_length=None,
+        processed_signal=None,
+        processed_signal_length=None,
+    ):
+        processed_signal, processed_signal_length = self.maybe_preprocess_audio(
+            input_signal, input_signal_length, processed_signal, processed_signal_length
+        )
+
+        # Spec augment is not applied during evaluation/testing
+        if self.spec_augmentation is not None and self.training:
+            processed_signal = self.spec_augmentation(input_spec=processed_signal, length=processed_signal_length)
+
+        encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_length)
+        encoded, encoded_len = self.modality_adapter(audio_signal=encoded, length=encoded_len)
+
+        # b, c, t -> b, t, c
+        encoded = self.proj(encoded.transpose(1, 2))
+
+        return encoded, encoded_len
+
+
+class MultiFeatureAggregator(nn.Module):
+    """
+    A module used to aggregate multiple encoded features (from different encoders or different layers) into a single feature sequence.
+    """
+
+    def __init__(self, cfg: DictConfig, channel_dim: int = 1):
+        super().__init__()
+        self.mode = cfg.get("mode", "cat")
+        self.channel_dim = channel_dim
+        self.pooling = cfg.get("pooling", "mean")
+        self.align_mode = cfg.get("align_mode", "min")
+
+    def _have_same_length(self, encoded_len: List[torch.Tensor]) -> bool:
+        sample_len = encoded_len[0]
+        for x in encoded_len:
+            if torch.sum(x - sample_len) != 0:
+                return False
+        return True
+
+    def forward(
+        self,
+        encoded: List[torch.Tensor],
+        encoded_len: List[torch.Tensor],
+        ref_idx: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        if not self._have_same_length(encoded_len):
+            """Align the length of encoded features if they are different."""
+            target_len = encoded[0].size(self.channel_dim)
+            if ref_idx is not None:
+                target_len = encoded[ref_idx].size(self.channel_dim)
+            if self.channel_dim != 1:
+                encoded = [x.transpose(1, self.channel_dim) for x in encoded]
+            encoded, encoded_len = align_feat_seq_list(
+                encoded, encoded_len, mode=self.align_mode, pooling=self.pooling, target_len=target_len
+            )
+            if self.channel_dim != 1:
+                encoded = [x.transpose(1, self.channel_dim) for x in encoded]
+
+        if self.mode == "cat":
+            return torch.cat(encoded, dim=self.channel_dim), encoded_len[0]
+        elif self.mode == "sum":
+            return torch([x.unsqueeze(-1) for x in encoded], dim=-1).sum(dim=-1), encoded_len[0]
+        elif self.mode == "mean" or self.mode == "avg":
+            return torch([x.unsqueeze(-1) for x in encoded], dim=-1).mean(dim=-1), encoded_len[0]
+        elif self.mode == "max":
+            return torch([x.unsqueeze(-1) for x in encoded], dim=-1).max(dim=-1), encoded_len[0]
+        elif self.mode == "min":
+            return torch([x.unsqueeze(-1) for x in encoded], dim=-1).min(dim=-1), encoded_len[0]
+        elif self.mode == "none":
+            return encoded, encoded_len
+        else:
+            raise ValueError(f"Unknown mode {self.mode}")
+
+
+@experimental
+class MultiAudioPerceptionModule(NeuralModule, Exportable):
+    """
+    Audio perception module that consists of multiple audio encoders and shared modality adapter.
+    This module is experimental. An example perception cfg is:
+    -------------------
+    perception:
+        modality_adapter:
+            _target_: nemo.collections.multimodal.speechllm.modules.PoolingMLPConnectors
+            hidden_dim: 512
+            pooling: 'cat'
+            pooling_factor: 2
+            num_layers: 4
+            input_dim: -1
+            output_dim: -1
+
+        spec_augment:
+            _target_: nemo.collections.asr.modules.SpectrogramAugmentation
+            freq_masks: 2 # set to zero to disable it
+            time_masks: 10 # set to zero to disable it
+            freq_width: 27
+            time_width: 0.05
+
+        encoders:
+            asr_model:
+                _target_: nemo.collections.asr.models.ASRModel
+                output_key: d_model
+                freeze: True
+                pretrained_model: stt_en_fastconformer_transducer_large
+            ssl_model:
+                _target_: nemo.collections.asr.models.SpeechEncDecSelfSupervisedModel
+                output_key: d_model
+                freeze: True
+                pretrained_model: ssl_en_conformer_large
+                use_multi_layer_feat: True
+                multi_layer_feat:
+                layer_idx_list: [0,16]
+                aggregator:
+                    mode: "cat"
+                    pooling: "avg"
+                    rounding: "floor"
+
+            speaker_model:
+                segment_length_in_secs: 0.4
+                freeze: True
+                pretrained_model: titanet_large
+
+            ref_model: asr_model
+            aggregator:
+                mode: "cat"
+                pooling: "mean"
+                rounding: "floor"
+    -------------------
+    """
+
+    def __init__(self, cfg: DictConfig):
+        super().__init__()
+        # Initialize components
+        self.aggregator = MultiFeatureAggregator(cfg.aggregator, channel_dim=1)
+        if 'spec_augment' in cfg and cfg.spec_augment is not None:
+            self.spec_augmentation = self.from_config_dict(cfg.spec_augment)
+        else:
+            self.spec_augmentation = None
+
+        self.encoder_cfg = cfg.encoders
+        if not isinstance(self.encoder_cfg, DictConfig):
+            raise TypeError(f"cfg.encoders must be a DictConfig, got {type(cfg.encoders)}")
+
+        preprocessor = {}
+        encoders = {}
+        for key, enc_cfg in self.encoder_cfg.items():
+            encoder = self.from_config_dict(enc_cfg.model)
+            if enc_cfg.get("use_multi_layer_feat", False) and enc_cfg.get("multi_layer_feat", None):
+                if not isinstance(encoder, ConformerEncoder):
+                    raise TypeError(
+                        f"Encoder {key} must be a ConformerEncoder when use_multi_layer_feat is True, got {type(encoder)}"
+                    )
+                if "_target_" in enc_cfg.multi_layer_feat.aggregator:
+                    aggregator = self.from_config_dict(enc_cfg.multi_layer_feat.aggregator)
+                else:
+                    aggregator = MultiFeatureAggregator(enc_cfg.multi_layer_feat.aggregator, channel_dim=1)
+                encoder = ConformerMultiLayerFeatureExtractor(
+                    encoder=encoder, layer_idx_list=enc_cfg.multi_layer_feat.layer_idx_list, aggregator=aggregator
+                )
+            encoders[key] = encoder
+            preprocessor[key] = (
+                self.from_config_dict(enc_cfg.get("preprocessor"))
+                if enc_cfg.get("preprocessor", None) is not None
+                else None
+            )
+        self.encoders = nn.ModuleDict(encoders)
+        self.preprocessor = nn.ModuleDict(preprocessor)
+
+        self.speaker_model = None
+        self.speaker_seg_len = None
+        if "speaker_model" in cfg and cfg.speaker_model.get("model", None) is not None:
+            self.speaker_model = EncDecSpeakerLabelModel(cfg=cfg.speaker_model.model)
+            self.speaker_model.spec_augmentation = self.spec_augmentation
+            self.speaker_seg_len = 1
+            if "preprocessor" in cfg.speaker_model.model:
+                self.speaker_seg_len = int(
+                    cfg.speaker_model.segment_length_in_secs // cfg.speaker_model.model.preprocessor.window_stride
+                )
+        self.ref_model = cfg.get("ref_model", None)
+        if self.ref_model is not None:
+            if self.ref_model not in self.encoders and (
+                self.ref_model != "speaker_model" and self.speaker_model is not None
+            ):
+                if self.ref_model == "speaker_model":
+                    raise ValueError(f"ref_model is `{self.ref_model}` but speaker_model is None")
+                raise ValueError(f"ref_model `{self.ref_model}` not found in encoders [{encoders.keys()}]")
+
+        self.modality_adapter = self.from_config_dict(cfg.modality_adapter)
+        if 'output_dim' not in cfg.modality_adapter and "d_model" in cfg.modality_adapter:  # e.g., conformer encoder
+            self.proj = nn.Linear(cfg.modality_adapter.d_model, cfg.output_dim)
+        else:
+            self.proj = nn.Identity()
+
+    def maybe_preprocess_audio(
+        self,
+        preprocessor,
+        input_signal=None,
+        input_signal_length=None,
+        processed_signal=None,
+        processed_signal_length=None,
+    ):
+        has_input_signal = input_signal is not None and input_signal_length is not None
+        has_processed_signal = processed_signal is not None and processed_signal_length is not None
+        if (has_input_signal ^ has_processed_signal) is False:
+            raise ValueError(
+                f"{self.__class__} Arguments ``input_signal`` and ``input_signal_length`` are mutually exclusive "
+                " with ``processed_signal`` and ``processed_signal_len`` arguments."
+            )
+
+        if not has_processed_signal and preprocessor is not None:
+            processed_signal, processed_signal_length = preprocessor(
+                input_signal=input_signal,
+                length=input_signal_length,
+            )
+        elif not has_processed_signal and preprocessor is None:
+            processed_signal, processed_signal_length = input_signal, input_signal_length
+        return processed_signal, processed_signal_length
+
+    def forward_speaker(
+        self, input_signal=None, input_signal_length=None, processed_signal=None, processed_signal_length=None
+    ):
+        has_input_signal = input_signal is not None and input_signal_length is not None
+        has_processed_signal = processed_signal is not None and processed_signal_length is not None
+        if (has_input_signal ^ has_processed_signal) is False:
+            raise ValueError(
+                f"{self.__class__} Arguments ``input_signal`` and ``input_signal_length`` are mutually exclusive "
+                " with ``processed_signal`` and ``processed_signal_len`` arguments."
+            )
+        if not has_processed_signal:
+            processed_signal, processed_signal_length = self.speaker_model.preprocessor(
+                input_signal=input_signal,
+                length=input_signal_length,
+            )
+        # Spec augment is not applied during evaluation/testing
+        if self.spec_augmentation is not None and self.training:
+            processed_signal = self.spec_augmentation(input_spec=processed_signal, length=processed_signal_length)
+
+        # encoded has shape [B, D, T], length has shape [B]
+        encoded, encoded_len = self.speaker_model.encoder(
+            audio_signal=processed_signal, length=processed_signal_length
+        )
+
+        # pad encoded to be divisible by speaker_seg_len
+        if encoded.shape[2] % self.speaker_seg_len != 0:
+            encoded = torch.cat(
+                [
+                    encoded,
+                    torch.zeros(
+                        encoded.shape[0],
+                        encoded.shape[1],
+                        self.speaker_seg_len - encoded.shape[2] % self.speaker_seg_len,
+                        device=encoded.device,
+                    ),
+                ],
+                dim=2,
+            )
+
+        B, D, T = encoded.shape
+        num_seg = int(T // self.speaker_seg_len)
+        encoded = encoded.view(int(B * num_seg), D, self.speaker_seg_len)  # [B*num_seg, D, seg_len]
+        encoded_len_seg = (encoded_len // self.speaker_seg_len).repeat_interleave(num_seg)  # [B*seg_len]
+
+        _, embeds = self.speaker_model.decoder(encoder_output=encoded, length=encoded_len_seg)
+
+        embeds = embeds.view(B, -1, num_seg)  # [B, D, num_seg]
+
+        embeds_len = encoded_len // self.speaker_seg_len  # [B]
+        return embeds, embeds_len
+
+    def forward(
+        self,
+        input_signal=None,
+        input_signal_length=None,
+        processed_signal=None,
+        processed_signal_length=None,
+    ):
+        encoded_list = []
+        encoded_len_list = []
+        ref_idx = None
+        for key, encoder in self.encoders.items():
+            curr_processed_signal, curr_processed_signal_length = self.maybe_preprocess_audio(
+                self.preprocessor[key], input_signal, input_signal_length, processed_signal, processed_signal_length
+            )
+            # Spec augment is not applied during evaluation/testing
+            if self.spec_augmentation is not None and self.training:
+                processed_signal = self.spec_augmentation(
+                    input_spec=curr_processed_signal, length=curr_processed_signal_length
+                )
+            encoded, encoded_len = encoder(audio_signal=curr_processed_signal, length=curr_processed_signal_length)
+            if key == self.ref_model:
+                ref_idx = len(encoded_list)
+            encoded_list.append(encoded)
+            encoded_len_list.append(encoded_len)
+
+        if self.speaker_model is not None:
+            speaker_embeds, speaker_embeds_len = self.forward_speaker(
+                input_signal=input_signal,
+                input_signal_length=input_signal_length,
+                processed_signal=processed_signal,
+                processed_signal_length=processed_signal_length,
+            )
+            encoded_list.append(speaker_embeds)
+            encoded_len_list.append(speaker_embeds_len)
+        encoded_list, encoded_len_list = self.aggregator(
+            encoded=encoded_list, encoded_len=encoded_len_list, ref_idx=ref_idx
+        )
+        encoded, encoded_len = self.modality_adapter(audio_signal=encoded_list, length=encoded_len_list)
+        # b, c, t -> b, t, c
+        encoded = self.proj(encoded.transpose(1, 2))
+        return encoded, encoded_len
diff --git a/nemo/collections/multimodal/speech_llm/parts/__init__.py b/nemo/collections/multimodal/speech_llm/parts/__init__.py
new file mode 100644
index 000000000000..d0c4b8bd282c
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/parts/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from nemo.collections.multimodal.speech_llm.parts.utils.data_utils import (
+    ceil_to_nearest,
+    get_num_samples_from_files,
+    maybe_cast_to_list,
+    shift_tokens_by_multi_audios,
+)
diff --git a/nemo/collections/multimodal/speech_llm/parts/mixins/__init__.py b/nemo/collections/multimodal/speech_llm/parts/mixins/__init__.py
new file mode 100644
index 000000000000..d9155f923f18
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/parts/mixins/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/nemo/collections/multimodal/speech_llm/parts/mixins/adapter_mixin.py b/nemo/collections/multimodal/speech_llm/parts/mixins/adapter_mixin.py
new file mode 100644
index 000000000000..6071bda87057
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/parts/mixins/adapter_mixin.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Optional, Union
+
+import torch
+
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+from nemo.collections.nlp.parts.mixins.nlp_adapter_mixins import NLPAdapterModelMixin, replace_prefix
+from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP, PEFTConfig
+from nemo.utils import logging
+
+
+class SpeechLLMAdapterMixin(NLPAdapterModelMixin):
+    def load_adapters(
+        self,
+        filepath: str,
+        peft_cfgs: Optional[Union[PEFTConfig, List[PEFTConfig]]] = None,
+        map_location: str = None,
+    ):
+        """
+        Utility method that restores only the adapter module(s), and not the entire model itself.
+        This allows the sharing of adapters which are often just a fraction of the size of the full model,
+        enabling easier delivery.
+
+        .. note::
+
+            During restoration, assumes that the model does not currently already have one or more adapter modules.
+
+        Args:
+            filepath: Filepath of the .ckpt or .nemo file.
+            peft_cfgs: One or more PEFTConfig objects that specify the PEFT method configuration.
+                If none, will infer from the .nemo checkpoint
+            map_location: Pytorch flag, where to place the adapter(s) state dict(s).
+        """
+
+        # Determine device
+        if map_location is None:
+            if torch.cuda.is_available():
+                map_location = 'cuda'
+            else:
+                map_location = 'cpu'
+
+        if filepath.endswith('.nemo'):
+            conf, state_dict = self._get_config_and_state_dict_from_nemo(filepath, map_location)
+        elif filepath.endswith('.ckpt'):
+            state_dict = torch.load(filepath, map_location)['state_dict']
+        else:
+            raise RuntimeError(f"{filepath} is not nemo file or ckpt file")
+        if not peft_cfgs:
+            assert filepath.endswith(
+                '.nemo'
+            ), "Inferring peft scheme is only supported for .nemo checkpoints. Please supply the `peft_cfgs` argument."
+            peft_cfgs = [PEFT_CONFIG_MAP[conf.peft.peft_scheme](conf)]
+        if self.cfg.megatron_amp_O2:
+            state_dict = {replace_prefix(k, 'model.', 'model.module.'): v for k, v in state_dict.items()}
+        self.add_adapter(peft_cfgs)
+        if not self.ptuning_only_and_non_first_stage:
+            target_keys = self.adapter_keys.union(self.tunable_base_param_keys)
+            if set(state_dict.keys()) != target_keys:
+                logging.warning(
+                    f"Unexpected keys found in state_dict: {set(state_dict.keys()) - target_keys}, missing keys in state_dict: {target_keys - set(state_dict.keys())}"
+                )
+        super(MegatronGPTModel, self).load_state_dict(state_dict, strict=False)
diff --git a/nemo/collections/multimodal/speech_llm/parts/utils/__init__.py b/nemo/collections/multimodal/speech_llm/parts/utils/__init__.py
new file mode 100644
index 000000000000..d9155f923f18
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/parts/utils/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/nemo/collections/multimodal/speech_llm/parts/utils/data_utils.py b/nemo/collections/multimodal/speech_llm/parts/utils/data_utils.py
new file mode 100644
index 000000000000..92a3548f9337
--- /dev/null
+++ b/nemo/collections/multimodal/speech_llm/parts/utils/data_utils.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Optional
+
+import numpy as np
+import torch
+
+
+def maybe_cast_to_list(x):
+    if isinstance(x, np.ndarray):
+        return [item.tolist() for item in x]
+    return x
+
+
+def ceil_to_nearest(n, m):
+    return (n + m - 1) // m * m
+
+
+def get_num_samples_from_files(file_list):
+    if isinstance(file_list, str):
+        file_list = file_list.split(',')
+    num_samples = []
+    for file in file_list:
+        with open(file, 'r') as f:
+            lines = list(f.readlines())
+            num = len(lines)
+            if lines[-1] == '\n':
+                num -= 1
+            num_samples.append(num)
+    return num_samples
+
+
+def shift_tokens_by_multi_audios(
+    context_tokens, context_lengths, audio_feat_lens, context_start_idx, encoder_max_length
+):
+    """
+    split and shift the context tokens by the audio segments, then concatenate them back. This function assumes that the whole context
+    starts and ends with text tokens, and the audio segments are in between the text tokens. The audio segments are not allowed to be adjacent to each other.
+    Args:
+        context_tokens: tensor of shape [batch, max_context_len]
+        context_lengths: tensor of shape [batch,]
+        audio_feat_lens: List[List[int]]
+        context_start_idx: List[List[int]]
+        encoder_max_length: int
+    """
+    new_context_tokens = []
+    for i in range(context_tokens.shape[0]):
+        start_idx_list_i = context_start_idx[i] + [context_lengths[i]]
+        input_len_list = [start_idx_list_i[j + 1] - start_idx_list_i[j] for j in range(len(start_idx_list_i) - 1)]
+        context_tokens_list = context_tokens[i][: context_lengths[i]].split(input_len_list)
+        context_tokens_i = [context_tokens_list[0]]
+        for j in range(1, len(context_tokens_list)):
+            context_tokens_i.append(
+                torch.zeros(audio_feat_lens[i][j - 1], dtype=torch.long, device=context_tokens.device)
+            )
+            context_tokens_i.append(context_tokens_list[j])
+        context_tokens_i = torch.cat(context_tokens_i)
+        context_tokens_i = torch.nn.functional.pad(
+            context_tokens_i, (0, encoder_max_length - context_tokens_i.shape[0])
+        )
+        new_context_tokens.append(context_tokens_i)
+    new_context_tokens = torch.stack(new_context_tokens)
+    return new_context_tokens
+
+
+def get_nested_dict_value(d, key, sep="."):
+    """
+    Get the value of a nested dict given a key
+    Args:
+        d: dict
+        key: str
+    """
+    for k in key.split(sep):
+        d = d[k]
+    return d
+
+
+def align_feat_seq_list(
+    seq_list: List[torch.Tensor],
+    seq_len_list: List[torch.Tensor],
+    mode: str = "min",
+    pooling: str = 'mean',
+    target_len: Optional[int] = None,
+):
+    """
+    Align a list of feature sequences to the same length by repeating or discarding frames.
+    Args:
+        seq_list: List[torch.Tensor], list of tensors of shape [batch, hidden_size, seq_len]
+        seq_len_list: List[torch.Tensor], list of tensors of shape [batch,]
+        mode: str, "min" or "max"
+        pooling: str, "mean", "max", or "min"
+    Returns:
+        new_seq_list: List[torch.Tensor], list of tensors of shape [batch, hidden_size, new_seq_len]
+        new_seq_len_list: List[torch.Tensor], list of tensors of shape [batch,]
+    """
+    MODES = ["min", "max"]
+    if mode not in MODES:
+        raise ValueError(f"mode {mode} not supported, available modes: {MODES}")
+    POOLING = ["mean", "max", "min", "avg"]
+    if pooling not in POOLING:
+        raise ValueError(f"pooling {pooling} not supported, available modes: {POOLING}")
+
+    new_seq_len_list = []
+    new_seq_list = []
+
+    if target_len is None:
+        target_len = [x.size(-1) for x in seq_list]
+        target_len = min(target_len) if mode == "min" else max(target_len)
+
+    for seq, seq_len in zip(seq_list, seq_len_list):
+        curr_len = seq.size(-1)
+        if curr_len > target_len:
+            ratio = round(curr_len / target_len)
+            res = abs(ratio * target_len - curr_len)
+            if ratio * target_len > curr_len:  # e.g., ratio = 1.9
+                # repeat the last res frames
+                seq = torch.cat([seq, seq[:, :, -res:]], dim=-1)
+                seq_len += res * (seq_len > target_len).long()
+            elif ratio * target_len < curr_len:  # e.g., ratio = 2.1
+                # discard the last res frames
+                seq = seq[:, :, :-res]
+                seq_len -= res * (seq_len > target_len).long()
+            new_seq = seq.reshape(seq.size(0), seq.size(1), ratio, target_len)
+            if pooling == "min":
+                new_seq = new_seq.min(dim=2)
+            elif pooling == "max":
+                new_seq = new_seq.max(dim=2)
+            else:
+                new_seq = new_seq.mean(dim=2)
+            new_seq_len = torch.round(seq_len / ratio).long()
+        else:  # curr_len <= target_len
+            ratio = round(target_len / curr_len)
+            res = abs(ratio * curr_len - target_len)
+            new_seq = torch.repeat_interleave(seq, ratio, dim=-1)
+            new_seq_len = seq_len * ratio
+            if ratio * curr_len > target_len:  # e.g., ratio = 1.9
+                new_seq = new_seq[:, :, :target_len]
+                new_seq_len = (
+                    seq_len * ratio - (ratio * seq_len - target_len) * (ratio * seq_len > target_len).long()
+                )  # subtract additional frames
+            elif ratio * curr_len < target_len:  # e.g., ratio = 2.1
+                new_seq = torch.cat([new_seq, seq[:, :, -res:]], dim=-1)
+        new_seq_list.append(new_seq)
+        new_seq_len_list.append(new_seq_len)
+    return new_seq_list, new_seq_len_list
diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
index ea56429f4de1..536fc5bff7c8 100644
--- a/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
@@ -174,7 +174,7 @@ def forward(self, **kwargs):
         the superclass by the square root of the hidden size specified in the configuration.
         """
         embeddings = super().forward(**kwargs)
-        return embeddings * torch.tensor(self.config.hidden_size ** 0.5, dtype=embeddings.dtype)
+        return embeddings * torch.tensor(self.config.hidden_size**0.5, dtype=embeddings.dtype)
 
 
 class MegatronGPTExportableModel(torch.nn.Module, Exportable):
@@ -196,11 +196,14 @@ def __init__(self, model):
 
     def forward(self, tokens, position_ids, attention_mask):
         if self.fp8_enabled and HAVE_TE:
-            with transformer_engine.pytorch.onnx_export(self.fp8_enabled), transformer_engine.pytorch.fp8_autocast(
-                enabled=self.fp8_enabled, fp8_recipe=self.fp8_recipe
-            ), torch.no_grad(), torch.inference_mode(), torch.autocast(
-                'cuda', dtype=self.dtype
-            ), warnings.catch_warnings():
+            with (
+                transformer_engine.pytorch.onnx_export(self.fp8_enabled),
+                transformer_engine.pytorch.fp8_autocast(enabled=self.fp8_enabled, fp8_recipe=self.fp8_recipe),
+                torch.no_grad(),
+                torch.inference_mode(),
+                torch.autocast('cuda', dtype=self.dtype),
+                warnings.catch_warnings(),
+            ):
                 warnings.filterwarnings(action='ignore', category=torch.jit.TracerWarning, module=r'.*')
                 assert tokens.shape == position_ids.shape
                 assert attention_mask.shape[2] == attention_mask.shape[3] == tokens.shape[1] == position_ids.shape[1]
@@ -211,9 +214,12 @@ def forward(self, tokens, position_ids, attention_mask):
                     labels=None,
                 )
         else:
-            with torch.no_grad(), torch.inference_mode(), torch.autocast(
-                'cuda', dtype=self.dtype
-            ), warnings.catch_warnings():
+            with (
+                torch.no_grad(),
+                torch.inference_mode(),
+                torch.autocast('cuda', dtype=self.dtype),
+                warnings.catch_warnings(),
+            ):
                 warnings.filterwarnings(action='ignore', category=torch.jit.TracerWarning, module=r'.*')
                 assert tokens.shape == position_ids.shape
                 assert attention_mask.shape[2] == attention_mask.shape[3] == tokens.shape[1] == position_ids.shape[1]
@@ -509,7 +515,7 @@ def setup_optimizer_param_groups(self):
             self._optimizer_param_groups = get_params_for_weight_decay_optimization(self.model)
 
     def setup_mcore_distributed_parallel(self):
-        """Set up mcore distributed data parallel """
+        """Set up mcore distributed data parallel"""
         if self.with_distributed_adam and self.use_mcore_dist_optim:
             config = get_model_config(self.model[0])
             ddp_config = DistributedDataParallelConfig(
@@ -641,7 +647,10 @@ def fwd_bwd_step(self, dataloader_iter, forward_only, first_val_step=None):
                 if self.validation_param_sync_overlap:
                     param_sync_func = self.sync_overlap_parameters
             elif not self.use_mcore_dist_optim:
-                no_sync_func = partial(self._optimizer.no_sync, greedy_grad_copy=self.megatron_amp_O2,)
+                no_sync_func = partial(
+                    self._optimizer.no_sync,
+                    greedy_grad_copy=self.megatron_amp_O2,
+                )
                 grad_sync_func = self.reduce_overlap_gradients
                 param_sync_func = self.sync_overlap_parameters
             else:
@@ -744,9 +753,9 @@ def training_step_fwd_bwd_step_call(self, dataloader_iter, forward_only):
 
     def training_step(self, dataloader_iter):
         """
-            We pass the dataloader iterator function to the micro-batch scheduler.
-            The input batch to each micro-batch is fetched using the dataloader function
-            in the micro-batch fwd function.
+        We pass the dataloader iterator function to the micro-batch scheduler.
+        The input batch to each micro-batch is fetched using the dataloader function
+        in the micro-batch fwd function.
         """
         # Initialize userbuffer communicators.
         if self.initialize_ub:
@@ -877,7 +886,11 @@ def training_step(self, dataloader_iter):
         if self.log_memory_usage:
             mem_reserved = torch.cuda.max_memory_reserved()
             self.log(
-                'peak_memory_usage', mem_reserved, prog_bar=True, rank_zero_only=True, batch_size=1,
+                'peak_memory_usage',
+                mem_reserved,
+                prog_bar=True,
+                rank_zero_only=True,
+                batch_size=1,
             )
 
         ## logging
@@ -901,20 +914,29 @@ def training_step(self, dataloader_iter):
         lr = self._optimizer.param_groups[0]['lr']
         self.log('lr', lr, rank_zero_only=True, batch_size=1)
         self.log(
-            'global_step', self.trainer.global_step, prog_bar=True, rank_zero_only=True, batch_size=1,
+            'global_step',
+            self.trainer.global_step,
+            prog_bar=True,
+            rank_zero_only=True,
+            batch_size=1,
         )
 
         consumed_samples = self._compute_consumed_samples_after_training_step()
         # TODO: make sure compute_consumed_samples works for pipeline parallelism
         self.log(
-            'consumed_samples', consumed_samples, prog_bar=True, rank_zero_only=True, batch_size=1,
+            'consumed_samples',
+            consumed_samples,
+            prog_bar=True,
+            rank_zero_only=True,
+            batch_size=1,
         )
 
         if self.rampup_batch_size:
             self.prev_global_batch_size = current_global_batch_size
             self.prev_consumed_samples = consumed_samples
             num_microbatch_calculator.update(
-                consumed_samples=consumed_samples, consistency_check=False,
+                consumed_samples=consumed_samples,
+                consistency_check=False,
             )
             current_global_batch_size = num_microbatch_calculator.current_global_batch_size
             self.log('global_batch_size', current_global_batch_size, prog_bar=True, rank_zero_only=True, batch_size=1)
@@ -923,20 +945,20 @@ def training_step(self, dataloader_iter):
         return loss_mean
 
     def backward(self, *args, **kwargs):
-        """ LightningModule hook to do backward.
-            We want this to do nothing since we run backward in the fwd/bwd functions from megatron-core.
-            No need to call it here.
+        """LightningModule hook to do backward.
+        We want this to do nothing since we run backward in the fwd/bwd functions from megatron-core.
+        No need to call it here.
         """
         return
 
     def optimizer_zero_grad(self, *args, **kwargs):
-        """ LightningModule hook to zero grad.
-            We want this to do nothing as we are zeroing grads during the training_step.
+        """LightningModule hook to zero grad.
+        We want this to do nothing as we are zeroing grads during the training_step.
         """
         return
 
     def _append_sequence_parallel_module_grads(self, module, grads):
-        """ Helper method for allreduce_sequence_parallel_gradients"""
+        """Helper method for allreduce_sequence_parallel_gradients"""
 
         for param in module.parameters():
             sequence_parallel_param = getattr(param, 'sequence_parallel', False) or getattr(
@@ -954,9 +976,9 @@ def _append_sequence_parallel_module_grads(self, module, grads):
                 grads.append(grad.data)
 
     def allreduce_sequence_parallel_gradients(self):
-        """ All-reduce layernorm parameters across model parallel nodes when sequence parallelism is used.
-            Modified from megatron-lm:
-            https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/blob/3f91f09bb2ab32f9904b47f46f19d2fc3f518ed8/megatron/training.py#L425
+        """All-reduce layernorm parameters across model parallel nodes when sequence parallelism is used.
+        Modified from megatron-lm:
+        https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/blob/3f91f09bb2ab32f9904b47f46f19d2fc3f518ed8/megatron/training.py#L425
         """
 
         grads = []
@@ -974,8 +996,7 @@ def allreduce_sequence_parallel_gradients(self):
             buf.copy_(synced)
 
     def allreduce_fsdp_sharding_omitted_gradients(self):
-        """ All-reduce gradients of FSDP-sharding-omitted parameters in sharding domain (data-parallel domain).
-        """
+        """All-reduce gradients of FSDP-sharding-omitted parameters in sharding domain (data-parallel domain)."""
         assert isinstance(self.model, torch.nn.Module)
         grads = []
         for param in self.model.parameters():
@@ -1022,16 +1043,16 @@ def allreduce_first_last_embeddings(self):
                     torch.distributed.all_reduce(grad, group=parallel_state.get_embedding_group())
 
     def _make_data_iterator_list(self, data_iterator: Iterator) -> List[Iterator]:
-        """ Convert data iterator into form expected by Megatron
-
-            With interleaved pipeline parallelism, Megatron expects a
-            list of one data iterator per model chunk. Each model
-            chunk independently gets data from its data iterator, so
-            we need to interact with the data iterator multiple times
-            for each microbatch step. Instead of incorporating this
-            logic into the data loader, we cache the iterator's output
-            to the first model chunk and reuse it in the other model
-            chunks.
+        """Convert data iterator into form expected by Megatron
+
+        With interleaved pipeline parallelism, Megatron expects a
+        list of one data iterator per model chunk. Each model
+        chunk independently gets data from its data iterator, so
+        we need to interact with the data iterator multiple times
+        for each microbatch step. Instead of incorporating this
+        logic into the data loader, we cache the iterator's output
+        to the first model chunk and reuse it in the other model
+        chunks.
         """
 
         if not isinstance(self.model, list) or len(self.model) == 1:
@@ -1159,7 +1180,10 @@ def fwd_output_and_loss_func(dataloader_iter, model, checkpoint_activations_all_
                     required_keys.update(('labels', 'loss_mask'))
             if self.get_attention_mask_from_fusion and 'attention_mask' in required_keys:
                 required_keys.remove('attention_mask')
-            batch = {key: val.cuda(non_blocking=True) if key in required_keys else None for key, val in batch.items()}
+            batch = {
+                key: val.cuda(non_blocking=True) if key in required_keys and isinstance(val, torch.Tensor) else None
+                for key, val in batch.items()
+            }
 
             # slice batch along sequence dimension for context parallelism
             batch = self.get_batch_on_this_context_parallel_rank(batch)
@@ -1323,10 +1347,10 @@ def id_func(output_tensor):
 
     def validation_step(self, dataloader_iter, dataloader_idx=0):
         """
-            Our dataloaders produce a micro-batch and then we fetch
-            a number of microbatches depending on the global batch size and model parallel size
-            from the dataloader to produce a list of microbatches.
-            The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
+        Our dataloaders produce a micro-batch and then we fetch
+        a number of microbatches depending on the global batch size and model parallel size
+        from the dataloader to produce a list of microbatches.
+        The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
         """
         mode = 'test' if self.trainer.testing else 'val'
         # Initialize userbuffer communicators.
@@ -1387,7 +1411,9 @@ def on_validation_epoch_end(self):
             if self.loss_broadcast_src_rank is None:
                 self.loss_broadcast_src_rank = parallel_state.get_pipeline_model_parallel_last_rank()
             torch.distributed.broadcast(
-                averaged_loss, self.loss_broadcast_src_rank, group=parallel_state.get_pipeline_model_parallel_group(),
+                averaged_loss,
+                self.loss_broadcast_src_rank,
+                group=parallel_state.get_pipeline_model_parallel_group(),
             )
 
         self.log('val_loss', averaged_loss, prog_bar=True, rank_zero_only=True, batch_size=1)
@@ -1492,7 +1518,10 @@ def build_train_valid_test_datasets(self):
                 dataset_type = MockGPTDataset if mock_dataset else GPTDataset
 
             self._train_ds, self._validation_ds, self._test_ds = BlendedMegatronDatasetBuilder(
-                dataset_type, train_valid_test_num_samples, is_dataset_built_on_rank, dataset_config,
+                dataset_type,
+                train_valid_test_num_samples,
+                is_dataset_built_on_rank,
+                dataset_config,
             ).build()
 
         if self._train_ds is not None:
@@ -1702,16 +1731,16 @@ def list_available_models(self):
         return None
 
     def transfer_batch_to_device(self, batch: Any, device: torch.device, dataloader_idx: int) -> Any:
-        """ PTL hook: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#transfer-batch-to-device
-            When using pipeline parallelism, we need the global batch to remain on the CPU,
-            since the memory overhead will be too high when using a large number of microbatches.
-            Microbatches are transferred from CPU to GPU inside the pipeline.
+        """PTL hook: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#transfer-batch-to-device
+        When using pipeline parallelism, we need the global batch to remain on the CPU,
+        since the memory overhead will be too high when using a large number of microbatches.
+        Microbatches are transferred from CPU to GPU inside the pipeline.
         """
         return batch
 
     def _validate_trainer(self):
-        """ Certain trainer configurations can break training.
-            Here we try to catch them and raise an error.
+        """Certain trainer configurations can break training.
+        Here we try to catch them and raise an error.
         """
         if self.trainer.accumulate_grad_batches > 1:
             raise ValueError(
@@ -1788,9 +1817,9 @@ def on_load_checkpoint(self, checkpoint) -> None:
 
     def on_validation_model_zero_grad(self) -> None:
         """
-         Skip gradient zeroing at the beginning of validation routine.
-         This is needed when overlapping the AllGather of the updated parameters with the following valdation step.
-         """
+        Skip gradient zeroing at the beginning of validation routine.
+        This is needed when overlapping the AllGather of the updated parameters with the following valdation step.
+        """
         if not self.validation_param_sync_overlap:
             super().on_validation_model_zero_grad()
 
@@ -1859,9 +1888,9 @@ def initialize_last_rank_embeddings(self):
                     parallel_state.set_virtual_pipeline_model_parallel_rank(0)
 
     def _reset_activation_checkpointing_args(self):
-        """ Disables activation checkpointing completely and saves the values so that
-            _restore_activation_checkpointing_args can restore them later. This function must always be
-            called before _restore_activation_checkpointing_args.
+        """Disables activation checkpointing completely and saves the values so that
+        _restore_activation_checkpointing_args can restore them later. This function must always be
+        called before _restore_activation_checkpointing_args.
         """
         # Store values to restore them later.
         self.last_activations_checkpoint_granularity = self.cfg.activations_checkpoint_granularity
@@ -1888,9 +1917,9 @@ def _reset_activation_checkpointing_args(self):
                 module.language_model.encoder.activations_checkpoint_layers_per_pipeline = None
 
     def _restore_activation_checkpointing_args(self):
-        """ Restores the activation checkpointing parameters using the values saved by
-            _reset_activation_checkpointing_args. This function must never be called before
-            _reset_activation_checkpointing_args.
+        """Restores the activation checkpointing parameters using the values saved by
+        _reset_activation_checkpointing_args. This function must never be called before
+        _reset_activation_checkpointing_args.
         """
         # Restore config values.
         self.cfg.activations_checkpoint_granularity = self.last_activations_checkpoint_granularity
@@ -1917,9 +1946,9 @@ def _restore_activation_checkpointing_args(self):
                 )
 
     def _reset_sequence_parallelism_args(self):
-        """ Disables sequence parallelism completely and saves the values so that
-            _restore_sequence_parallelism_args can restore them later. This function must always be
-            called before _restore_sequence_parallelism_args.
+        """Disables sequence parallelism completely and saves the values so that
+        _restore_sequence_parallelism_args can restore them later. This function must always be
+        called before _restore_sequence_parallelism_args.
         """
         # Store values to restore them later.
         self.last_sequence_parallel = self.cfg.sequence_parallel
@@ -1936,9 +1965,9 @@ def _reset_sequence_parallelism_args(self):
                     mod.sequence_parallel = False
 
     def _restore_sequence_parallelism_args(self):
-        """ Restores the sequence parallelism parameters using the values saved by
-            _reset_sequence_parallelism_args. This function must never be called before
-            _reset_sequence_parallelism_args.
+        """Restores the sequence parallelism parameters using the values saved by
+        _reset_sequence_parallelism_args. This function must never be called before
+        _reset_sequence_parallelism_args.
         """
         # Restore config values.
         self.cfg.sequence_parallel = self.last_sequence_parallel
@@ -1952,10 +1981,10 @@ def _restore_sequence_parallelism_args(self):
                     mod.sequence_parallel = self.last_sequence_parallel
 
     def build_transformer_config(self) -> TransformerConfig:
-        """ Builds the megatron core gpt transformer config for the model.
-            For attributes in the nemo model config that are the same
-            as the megatron core TransformerConfig, we will use the value from the nemo model config.
-            For attributes in TransformerConfig that are not in the nemo model config, we add custom logic.
+        """Builds the megatron core gpt transformer config for the model.
+        For attributes in the nemo model config that are the same
+        as the megatron core TransformerConfig, we will use the value from the nemo model config.
+        For attributes in TransformerConfig that are not in the nemo model config, we add custom logic.
         """
 
         normalization = self.cfg.get('normalization', 'layernorm').lower()
diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py
index d7a5cf3f26bf..1b59b90d2968 100644
--- a/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py
@@ -354,7 +354,7 @@ def fwd_bwd_step(self, dataloader_iter, forward_only, first_val_step=None):
             token_count_avg = sum(batch['token_count']) / len(batch['token_count'])
 
         # Pass only torch.Tensor to prevent errors when process get_iterator_k_split()
-        batch = {k: v for k, v in batch.items() if isinstance(v, torch.Tensor)}
+        batch = {k: v for k, v in batch.items() if isinstance(v, (torch.Tensor, list))}
         _, seq_length = batch['tokens'].shape
         data_iter = get_iterator_k_split(batch, get_num_microbatches())
 
@@ -367,7 +367,10 @@ def fwd_bwd_step(self, dataloader_iter, forward_only, first_val_step=None):
         grad_sync_func = None
         param_sync_func = None
         if not forward_only and self.with_distributed_adam:
-            no_sync_func = partial(self._optimizer.no_sync, greedy_grad_copy=self.megatron_amp_O2,)
+            no_sync_func = partial(
+                self._optimizer.no_sync,
+                greedy_grad_copy=self.megatron_amp_O2,
+            )
             grad_sync_func = self.reduce_overlap_gradients
             param_sync_func = self.sync_overlap_parameters
 
@@ -855,13 +858,19 @@ def setup_training_dataloader(self):
         if hasattr(self, '_train_ds'):
             consumed_samples = self.compute_consumed_samples(0)
             self._train_dl = self.build_data_loader(
-                dataset=self._train_ds, data_cfg=self.cfg.data.train_ds, consumed_samples=consumed_samples,
+                dataset=self._train_ds,
+                data_cfg=self.cfg.data.train_ds,
+                consumed_samples=consumed_samples,
             )
 
     def setup_eval_dataloader(self, datasets, data_cfg):
         dataloaders = []
         for dataset in datasets:
-            eval_dl = self.build_data_loader(dataset=dataset, data_cfg=data_cfg, consumed_samples=0,)
+            eval_dl = self.build_data_loader(
+                dataset=dataset,
+                data_cfg=data_cfg,
+                consumed_samples=0,
+            )
             dataloaders.append(eval_dl)
         return dataloaders
 
diff --git a/nemo/collections/nlp/modules/common/megatron/utils.py b/nemo/collections/nlp/modules/common/megatron/utils.py
index 48234459453e..75c50146bfab 100644
--- a/nemo/collections/nlp/modules/common/megatron/utils.py
+++ b/nemo/collections/nlp/modules/common/megatron/utils.py
@@ -22,6 +22,8 @@
 
 from torch import Tensor
 
+from nemo.utils import logging, logging_mode
+
 try:
     from apex.normalization import MixedFusedRMSNorm
     from apex.normalization.fused_layer_norm import FusedLayerNorm  # NOQA
@@ -310,9 +312,7 @@ def make_inference_attention_mask_3d(source_block, target_block, pad_id):
 def make_inference_history_mask_3d(block):
     batch, length = block.shape
     arange = torch.arange(length, device=block.device)
-    history_mask = (arange[None,] <= arange[:, None])[
-        None,
-    ]
+    history_mask = (arange[None,] <= arange[:, None])[None,]
     history_mask = history_mask.expand(batch, length, length)
     return history_mask
 
@@ -413,14 +413,56 @@ def get_all_params_for_weight_decay_optimization(
     return tuple(filter(lambda g: len(g['params']) > 0, param_groups))
 
 
-def get_iterator_k_split(batch: List[torch.Tensor], num_microbatches: int) -> Iterator:
+def split_list(inputs, num_chunks):
+    """
+    Split a list into equal sized chunks
+    """
+    chunk_size = len(inputs) // num_chunks
+    assert len(inputs) % chunk_size == 0, "Issue with batch size configuration!"
+    return [inputs[i : i + chunk_size] for i in range(0, len(inputs), chunk_size)]
+
+
+def get_iterator_k_split(batch: Union[Dict, List[torch.Tensor]], num_microbatches: int) -> Iterator:
+    """
+    Split a batch into k microbatches, where the batch size is divisible by k. Batch could be
+    a dictionary of tensors or a list of tensors. A dictionary batch could also have items of List type,
+    as long as the length of that list is the same as the batch size.
+    """
     if isinstance(batch, dict):
-        items = list(batch.items())
+        discard_items = [k for k, v in batch.items() if not isinstance(v, (torch.Tensor, list))]
+        if len(discard_items) > 0:
+            logging.warning(
+                f"Only support splitting torch.Tensor and List[torch.Tensor]. Discarding the following keys from the batch: {discard_items}",
+                mode=logging_mode.ONCE,
+            )
+
+        batch = {k: v for k, v in batch.items() if isinstance(v, (torch.Tensor, list))}
+        tensor_items = {k: v for k, v in batch.items() if isinstance(v, torch.Tensor)}
+        list_items = {k: v for k, v in batch.items() if isinstance(v, list)}
+
+        # Split tensor items
+        items = list(tensor_items.items())
         assert items[0][1].shape[0] % num_microbatches == 0, "Issue with batch size configuration!"
         split_batch = [torch.tensor_split(item[1], num_microbatches, dim=0) for item in items]
-        microbatches = [[(items[i][0], split_batch[i][j]) for i in range(len(items))] for j in range(num_microbatches)]
+
+        if len(list_items) == 0:
+            # Only have tensor items
+            microbatches = [
+                [(items[i][0], split_batch[i][j]) for i in range(len(items))] for j in range(num_microbatches)
+            ]
+        else:
+            # Split list items
+            list_items = list(list_items.items())
+            split_list_batch = [split_list(item[1], num_microbatches) for item in list_items]
+            # Merge tensor and list items
+            all_keys = [item[0] for item in items] + [item[0] for item in list_items]
+            all_split_batch = split_batch + split_list_batch
+            microbatches = [
+                [(all_keys[i], all_split_batch[i][j]) for i in range(len(all_keys))] for j in range(num_microbatches)
+            ]
         microbatches = [dict(elem) for elem in microbatches]
     else:
+        # Split a list of torch tensors
         assert batch[0].shape[0] % num_microbatches == 0, "Issue with batch size configuration!"
         split_batch = [
             torch.tensor_split(item, num_microbatches, dim=0) if torch.is_tensor(item) else item for item in batch
diff --git a/nemo/core/classes/common.py b/nemo/core/classes/common.py
index cf39ed134768..97757b2e3826 100644
--- a/nemo/core/classes/common.py
+++ b/nemo/core/classes/common.py
@@ -219,7 +219,10 @@ def _validate_input_types(self, input_types=None, ignore_collections=False, **kw
                     hasattr(value, 'neural_type')
                     and is_semantic_typecheck_enabled()
                     and not metadata.base_types[key].compare(value.neural_type)
-                    in (NeuralTypeComparisonResult.SAME, NeuralTypeComparisonResult.GREATER,)
+                    in (
+                        NeuralTypeComparisonResult.SAME,
+                        NeuralTypeComparisonResult.GREATER,
+                    )
                 ):
                     error_msg = [
                         f"{input_types[key].compare(value.neural_type)} :",
@@ -398,7 +401,10 @@ def __check_neural_type(self, obj, metadata: TypecheckMetadata, depth: int, name
             hasattr(obj, 'neural_type')
             and is_semantic_typecheck_enabled()
             and not type_val.compare(obj.neural_type)
-            in (NeuralTypeComparisonResult.SAME, NeuralTypeComparisonResult.GREATER,)
+            in (
+                NeuralTypeComparisonResult.SAME,
+                NeuralTypeComparisonResult.GREATER,
+            )
         ):
             raise TypeError(
                 f"{type_val.compare(obj.neural_type)} : \n"
@@ -711,6 +717,7 @@ def from_pretrained(
         return_config: bool = False,
         trainer: Optional['Trainer'] = None,
         save_restore_connector: SaveRestoreConnector = None,
+        return_model_file: Optional[bool] = False,
     ):
         """
         Instantiates an instance of NeMo from NVIDIA NGC cloud
@@ -726,6 +733,7 @@ def from_pretrained(
             strict: Passed to torch.load_state_dict. By default true.
             return_config: If set to true, will return just the underlying config of the restored
                 model as an OmegaConf DictConfig object without instantiating the model.
+            return_model_file: If set to true, will return just the downloaded model file in cache
 
         Returns:
             A model instance of a particular model class or its underlying config (if return_config is set).
@@ -751,6 +759,9 @@ def from_pretrained(
                 model_name=model_name, refresh_cache=refresh_cache
             )
 
+        if return_model_file:
+            return nemo_model_file_in_cache
+
         instance = class_.restore_from(
             restore_path=nemo_model_file_in_cache,
             override_config_path=override_config_path,
diff --git a/scripts/speech_recognition/convert_to_tarred_audio_dataset.py b/scripts/speech_recognition/convert_to_tarred_audio_dataset.py
index 690010ad29ca..f0c7847b8c9b 100644
--- a/scripts/speech_recognition/convert_to_tarred_audio_dataset.py
+++ b/scripts/speech_recognition/convert_to_tarred_audio_dataset.py
@@ -124,7 +124,11 @@
 )
 
 parser.add_argument(
-    "--metadata_path", required=False, default=None, type=str, help="Path to metadata file for the dataset.",
+    "--metadata_path",
+    required=False,
+    default=None,
+    type=str,
+    help="Path to metadata file for the dataset.",
 )
 
 parser.add_argument(
@@ -165,7 +169,10 @@
 )
 
 parser.add_argument(
-    "--buckets_num", type=int, default=1, help="Number of buckets to create based on duration.",
+    "--buckets_num",
+    type=int,
+    default=1,
+    help="Number of buckets to create based on duration.",
 )
 
 parser.add_argument(
@@ -617,6 +624,15 @@ def _read_manifest(self, manifest_path: str, config: ASRTarredDatasetConfig):
         with open(manifest_path, 'r', encoding='utf-8') as m:
             for line in m:
                 entry = json.loads(line)
+                audio_key = "audio_filepath" if "audio_filepath" in entry else "audio_file"
+                if audio_key not in entry:
+                    raise KeyError(f"Manifest entry does not contain 'audio_filepath' or  'audio_file' key: {entry}")
+                audio_filepath = entry[audio_key]
+                if not os.path.isfile(audio_filepath) and not os.path.isabs(audio_filepath):
+                    audio_filepath_abs = os.path.join(os.path.dirname(manifest_path), audio_filepath)
+                    if not os.path.isfile(audio_filepath_abs):
+                        raise FileNotFoundError(f"Could not find {audio_filepath} or {audio_filepath_abs}!")
+                    entry[audio_key] = audio_filepath_abs
                 if (config.max_duration is None or entry['duration'] < config.max_duration) and (
                     config.min_duration is None or entry['duration'] >= config.min_duration
                 ):
@@ -648,8 +664,7 @@ def _write_to_tar(self, tar, audio_filepath: str, squashed_filename: str) -> Non
             tar.addfile(ti, encoded_audio)
 
     def _create_shard(self, entries, target_dir, shard_id, manifest_folder):
-        """Creates a tarball containing the audio files from `entries`.
-        """
+        """Creates a tarball containing the audio files from `entries`."""
         if self.config.sort_in_shards:
             entries.sort(key=lambda x: x["duration"], reverse=False)
 
diff --git a/tests/collections/multimodal/test_speechllm_models.py b/tests/collections/multimodal/test_speechllm_models.py
new file mode 100644
index 000000000000..8698fed205ea
--- /dev/null
+++ b/tests/collections/multimodal/test_speechllm_models.py
@@ -0,0 +1,266 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import tempfile
+from pathlib import Path
+
+import numpy as np
+import pytest
+import pytorch_lightning as pl
+import torch
+from megatron.core import parallel_state
+from omegaconf import DictConfig, OmegaConf
+from pytorch_lightning.plugins.environments import TorchElasticEnvironment
+
+from nemo.collections.multimodal.speech_llm.models import modular_models
+from nemo.collections.multimodal.speech_llm.parts.utils.data_utils import shift_tokens_by_multi_audios
+from nemo.collections.nlp.models.language_modeling.megatron.gpt_model import GPTModel
+from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy
+
+
+class ModularAudioGPTModel(modular_models.ModularAudioGPTModel):
+    # disable logging to avoid MisconfigurationException
+    def log(self, *args, **kwargs):
+        pass
+
+
+def setup_module():
+    pl.seed_everything(1)
+    # init model parallel needed for LLM loss
+    init_method = 'tcp://'
+    master_ip = 'localhost'
+    master_port = '6000'
+    init_method += master_ip + ':' + master_port
+    torch.distributed.init_process_group(backend='gloo', world_size=1, rank=0, init_method=init_method)
+    parallel_state.initialize_model_parallel(1, 1)
+
+
+@pytest.fixture
+def llm_model_config():
+    this_test_dir = os.path.dirname(os.path.abspath(__file__))
+    # Although most of the stuff in model is loaded from ckpt, we need configs
+    # for e.g. cfg.model.optim
+    config = OmegaConf.load(
+        os.path.join(
+            this_test_dir,
+            "../../../examples/multimodal/speech_llm/conf/modular_audio_gpt_config_peft.yaml",
+        )
+    )
+    # TODO(zhehuai): move the following to Test /home/TestData
+    config.model.restore_from_path = "/root/home/works/TestData/pretrained_models/megatron_gpt/gpt_pretrain_220m_len_4096_pos_alibi_step_595508_gbs256.nemo"
+    config.model.micro_batch_size = 2
+    config.model.global_batch_size = 2
+    config.model.data.validation_ds.manifest_filepath = (
+        '/root/home/works/TestData/datasets/LibriSpeech/dev_clean_cleaned.json'
+    )
+    config.model.data.train_ds.manifest_filepath = (
+        '/root/home/works/TestData/datasets/LibriSpeech/dev_clean_cleaned.json'
+    )
+    return config
+
+
+@pytest.fixture
+def trainer_config():
+    config_trainer = DictConfig({})
+
+    if torch.cuda.is_available():
+        accelerator = "gpu"
+        torch.set_default_device('cuda')
+    else:
+        accelerator = "cpu"
+    config_trainer.accelerator = accelerator
+    config_trainer.devices = 1
+    config_trainer.num_nodes = 1
+    config_trainer.max_epochs = 4
+    config_trainer.max_steps = 1
+    config_trainer.val_check_interval = 1.0
+
+    # for PyTorch Native AMP set precision=16
+    config_trainer.precision = 32
+
+    # setup cluster environment parameters"
+    # use torch elastic cluster environment so `create_process_externally` is True
+    # the launcher is set to None. It will not try to spawn new processes.
+    # It won't create the misconfiguration error because of the `interactive session`
+    os.environ["LOCAL_RANK"] = "0"
+    os.environ["RANK"] = "0"
+    os.environ["WORLD_SIZE"] = "1"
+
+    strategy = NLPDDPStrategy()
+    plugins = [TorchElasticEnvironment()]
+    trainer = pl.Trainer(logger=False, plugins=plugins, strategy=strategy, **config_trainer)
+    return trainer, config_trainer
+
+
+@pytest.fixture
+def perception_model_config():
+    preprocessor = {"_target_": "nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor"}
+    encoder = {
+        "_target_": "nemo.collections.asr.modules.ConformerEncoder",
+        "feat_in": 64,
+        "n_layers": 8,
+        "d_model": 64,
+        "self_attention_model": "rel_pos_local_attn",
+        "att_context_size": [128, 128],
+    }
+
+    model_config = DictConfig(
+        {
+            "_target_": "nemo.collections.multimodal.speechllm.modules.speechllm_perception.AudioPerceptionModule",
+            "preprocessor": DictConfig(preprocessor),
+            "encoder": DictConfig(encoder),
+            "modality_adapter": DictConfig(encoder),
+            "output_dim": 1024,
+        }
+    )
+    return model_config
+
+
+@pytest.fixture
+def test_batch():
+    signal_len = torch.from_numpy(np.array([64000, 64000]))
+    transcript = torch.arange(10).reshape(2, 5).int()
+    tokens = transcript[:, :-1]
+    labels = transcript[:, 1:]
+    transcript_length = torch.Tensor([3, 2]).int()
+    # assuming context_lengths = [1, 1]
+    loss_mask = torch.Tensor([[0, 1, 1, 0], [0, 1, 0, 0]])
+    batch = {
+        'audio_signal_length': signal_len,
+        'tokens': tokens,
+        'tokens_length': transcript_length,
+        'contexts': torch.arange(260).reshape(2, 130).int(),
+        'context_lengths': torch.Tensor([1, 1]).int(),
+        'labels': labels,
+        'answers': labels,
+        'loss_mask': loss_mask,
+    }
+    batch['audio_signal'] = torch.randn([2, 64000])
+    return batch
+
+
+@pytest.mark.skip(reason="nedd to move pretrained GPT model to /home/works/TestData first")
+class TestModularAudioGPTModel:
+    @pytest.mark.unit
+    def test_init_and_train(self, llm_model_config, perception_model_config, trainer_config):
+        llm_model_config.model.pretrained_audio_model = "stt_en_fastconformer_transducer_large"
+        llm_model_config.model.perception = perception_model_config
+        trainer, llm_model_config.trainer = trainer_config
+        model = ModularAudioGPTModel.restore_from_pretrained_models(llm_model_config, trainer=trainer)
+
+        assert isinstance(model.model, GPTModel)
+        with tempfile.TemporaryDirectory() as tmpdir:
+            save_path = str(Path(tmpdir) / "model.nemo")
+            model.train()
+            model.save_to(save_path)
+
+    @pytest.mark.unit
+    def test_prepare_llm_input(self, llm_model_config, perception_model_config, trainer_config, test_batch):
+        llm_model_config.model.pretrained_audio_model = "stt_en_fastconformer_transducer_large"
+        llm_model_config.model.perception = perception_model_config
+        trainer, llm_model_config.trainer = trainer_config
+        model = ModularAudioGPTModel.restore_from_pretrained_models(llm_model_config, trainer=trainer)
+        model.cuda()
+        model.train()
+        batch = {key: val.cuda(non_blocking=True) for key, val in test_batch.items()}
+        encoder_input, attention_mask, labels, loss_mask, encoder_length = model.prepare_llm_input(batch)
+        assert encoder_input.shape == (17, 2, 768)
+        assert np.allclose(encoder_input.sum().cpu().detach().numpy(), 15.783691)
+        assert attention_mask.shape == (2, 1, 17, 17)
+        assert labels.shape == (2, 17)
+        assert np.allclose(loss_mask.sum(axis=1).cpu().numpy(), [2, 1])
+        assert np.allclose(encoder_length.cpu().numpy(), (16, 15))
+
+    @pytest.mark.unit
+    def test_training_step(self, llm_model_config, perception_model_config, trainer_config, test_batch):
+        llm_model_config.model.pretrained_audio_model = "stt_en_fastconformer_transducer_large"
+        llm_model_config.model.perception = perception_model_config
+        trainer, llm_model_config.trainer = trainer_config
+        model = ModularAudioGPTModel.restore_from_pretrained_models(llm_model_config, trainer=trainer)
+        model.cuda()
+        model.on_train_start()
+        model.setup()
+        model.train()
+        loss_mean = model.training_step(iter([test_batch]), None)
+        assert np.allclose(loss_mean.cpu().detach().numpy(), 5.7052)
+
+    @pytest.mark.unit
+    def test_validation_step(self, llm_model_config, perception_model_config, trainer_config, test_batch):
+        llm_model_config.model.pretrained_audio_model = "stt_en_fastconformer_transducer_large"
+        llm_model_config.model.perception = perception_model_config
+        trainer, llm_model_config.trainer = trainer_config
+        model = ModularAudioGPTModel.restore_from_pretrained_models(llm_model_config, trainer=trainer)
+        model.cuda()
+        model.train()
+        batch = {key: val.cuda(non_blocking=True) for key, val in test_batch.items()}
+        loss_mean = model.validation_step(iter([batch]), 0)
+        assert np.allclose(loss_mean['loss'].cpu().detach().numpy(), 5.7052)
+
+    @pytest.mark.unit
+    def test_predict_step(self, llm_model_config, perception_model_config, trainer_config, test_batch):
+        llm_model_config.model.pretrained_audio_model = "stt_en_fastconformer_transducer_large"
+        llm_model_config.model.perception = perception_model_config
+        trainer, llm_model_config.trainer = trainer_config
+        model = ModularAudioGPTModel.restore_from_pretrained_models(llm_model_config, trainer=trainer)
+        model.cuda()
+        model.train()
+        batch = {key: val.cuda(non_blocking=True) for key, val in test_batch.items()}
+        response = model.predict_step(batch, 0, 0)
+        ground_truth = 'to suit you. Please note these are lecture notes from an alternate presentation. Copyright  ⁇ '
+        assert response['sentences'][0] == ground_truth
+
+    @pytest.mark.unit
+    def test_concat_multi_features(self, llm_model_config, perception_model_config, trainer_config):
+        llm_model_config.model.pretrained_audio_model = "stt_en_fastconformer_transducer_large"
+        llm_model_config.model.perception = perception_model_config
+        trainer, llm_model_config.trainer = trainer_config
+        model = ModularAudioGPTModel.restore_from_pretrained_models(llm_model_config, trainer=trainer)
+        model.eval()
+
+        feat_dim = 32
+        encoded = [torch.ones([3, 16, feat_dim]), torch.ones([3, 16, feat_dim])]
+        encoded_len = [torch.LongTensor([12, 8, 4]), torch.LongTensor([12, 8, 4])]
+        input_embeds = torch.zeros([2, 32, feat_dim])
+        input_length = torch.LongTensor([32, 28])
+        context_start_idx = [[0, 4, 12, 20], [0, 8, 16, 25]]
+        encoder_input, encoder_length = model._concat_multi_features(
+            encoded, encoded_len, input_embeds, input_length, context_start_idx
+        )
+        assert encoder_input.shape == (2, 56, feat_dim)  # max audio_len + text_len = (12 + 8 + 4) + 32 = 56
+        assert encoder_length.shape == (2,)
+        assert np.allclose(encoder_length.cpu().numpy(), (56, 52))
+        assert encoder_input[0, : context_start_idx[0][1]].sum() == 0  # first 4 features are text features
+        assert np.allclose(
+            encoder_input[0, context_start_idx[0][1] : context_start_idx[0][1] + encoded_len[0][0]],
+            torch.ones([encoded_len[0][0], feat_dim]),
+        )
+
+    @pytest.mark.unit
+    def test_shift_tokens_by_multi_audios(self):
+        """This test is put here because its functionality is similar to _concat_multi_features()"""
+        encoder_max_length = 64
+        audio_len = [torch.LongTensor([12, 8, 4]), torch.LongTensor([12, 8, 4])]
+        context_tokens = torch.ones([2, 32])
+        context_length = torch.LongTensor([32, 28])
+        context_start_idx = [[0, 4, 12, 20], [0, 8, 16, 25]]
+        new_context_tokens = shift_tokens_by_multi_audios(
+            context_tokens, context_length, audio_len, context_start_idx, encoder_max_length
+        )
+        assert new_context_tokens.shape == (2, 64)
+        assert np.allclose(new_context_tokens[0, : context_start_idx[0][1]], torch.ones([context_start_idx[0][1]]))
+        assert np.allclose(
+            new_context_tokens[0, context_start_idx[0][1] : context_start_idx[0][1] + audio_len[0][0]],
+            torch.zeros([audio_len[0][0]]),
+        )

From 820a285dcbafa72265a79c27d2656d636eed17b1 Mon Sep 17 00:00:00 2001
From: Pablo Garay <palenq@gmail.com>
Date: Mon, 13 May 2024 09:10:09 -0700
Subject: [PATCH 02/36] ASR_dev_run_Speech_To_Text_HF_Finetuning optional as
 flaky (#9180)

---
 .github/workflows/cicd-main.yml | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
index 252843bcc0ce..ef646ab92e7b 100644
--- a/.github/workflows/cicd-main.yml
+++ b/.github/workflows/cicd-main.yml
@@ -622,7 +622,7 @@ jobs:
         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
           if: "failure()"
 
-  ASR_dev_run_Speech_To_Text_HF_Finetuning:
+  OPTIONAL_ASR_dev_run_Speech_To_Text_HF_Finetuning:
     needs: [cicd-test-container-setup]
     runs-on: self-hosted-azure-gpus-1
     timeout-minutes: 10
@@ -667,8 +667,8 @@ jobs:
             +trainer.fast_dev_run=True \
             exp_manager.exp_dir=examples/asr/speech_finetuning_results
             rm -rf examples/asr/speech_finetuning_results
-        - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
-          if: "failure()"
+        #- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
+        #  if: "failure()"
 
   ASR_dev_run_Speech_to_Text_WPE_-_Conformer:
     needs: [cicd-test-container-setup]
@@ -6521,7 +6521,7 @@ jobs:
       - ASR_dev_run_Speech_to_Text_WPE_-_CitriNet
       - ASR_dev_run_Speech_Pre-training_-_CitriNet
       - ASR_dev_run_Speech_To_Text_Finetuning
-      - ASR_dev_run_Speech_To_Text_HF_Finetuning
+      #- OPTIONAL_ASR_dev_run_Speech_To_Text_HF_Finetuning
       - ASR_dev_run_Speech_to_Text_WPE_-_Conformer
       - ASR_dev_run-part_two_Speech_to_Text_WPE_-_Squeezeformer
       - L2_Speech_to_Text_EMA

From a0e9ee3c49d6eddd8d5372c3a7e44b509d2edc0e Mon Sep 17 00:00:00 2001
From: Eric Harper <complex451@gmail.com>
Date: Mon, 13 May 2024 11:06:51 -0600
Subject: [PATCH 03/36] update (#9181)

Signed-off-by: eharper <eharper@nvidia.com>
---
 nemo/package_info.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nemo/package_info.py b/nemo/package_info.py
index b253927a6b38..59805e0e04d3 100644
--- a/nemo/package_info.py
+++ b/nemo/package_info.py
@@ -16,7 +16,7 @@
 MAJOR = 2
 MINOR = 0
 PATCH = 0
-PRE_RELEASE = 'rc0'
+PRE_RELEASE = 'rc1'
 
 # Use the following formatting: (major, minor, patch, pre-release)
 VERSION = (MAJOR, MINOR, PATCH, PRE_RELEASE)

From 7a23bfa3969da3acb60a3f00a5191652833ca880 Mon Sep 17 00:00:00 2001
From: Ao Tang <aot@nvidia.com>
Date: Mon, 13 May 2024 13:53:19 -0400
Subject: [PATCH 04/36] Change FIM Dataset Random Seed Init (#9165)

* change seed to dataset init

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

---------

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>
Co-authored-by: suiyoubi <suiyoubi@users.noreply.github.com>
---
 .../megatron/gpt_fim_dataset.py               | 50 ++++++++++---------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/nemo/collections/nlp/data/language_modeling/megatron/gpt_fim_dataset.py b/nemo/collections/nlp/data/language_modeling/megatron/gpt_fim_dataset.py
index 474761c41d67..358dbc22a2cd 100644
--- a/nemo/collections/nlp/data/language_modeling/megatron/gpt_fim_dataset.py
+++ b/nemo/collections/nlp/data/language_modeling/megatron/gpt_fim_dataset.py
@@ -17,6 +17,7 @@
 import numpy as np
 
 from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults
+from nemo.utils import logging
 
 try:
     from megatron.core.datasets.gpt_dataset import GPTDataset, GPTDatasetConfig
@@ -36,8 +37,8 @@
 class GPTFIMDatasetConfig(GPTDatasetConfig):
     """Configuration object for Megatron Core GPT FIM datasets
 
-        Attributes:
-            fim: fill in the middle parameters config
+    Attributes:
+        fim: fill in the middle parameters config
     """
 
     def __init__(self, fim, **kwargs):
@@ -79,6 +80,27 @@ def __init__(
         super().__init__(indexed_dataset, dataset_path, indexed_indices, num_samples, index_split, config)
 
         self.indexed_dataset = indexed_dataset
+        self.np_rng = np.random.RandomState(seed=self.config.random_seed)
+        logging.info(f"Initialized FIM RNG with seed = {self.config.random_seed}")
+        # get FIM params
+        self.fim_rate = self.config.fim.get('rate', 0.5)
+        self.fim_spm_rate = self.config.fim.get('spm_rate', 0.5)
+        self.fragment_fim_rate = self.config.fim.get('fragment_rate', 0.5)
+        split_sample = self.config.fim.get('split_sample', None)
+        self.fim_split_sample = self.config.tokenizer.tokens_to_ids(split_sample) if split_sample else None
+        self.no_fim_prefix = self.config.fim.get('no_prefix', None)
+
+        # get extra tokens ids
+        fim_tokens = self.config.fim.extra_tokens
+        fim_tokens = [fim_tokens.prefix, fim_tokens.middle, fim_tokens.suffix, fim_tokens.pad, fim_tokens.eod]
+        fim_tokens_ids = self.config.tokenizer.tokens_to_ids(fim_tokens)
+        (
+            self.prefix_tok_id,
+            self.middle_tok_id,
+            self.suffix_tok_id,
+            self.pad_tok_id,
+            self.eod_tok_id,
+        ) = fim_tokens_ids
 
     def _query_document_sample_shuffle_indices(self, idx: int) -> Tuple[np.ndarray, np.ndarray]:
         """Get the text (token ids) and document ids for a given index
@@ -126,29 +148,9 @@ def _query_document_sample_shuffle_indices(self, idx: int) -> Tuple[np.ndarray,
 
         sample = np.concatenate(sample_parts)
 
-        # get FIM params
-        self.fim_rate = self.config.fim.get('rate', 0.5)
-        self.fim_spm_rate = self.config.fim.get('spm_rate', 0.5)
-        self.fragment_fim_rate = self.config.fim.get('fragment_rate', 0.5)
-        split_sample = self.config.fim.get('split_sample', None)
-        self.fim_split_sample = self.config.tokenizer.tokens_to_ids(split_sample) if split_sample else None
-        self.no_fim_prefix = self.config.fim.get('no_prefix', None)
-
-        # get extra tokens ids
-        fim_tokens = self.config.fim.extra_tokens
-        fim_tokens = [fim_tokens.prefix, fim_tokens.middle, fim_tokens.suffix, fim_tokens.pad, fim_tokens.eod]
-        fim_tokens_ids = self.config.tokenizer.tokens_to_ids(fim_tokens)
-        (
-            self.prefix_tok_id,
-            self.middle_tok_id,
-            self.suffix_tok_id,
-            self.pad_tok_id,
-            self.eod_tok_id,
-        ) = fim_tokens_ids
-
         sample_len = sample.shape[0]
         segment_breaks = np.argwhere(sample == self.eod_tok_id)
-        np_rng = np.random.RandomState(seed=self.config.random_seed)
+        np_rng = self.np_rng
 
         if segment_breaks.shape != (0, 1):  # then there is an EOD token in this example
             curr_start_position = 0
@@ -245,7 +247,7 @@ def _permute(
         no_fim_prefix=None,
     ):
         """
-        Take in a sample (np array w/ size (0,chunklength)) and perform a FIM transformation on it. 
+        Take in a sample (np array w/ size (0,chunklength)) and perform a FIM transformation on it.
         Maintain the same sample length (if transform creates a few extra tokens, drop them).
         """
         if np_rng.binomial(1, fim_rate):  # sample bernoulli dist

From 43686ecef00837bca9a1c63e64759dc57d4fe2f7 Mon Sep 17 00:00:00 2001
From: Pablo Garay <palenq@gmail.com>
Date: Mon, 13 May 2024 15:40:54 -0700
Subject: [PATCH 05/36] increase time limit for Speech_Checkpoints_tests
 (#9186)

---
 .github/workflows/cicd-main.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
index ef646ab92e7b..4652e4d19f89 100644
--- a/.github/workflows/cicd-main.yml
+++ b/.github/workflows/cicd-main.yml
@@ -6484,7 +6484,7 @@ jobs:
   Speech_Checkpoints_tests:
     needs: [cicd-test-container-setup]
     runs-on: self-hosted-azure
-    timeout-minutes: 10
+    timeout-minutes: 20
     container:
       image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
       options: 

From 467d94b7b9ab796b49025487edc05e635e0f8a94 Mon Sep 17 00:00:00 2001
From: gdengk <160076886+gdengk@users.noreply.github.com>
Date: Mon, 13 May 2024 15:58:56 -0700
Subject: [PATCH 06/36] fix ep rank (#9161)

Signed-off-by: Gao Deng <gdeng@nvidia.com>
---
 nemo/collections/nlp/modules/common/megatron/megatron_init.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nemo/collections/nlp/modules/common/megatron/megatron_init.py b/nemo/collections/nlp/modules/common/megatron/megatron_init.py
index 5d5b65b360ee..341e534bcd89 100644
--- a/nemo/collections/nlp/modules/common/megatron/megatron_init.py
+++ b/nemo/collections/nlp/modules/common/megatron/megatron_init.py
@@ -315,7 +315,7 @@ def fake_initialize_model_parallel(
     if expert_model_parallel_size_ is not None and expert_model_parallel_size_ > 1:
         for ranks in rank_generator.get_ranks('ep', independent_ep=True):
             if rank in ranks:
-                expert_model_parallel_rank = list(ranks).index(rank) // tensor_model_parallel_size
+                expert_model_parallel_rank = list(ranks).index(rank)
 
     # Build the pipeline model-parallel groups and embedding groups
     # (first and last rank in each pipeline model-parallel group).

From 77090d4e5e218261b1fe6b3a931d16f4083f2d53 Mon Sep 17 00:00:00 2001
From: meatybobby <bobchen@nvidia.com>
Date: Mon, 13 May 2024 16:14:34 -0700
Subject: [PATCH 07/36] TRTLLM new API support (#9003)

* Add trtllm checkpoint

* Change model config

* fix no query_group

* Using build API

* Change export to new API

* Update generate API

* Fix runtime config

* Fix for llama

* Fix for ptuning

* Fix TP issue

* Change TP rank for building weight dict

* Add lora config

* add prompt embedding table config

* Fix PP isue

* PP layers fix

* Fix no prompt task ids

* Add bos for Gemma

* Add multi block mode

* Embedding and layernorm for PP

* MPI multiprocess support for multinode

* Only output text on first rank

* Change to ModelRunnerCpp

* Add falcon

* Add rotary_pct default value

* Falcon fix

* Add MOE config

* Fix MOE weight dict

* Clean code

* Add rotary_base

* Fix MOE config

* Fix falcon new architecture

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Gemma 7B

* Add rotary_scaling

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: abharwani <abharwani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
---
 nemo/export/tensorrt_llm.py                   | 138 ++++++++------
 nemo/export/trt_llm/decoder/__init__.py       |   8 +
 nemo/export/trt_llm/nemo/convert.py           |  71 ++++---
 nemo/export/trt_llm/nemo/nemo_ckpt_convert.py |  39 ++--
 nemo/export/trt_llm/nemo_utils.py             | 180 +++++++++++++++++-
 nemo/export/trt_llm/tensorrt_llm_build.py     |  90 ++++++++-
 nemo/export/trt_llm/tensorrt_llm_run.py       | 130 ++++++-------
 scripts/export/export_to_trt_llm.py           |  12 +-
 8 files changed, 468 insertions(+), 200 deletions(-)

diff --git a/nemo/export/tensorrt_llm.py b/nemo/export/tensorrt_llm.py
index 033044b3b328..af4f1b6699ee 100644
--- a/nemo/export/tensorrt_llm.py
+++ b/nemo/export/tensorrt_llm.py
@@ -30,9 +30,10 @@
 from nemo.export.tarutils import TarPath, unpack_tarball
 from nemo.export.trt_llm.model_config_trt import model_config_to_tensorrt_llm
 from nemo.export.trt_llm.nemo.nemo_ckpt_convert import build_tokenizer
-from nemo.export.trt_llm.nemo_utils import get_tokenzier, nemo_llm_model_to_model_config, nemo_llm_to_model_config
+from nemo.export.trt_llm.nemo_utils import get_tokenzier, nemo_llm_model_to_model_config, nemo_to_trtllm_config
 from nemo.export.trt_llm.qnemo import qnemo_to_tensorrt_llm
 from nemo.export.trt_llm.qnemo.tokenizer_utils import get_nmt_tokenizer
+from nemo.export.trt_llm.tensorrt_llm_build import build_and_save_engine
 from nemo.export.trt_llm.tensorrt_llm_run import generate, generate_streaming, load, load_refit
 from nemo.export.trt_llm.utils import is_nemo_file
 
@@ -115,6 +116,7 @@ def export(
         max_output_token: int = 256,
         max_batch_size: int = 8,
         max_prompt_embedding_table_size=None,
+        use_parallel_embedding: bool = False,
         use_inflight_batching: bool = False,
         enable_context_fmha: bool = True,
         paged_kv_cache: bool = False,
@@ -188,65 +190,70 @@ def export(
 
         self.model = None
 
-        tmp_dir = tempfile.TemporaryDirectory()
-        nemo_export_dir = Path(tmp_dir.name)
+        if tensorrt_llm.mpi_rank() == 0:
+            tmp_dir = tempfile.TemporaryDirectory()
+            nemo_export_dir = Path(tmp_dir.name)
 
-        if nemo_checkpoint_path.endswith("qnemo"):
-            if os.path.isdir(nemo_checkpoint_path):
-                nemo_export_dir = nemo_checkpoint_path
+            if nemo_checkpoint_path.endswith("qnemo"):
+                if os.path.isdir(nemo_checkpoint_path):
+                    nemo_export_dir = nemo_checkpoint_path
+                else:
+                    unpack_tarball(nemo_checkpoint_path, tmp_dir.name)
+                    nemo_checkpoint_path = tmp_dir.name
+                self.tokenizer = get_nmt_tokenizer(nemo_checkpoint_path)
+
+                qnemo_to_tensorrt_llm(
+                    nemo_checkpoint_path=nemo_checkpoint_path,
+                    engine_dir=self.model_dir,
+                    max_input_len=max_input_token,
+                    max_output_len=max_output_token,
+                    max_batch_size=max_batch_size,
+                    max_prompt_embedding_table_size=max_prompt_embedding_table_size,
+                    lora_target_modules=lora_target_modules,
+                )
             else:
-                unpack_tarball(nemo_checkpoint_path, tmp_dir.name)
-                nemo_checkpoint_path = tmp_dir.name
-            self.tokenizer = get_nmt_tokenizer(nemo_checkpoint_path)
-
-            qnemo_to_tensorrt_llm(
-                nemo_checkpoint_path=nemo_checkpoint_path,
-                engine_dir=self.model_dir,
-                max_input_len=max_input_token,
-                max_output_len=max_output_token,
-                max_batch_size=max_batch_size,
-                max_prompt_embedding_table_size=max_prompt_embedding_table_size,
-                lora_target_modules=lora_target_modules,
-            )
-        else:
-            model_configs, self.tokenizer = nemo_llm_to_model_config(
-                in_file=nemo_checkpoint_path,
-                decoder_type=model_type,
-                dtype=dtype,
-                tensor_parallel_size=tensor_parallel_size,
-                pipeline_parallel_size=pipeline_parallel_size,
-                nemo_export_dir=nemo_export_dir,
-                save_nemo_model_config=save_nemo_model_config,
-            )
+                weights_dicts, model_configs, self.tokenizer = nemo_to_trtllm_config(
+                    in_file=nemo_checkpoint_path,
+                    decoder_type=model_type,
+                    dtype=dtype,
+                    tensor_parallel_size=tensor_parallel_size,
+                    pipeline_parallel_size=pipeline_parallel_size,
+                    use_parallel_embedding=use_parallel_embedding,
+                    nemo_export_dir=nemo_export_dir,
+                    save_nemo_model_config=save_nemo_model_config,
+                )
 
-            model_config_to_tensorrt_llm(
-                model_configs,
-                self.model_dir,
-                world_size=tensor_parallel_size * pipeline_parallel_size,
-                max_input_len=max_input_token,
-                max_output_len=max_output_token,
-                max_batch_size=max_batch_size,
-                max_prompt_embedding_table_size=max_prompt_embedding_table_size,
-                use_inflight_batching=use_inflight_batching,
-                paged_kv_cache=paged_kv_cache,
-                enable_context_fmha=enable_context_fmha,
-                enable_multi_block_mode=enable_multi_block_mode,
-                use_lora_plugin=use_lora_plugin,
-                lora_target_modules=lora_target_modules,
-                max_lora_rank=max_lora_rank,
-            )
+                for weight_dict, model_config in zip(weights_dicts, model_configs):
+                    build_and_save_engine(
+                        max_input_len=max_input_token,
+                        max_output_len=max_output_token,
+                        max_batch_size=max_batch_size,
+                        model_config=model_config,
+                        model_weights=weight_dict,
+                        model_dir=self.model_dir,
+                        model_type=model_type,
+                        lora_ckpt_list=self.lora_ckpt_list,
+                        use_lora_plugin=use_lora_plugin,
+                        max_lora_rank=max_lora_rank,
+                        lora_target_modules=lora_target_modules,
+                        max_prompt_embedding_table_size=max_prompt_embedding_table_size,
+                        enable_multi_block_mode=enable_multi_block_mode,
+                    )
 
-        tokenizer_path = os.path.join(nemo_export_dir, "tokenizer.model")
-        if os.path.exists(tokenizer_path):
-            shutil.copy(tokenizer_path, self.model_dir)
-        else:
-            self.tokenizer.save_pretrained(os.path.join(self.model_dir, 'huggingface_tokenizer'))
+            tokenizer_path = os.path.join(nemo_export_dir, "tokenizer.model")
+            if os.path.exists(tokenizer_path):
+                shutil.copy(tokenizer_path, self.model_dir)
+            else:
+                self.tokenizer.save_pretrained(os.path.join(self.model_dir, 'huggingface_tokenizer'))
+
+            nemo_model_config = os.path.join(nemo_export_dir, "model_config.yaml")
+            if os.path.exists(nemo_model_config):
+                shutil.copy(nemo_model_config, self.model_dir)
 
-        nemo_model_config = os.path.join(nemo_export_dir, "model_config.yaml")
-        if os.path.exists(nemo_model_config):
-            shutil.copy(nemo_model_config, self.model_dir)
+            tmp_dir.cleanup()
 
-        tmp_dir.cleanup()
+        if tensorrt_llm.mpi_world_size() > 1:
+            tensorrt_llm.mpi_barrier()
 
         if load_model:
             self._load()
@@ -279,7 +286,9 @@ def build(
 
         # Build or refit TRT-LLM engine from a nemo model.
         model_configs = nemo_llm_model_to_model_config(
-            nemo_model=nemo_model, decoder_type=model_type, nemo_model_config=nemo_model_config,
+            nemo_model=nemo_model,
+            decoder_type=model_type,
+            nemo_model_config=nemo_model_config,
         )
 
         model_config_to_tensorrt_llm(
@@ -298,7 +307,9 @@ def build(
         )
 
     def refit(
-        self, nemo_model, nemo_model_config,
+        self,
+        nemo_model,
+        nemo_model_config,
     ):
         assert self.use_refit, "TRT-LLM model must be built() with refit=True"
 
@@ -329,7 +340,6 @@ def forward(
         output_log_probs: bool = False,
         **sampling_kwargs,
     ):
-
         """
         Exports nemo checkpoints to TensorRT-LLM.
 
@@ -394,7 +404,7 @@ def forward(
                             ), "Task: {0} doesn't exist in the task list.".format(task_ids[i])
                             input_task_ids.append(self.task_ids[task_ids[i]])
             if not streaming:
-                if torch.distributed.is_initialized():
+                if torch.distributed.is_initialized() or tensorrt_llm.mpi_world_size() > 1:
                     multiprocessed_env = True
                 else:
                     multiprocessed_env = False
@@ -478,7 +488,7 @@ def get_hidden_size(self):
         if self.config is None:
             return None
         else:
-            return self.config["builder_config"]["hidden_size"]
+            return self.config["pretrained_config"]["hidden_size"]
 
     @property
     def get_triton_input(self):
@@ -665,7 +675,9 @@ def _get_prompt_embedding_table_ckpt(self, prompt_embeddings_checkpoint_path):
             return weights.cpu().detach()
 
     def _get_prompt_embedding_table(
-        self, prompt_embeddings_table=None, prompt_embeddings_checkpoint_path=None,
+        self,
+        prompt_embeddings_table=None,
+        prompt_embeddings_checkpoint_path=None,
     ):
         if prompt_embeddings_table is not None and prompt_embeddings_checkpoint_path is not None:
             LOGGER.warning(
@@ -694,15 +706,15 @@ def _get_prompt_embedding_table(
                 raise TypeError(prompt_embeddings_checkpoint_path + " is not a nemo file.")
             prompt_embeddings_table = self._get_prompt_embedding_table_ckpt(prompt_embeddings_checkpoint_path)
 
-        dtype = self.config['builder_config']['precision']
+        dtype = self.config['pretrained_config']['dtype']
         prompt_embeddings_table = prompt_embeddings_table.to(
             dtype=tensorrt_llm._utils.str_dtype_to_torch(dtype)
         ).cuda()
 
-        if prompt_embeddings_table.size(dim=1) != self.config["builder_config"]["hidden_size"]:
+        if prompt_embeddings_table.size(dim=1) != self.config["pretrained_config"]["hidden_size"]:
             raise Exception(
                 "Hidden dimension of the model is {0} and does not match with the dimension of the prompt table.".format(
-                    self.config["builder_config"]["hidden_size"]
+                    self.config["pretrained_config"]["hidden_size"]
                 )
             )
 
diff --git a/nemo/export/trt_llm/decoder/__init__.py b/nemo/export/trt_llm/decoder/__init__.py
index 5fe749408cb9..b5e22b5e513e 100644
--- a/nemo/export/trt_llm/decoder/__init__.py
+++ b/nemo/export/trt_llm/decoder/__init__.py
@@ -40,6 +40,14 @@
     DECODER_GEMMA: GemmaDecoderLayerConfigBuilder,
 }
 
+DECODER_MODEL_TYPE = {
+    DECODER_GPT2: 'GPTForCausalLM',
+    DECODER_GPTNEXT: 'GPTForCausalLM',
+    DECODER_LLAMA: 'LLaMAForCausalLM',
+    DECODER_GEMMA: 'GemmaForCausalLM',
+    DECODER_FALCON: 'FalconForCausalLM',
+}
+
 
 def build_decoder_layer_config(layer, decoder: str, dtype=trt.float16, rank=0, tensor_parallel=1):
     """Builds the decoder layer config with the input torch module."""
diff --git a/nemo/export/trt_llm/nemo/convert.py b/nemo/export/trt_llm/nemo/convert.py
index 09476da6b939..7598b3f6825f 100644
--- a/nemo/export/trt_llm/nemo/convert.py
+++ b/nemo/export/trt_llm/nemo/convert.py
@@ -39,12 +39,12 @@ def gpu_map_location(storage, loc):
 
 
 def save_val(val, dir, key, tp_num=None):
-    suffix = "bin" if tp_num is None else f"{tp_num}.bin"
+    suffix = "" if tp_num is None else f".{tp_num}.bin"
     # Transpose linear layer weights to the correct shape.
     if len(val.shape) >= 2:
         val = np.ascontiguousarray(np.transpose(val.reshape(val.shape[0], -1), [1, 0]))
     global weights_dict
-    weights_dict[f"model.{key}.{suffix}"] = val
+    weights_dict[f"{key}{suffix}"] = val
 
 
 def save_split(split_vals, dir, key, i, split_factor):
@@ -55,10 +55,10 @@ def save_split(split_vals, dir, key, i, split_factor):
 def save_expert_split(split_vals, dir, key, i, split_factor):
     for j, val in enumerate(split_vals):
         tp_num = i * split_factor + j
-        suffix = "bin" if tp_num is None else f"{tp_num}.bin"
+        suffix = "" if tp_num is None else f".{tp_num}.bin"
 
         global weights_dict
-        weights_dict[f"model.{key}.{suffix}"] = val
+        weights_dict[f"{key}{suffix}"] = val
 
 
 def generate_int8(weights, act_range, is_qkv=False, multi_query_mode=False):
@@ -183,6 +183,9 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
 
     save_int8 = int8_outputs == "all" or int8_outputs == "kv_cache_only"
 
+    layer_num = key.split(".")[1]
+    layer_prefix = f'transformer.layers.{layer_num}'
+
     if not isinstance(vals, list):
         vals = [vals]
 
@@ -210,12 +213,27 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
         or "final_layernorm.bias" in key
     ):
         # shared weights, only need to convert the weights of rank 0
-        if "post_self_attn_layernorm.weight" in key:
-            key = key.replace("post_self_attn_layernorm.weight", "post_attention_layernorm.weight")
-        elif "mlp.linear_fc2.bias" in key:
-            key = key.replace("mlp.linear_fc2.bias", "mlp.dense_4h_to_h.bias")
-        elif "attention.linear_proj.bias" in key:
-            key = key.replace("attention.linear_proj.bias", "attention.dense.bias")
+        if "post_self_attn_layernorm" in key or "post_attention_layernorm" in key:
+            if key.endswith('weight'):
+                key = f'{layer_prefix}.post_layernorm.weight'
+            else:
+                key = f'{layer_prefix}.post_layernorm.bias'
+        elif "mlp.linear_fc2.bias" in key or "mlp.dense_4h_to_h.bias" in key:
+            key = f'{layer_prefix}.mlp.proj.bias'
+        elif "attention.linear_proj.bias" in key or "attention.dense.bias" in key:
+            key = f'{layer_prefix}.attention.dense.bias'
+        elif "final_layernorm" in key:
+            key = key.replace("final_layernorm", "transformer.ln_f")
+        elif "input_layernorm" in key:
+            if key.endswith('weight'):
+                key = f'{layer_prefix}.input_layernorm.weight'
+            else:
+                key = f'{layer_prefix}.input_layernorm.bias'
+        elif "pre_mlp_layernorm" in key:
+            if key.endswith('weight'):
+                key = f'{layer_prefix}.post_layernorm.weight'
+            else:
+                key = f'{layer_prefix}.post_layernorm.bias'
         if tp_rank == 0:
             save_val(vals[0], saved_dir, key)
 
@@ -228,10 +246,10 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
         cat_dim = 0
         val = np.concatenate(vals, axis=cat_dim)
         split_vals = np.split(val, split_factor, axis=cat_dim)
-        if "attention.linear_proj.weight" in key:
-            key = key.replace("attention.linear_proj.weight", "attention.dense.weight")
-        elif "mlp.linear_fc2.weight" in key:
-            key = key.replace("mlp.linear_fc2.weight", "mlp.dense_4h_to_h.weight")
+        if "attention.linear_proj.weight" in key or "attention.dense.weight" in key:
+            key = f'{layer_prefix}.attention.dense.weight'
+        elif "mlp.linear_fc2.weight" in key or "mlp.dense_4h_to_h.weight" in key:
+            key = f'{layer_prefix}.mlp.proj.weight'
         save_split(split_vals, saved_dir, key, tp_rank, split_factor)
         if act_range is not None and int8_outputs == "all":
             base_key = key.replace(".weight", "")
@@ -251,8 +269,10 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
         val = np.concatenate(vals, axis=cat_dim)
         split_vals = np.split(val, split_factor, axis=cat_dim)
 
-        if "mlp.linear_fc1" in key:
-            key = key.replace("mlp.linear_fc1", "mlp.dense_h_to_4h")
+        if key.endswith("weight"):
+            key = f'{layer_prefix}.mlp.fc.weight'
+        else:
+            key = f'{layer_prefix}.mlp.fc.bias'
         save_split(split_vals, saved_dir, key, tp_rank, split_factor)
         if act_range is not None and int8_outputs == "all":
             base_key = key.replace(".weight", "")
@@ -261,8 +281,10 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
 
         if split_gated_activation:
             assert not save_int8
-            prefix, dot, suffix = key.rpartition(".")
-            key = prefix + ".gate" + dot + suffix
+            if key.endswith("weight"):
+                key = f'{layer_prefix}.mlp.gate.weight'
+            else:
+                key = f'{layer_prefix}.mlp.gate.bias'
 
             gate = np.concatenate(gates, axis=cat_dim)
             split_vals = np.split(gate, split_factor, axis=cat_dim)
@@ -279,9 +301,6 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
             write_int8(vals_i8, saved_dir, base_key, cat_dim, tp_rank, split_factor)
 
     elif "attention.query_key_value.bias" in key or "attention.linear_qkv.bias" in key:
-        if "attention.linear_qkv.bias" in key:
-            key = key.replace("attention.linear_qkv.bias", "attention.query_key_value.bias")
-
         qkv_hidden_dim = vals[0].shape[0]
         size_per_head = qkv_hidden_dim // (num_attention_heads + 2 * num_kv_heads)
         q_num = num_attention_heads // num_kv_heads
@@ -304,6 +323,7 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
             np.concatenate([q_split[i].reshape(-1), k_split[i].reshape(-1), v_split[i].reshape(-1)], axis=0)
             for i in range(split_factor)
         ]
+        key = f'{layer_prefix}.attention.qkv.bias'
         save_split(split_vals, saved_dir, key, tp_rank, split_factor)
 
     elif "attention.query_key_value.weight" in key or "attention.linear_qkv.weight" in key:
@@ -342,8 +362,7 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
             for i in range(split_factor)
         ]
 
-        if "attention.linear_qkv.weight" in key:
-            key = key.replace("attention.linear_qkv.weight", "attention.query_key_value.weight")
+        key = f'{layer_prefix}.attention.qkv.weight'
         save_split(split_vals, saved_dir, key, tp_rank, split_factor)
         if save_int8:
             base_key = key.replace(".weight", "")
@@ -366,8 +385,8 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
         pass
     elif "mlp.router.weight" in key:
         val = np.concatenate(vals, axis=1)
-        split_vals = np.split(val, split_factor, axis=0)
-        save_split(split_vals, saved_dir, key, tp_rank, split_factor)
+        key = f'{layer_prefix}.mlp.router.weight'
+        save_val(val, saved_dir, key)
     elif "experts.linear_fc1.weight" in key:
         cat_dim = -1
         val = np.concatenate(vals, axis=cat_dim)
@@ -378,12 +397,14 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals, storage_t
         split_w3s = np.split(w3, split_factor, axis=1)
 
         split_vals = [np.concatenate(item, axis=1) for item in zip(split_w3s, split_w1s)]
+        key = f'{layer_prefix}.mlp.experts_weight_1'
         save_expert_split(split_vals, saved_dir, key, tp_rank, split_factor)
 
     elif "experts.linear_fc2.weight" in key:
         cat_dim = -1
         val = np.concatenate(vals, axis=cat_dim)
         split_vals = np.split(val, split_factor, axis=cat_dim)
+        key = f'{layer_prefix}.mlp.experts_weight_2'
         save_expert_split(split_vals, saved_dir, key, tp_rank, split_factor)
     else:
         print(f"[WARNING] {key} not handled by converter")
diff --git a/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py b/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py
index d9135d5c0c21..44133de381bd 100644
--- a/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py
+++ b/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py
@@ -27,7 +27,7 @@
 import tensorstore  # This is important even though not used. Otherwise zarr raises error.
 import torch
 import zarr
-from tensorrt_llm._utils import np_bfloat16, str_dtype_to_torch, torch_to_numpy
+from tensorrt_llm._utils import np_bfloat16, pad_vocab_size, str_dtype_to_torch, torch_to_numpy
 from tqdm import tqdm
 from transformers import AutoTokenizer, GPT2Tokenizer, LlamaConfig
 
@@ -174,6 +174,7 @@ def convert_dist_checkpoint(unpacked_checkpoints_dir: UnpackedNemoCheckpointDir,
     multi_query_mode = nemo_model_config.get("multi_query_mode", False)
     num_attention_heads = nemo_model_config["num_attention_heads"]
     kv_channels = nemo_model_config.get("kv_channels", None)
+    use_parallel_embedding = args.use_parallel_embedding
     if num_kv_heads == 0:
         if multi_query_mode:
             num_kv_heads = 1
@@ -191,6 +192,7 @@ def convert_dist_checkpoint(unpacked_checkpoints_dir: UnpackedNemoCheckpointDir,
         "kv_channels": kv_channels,
         "use_attention_nemo_shape": True,
         "transpose_weights": True,
+        "use_parallel_embedding": use_parallel_embedding,
     }
 
     # split_factor: in how many parts a TP training node is split
@@ -202,22 +204,30 @@ def handle_model_level_weights(model, tp_idx: int, pp_idx: int):
             if has_position_embedding:
                 val = model[get_layer_name("position_embedding", prefix)]
                 val = torch_to_numpy(val.to(storage_type).cpu())
-                model_level_weights["model.wpe.bin"].append(val)
+                model_level_weights["transformer.position_embedding.weight"].append(val)
         if pp_idx == 0:
             val = model.get("state_dict", model)[get_layer_name("word_embedding", prefix)]
             if embedding_scaling:
                 val = val * float(math.sqrt(hidden_size))
 
+            vocab_size = val.shape[0]
+            if use_parallel_embedding:
+                # Pad vocab_size first
+                if vocab_size % inference_tp_size != 0:
+                    vocab_size_padded = pad_vocab_size(vocab_size, inference_tp_size)
+                    pad_width = vocab_size_padded - vocab_size
+                    val = torch.nn.functional.pad(val, (0, 0, 0, pad_width), value=0)
+
             val = torch_to_numpy(val.to(storage_type).cpu())
-            model_level_weights["model.wte.bin"].append(val)
+            model_level_weights["transformer.vocab_embedding.weight"].append(val)
             if share_embeddings_and_output:
                 val = model.get("state_dict", model)[get_layer_name("word_embedding", prefix)]
                 val = torch_to_numpy(val.to(storage_type).cpu())
-                model_level_weights["model.lm_head.weight.bin"].append(val)
+                model_level_weights["lm_head.weight"].append(val)
         if has_lm_head and pp_idx == training_pp_size - 1:
             val = model.get("state_dict", model)[get_layer_name("output_layer", prefix)]
             val = torch_to_numpy(val.to(storage_type).cpu())
-            model_level_weights["model.lm_head.weight.bin"].append(val)
+            model_level_weights["lm_head.weight"].append(val)
 
     weights_dict = {}
 
@@ -280,7 +290,6 @@ def handle_model_level_weights(model, tp_idx: int, pp_idx: int):
         model_level_weights[key] = np.concatenate(values, axis=0)
 
         weights_dict[key] = model_level_weights[key]
-    vocab_size = model_level_weights["model.wte.bin"].shape[0]
 
     if nemo_model_config["tokenizer"].get("library", None) == "huggingface":
         tokenizer = AutoTokenizer.from_pretrained(
@@ -293,23 +302,7 @@ def handle_model_level_weights(model, tp_idx: int, pp_idx: int):
         tokenizer_config["model"] = os.path.join(out_dir, "tokenizer.model")
         tokenizer = build_tokenizer(tokenizer_config)
 
-    llm_config = nemo_to_llm_config(
-        nemo_model_config, vocab_size, tokenizer.eos_token_id, tokenizer.bos_token_id, args.decoder_type,
-    )
-
-    llm_config.is_mcore = is_mcore
-
-    config = configparser.ConfigParser()
-    decoder_name_dict = {"llama": "llama", "falcon": "falcon"}
-    model_name = decoder_name_dict[args.decoder_type] if args.decoder_type in decoder_name_dict else "gpt"
-
-    config[model_name] = {k: str(v) for k, v in vars(llm_config).items()}
-    config[model_name]["storage_dtype"] = args.storage_type
-    config_path = out_dir / "config.ini"
-    with config_path.open("w") as config_file:
-        config.write(config_file)
-
-    return weights_dict, llm_config, tokenizer
+    return weights_dict, nemo_model_config, tokenizer
 
 
 @torch.no_grad()
diff --git a/nemo/export/trt_llm/nemo_utils.py b/nemo/export/trt_llm/nemo_utils.py
index ee2073fa518d..d735cab36b00 100644
--- a/nemo/export/trt_llm/nemo_utils.py
+++ b/nemo/export/trt_llm/nemo_utils.py
@@ -28,9 +28,14 @@
 import numpy as np
 import tensorrt_llm
 from tensorrt_llm import str_dtype_to_trt
-from transformers import AutoTokenizer, LlamaConfig, PretrainedConfig, PreTrainedTokenizer
+from tensorrt_llm._utils import pad_vocab_size
+from tensorrt_llm.functional import non_gated_version
+from tensorrt_llm.layers import MoeConfig
+from tensorrt_llm.models.modeling_utils import PretrainedConfig
+from transformers import AutoTokenizer, LlamaConfig, PreTrainedTokenizer
 
 from nemo.export.tarutils import TarPath
+from nemo.export.trt_llm.decoder import DECODER_MODEL_TYPE
 from nemo.export.trt_llm.model_config import (
     LAYERNORM_DEFAULT,
     LAYERNORM_RMS,
@@ -56,6 +61,7 @@ def _nemo_llm_decode(
     storage_type: str = "bfloat16",
     load_checkpoints_on_gpu: bool = False,
     decoder_type: str = "gptnext",
+    use_parallel_embedding: bool = False,
     save_nemo_model_config: bool = False,
 ) -> Tuple[Dict[str, np.ndarray], PretrainedConfig, PreTrainedTokenizer]:
     """Decodes the NEMO file and returns the weights dict, llm config and tokenizer."""
@@ -67,6 +73,7 @@ def _nemo_llm_decode(
     args.load_checkpoints_on_gpu = load_checkpoints_on_gpu
     args.verbose = False
     args.decoder_type = decoder_type
+    args.use_parallel_embedding = use_parallel_embedding
 
     if not os.path.exists(in_file):
         LOGGER.error("%s does not exist", in_file)
@@ -194,7 +201,9 @@ def nemo_llm_to_model_config(
 
 
 def to_word_list_format(
-    word_dict: List[List[str]], tokenizer=None, ref_str="<extra_id_1>",
+    word_dict: List[List[str]],
+    tokenizer=None,
+    ref_str="<extra_id_1>",
 ):
     '''
     format of word_dict
@@ -250,7 +259,10 @@ def to_word_list_format(
 
 
 def nemo_llm_model_to_model_config(
-    nemo_model: str, decoder_type: str, nemo_model_config: str, dtype_str: str = "float32",
+    nemo_model: str,
+    decoder_type: str,
+    nemo_model_config: str,
+    dtype_str: str = "float32",
 ) -> Tuple[List[ModelConfig], PreTrainedTokenizer]:
     """Converts the NEMO model object and construct the `ModelConfig` before tensorrt_llm deployment."""
     from megatron.core import parallel_state
@@ -297,8 +309,8 @@ def nemo_llm_model_to_model_config(
 
     LOGGER.info(
         f'''Resharing: Rank {tensorrt_llm.mpi_rank()} mapping:
-        tp_rank  {parallel_state.get_tensor_model_parallel_rank()} -> {model_config.mapping.tp_rank}, 
-        pp_rank  {parallel_state.get_pipeline_model_parallel_rank()} -> {model_config.mapping.pp_rank}, 
+        tp_rank  {parallel_state.get_tensor_model_parallel_rank()} -> {model_config.mapping.tp_rank},
+        pp_rank  {parallel_state.get_pipeline_model_parallel_rank()} -> {model_config.mapping.pp_rank},
         tp_group {model_config.mapping.tp_group}'''
     )
 
@@ -321,3 +333,161 @@ def nemo_llm_model_to_model_config(
     model_config.lm_head.weight = lm_head_weight
 
     return [model_config]
+
+
+def nemo_to_trtllm_config(
+    in_file: str,
+    decoder_type: str,
+    nemo_export_dir: Union[str, Path],
+    dtype: str = "bfloat16",
+    tensor_parallel_size: int = 1,
+    pipeline_parallel_size: int = 1,
+    use_parallel_embedding: bool = False,
+    save_nemo_model_config: bool = False,
+) -> Tuple[List[Dict], List[PretrainedConfig], PreTrainedTokenizer]:
+    """Converts the NEMO file and construct the `PretrainedConfig` before tensorrt_llm deployment."""
+    dtype_str = dtype
+
+    weights_dict, nemo_model_config, tokenizer = _nemo_llm_decode(
+        in_file=in_file,
+        out_dir=nemo_export_dir,
+        tensor_parallelism=tensor_parallel_size,
+        processes=1,
+        storage_type=dtype_str,
+        use_parallel_embedding=use_parallel_embedding,
+        load_checkpoints_on_gpu=False,
+        decoder_type=decoder_type,
+        save_nemo_model_config=save_nemo_model_config,
+    )
+
+    world_size = tensor_parallel_size * pipeline_parallel_size
+
+    lm_head_weight = weights_dict["lm_head.weight"]
+
+    vocab_size = weights_dict["transformer.vocab_embedding.weight"].shape[0]
+    vocab_size_padded = pad_vocab_size(vocab_size, tensor_parallel_size)
+
+    if vocab_size_padded != vocab_size:
+        pad_width = vocab_size_padded - vocab_size
+        lm_head_weight = np.pad(lm_head_weight, ((0, pad_width), (0, 0)), "constant", constant_values=0)
+
+    hidden_act = nemo_model_config.get('activation')
+    hidden_act = (
+        hidden_act.split("-")[-1] if nemo_model_config.get('num_moe_experts', 0) else non_gated_version(hidden_act)
+    )
+
+    config = {
+        'architecture': DECODER_MODEL_TYPE[decoder_type],
+        'dtype': dtype_str,
+        'num_hidden_layers': nemo_model_config.get('num_layers'),
+        'num_attention_heads': nemo_model_config.get('num_attention_heads'),
+        'num_key_value_heads': nemo_model_config.get('num_query_groups', nemo_model_config['num_attention_heads']),
+        'head_size': nemo_model_config.get('kv_channels'),
+        'hidden_size': nemo_model_config.get('hidden_size'),
+        'intermediate_size': nemo_model_config.get('ffn_hidden_size'),
+        'norm_epsilon': nemo_model_config.get('layernorm_epsilon'),
+        'vocab_size': vocab_size_padded,
+        'position_embedding_type': (
+            "rope_gpt_neox" if nemo_model_config.get('position_embedding_type') == "rope" else "learned_absolute"
+        ),
+        'max_position_embeddings': nemo_model_config.get('max_position_embeddings'),
+        'hidden_act': hidden_act,
+        'use_parallel_embedding': use_parallel_embedding,
+        'embedding_sharding_dim': 0,
+        'share_embedding_table': False,
+        'quantization': {
+            'quant_algo': None,
+            'kv_cache_quant_algo': None,
+        },
+        'bias': nemo_model_config.get('bias'),
+        'apply_query_key_layer_scaling': False,
+        'rotary_pct': nemo_model_config.get('rotary_percentage', 1.0),
+        'rotary_base': nemo_model_config.get('rotary_base', 10000),
+        'moe_num_experts': nemo_model_config.get('num_moe_experts', 0),
+        'moe_top_k': nemo_model_config.get('moe_router_topk'),
+        'moe_normalization_mode': nemo_model_config.get(
+            'moe_renorm_mode', MoeConfig.ExpertScaleNormalizationMode.RENORMALIZE
+        ),
+        'moe_tp_mode': nemo_model_config.get('moe_tp_mode', MoeConfig.ParallelismMode.TENSOR_PARALLEL),
+        'logits_dtype': 'float32',
+        'world_size': world_size,
+        'tp_size': tensor_parallel_size,
+        'pp_size': pipeline_parallel_size,
+    }
+
+    model_configs = []
+    weights_dicts = []
+    num_layers = nemo_model_config.get('num_layers')
+    rotary_scaling = nemo_model_config.get("seq_len_interpolation_factor")
+
+    if decoder_type == "falcon":
+        config["new_decoder_architecture"] = False if num_layers == 32 else True
+        config["parallel_attention"] = True
+    if rotary_scaling is not None:
+        config["rotary_scaling"] = {"type": "linear", "factor": float(rotary_scaling)}
+
+    pp_key = {
+        "transformer.vocab_embedding.weight",
+        "transformer.position_embedding.weight",
+        "lm_head.weight",
+        "transformer.ln_f.weight",
+        "transformer.ln_f.bias",
+    }
+
+    for i in range(world_size):
+        mapping = tensorrt_llm.Mapping(
+            world_size=world_size, rank=i, tp_size=tensor_parallel_size, pp_size=pipeline_parallel_size
+        )
+        layers_range = mapping.pp_layers(num_layers)
+
+        weights_dict_local = {}
+        for k, v in weights_dict.items():
+            if k in pp_key:
+                continue
+            new_key = k
+            if new_key.endswith(".bin"):  # TP split
+                if new_key.endswith(f"{mapping.tp_rank}.bin"):
+                    new_key = new_key.replace(f".{mapping.tp_rank}.bin", "")
+            if "layers" in new_key:  # PP
+                layer_num = int(new_key.split(".")[2])
+                if layer_num in layers_range:
+                    new_key = new_key.replace(f"layers.{layer_num}", f"layers.{layer_num-layers_range[0]}")
+            if config.get("new_decoder_architecture", False) and "post_layernorm" in new_key:
+                new_key = new_key.replace("post_layernorm", "mlp_layernorm")
+            weights_dict_local[new_key] = v
+
+        if mapping.is_first_pp_rank():
+            embedding_weight = (
+                np.ascontiguousarray(
+                    split(weights_dict["transformer.vocab_embedding.weight"], mapping.tp_size, mapping.tp_rank)
+                )
+                if use_parallel_embedding
+                else weights_dict["transformer.vocab_embedding.weight"]
+            )
+
+            weights_dict_local["transformer.vocab_embedding.weight"] = embedding_weight
+
+            pos_embedding_weight = weights_dict.get("transformer.position_embedding.weight")
+            if pos_embedding_weight is not None:
+                if use_parallel_embedding:
+                    pos_embedding_weight = np.ascontiguousarray(
+                        split(pos_embedding_weight, mapping.tp_size, mapping.tp_rank)
+                    )
+                weights_dict_local["transformer.position_embedding.weight"] = pos_embedding_weight
+
+        if mapping.is_last_pp_rank():
+            weights_dict_local["lm_head.weight"] = np.ascontiguousarray(
+                split(lm_head_weight, mapping.tp_size, mapping.tp_rank)
+            )
+            weights_dict_local["transformer.ln_f.weight"] = weights_dict["transformer.ln_f.weight"]
+
+            ln_f_bias = weights_dict.get("transformer.ln_f.bias")
+            if ln_f_bias is not None:
+                weights_dict_local["transformer.ln_f.bias"] = ln_f_bias
+
+        model_config = PretrainedConfig(**config)
+        model_config.mapping = mapping
+        model_configs.append(model_config)
+        weights_dicts.append(weights_dict_local)
+
+    return weights_dicts, model_configs, tokenizer
diff --git a/nemo/export/trt_llm/tensorrt_llm_build.py b/nemo/export/trt_llm/tensorrt_llm_build.py
index 3ad27a2eb9a6..ac8d9094ea32 100644
--- a/nemo/export/trt_llm/tensorrt_llm_build.py
+++ b/nemo/export/trt_llm/tensorrt_llm_build.py
@@ -25,10 +25,13 @@
 import torch
 from tensorrt_llm import str_dtype_to_trt
 from tensorrt_llm._utils import np_dtype_to_trt
-from tensorrt_llm.builder import Builder
+from tensorrt_llm.builder import BuildConfig, Builder
+from tensorrt_llm.commands.build import build as build_trtllm
 from tensorrt_llm.logger import logger
-from tensorrt_llm.models.modeling_utils import add_lora
+from tensorrt_llm.lora_manager import LoraBuildConfig
+from tensorrt_llm.models.modeling_utils import add_lora, optimize_model, preprocess_weights
 from tensorrt_llm.network import net_guard
+from tensorrt_llm.plugin import PluginConfig
 from tensorrt_llm.plugin.plugin import ContextFMHAType
 from tensorrt_llm.quantization import QuantMode
 
@@ -57,11 +60,11 @@ def serialize_engine(engine, path):
 
 def refit_runtime_engine(params, cuda_engine):
     '''
-        @brief: Inplace refit one TensorRT cuda engine using weights from the network,
-            user should guarantee that the engine is built with REFIT flag, and the network has the same structure with the engine.
-        @param engine_buffer: A serialized TensorRT engine.
-        @param network: Network object.
-        @return: A serialized TRT engine if refit successfully, None otherwise
+    @brief: Inplace refit one TensorRT cuda engine using weights from the network,
+        user should guarantee that the engine is built with REFIT flag, and the network has the same structure with the engine.
+    @param engine_buffer: A serialized TensorRT engine.
+    @param network: Network object.
+    @return: A serialized TRT engine if refit successfully, None otherwise
     '''
     logger.info(f'Refit runtime engine')
     tik = time.time()
@@ -88,7 +91,11 @@ def refit_runtime_engine(params, cuda_engine):
 
 
 def build_rank_engine(
-    tensorrt_llm_gpt, builder: Builder, builder_config: tensorrt_llm.builder.BuilderConfig, engine_name, args,
+    tensorrt_llm_gpt,
+    builder: Builder,
+    builder_config: tensorrt_llm.builder.BuilderConfig,
+    engine_name,
+    args,
 ):
 
     str_dtype_to_trt(args.dtype)
@@ -348,3 +355,70 @@ def build(
     tok = time.time()
     t = time.strftime("%H:%M:%S", time.gmtime(tok - tik))
     logger.info(f"Total time of building all {args.mapping.world_size} engines: {t}")
+
+
+def build_and_save_engine(
+    max_input_len=1024,
+    max_output_len=1024,
+    max_batch_size=4,
+    model_dir=None,
+    model_weights=None,
+    model_config=None,
+    model_type='gpt',
+    lora_ckpt_list=None,
+    use_lora_plugin=None,
+    max_lora_rank=64,
+    lora_target_modules=None,
+    max_prompt_embedding_table_size=0,
+    enable_multi_block_mode: bool = False,
+):
+    try:
+        model_cls = getattr(tensorrt_llm.models, model_config.architecture)
+    except:
+        raise AttributeError(f"Could not find TRTLLM model type: {model_type}!")
+
+    logger.set_level("info")
+    str_dtype = model_config.dtype
+    plugin_config = PluginConfig()
+    plugin_config.set_gpt_attention_plugin(dtype=str_dtype)
+    plugin_config.set_gemm_plugin(dtype=str_dtype)
+    plugin_config.set_plugin("multi_block_mode", enable_multi_block_mode)
+    max_num_tokens = max_batch_size * max_input_len
+
+    build_dict = {
+        'max_input_len': max_input_len,
+        'max_output_len': max_output_len,
+        'max_batch_size': max_batch_size,
+        'max_beam_width': 1,
+        'max_num_tokens': max_num_tokens,
+        'opt_num_tokens': None,
+        'max_prompt_embedding_table_size': max_prompt_embedding_table_size,
+        'gather_context_logits': False,
+        'gather_generation_logits': False,
+        'strongly_typed': False,
+        'builder_opt': None,
+    }
+    build_config = BuildConfig.from_dict(build_dict, plugin_config=plugin_config)
+
+    if use_lora_plugin is not None:
+        build_config.plugin_config.set_lora_plugin(use_lora_plugin)
+        lora_config = LoraBuildConfig(
+            lora_dir=lora_ckpt_list,
+            lora_ckpt_source='nemo',
+            max_lora_rank=max_lora_rank,
+            lora_target_modules=lora_target_modules,
+        )
+        build_config.lora_config = lora_config
+
+    model = model_cls.from_config(model_config)
+    model = optimize_model(
+        model,
+        use_parallel_embedding=model_config.use_parallel_embedding,
+        share_embedding_table=model_config.share_embedding_table,
+    )
+    preprocess_weights(model_weights, model_config)
+    model.load(model_weights)
+    engine = build_trtllm(model, build_config)
+    engine.save(model_dir)
+
+    return engine
diff --git a/nemo/export/trt_llm/tensorrt_llm_run.py b/nemo/export/trt_llm/tensorrt_llm_run.py
index c490f37e1fc4..92fc36272f7c 100644
--- a/nemo/export/trt_llm/tensorrt_llm_run.py
+++ b/nemo/export/trt_llm/tensorrt_llm_run.py
@@ -26,7 +26,7 @@
 from tensorrt_llm.logger import logger
 from tensorrt_llm.lora_manager import LoraManager
 from tensorrt_llm.quantization import QuantMode
-from tensorrt_llm.runtime import ModelConfig, SamplingConfig
+from tensorrt_llm.runtime import ModelConfig, ModelRunnerCpp, SamplingConfig
 from transformers import PreTrainedTokenizer
 
 from nemo.export.trt_llm.tensor_utils import get_tensor_parallel_group
@@ -55,7 +55,7 @@ class TensorrtLLMHostContext:
 class TensorrtLLMWorkerContext:
     """The MPI worker side context for TRT LLM inference."""
 
-    decoder: tensorrt_llm.runtime.GenerationSession = None
+    decoder: ModelRunnerCpp = None
     sampling_config: SamplingConfig = None
     lora_manager: LoraManager = None
     max_batch_size: int = 0
@@ -135,42 +135,38 @@ def _load(tokenizer: PreTrainedTokenizer, engine_dir, lora_ckpt_list=None, num_b
 
         engine_dir = Path(engine_dir)
         config_path = engine_dir / "config.json"
-        model_config, world_size, tp_size, pp_size, dtype, max_input_len, max_batch_size = _read_config(config_path)
+        # model_config, world_size, tp_size, pp_size, dtype, max_input_len, max_batch_size = _read_config(config_path)
 
-        runtime_rank = tensorrt_llm.mpi_rank()
+        with open(config_path, "r") as f:
+            config = json.load(f)
 
-        assert runtime_rank < torch.cuda.device_count(), f"Rank {runtime_rank} out of bound"
-        runtime_mapping = tensorrt_llm.Mapping(world_size, runtime_rank, tp_size=tp_size, pp_size=pp_size)
+        max_batch_size = config["build_config"]["max_batch_size"]
+        max_input_len = config["build_config"]["max_input_len"]
+        max_output_len = config["build_config"]["max_output_len"]
+        max_beam_width = config["build_config"]["max_beam_width"]
 
-        torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
-        engine_name = get_engine_name(MODEL_NAME, dtype, tp_size, pp_size, runtime_rank)
-        serialize_path = os.path.join(engine_dir, engine_name)
-        logger.info(f"Reading from serialize path {serialize_path}")
+        runtime_rank = tensorrt_llm.mpi_rank()
 
-        with open(serialize_path, "rb") as f:
-            engine_buffer = f.read()
-        decoder = tensorrt_llm.runtime.GenerationSession(
-            model_config, engine_buffer, runtime_mapping, debug_mode=False
+        decoder = ModelRunnerCpp.from_dir(
+            engine_dir=engine_dir,
+            lora_dir=lora_ckpt_list,
+            lora_ckpt_source="nemo",
+            rank=runtime_rank,
+            max_batch_size=max_batch_size,
+            max_input_len=max_input_len,
+            max_output_len=max_output_len,
+            max_beam_width=max_beam_width,
+            debug_mode=False,
         )
 
         sampling_config = SamplingConfig(
             end_id=tokenizer.eos_token_id, pad_id=tokenizer.eos_token_id, num_beams=num_beams
         )
 
-        if decoder.use_lora_plugin:
-            lora_manager = LoraManager()
-            if lora_ckpt_list is not None:
-                lora_manager.load_from_nemo(
-                    model_files=lora_ckpt_list, model_config=model_config, runtime_mapping=runtime_mapping,
-                )
-        else:
-            lora_manager = None
-
         # Initialize the global context so it can be used during `run` API.
         global tensorrt_llm_worker_context
         tensorrt_llm_worker_context.decoder = decoder
         tensorrt_llm_worker_context.sampling_config = sampling_config
-        tensorrt_llm_worker_context.lora_manager = lora_manager
         tensorrt_llm_worker_context.max_batch_size = max_batch_size
         tensorrt_llm_worker_context.max_input_len = max_input_len
 
@@ -207,7 +203,6 @@ def _forward(
         decoder = tensorrt_llm_worker_context.decoder
         assert decoder is not None, "Invalid worker context, decoder is not loaded."
         sampling_config = tensorrt_llm_worker_context.sampling_config
-        lora_manager = tensorrt_llm_worker_context.lora_manager
         max_batch_size = tensorrt_llm_worker_context.max_batch_size
         max_input_len = tensorrt_llm_worker_context.max_input_len
 
@@ -217,60 +212,36 @@ def _forward(
         max_length = max(input_lengths)
         assert max_length <= max_input_len, f"input length {max_length} exceedng max input length {max_input_len}"
         pad_id = sampling_config.pad_id
-
-        if decoder.remove_input_padding:
-            line_encoded = torch.concat(input_tensors).cuda()
-        else:
-            line_encoded = torch.nested.to_padded_tensor(
-                torch.nested.nested_tensor(input_tensors, dtype=torch.int32), pad_id
-            ).cuda()
-
-        input_lengths = torch.tensor(input_lengths, dtype=torch.int32).cuda()
-
-        if prompt_table is None:
-            ptuning_args = []
-        else:
-            if task_vocab_size is None:
-                raise Exception("task_vocab_size cannot be None")
-
-            task_vocab_size = torch.tensor([task_vocab_size], dtype=torch.int32, device="cuda")
-            task_ids = torch.tensor(task_ids, dtype=torch.int32, device="cuda")
-            prompt_table = prompt_table.cuda()
-            ptuning_args = [prompt_table, task_ids, task_vocab_size]
+        end_id = sampling_config.end_id
+        num_beams = sampling_config.num_beams
 
         with torch.no_grad():
-            sampling_config.top_k = top_k
-            sampling_config.top_p = top_p
-            sampling_config.temperature = temperature
-            for key, param in sampling_kwargs.items():
-                # set any additional SamplingConfig kwargs
-                setattr(sampling_config, key, param)
-
-            decoder.setup(
-                batch_size,
-                max_context_length=max_length,
-                max_new_tokens=max_output_len,
-                lora_manager=lora_manager,
-                lora_uids=lora_uids,
-            )
+            prompt_tasks = None if task_ids is None else ",".join(str(task) for task in task_ids)
 
-            outputs = decoder.decode(
-                line_encoded,
-                input_lengths,
-                sampling_config,
-                *ptuning_args,
+            outputs = decoder.generate(
+                input_tensors,
+                max_new_tokens=max_output_len,
+                end_id=end_id,
+                pad_id=pad_id,
+                temperature=temperature,
+                top_k=top_k,
+                top_p=top_p,
+                num_beams=num_beams,
                 stop_words_list=stop_words_list,
                 bad_words_list=bad_words_list,
-                no_repeat_ngram_size=no_repeat_ngram_size,
+                lora_uids=lora_uids,
+                prompt_table=prompt_table,
+                prompt_tasks=prompt_tasks,
                 streaming=streaming,
                 output_sequence_lengths=True,
                 return_dict=True,
             )
+
             torch.cuda.synchronize()
 
         runtime_rank = tensorrt_llm.mpi_rank()
         if runtime_rank == 0 or multiprocessed_env:
-            return outputs, decoder.log_probs
+            return outputs
         else:
             return None
 
@@ -290,10 +261,14 @@ def load(
     config_path = os.path.join(engine_dir, "config.json")
     with open(config_path, "r") as f:
         config = json.load(f)
-    world_size = config["builder_config"]["world_size"]
+    world_size = config["pretrained_config"]["mapping"]["world_size"]
     if world_size == 1:
         _load(tokenizer, engine_dir, lora_ckpt_list, num_beams)
         executor = None
+    elif tensorrt_llm.mpi_world_size() > 1:
+        _load(tokenizer, engine_dir, lora_ckpt_list, num_beams)
+        executor = None
+        tensorrt_llm.mpi_barrier()
     else:
         executor = MPIPoolExecutor(max_workers=world_size)
         futures = []
@@ -303,9 +278,9 @@ def load(
         for future in futures:
             future.result()
 
-    max_batch_size = config["builder_config"]["max_batch_size"]
-    max_input_len = config["builder_config"]["max_input_len"]
-    add_bos = config["builder_config"]["add_bos"]
+    max_batch_size = config["build_config"]["max_batch_size"]
+    max_input_len = config["build_config"]["max_input_len"]
+    add_bos = True if config["pretrained_config"]["architecture"] == "GemmaForCausalLM" else False
 
     return TensorrtLLMHostContext(
         executor=executor,
@@ -355,7 +330,10 @@ def load_refit(
     # Manipulate the tensorrt_llm mapping to make it compatible with the multiprocessed env.
     assert tensorrt_llm.mpi_world_size() == torch.distributed.get_world_size(), "MPI world size mismatch"
     runtime_mapping = tensorrt_llm.Mapping(
-        world_size=tensorrt_llm.mpi_world_size(), rank=runtime_rank, tp_size=tensorrt_llm.mpi_world_size(), pp_size=1,
+        world_size=tensorrt_llm.mpi_world_size(),
+        rank=runtime_rank,
+        tp_size=tensorrt_llm.mpi_world_size(),
+        pp_size=1,
     )
 
     engine_name = get_engine_name(
@@ -386,7 +364,9 @@ def load_refit(
         lora_manager = LoraManager()
         if lora_ckpt_list is not None:
             lora_manager.load_from_nemo(
-                model_files=lora_ckpt_list, model_config=model_config, runtime_mapping=runtime_mapping,
+                model_files=lora_ckpt_list,
+                model_config=model_config,
+                runtime_mapping=runtime_mapping,
             )
     else:
         lora_manager = None
@@ -576,7 +556,7 @@ def generate(
     if no_repeat_ngram_size is not None:
         no_repeat_ngram_size = torch.IntTensor(no_repeat_ngram_size).to(torch.cuda.current_device())
 
-    outputs, log_probs = forward(
+    outputs = forward(
         input_tensors=input_tensors,
         max_output_len=max_output_len,
         host_context=host_context,
@@ -596,6 +576,8 @@ def generate(
         **sampling_kwargs,
     )
     assert outputs is not None
+    if tensorrt_llm.mpi_rank() != 0:
+        return None
 
     output_ids = outputs['output_ids']
     sequence_lengths = outputs['sequence_lengths']
@@ -656,7 +638,7 @@ def generate_streaming(
     if no_repeat_ngram_size is not None:
         no_repeat_ngram_size = torch.IntTensor(no_repeat_ngram_size).to(torch.cuda.current_device())
 
-    outputs, log_probs = forward(
+    outputs = forward(
         input_tensors=input_tensors,
         max_output_len=max_output_len,
         host_context=host_context,
diff --git a/scripts/export/export_to_trt_llm.py b/scripts/export/export_to_trt_llm.py
index 9798473dd880..5e5833444f65 100644
--- a/scripts/export/export_to_trt_llm.py
+++ b/scripts/export/export_to_trt_llm.py
@@ -78,7 +78,6 @@ def get_args(argv):
         '--use_lora_plugin',
         nargs='?',
         const=None,
-        default=False,
         choices=['float16', 'float32', 'bfloat16'],
         help="Activates the lora plugin which enables embedding sharing.",
     )
@@ -86,7 +85,16 @@ def get_args(argv):
         '--lora_target_modules',
         nargs='+',
         default=None,
-        choices=["attn_qkv", "attn_q", "attn_k", "attn_v", "attn_dense", "mlp_h_to_4h", "mlp_gate", "mlp_4h_to_h",],
+        choices=[
+            "attn_qkv",
+            "attn_q",
+            "attn_k",
+            "attn_v",
+            "attn_dense",
+            "mlp_h_to_4h",
+            "mlp_gate",
+            "mlp_4h_to_h",
+        ],
         help="Add lora in which modules. Only be activated when use_lora_plugin is enabled.",
     )
     parser.add_argument(

From b1628cf231eff0ca96a94b5f840b0dcbb7f2d667 Mon Sep 17 00:00:00 2001
From: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Date: Mon, 13 May 2024 22:35:05 -0500
Subject: [PATCH 08/36] Alit/optim 8k (#9166)

* fix fuser issue with dynamo

* optimized 4k seq len

* optim 8k

* add checkpointing

* add ckpt arg

* fix minor bug

* minor fix

* more optimized chkpting

* Apply isort and black reformatting

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>

* addressing comments

* Apply isort and black reformatting

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>

---------

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <ataghibakhsh@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: JRD971000 <JRD971000@users.noreply.github.com>
---
 .../conf/megatron_griffin_config.yaml         |   1 +
 .../megatron_griffin_finetuning_config.yaml   |   1 +
 .../megatron_griffin_generate_config.yaml     |   2 +-
 .../megatron/griffin/griffin_block.py         | 165 +++++++++++++--
 .../megatron/griffin/griffin_model.py         |  26 ++-
 .../megatron/griffin/recurrent_layer.py       |  20 +-
 .../megatron/griffin/recurrent_module.py      | 194 ++++++++++++------
 .../megatron_griffin_model.py                 |  20 +-
 8 files changed, 324 insertions(+), 105 deletions(-)

diff --git a/examples/nlp/language_modeling/conf/megatron_griffin_config.yaml b/examples/nlp/language_modeling/conf/megatron_griffin_config.yaml
index c080ff846ba1..1d3620493162 100644
--- a/examples/nlp/language_modeling/conf/megatron_griffin_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_griffin_config.yaml
@@ -108,6 +108,7 @@ model:
   # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
   # 'full' will checkpoint the entire transformer layer.
   activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_recurrent: False # If set to True, the checkpointing is only done for rglru and conv1d and not for attention and mlp layers
   activations_checkpoint_method: null # 'uniform', 'block'
   # 'uniform' divides the total number of transformer layers and checkpoints the input activation
   # of each chunk at the specified granularity. When used with 'selective', 'uniform' checkpoints all attention blocks in the model.
diff --git a/examples/nlp/language_modeling/conf/megatron_griffin_finetuning_config.yaml b/examples/nlp/language_modeling/conf/megatron_griffin_finetuning_config.yaml
index e144c784fb0c..f92f971eb059 100644
--- a/examples/nlp/language_modeling/conf/megatron_griffin_finetuning_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_griffin_finetuning_config.yaml
@@ -117,6 +117,7 @@ model:
   # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
   # 'full' will checkpoint the entire transformer layer.
   activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_recurrent: False # If set to True, the checkpointing is only done for rglru and conv1d and not for attention and mlp layers  activations_checkpoint_method: null # 'uniform', 'block'
   activations_checkpoint_method: null # 'uniform', 'block'
   # 'uniform' divides the total number of transformer layers and checkpoints the input activation
   # of each chunk at the specified granularity. When used with 'selective', 'uniform' checkpoints all attention blocks in the model.
diff --git a/examples/nlp/language_modeling/conf/megatron_griffin_generate_config.yaml b/examples/nlp/language_modeling/conf/megatron_griffin_generate_config.yaml
index b09cce5671c9..e22b615d48aa 100644
--- a/examples/nlp/language_modeling/conf/megatron_griffin_generate_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_griffin_generate_config.yaml
@@ -121,7 +121,7 @@ model:
   # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
   # 'full' will checkpoint the entire transformer layer.
   activations_checkpoint_granularity: null # 'selective' or 'full' 
-  activations_checkpoint_method: null # 'uniform', 'block'
+  activations_checkpoint_recurrent: False # If set to True, the checkpointing is only done for rglru and conv1d and not for attention and mlp layers  activations_checkpoint_method: null # 'uniform', 'block'
   # 'uniform' divides the total number of transformer layers and checkpoints the input activation
   # of each chunk at the specified granularity. When used with 'selective', 'uniform' checkpoints all attention blocks in the model.
   # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
diff --git a/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_block.py b/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_block.py
index 3fc26a51f3c1..d8954ad1b3c3 100755
--- a/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_block.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_block.py
@@ -11,17 +11,26 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-from megatron.core.models.common.language_module.language_module import LanguageModule
-from megatron.core.transformer.custom_layers.transformer_engine import TENorm
-from megatron.core.transformer.spec_utils import build_module
-from megatron.core.transformer.transformer_config import TransformerConfig
-from torch import nn
-
+from torch import Tensor, nn
 from nemo.collections.nlp.models.language_modeling.megatron.griffin.griffin_layer_spec import (
     griffin_mqa_layer_with_transformer_engine_spec,
     griffin_recurrent_layer_with_transformer_engine_spec,
 )
+from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults
+
+try:
+    from megatron.core import parallel_state, tensor_parallel
+    from megatron.core.models.common.language_module.language_module import LanguageModule
+    from megatron.core.packed_seq_params import PackedSeqParams
+    from megatron.core.transformer.custom_layers.transformer_engine import TENorm, te_checkpoint
+    from megatron.core.transformer.spec_utils import build_module
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+    TransformerConfig = ApexGuardDefaults
+    HAVE_MEGATRON_CORE = False
 
 
 def get_griffin_layers(num_layers):
@@ -41,16 +50,22 @@ def get_griffin_layers(num_layers):
 
 
 def create_block(
-    config, layer_spec, layer_idx,
+    config,
+    layer_spec,
+    layer_idx,
 ):
-    block = build_module(layer_spec, config,)
+    block = build_module(
+        layer_spec,
+        config,
+    )
     block.layer_number = layer_idx + 1
     return block
 
 
 class GriffinStack(LanguageModule):
     def __init__(
-        self, config: TransformerConfig,
+        self,
+        config: TransformerConfig,
     ):
 
         super().__init__(config)
@@ -58,17 +73,139 @@ def __init__(
         self.griffin_layers = get_griffin_layers(self.config.num_layers)
 
         self.layers = nn.ModuleList(
-            [create_block(self.config, layer_spec, layer_idx=i,) for i, layer_spec in enumerate(self.griffin_layers)]
+            [
+                create_block(
+                    self.config,
+                    layer_spec,
+                    layer_idx=i,
+                )
+                for i, layer_spec in enumerate(self.griffin_layers)
+            ]
         )
         self.final_layernorm = TENorm(
-            config=self.config, hidden_size=self.config.hidden_size, eps=self.config.layernorm_epsilon,
+            config=self.config,
+            hidden_size=self.config.hidden_size,
+            eps=self.config.layernorm_epsilon,
         )
+        self.num_layers = len(self.layers)
+
+    def _get_layer(self, layer_number: int):
+        return self.layers[layer_number]
+
+    def _checkpointed_forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Tensor,
+        context: Tensor = None,
+        context_mask: Tensor = None,
+        rotary_pos_emb: Tensor = None,
+        packed_seq_params: PackedSeqParams = None,
+    ):
+        """Forward method with activation checkpointing."""
+
+        def custom(start: int, end: int):
+            def custom_forward(
+                hidden_states,
+                attention_mask,
+                context,
+                context_mask,
+                rotary_pos_emb,
+                packed_seq_params,
+            ):
+                for index in range(start, end):
+                    layer = self._get_layer(index)
+                    hidden_states, context = layer(
+                        hidden_states=hidden_states,
+                        attention_mask=attention_mask,
+                        context=context,
+                        context_mask=context_mask,
+                        rotary_pos_emb=rotary_pos_emb,
+                        inference_params=None,
+                        packed_seq_params=packed_seq_params,
+                    )
+                return hidden_states, context
+
+            return custom_forward
+
+        def checkpoint_handler(forward_func):
+            if self.config.fp8:
+                return te_checkpoint(
+                    forward_func,
+                    self.config.distribute_saved_activations,
+                    tensor_parallel.random.get_cuda_rng_tracker,
+                    parallel_state.get_tensor_model_parallel_group(),
+                    hidden_states,
+                    attention_mask,
+                    context,
+                    context_mask,
+                    rotary_pos_emb,
+                    packed_seq_params,
+                )
+            else:
+                return tensor_parallel.checkpoint(
+                    forward_func,
+                    self.config.distribute_saved_activations,
+                    hidden_states,
+                    attention_mask,
+                    context,
+                    context_mask,
+                    rotary_pos_emb,
+                    packed_seq_params,
+                )
+
+        if self.config.recompute_method == 'uniform':
+            # Uniformly divide the total number of Transformer layers and checkpoint
+            # the input activation of each divided chunk.
+            # A method to further reduce memory usage reducing checkpoints.
+            l = 0
+            while l < self.num_layers:
+                hidden_states, context = checkpoint_handler(custom(l, l + self.config.recompute_num_layers))
+
+                l += self.config.recompute_num_layers
+
+        elif self.config.recompute_method == 'block':
+            # Checkpoint the input activation of only a set number of individual
+            # Transformer layers and skip the rest.
+            # A method fully use the device memory removing redundant re-computation.
+            recompute_skip_num_layers = 0
+            for l in range(self.num_layers):
+                # Skip recomputation when input grad computation is not needed.
+                # Need to have at least one input tensor with gradient computation
+                # for re-enterant autograd engine.
+                if self.config.fp8 and not hidden_states.requires_grad:
+                    recompute_skip_num_layers += 1
+                if l >= recompute_skip_num_layers and l < self.config.recompute_num_layers + recompute_skip_num_layers:
+                    hidden_states, context = checkpoint_handler(custom(l, l + 1))
+                else:
+                    hidden_states, context = custom(l, l + 1)(
+                        hidden_states,
+                        attention_mask,
+                        context,
+                        context_mask,
+                        rotary_pos_emb,
+                        packed_seq_params,
+                    )
+        else:
+            raise ValueError("Invalid activation recompute method.")
+
+        return hidden_states
 
     def forward(self, hidden_states, attention_mask, rotary_pos_emb):
 
-        for layer in self.layers:
+        if (
+            self.config.recompute_granularity == 'full'
+            and self.training
+            and not self.config.activations_checkpoint_recurrent
+        ):
+            hidden_states = self._checkpointed_forward(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                rotary_pos_emb=rotary_pos_emb,
+            )
+        else:
+            for layer in self.layers:
 
-            hidden_states, _ = layer(hidden_states, attention_mask=attention_mask, rotary_pos_emb=rotary_pos_emb)
+                hidden_states, _ = layer(hidden_states, attention_mask=attention_mask, rotary_pos_emb=rotary_pos_emb)
 
         hidden_states = self.final_layernorm(hidden_states)
 
diff --git a/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_model.py b/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_model.py
index 4531b64d1d96..7a327a3a35cb 100755
--- a/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_model.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron/griffin/griffin_model.py
@@ -13,15 +13,23 @@
 # limitations under the License.
 
 import math
-
 import torch
-from megatron.core import tensor_parallel
-from megatron.core.jit import jit_fuser
-from megatron.core.models.common.embeddings.language_model_embedding import LanguageModelEmbedding
-from megatron.core.models.common.embeddings.rotary_pos_embedding import RotaryEmbedding
-from megatron.core.models.common.language_module.language_module import LanguageModule
-from megatron.core.transformer.transformer_config import TransformerConfig
-from torch import Tensor, nn
+from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults
+
+try:
+    from megatron.core import tensor_parallel
+    from megatron.core.jit import jit_fuser
+    from megatron.core.models.common.embeddings.language_model_embedding import LanguageModelEmbedding
+    from megatron.core.models.common.embeddings.rotary_pos_embedding import RotaryEmbedding
+    from megatron.core.models.common.language_module.language_module import LanguageModule
+    from megatron.core.transformer.transformer_config import TransformerConfig
+    from torch import Tensor, nn
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+    TransformerConfig = ApexGuardDefaults
+    HAVE_MEGATRON_CORE = False
 
 from nemo.collections.nlp.models.language_modeling.megatron.griffin.griffin_block import GriffinStack
 
@@ -142,7 +150,7 @@ def forward(
         position_ids: Tensor = None,
         attention_mask: Tensor = None,
         labels: Tensor = None,
-        **extra_arg
+        **extra_arg,
     ):
         if input_ids is None:
             input_ids = self.input_tensor
diff --git a/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_layer.py b/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_layer.py
index 8263f54889a0..3a33f8966fd2 100755
--- a/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_layer.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_layer.py
@@ -14,13 +14,21 @@
 
 from dataclasses import dataclass
 from typing import Union
-
-from megatron.core.transformer.identity_op import IdentityFuncOp, IdentityOp
-from megatron.core.transformer.module import MegatronModule
-from megatron.core.transformer.spec_utils import ModuleSpec, build_module
-from megatron.core.transformer.transformer_config import TransformerConfig
-from megatron.core.utils import make_viewless_tensor
 from torch import Tensor
+from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults
+
+try:
+    from megatron.core.transformer.identity_op import IdentityFuncOp, IdentityOp
+    from megatron.core.transformer.module import MegatronModule
+    from megatron.core.transformer.spec_utils import ModuleSpec, build_module
+    from megatron.core.transformer.transformer_config import TransformerConfig
+    from megatron.core.utils import make_viewless_tensor
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+    TransformerConfig = ApexGuardDefaults
+    HAVE_MEGATRON_CORE = False
 
 
 @dataclass
diff --git a/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_module.py b/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_module.py
index d91c07718917..033d3abec732 100755
--- a/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_module.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron/griffin/recurrent_module.py
@@ -17,33 +17,50 @@
 from typing import Union
 
 import torch
+import torch._dynamo
 from accelerated_scan.triton import scan
 from causal_conv1d import causal_conv1d_fn
 from einops import rearrange
-from megatron.core.fusions.fused_bias_gelu import bias_gelu_impl
-from megatron.core.jit import jit_fuser
-from megatron.core.transformer.identity_op import IdentityOp
-from megatron.core.transformer.module import MegatronModule
-from megatron.core.transformer.spec_utils import ModuleSpec, build_module
-from megatron.core.transformer.transformer_config import TransformerConfig
 from torch import nn
 
+from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults
+
+try:
+    from megatron.core import tensor_parallel
+    from megatron.core.fusions.fused_bias_gelu import bias_gelu_impl
+    from megatron.core.jit import jit_fuser
+    from megatron.core.transformer.identity_op import IdentityOp
+    from megatron.core.transformer.module import MegatronModule
+    from megatron.core.transformer.spec_utils import ModuleSpec, build_module
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+    TransformerConfig = ApexGuardDefaults
+    HAVE_MEGATRON_CORE = False
+
+torch._dynamo.config.suppress_errors = True
+
 
 # Class copied from https://github.com/google-deepmind/recurrentgemma
 class BlockDiagonalLinear(nn.Module):
     """Block-diagonal linear layer."""
 
     def __init__(
-        self, width: int, num_blocks: int, w_init_variance_scale: float = 1.0,
+        self,
+        width: int,
+        num_blocks: int,
+        w_init_variance_scale: float = 1.0,
     ):
         """Initializes the BlockDiagonalLinear.
 
-    Args:
-      width: The number of dimensions of the input and output.
-      num_blocks: The number of diagonal blocks in the layer.
-      w_init_variance_scale: A parameters that scales the variance of the
-        initialization of the weights.
-    """
+        Args:
+          width: The number of dimensions of the input and output.
+          num_blocks: The number of diagonal blocks in the layer.
+          w_init_variance_scale: A parameters that scales the variance of the
+            initialization of the weights.
+        """
         super().__init__()
         self.width = width
         self.num_blocks = num_blocks
@@ -62,25 +79,46 @@ def w_init_(self, w: torch.Tensor) -> None:
         std = math.sqrt(self.w_init_variance_scale / self.block_width)
         torch.nn.init.normal_(w, mean=0.0, std=std)
 
-    def forward(self, x):
-        """Calls the BlockDiagonalLinear."""
-        # Split x to blocks.
-        bs, seq_l = x.shape[0], x.shape[1]
+    @jit_fuser
+    def _fused_pre_reshape_(self, x, bs, seq_l):
         x = (
             x.reshape(bs, seq_l, self.num_blocks, self.block_width)
             .permute(2, 0, 1, 3)
             .reshape(self.num_blocks, bs * seq_l, self.block_width)
         )
-        x = (torch.bmm(x, self.w).permute(1, 0, 2) + self.b).reshape(bs, seq_l, self.num_blocks * self.block_width)
-        out = torch.sigmoid(x)
-        return out
+        return x
+
+    @jit_fuser
+    def _post_add_reshape_sigmoid_(self, x, bs, seq_l):
+        x = (x.permute(1, 0, 2) + self.b).reshape(bs, seq_l, self.num_blocks * self.block_width)
+        x = torch.sigmoid(x)
+        return x
+
+    def forward(self, x):
+        """Calls the BlockDiagonalLinear."""
+        # Split x to blocks.
+        bs, seq_l = x.shape[0], x.shape[1]
+        x = self._fused_pre_reshape_(x, bs, seq_l)
+
+        x = torch.bmm(x, self.w)
+        x = self._post_add_reshape_sigmoid_(x, bs, seq_l)
+
+        return x
 
 
 # Class copied from https://github.com/google-deepmind/recurrentgemma
 
 
 @jit_fuser
-def _scan_preprocess_(a, x, reset):
+def _scan_preprocess_(x, gate_a, gate_x, reset, a_params):
+
+    log_a = -8.0 * gate_a * nn.functional.softplus(a_params)
+    a = torch.exp(log_a)
+    gated_x = x * gate_x
+    multiplier = torch.sqrt((1 - torch.exp(2 * log_a)) + 1e-6)
+    multiplier = reset + (1 - reset) * multiplier
+    x = gated_x * multiplier.type(x.dtype)
+
     assert x.ndim == 3
     assert a.shape == x.shape[-a.ndim :]
     assert a.dtype == x.dtype
@@ -94,38 +132,54 @@ def _scan_preprocess_(a, x, reset):
     a = a.permute(0, 2, 1)
     x = x.contiguous()
     a = a.contiguous()
+
     return a, x
 
 
 def rnn_scan(
-    x, a, reset,
+    x,
+    gate_a,
+    gate_x,
+    reset,
+    a_params,
+    # x, a, reset,
 ):
     """Runs the recurrence of a linear RNN.
 
-  Args:
-    x: The input sequence.
-    a: The diagonal of the recurrence matrix `A`.
-    reset: Indicator of document boundaries, e.g. when to reset the hidden
-      state of the RNN.
-    h0: The initial hidden state.
-
-  Returns:
-    The output of the linear recurrence.
-  """
-    a, x = _scan_preprocess_(a, x, reset)
+    Args:
+      x: The input sequence.
+      a: The diagonal of the recurrence matrix `A`.
+      reset: Indicator of document boundaries, e.g. when to reset the hidden
+        state of the RNN.
+      h0: The initial hidden state.
+
+    Returns:
+      The output of the linear recurrence.
+    """
+
+    a, x = _scan_preprocess_(x, gate_a, gate_x, reset, a_params)
+
     y = scan(a.float(), x.float()).type_as(x)
+
     y = y.permute(0, 2, 1)
+
     return y, None
 
 
 # Class copied from https://github.com/google-deepmind/recurrentgemma
 
 
-def rnn_param_init(*, width: int, min_rad: float, max_rad: float, transform: str = "softplus",) -> torch.Tensor:
+def rnn_param_init(
+    *,
+    width: int,
+    min_rad: float,
+    max_rad: float,
+    transform: str = "softplus",
+) -> torch.Tensor:
     """Initializes the `A` parameter of the RG-LRU uniformly on a ring."""
     unif = torch.rand(width)
     # Proportional to area in a ring.
-    a_real = 0.5 * torch.log(unif * (max_rad ** 2 - min_rad ** 2) + min_rad ** 2 + 1e-8)
+    a_real = 0.5 * torch.log(unif * (max_rad**2 - min_rad**2) + min_rad**2 + 1e-8)
 
     if transform == "softplus":
         # Inverse transform.
@@ -141,17 +195,20 @@ class RGLRU(nn.Module):
     """A Real-Gated Linear Recurrent Unit (RG-LRU) layer."""
 
     def __init__(
-        self, width: int, num_heads: int, w_init_variance_scale: float = 1.0,
+        self,
+        width: int,
+        num_heads: int,
+        w_init_variance_scale: float = 1.0,
     ):
         """Initializes the RG-LRU.
 
-    Args:
-      width: The number of dimensions of the input and output.
-      num_heads: The number of diagonal blocks in the input and A gate layers.
-      w_init_variance_scale: Initialization parameter for the
-        BlockDiagonalLinear layers of the gates. See the `BlockDiagonalLinear`
-        layer for details.
-    """
+        Args:
+          width: The number of dimensions of the input and output.
+          num_heads: The number of diagonal blocks in the input and A gate layers.
+          w_init_variance_scale: Initialization parameter for the
+            BlockDiagonalLinear layers of the gates. See the `BlockDiagonalLinear`
+            layer for details.
+        """
         super().__init__()
         self.width = width
         self.num_heads = num_heads
@@ -160,7 +217,9 @@ def __init__(
         # Parameters and layers.
         self.a_param = nn.Parameter(self.a_param_init)
         self.input_gate = BlockDiagonalLinear(
-            width=self.width, num_blocks=self.num_heads, w_init_variance_scale=w_init_variance_scale,
+            width=self.width,
+            num_blocks=self.num_heads,
+            w_init_variance_scale=w_init_variance_scale,
         )
         self.a_gate = BlockDiagonalLinear(
             width=self.width, num_blocks=self.num_heads, w_init_variance_scale=self.w_init_variance_scale
@@ -184,18 +243,22 @@ def _fused_pst_gates_(self, x, gate_a, gate_x, reset):
         return normalized_x, a
 
     def __call__(
-        self, x, segment_pos, prev_h,
+        self,
+        x,
+        segment_pos,
+        prev_h,
     ):
         """Calls the RG-LRU.
 
-    Args:
-      x: Sequence of input activations.
-      segment_pos: Position of each token in the sequence.
-      prev_h: The previous hidden state of the RG-LRU.
+        Args:
+          x: Sequence of input activations.
+          segment_pos: Position of each token in the sequence.
+          prev_h: The previous hidden state of the RG-LRU.
+
+        Returns:
+          Output of the block together with the updated hidden state.
+        """
 
-    Returns:
-      Output of the block together with the updated hidden state.
-    """
         for param in self.parameters():
             param.data_ptr()
 
@@ -207,9 +270,7 @@ def __call__(
         gate_x = self.input_gate(x)
         gate_a = self.a_gate(x)
 
-        # Compute the parameter `A` of the recurrence.
-        normalized_x, a = self._fused_pst_gates_(x, gate_a, gate_x, reset)
-        y, last_h = rnn_scan(x=normalized_x, a=a, reset=reset)
+        y, last_h = rnn_scan(x, gate_a, gate_x, reset, self.a_param)
 
         return y, last_h
 
@@ -230,11 +291,17 @@ def __init__(self, config, width, temporal_width):
         )
 
     def forward(
-        self, x, segment_pos=None, prev_x=None,
+        self,
+        x,
+        segment_pos=None,
+        prev_x=None,
     ):
         x = x.permute(0, 2, 1)
         output = causal_conv1d_fn(
-            x=x, weight=rearrange(self.conv_1d.weight, "d 1 w -> d w"), bias=self.conv_1d.bias, activation=None,
+            x=x,
+            weight=rearrange(self.conv_1d.weight, "d 1 w -> d w"),
+            bias=self.conv_1d.bias,
+            activation=None,
         ).permute(0, 2, 1)
         return output, None
 
@@ -314,6 +381,11 @@ def __init__(
             submodules.rg_lru, width=self.config.hidden_size, num_heads=self.config.num_attention_heads
         )
 
+    def checkpoint_handler(self, forward_func, x, segment_pos, prev_x):
+        return tensor_parallel.checkpoint(
+            forward_func, self.config.distribute_saved_activations, x, segment_pos, prev_x
+        )
+
     def forward(self, hidden_states, attention_mask=None, rotary_pos_emb=None):
 
         segment_pos = torch.arange(hidden_states.shape[0]).unsqueeze(0).repeat(hidden_states.shape[1], 1).cuda()
@@ -326,9 +398,13 @@ def forward(self, hidden_states, attention_mask=None, rotary_pos_emb=None):
 
         x = _fused_permute_add_(x_intermidiate_parallel, x_bias_parallel)
 
-        x, _ = self.conv_1d(x=x, segment_pos=segment_pos, prev_x=None)
+        if self.config.activations_checkpoint_recurrent and self.training:
+            x, _ = self.checkpoint_handler(self.conv_1d, x=x, segment_pos=segment_pos, prev_x=None)
+            x, _ = self.checkpoint_handler(self.rg_lru, x=x, segment_pos=segment_pos, prev_x=None)
 
-        x, _ = self.rg_lru(x=x, segment_pos=segment_pos, prev_h=None,)
+        else:
+            x, _ = self.conv_1d(x=x, segment_pos=segment_pos, prev_x=None)
+            x, _ = self.rg_lru(x=x, segment_pos=segment_pos, prev_h=None)
 
         x = _fused_permute_mult_(x, y)
 
diff --git a/nemo/collections/nlp/models/language_modeling/megatron_griffin_model.py b/nemo/collections/nlp/models/language_modeling/megatron_griffin_model.py
index 20ad376b8f98..1e5a2f0c15c0 100644
--- a/nemo/collections/nlp/models/language_modeling/megatron_griffin_model.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron_griffin_model.py
@@ -18,15 +18,6 @@
 
 from nemo.collections.nlp.models.language_modeling.megatron.griffin.griffin_model import GriffinModel
 from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
-from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults
-
-try:
-
-    HAVE_MEGATRON_CORE = True
-
-except (ImportError, ModuleNotFoundError):
-    TransformerConfig = ApexGuardDefaults
-    HAVE_MEGATRON_CORE = False
 
 
 class MegatronGriffinModel(MegatronGPTModel):
@@ -35,13 +26,6 @@ class MegatronGriffinModel(MegatronGPTModel):
     """
 
     def __init__(self, cfg: DictConfig, trainer: Trainer):
-        if not HAVE_MEGATRON_CORE:
-            raise ImportError(
-                "megatron-core was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt."
-            )
-
-        # build the transformer config
-        # TODO: add type hint once pip package is out
 
         self.vocab_size = cfg.get('vocab_size', 256000)
         self.cfg = cfg
@@ -70,8 +54,12 @@ def forward(self, input_ids, position_ids=None, attention_mask=None, labels=None
 
     def build_transformer_config(self):
         transformer_config = super().build_transformer_config()
+        transformer_config.activations_checkpoint_recurrent = self.cfg.get('activations_checkpoint_recurrent', False)
         transformer_config.gated_linear_unit = self.cfg.get('gated_linear_unit', True)
         transformer_config.layernorm_zero_centered_gamma = self.cfg.get('layernorm_zero_centered_gamma', True)
+        assert (
+            not transformer_config.activations_checkpoint_recurrent or not transformer_config.recompute_granularity
+        ), "Either the recurrent checkpoiting or the full/custom checkpointing should be set"
 
         return transformer_config
 

From 93907f000dbaeed899556c5ae224557172233412 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Piotr=20=C5=BBelasko?= <petezor@gmail.com>
Date: Tue, 14 May 2024 12:01:13 -0400
Subject: [PATCH 09/36] Bucketing duration bins: less optimal but instant init
 when not provided + fixes in estimation script (#9157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Bucketing duration bins: less optimal but instant init when not provided + fixes in estimation script

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix CPU mem hungriness

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Make estimate duration bins work for every kind of manifest

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support more type of inputs

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fixes

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* msg

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Apply isort and black reformatting

Signed-off-by: pablo-garay <pablo-garay@users.noreply.github.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: pablo-garay <pablo-garay@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Co-authored-by: pablo-garay <pablo-garay@users.noreply.github.com>
---
 nemo/collections/common/data/lhotse/cutset.py | 84 ++++++++++++++-----
 .../common/data/lhotse/dataloader.py          | 62 ++++++++++++--
 .../common/data/lhotse/nemo_adapters.py       | 32 ++++---
 .../convert_to_tarred_audio_dataset.py        |  2 +-
 .../estimate_duration_bins.py                 | 53 ++++++++----
 5 files changed, 177 insertions(+), 56 deletions(-)

diff --git a/nemo/collections/common/data/lhotse/cutset.py b/nemo/collections/common/data/lhotse/cutset.py
index cb2efe0312d2..775395400d8e 100644
--- a/nemo/collections/common/data/lhotse/cutset.py
+++ b/nemo/collections/common/data/lhotse/cutset.py
@@ -127,7 +127,7 @@ def read_dataset_config(config) -> tuple[CutSet, bool]:
         "shard_seed": config.shard_seed,
         "text_field": config.text_field,
         "lang_field": config.lang_field,
-        "missing_sampling_rate_ok": config.missing_sampling_rate_ok,
+        "metadata_only": config.metadata_only,
         "max_open_streams": config.max_open_streams,
     }
     input_cfg = config.input_cfg
@@ -164,7 +164,10 @@ def parse_group(grp_cfg: DictConfig, propagate_attrs: dict) -> [CutSet, bool]:
         is_tarred = True
         cuts = read_txt_pair_paths(grp_cfg)
     elif grp_cfg.type == "group":
-        cuts, is_tarred = parse_and_combine_datasets(grp_cfg.input_cfg, propagate_attrs=propagate_attrs,)
+        cuts, is_tarred = parse_and_combine_datasets(
+            grp_cfg.input_cfg,
+            propagate_attrs=propagate_attrs,
+        )
     else:
         raise ValueError(f"Unrecognized group: {grp_cfg.type}")
     # Attach extra tags to every utterance dynamically, if provided.
@@ -176,7 +179,10 @@ def parse_group(grp_cfg: DictConfig, propagate_attrs: dict) -> [CutSet, bool]:
 def read_txt_paths(config: DictConfig) -> CutSet:
     return CutSet(
         LhotseTextAdapter(
-            paths=config.paths, language=config.language, shuffle_shards=config.shuffle, shard_seed=config.shard_seed,
+            paths=config.paths,
+            language=config.language,
+            shuffle_shards=config.shuffle,
+            shard_seed=config.shard_seed,
         )
     ).repeat()
 
@@ -238,6 +244,7 @@ def parse_and_combine_datasets(
             weights=weights if weights else None,
             max_open_streams=propagate_attrs["max_open_streams"],
             seed=propagate_attrs["shard_seed"],
+            metadata_only=propagate_attrs["metadata_only"],
         )
     else:
         (cuts,) = cuts
@@ -261,11 +268,16 @@ def read_lhotse_manifest(config, is_tarred: bool) -> CutSet:
         # - integer means we'll set a specific seed in every worker, and data would be duplicated across them.
         #   This is mostly useful for unit testing or debugging.
         shard_seed = config.shard_seed
+        metadata_only = config.metadata_only
         if config.get("cuts_path") is not None:
             warnings.warn("Note: lhotse.cuts_path will be ignored because lhotse.shar_path was provided.")
         if isinstance(config.shar_path, (str, Path)):
             logging.info(f"Initializing Lhotse Shar CutSet (tarred) from a single data source: '{config.shar_path}'")
-            cuts = CutSet.from_shar(in_dir=config.shar_path, shuffle_shards=True, seed=shard_seed).repeat()
+            cuts = CutSet.from_shar(
+                **_resolve_shar_inputs(config.shar_path, metadata_only), shuffle_shards=True, seed=shard_seed
+            )
+            if not metadata_only:
+                cuts = cuts.repeat()
         else:
             # Multiple datasets in Lhotse Shar format: we will dynamically multiplex them
             # with probability approximately proportional to their size
@@ -278,7 +290,9 @@ def read_lhotse_manifest(config, is_tarred: bool) -> CutSet:
             for item in config.shar_path:
                 if isinstance(item, (str, Path)):
                     path = item
-                    cs = CutSet.from_shar(in_dir=path, shuffle_shards=True, seed=shard_seed)
+                    cs = CutSet.from_shar(
+                        **_resolve_shar_inputs(path, metadata_only), shuffle_shards=True, seed=shard_seed
+                    )
                     weight = len(cs)
                 else:
                     assert isinstance(item, Sequence) and len(item) == 2 and isinstance(item[1], (int, float)), (
@@ -288,11 +302,19 @@ def read_lhotse_manifest(config, is_tarred: bool) -> CutSet:
                         f"We got: '{item}'"
                     )
                     path, weight = item
-                    cs = CutSet.from_shar(in_dir=path, shuffle_shards=True, seed=shard_seed)
+                    cs = CutSet.from_shar(
+                        **_resolve_shar_inputs(path, metadata_only), shuffle_shards=True, seed=shard_seed
+                    )
                 logging.info(f"- {path=} {weight=}")
-                cutsets.append(cs.repeat())
+                cutsets.append(cs)
                 weights.append(weight)
-            cuts = mux(*cutsets, weights=weights, max_open_streams=config.max_open_streams, seed=config.shard_seed)
+            cuts = mux(
+                *cutsets,
+                weights=weights,
+                max_open_streams=config.max_open_streams,
+                seed=config.shard_seed,
+                metadata_only=metadata_only,
+            )
     else:
         # Regular Lhotse manifest points to individual audio files (like native NeMo manifest).
         path = config.cuts_path
@@ -300,6 +322,13 @@ def read_lhotse_manifest(config, is_tarred: bool) -> CutSet:
     return cuts
 
 
+def _resolve_shar_inputs(path: str | Path, only_metadata: bool) -> dict:
+    if only_metadata:
+        return dict(fields={"cuts": sorted(Path(path).glob("cuts.*"))})
+    else:
+        return dict(in_dir=path)
+
+
 def resolve_relative_paths(cut: Cut, manifest_path: str) -> Cut:
     if isinstance(cut, PaddingCut):
         return cut
@@ -352,20 +381,24 @@ def read_nemo_manifest(config, is_tarred: bool) -> CutSet:
     common_kwargs = {
         "text_field": config.text_field,
         "lang_field": config.lang_field,
+        "shuffle_shards": config.shuffle,
+        "shard_seed": config.shard_seed,
     }
     # The option below is to allow a special case of NeMo manifest iteration as Lhotse CutSet
-    # without performing any I/O. NeMo manifests typically don't have sampling_rate information required by Lhotse.
-    # This is useful for utility scripts that iterate metadata and estimate optimal batching settings.
-    notar_kwargs = {"missing_sampling_rate_ok": config.missing_sampling_rate_ok}
+    # without performing any I/O. NeMo manifests typically don't have sampling_rate information required by Lhotse,
+    # so lhotse has to look up the headers of audio files to fill it on-the-fly.
+    # (this only has an impact on non-tarred data; tarred data is read into memory anyway).
+    # This is useful for utility scripts that iterate metadata and estimate optimal batching settings
+    # and other data statistics.
+    notar_kwargs = {"metadata_only": config.metadata_only}
+    metadata_only = config.metadata_only
     if isinstance(config.manifest_filepath, (str, Path)):
         logging.info(f"Initializing Lhotse CutSet from a single NeMo manifest (tarred): '{config.manifest_filepath}'")
-        if is_tarred:
+        if is_tarred and not metadata_only:
             cuts = CutSet(
                 LazyNeMoTarredIterator(
                     config.manifest_filepath,
                     tar_paths=config.tarred_audio_filepaths,
-                    shuffle_shards=config.shuffle,
-                    shard_seed=config.shard_seed,
                     **common_kwargs,
                 )
             ).repeat()
@@ -393,12 +426,10 @@ def read_nemo_manifest(config, is_tarred: bool) -> CutSet:
         for manifest_info, (tar_path,) in zip(config.manifest_filepath, tar_paths):
             # First, convert manifest_path[+tar_path] to an iterator.
             manifest_path = manifest_info[0]
-            if is_tarred:
+            if is_tarred and not metadata_only:
                 nemo_iter = LazyNeMoTarredIterator(
                     manifest_path=manifest_path,
                     tar_paths=tar_path,
-                    shuffle_shards=config.shuffle,
-                    shard_seed=config.shard_seed,
                     **common_kwargs,
                 )
             else:
@@ -431,12 +462,22 @@ def read_nemo_manifest(config, is_tarred: bool) -> CutSet:
                 cutsets.append(CutSet(nemo_iter))
                 weights.append(weight)
         # Finally, we multiplex the dataset streams to mix the data.
-        cuts = mux(*cutsets, weights=weights, max_open_streams=config.max_open_streams, seed=config.shard_seed)
+        cuts = mux(
+            *cutsets,
+            weights=weights,
+            max_open_streams=config.max_open_streams,
+            seed=config.shard_seed,
+            metadata_only=metadata_only,
+        )
     return cuts
 
 
 def mux(
-    *cutsets: CutSet, weights: list[int | float], max_open_streams: int | None = None, seed: str | int = "trng"
+    *cutsets: CutSet,
+    weights: list[int | float],
+    max_open_streams: int | None = None,
+    seed: str | int = "trng",
+    metadata_only: bool = False,
 ) -> CutSet:
     """
     Helper function to call the right multiplexing method flavour in lhotse.
@@ -444,9 +485,12 @@ def mux(
     it will select a more appropriate multiplexing strategy.
     """
     if max_open_streams is not None:
+        assert not metadata_only, "max_open_streams and metadata_only options are not compatible"
         cuts = CutSet.infinite_mux(*cutsets, weights=weights, seed=seed, max_open_streams=max_open_streams)
     else:
-        cuts = CutSet.mux(*[cs.repeat() for cs in cutsets], weights=weights, seed=seed)
+        if not metadata_only:
+            cutsets = [cs.repeat() for cs in cutsets]
+        cuts = CutSet.mux(*cutsets, weights=weights, seed=seed)
     return cuts
 
 
diff --git a/nemo/collections/common/data/lhotse/dataloader.py b/nemo/collections/common/data/lhotse/dataloader.py
index 9efd6444aecd..32bbc1f3e8f4 100644
--- a/nemo/collections/common/data/lhotse/dataloader.py
+++ b/nemo/collections/common/data/lhotse/dataloader.py
@@ -95,7 +95,9 @@ class LhotseDataLoadingConfig:
 
     # 4. Optional Lhotse data augmentation.
     #   a. On-the-fly noise/audio mixing.
-    noise_path: Any | None = None  # str | dict where dict can have any of keys: manifest_filepath, tarred_audio_filepaths, cuts_path, shar_path
+    noise_path: Any | None = (
+        None  # str | dict where dict can have any of keys: manifest_filepath, tarred_audio_filepaths, cuts_path, shar_path
+    )
     noise_snr: tuple[float, float] = (10.0, 20.0)
     noise_mix_prob: float = 0.5
     #   b. On-the-fly 3-way speed perturbation.
@@ -114,7 +116,9 @@ class LhotseDataLoadingConfig:
     cut_into_windows_duration: Optional[float] = None  # set this to enable
     cut_into_windows_hop: Optional[float] = None
     #       III) common options
-    keep_excessive_supervisions: bool = True  # when a cut is truncated in the middle of a supervision, should we keep them.
+    keep_excessive_supervisions: bool = (
+        True  # when a cut is truncated in the middle of a supervision, should we keep them.
+    )
     #   e. RIR augmentation (synthetic RIR if rir_path is None)
     #   at the moment supports only Lhotse recording manifests, e.g. https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/rir_noise.py
     rir_enabled: bool = False
@@ -126,11 +130,15 @@ class LhotseDataLoadingConfig:
     lang_field: str = "lang"  # key to read the language tag from
     # Enables iteration of NeMo non-tarred manifests that don't have a "sampling_rate" key without performing any I/O.
     # Note that this will not allow actual dataloading; it's only for manifest iteration as Lhotse objects.
-    missing_sampling_rate_ok: bool = False
+    metadata_only: bool = False
 
 
 def get_lhotse_dataloader_from_config(
-    config: DictConfig, global_rank: int, world_size: int, dataset: torch.utils.data.Dataset, tokenizer=None,
+    config: DictConfig,
+    global_rank: int,
+    world_size: int,
+    dataset: torch.utils.data.Dataset,
+    tokenizer=None,
 ) -> torch.utils.data.DataLoader:
     """
     Set up a Lhotse training dataloder.
@@ -205,7 +213,11 @@ def get_lhotse_dataloader_from_config(
     #    and applying it here (before sampler/dataset) ensures optimal
     #    bucket allocation.
     if config.perturb_speed:
-        cuts = CutSet.mux(cuts, cuts.perturb_speed(0.9), cuts.perturb_speed(1.1),)
+        cuts = CutSet.mux(
+            cuts,
+            cuts.perturb_speed(0.9),
+            cuts.perturb_speed(1.1),
+        )
 
     # 2.d: truncation/slicing
     if config.truncate_duration is not None:
@@ -249,6 +261,7 @@ def get_lhotse_dataloader_from_config(
             f"Creating a Lhotse DynamicBucketingSampler "
             f"(max_batch_duration={config.batch_duration} max_batch_size={config.batch_size})"
         )
+        # Determine the bucket duration bins
         sampler = DynamicBucketingSampler(
             cuts,
             constraint=constraint,
@@ -257,7 +270,7 @@ def get_lhotse_dataloader_from_config(
             shuffle_buffer_size=config.shuffle_buffer_size,
             seed=config.shard_seed,
             num_buckets=config.num_buckets,
-            duration_bins=config.bucket_duration_bins,
+            duration_bins=determine_bucket_duration_bins(config),
             num_cuts_for_bins_estimate=config.num_cuts_for_bins_estimate,
             buffer_size=config.bucket_buffer_size,
             rank=0 if is_tarred else global_rank,
@@ -291,7 +304,10 @@ def get_lhotse_dataloader_from_config(
         # object with texts joined by a whitespace so that "regular" dataset classes don't
         # have to add a special support for multi-supervision cuts.
         sampler = sampler.map(
-            CutConcatenate(gap=config.concatenate_gap_seconds, duration_factor=config.concatenate_duration_factor,)
+            CutConcatenate(
+                gap=config.concatenate_gap_seconds,
+                duration_factor=config.concatenate_duration_factor,
+            )
         )
         if config.db_norm is not None:
             sampler = sampler.map(partial(_normalize_loudness, db_norm=config.db_norm))
@@ -326,12 +342,38 @@ def get_lhotse_dataloader_from_config(
         # the meta-data to Dataset, which performs the actual I/O inside its __getitem__ method.
         dloader_kwargs = dict(dataset=dataset, sampler=sampler)
     dloader = torch.utils.data.DataLoader(
-        **dloader_kwargs, batch_size=None, num_workers=config.num_workers, pin_memory=config.pin_memory,
+        **dloader_kwargs,
+        batch_size=None,
+        num_workers=config.num_workers,
+        pin_memory=config.pin_memory,
     )
 
     return dloader
 
 
+def determine_bucket_duration_bins(config):
+    if config.bucket_duration_bins is not None:
+        # Bucket duration bins are provided: just use them.
+        return config.bucket_duration_bins
+    # Bucket duration bins are not set.
+    if config.use_multimodal_sampling:
+        # For multimodal sampling it's currently impossible to define a linspace over durations
+        # because the buckets are counted in the number of tokens.
+        # The bins will be auto-estimated by lhotse at the cost of a slight lag in the training start.
+        return None
+    elif config.max_duration is not None and config.max_duration < float("inf"):
+        # If max duration is provided, we can use that to compute uniformly distant bucket bins.
+        # This is not optimal but should be close enough for users who didn't want to estimate these up-front.
+        begin = config.min_duration if config.min_duration is not None and config.min_duration > 0 else 0.0
+        end = config.max_duration
+        return np.linspace(begin, end, config.num_buckets + 1)[1:-1].tolist()
+    else:
+        # If we don't know max_duration, we can't guess a reasonable estimate of the upper bound of
+        # durations.
+        # The bins will be auto-estimated by lhotse at the cost of a slight lag in the training start.
+        return None
+
+
 def make_structured_with_schema_warnings(config: DictConfig) -> DictConfig:
     """
     Checks the schema and fills missing default option values.
@@ -377,7 +419,9 @@ class MultimodalSamplingConstraint(SamplingConstraint):
 
     def __post_init__(self):
         self._internal = TokenConstraint(
-            max_tokens=self.batch_tokens, max_examples=self.batch_size, quadratic_length=self.quadratic_factor,
+            max_tokens=self.batch_tokens,
+            max_examples=self.batch_size,
+            quadratic_length=self.quadratic_factor,
         )
 
     def add(self, example: Any) -> None:
diff --git a/nemo/collections/common/data/lhotse/nemo_adapters.py b/nemo/collections/common/data/lhotse/nemo_adapters.py
index b8769b041b4f..b2ca1186c8e3 100644
--- a/nemo/collections/common/data/lhotse/nemo_adapters.py
+++ b/nemo/collections/common/data/lhotse/nemo_adapters.py
@@ -49,7 +49,7 @@ class LazyNeMoIterator:
 
     .. caution:: We will perform some I/O (as much as required by soundfile.info) to discover the sampling rate
         of the audio file. If this is not acceptable, convert the manifest to Lhotse format which contains
-        sampling rate info. For pure metadata iteration purposes we also provide a ``missing_sampling_rate_ok`` flag that
+        sampling rate info. For pure metadata iteration purposes we also provide a ``metadata_only`` flag that
         will create only partially valid Lhotse objects (with metadata related to sampling rate / num samples missing).
 
     Example::
@@ -62,16 +62,23 @@ def __init__(
         path: str | Path,
         text_field: str = "text",
         lang_field: str = "lang",
-        missing_sampling_rate_ok: bool = False,
+        metadata_only: bool = False,
+        shuffle_shards: bool = False,
+        shard_seed: int | Literal["randomized", "trng"] = "trng",
     ) -> None:
-        self.source = LazyJsonlIterator(path)
+        self.path = path
+        self.shuffle_shards = shuffle_shards
+        self.shard_seed = shard_seed
+        paths = expand_sharded_filepaths(path)
+        if len(paths) == 1:
+            self.source = LazyJsonlIterator(paths[0])
+        else:
+            self.source = LazyIteratorChain(
+                *(LazyJsonlIterator(p) for p in paths), shuffle_iters=self.shuffle_shards, seed=self.shard_seed
+            )
         self.text_field = text_field
         self.lang_field = lang_field
-        self.missing_sampling_rate_ok = missing_sampling_rate_ok
-
-    @property
-    def path(self) -> str | Path:
-        return self.source.path
+        self.metadata_only = metadata_only
 
     def __iter__(self) -> Generator[Cut, None, None]:
         for data in self.source:
@@ -104,7 +111,12 @@ def __len__(self) -> int:
     def __add__(self, other):
         return LazyIteratorChain(self, other)
 
-    def _create_recording(self, audio_path: str, duration: float, sampling_rate: int | None = None,) -> Recording:
+    def _create_recording(
+        self,
+        audio_path: str,
+        duration: float,
+        sampling_rate: int | None = None,
+    ) -> Recording:
         if sampling_rate is not None:
             # TODO(pzelasko): It will only work with single-channel audio in the current shape.
             return Recording(
@@ -115,7 +127,7 @@ def _create_recording(self, audio_path: str, duration: float, sampling_rate: int
                 duration=duration,
                 channel_ids=[0],
             )
-        elif self.missing_sampling_rate_ok:
+        elif self.metadata_only:
             return Recording(
                 id=audio_path,
                 sources=[AudioSource(type="file", channels=[0], source=audio_path)],
diff --git a/scripts/speech_recognition/convert_to_tarred_audio_dataset.py b/scripts/speech_recognition/convert_to_tarred_audio_dataset.py
index f0c7847b8c9b..c3b5cef57cbc 100644
--- a/scripts/speech_recognition/convert_to_tarred_audio_dataset.py
+++ b/scripts/speech_recognition/convert_to_tarred_audio_dataset.py
@@ -412,7 +412,7 @@ def estimate_dynamic_bucketing_duration_bins(self, manifest_path: str, num_bucke
         from lhotse.dataset.sampling.dynamic_bucketing import estimate_duration_buckets
         from nemo.collections.common.data.lhotse.nemo_adapters import LazyNeMoIterator
 
-        cuts = CutSet(LazyNeMoIterator(manifest_path, missing_sampling_rate_ok=True))
+        cuts = CutSet(LazyNeMoIterator(manifest_path, metadata_only=True))
         bins = estimate_duration_buckets(cuts, num_buckets=num_buckets)
         print(
             f"Note: we estimated the optimal bucketing duration bins for {num_buckets} buckets. "
diff --git a/scripts/speech_recognition/estimate_duration_bins.py b/scripts/speech_recognition/estimate_duration_bins.py
index 687c2af59ad2..cca101731772 100644
--- a/scripts/speech_recognition/estimate_duration_bins.py
+++ b/scripts/speech_recognition/estimate_duration_bins.py
@@ -13,6 +13,10 @@
 # limitations under the License.
 
 import argparse
+from itertools import islice
+from pathlib import Path
+
+from lhotse.cut import Cut
 from lhotse.dataset.sampling.dynamic_bucketing import estimate_duration_buckets
 from omegaconf import OmegaConf
 
@@ -23,14 +27,18 @@
 def parse_args():
     parser = argparse.ArgumentParser(
         description="Estimate duration bins for Lhotse dynamic bucketing using a sample of the input dataset. "
-        "The dataset is read either from one or more manifest files and supports data weighting."
+        "The dataset is read either from one or more manifest files and supports data weighting.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
     )
     parser.add_argument(
         "input",
-        help='Same input format as in model configs under model.train_ds.manifest_filepath. Options: '
-        '1) "path.json"; '
-        '2) "[[path1.json],[path2.json],...]"; '
-        '3) "[[path1.json,weight1],[path2.json,weight2],...]"',
+        help='Data input. Options: '
+        '1) "path.json" - any single NeMo manifest; '
+        '2) "[[path1.json],[path2.json],...]" - any collection of NeMo manifests; '
+        '3) "[[path1.json,weight1],[path2.json,weight2],...]" - any collection of weighted NeMo manifests; '
+        '4) "input_cfg.yaml" - a new option supporting input configs, same as in model training \'input_cfg\' arg; '
+        '5) "path/to/shar_data" - a path to Lhotse Shar data directory; '
+        '6) "key=val" - in case none of the previous variants cover your case: "key" is the key you\'d use in NeMo training config with its corresponding value ',
     )
     parser.add_argument("-b", "--buckets", type=int, default=30, help="The desired number of buckets.")
     parser.add_argument(
@@ -38,7 +46,8 @@ def parse_args():
         "--num_examples",
         type=int,
         default=-1,
-        help="The number of examples (utterances) to estimate the bins. -1 means use all data.",
+        help="The number of examples (utterances) to estimate the bins. -1 means use all data "
+        "(be careful: it could be iterated over infinitely).",
     )
     parser.add_argument(
         "-l",
@@ -62,25 +71,36 @@ def parse_args():
 
 def main():
     args = parse_args()
+    if '=' in args.input:
+        inp_arg = args.input
+    elif args.input.endswith(".yaml"):
+        inp_arg = f"input_cfg={args.input}"
+    elif Path(args.input).is_dir():
+        inp_arg = f"shar_path={args.input}"
+    else:
+        inp_arg = f"manifest_filepath={args.input}"
     config = OmegaConf.merge(
         OmegaConf.structured(LhotseDataLoadingConfig),
-        OmegaConf.from_dotlist([f"manifest_filepath={args.input}", "missing_sampling_rate_ok=true"]),
+        OmegaConf.from_dotlist([inp_arg, "metadata_only=true"]),
     )
     cuts, _ = read_cutset_from_config(config)
     min_dur, max_dur = args.min_duration, args.max_duration
-    discarded, tot = 0, 0
+    nonaudio, discarded, tot = 0, 0, 0
 
     def duration_ok(cut) -> bool:
-        nonlocal discarded, tot
-        ans = min_dur <= cut.duration <= max_dur
-        if not ans:
-            discarded += 1
+        nonlocal nonaudio, discarded, tot
         tot += 1
-        return ans
+        if not isinstance(cut, Cut):
+            nonaudio += 1
+            return False
+        if not (min_dur <= cut.duration <= max_dur):
+            discarded += 1
+            return False
+        return True
 
     cuts = cuts.filter(duration_ok)
     if (N := args.num_examples) > 0:
-        cuts = cuts.subset(first=N)
+        cuts = islice(cuts, N)
     duration_bins = estimate_duration_buckets(cuts, num_buckets=args.buckets)
     duration_bins = f"[{','.join(str(round(b, ndigits=5)) for b in duration_bins)}]"
     if args.quiet:
@@ -89,11 +109,12 @@ def duration_ok(cut) -> bool:
     if discarded:
         ratio = discarded / tot
         print(f"Note: we discarded {discarded}/{tot} ({ratio:.2%}) utterances due to min/max duration filtering.")
+    if nonaudio:
+        print(f"Note: we discarded {nonaudio} non-audio examples found during iteration.")
+    print(f"Used {tot - nonaudio - discarded} examples for the estimation.")
     print("Use the following options in your config:")
     print(f"\tnum_buckets={args.buckets}")
     print(f"\tbucket_duration_bins={duration_bins}")
-    print("Computing utterance duration distribution...")
-    cuts.describe()  # prints a nice table with duration stats + other info
 
 
 if __name__ == "__main__":

From acbd4e00ae2618c36ed9dad265d339e77a57832a Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
 <41898282+github-actions[bot]@users.noreply.github.com>
Date: Tue, 14 May 2024 21:55:34 +0400
Subject: [PATCH 10/36] Enable CUDA graphs by default only for transcription
 (#9196) (#9197)

* Enable CUDA graphs only for transcription. Sync streams before capture.

---------

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
---
 examples/asr/transcribe_speech.py             |  17 +-
 examples/asr/transcribe_speech_parallel.py    |   6 +-
 .../asr/parts/submodules/rnnt_decoding.py     |  15 +-
 .../parts/submodules/rnnt_greedy_decoding.py  |  20 +-
 .../submodules/rnnt_loop_labels_computer.py   | 121 +++++++---
 .../submodules/tdt_loop_labels_computer.py    | 225 +++++++++++-------
 6 files changed, 266 insertions(+), 138 deletions(-)

diff --git a/examples/asr/transcribe_speech.py b/examples/asr/transcribe_speech.py
index c8372c422e7b..1763c2035805 100644
--- a/examples/asr/transcribe_speech.py
+++ b/examples/asr/transcribe_speech.py
@@ -29,6 +29,7 @@
 from nemo.collections.asr.parts.submodules.ctc_decoding import CTCDecodingConfig
 from nemo.collections.asr.parts.submodules.multitask_decoding import MultiTaskDecoding, MultiTaskDecodingConfig
 from nemo.collections.asr.parts.submodules.rnnt_decoding import RNNTDecodingConfig
+from nemo.collections.asr.parts.submodules.rnnt_greedy_decoding import GreedyBatchedRNNTInferConfig
 from nemo.collections.asr.parts.utils.eval_utils import cal_write_wer
 from nemo.collections.asr.parts.utils.rnnt_utils import Hypothesis
 from nemo.collections.asr.parts.utils.transcribe_utils import (
@@ -121,9 +122,9 @@ class TranscriptionConfig:
     pretrained_name: Optional[str] = None  # Name of a pretrained model
     audio_dir: Optional[str] = None  # Path to a directory which contains audio files
     dataset_manifest: Optional[str] = None  # Path to dataset's JSON manifest
-    channel_selector: Optional[
-        Union[int, str]
-    ] = None  # Used to select a single channel from multichannel audio, or use average across channels
+    channel_selector: Optional[Union[int, str]] = (
+        None  # Used to select a single channel from multichannel audio, or use average across channels
+    )
     audio_key: str = 'audio_filepath'  # Used to override the default audio key in dataset_manifest
     eval_config_yaml: Optional[str] = None  # Path to a yaml file of config of evaluation
     presort_manifest: bool = True  # Significant inference speedup on short-form data due to padding reduction
@@ -161,7 +162,10 @@ class TranscriptionConfig:
     ctc_decoding: CTCDecodingConfig = CTCDecodingConfig()
 
     # Decoding strategy for RNNT models
-    rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(fused_batch_size=-1)
+    # enable CUDA graphs for transcription
+    rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(
+        fused_batch_size=-1, greedy=GreedyBatchedRNNTInferConfig(use_cuda_graph_decoder=True)
+    )
 
     # Decoding strategy for AED models
     multitask_decoding: MultiTaskDecodingConfig = MultiTaskDecodingConfig()
@@ -407,7 +411,10 @@ def autocast(dtype=None):
                 override_cfg.augmentor = augmentor
                 override_cfg.text_field = cfg.gt_text_attr_name
                 override_cfg.lang_field = cfg.gt_lang_attr_name
-                transcriptions = asr_model.transcribe(audio=filepaths, override_config=override_cfg,)
+                transcriptions = asr_model.transcribe(
+                    audio=filepaths,
+                    override_config=override_cfg,
+                )
 
     if cfg.dataset_manifest is not None:
         logging.info(f"Finished transcribing from manifest file: {cfg.dataset_manifest}")
diff --git a/examples/asr/transcribe_speech_parallel.py b/examples/asr/transcribe_speech_parallel.py
index c0af8f97146a..df2f31072851 100644
--- a/examples/asr/transcribe_speech_parallel.py
+++ b/examples/asr/transcribe_speech_parallel.py
@@ -84,6 +84,7 @@
 from nemo.collections.asr.models import ASRModel, EncDecHybridRNNTCTCModel
 from nemo.collections.asr.models.configs.asr_models_config import ASRDatasetConfig
 from nemo.collections.asr.parts.submodules.rnnt_decoding import RNNTDecodingConfig
+from nemo.collections.asr.parts.submodules.rnnt_greedy_decoding import GreedyBatchedRNNTInferConfig
 from nemo.core.config import TrainerConfig, hydra_runner
 from nemo.utils import logging
 from nemo.utils.get_rank import is_global_rank_zero
@@ -100,7 +101,10 @@ class ParallelTranscriptionConfig:
     use_cer: bool = False
 
     # decoding strategy for RNNT models
-    rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig()
+    # enable CUDA graphs for transcription
+    rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(
+        fused_batch_size=-1, greedy=GreedyBatchedRNNTInferConfig(use_cuda_graph_decoder=True)
+    )
 
     # decoder type: ctc or rnnt, can be used to switch between CTC and RNNT decoder for Hybrid RNNT/CTC models
     decoder_type: Optional[str] = None
diff --git a/nemo/collections/asr/parts/submodules/rnnt_decoding.py b/nemo/collections/asr/parts/submodules/rnnt_decoding.py
index 5fa225864f8c..2416d916ac13 100644
--- a/nemo/collections/asr/parts/submodules/rnnt_decoding.py
+++ b/nemo/collections/asr/parts/submodules/rnnt_decoding.py
@@ -331,7 +331,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
                         preserve_frame_confidence=self.preserve_frame_confidence,
                         confidence_method_cfg=self.confidence_method_cfg,
                         loop_labels=self.cfg.greedy.get('loop_labels', True),
-                        use_cuda_graph_decoder=self.cfg.greedy.get('use_cuda_graph_decoder', True),
+                        use_cuda_graph_decoder=self.cfg.greedy.get('use_cuda_graph_decoder', False),
                     )
                 else:
                     self.decoding = rnnt_greedy_decoding.GreedyBatchedTDTInfer(
@@ -347,7 +347,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
                         preserve_frame_confidence=self.preserve_frame_confidence,
                         include_duration_confidence=self.tdt_include_duration_confidence,
                         confidence_method_cfg=self.confidence_method_cfg,
-                        use_cuda_graph_decoder=self.cfg.greedy.get('use_cuda_graph_decoder', True),
+                        use_cuda_graph_decoder=self.cfg.greedy.get('use_cuda_graph_decoder', False),
                     )
 
             else:
@@ -1175,7 +1175,11 @@ class RNNTDecoding(AbstractRNNTDecoding):
     """
 
     def __init__(
-        self, decoding_cfg, decoder, joint, vocabulary,
+        self,
+        decoding_cfg,
+        decoder,
+        joint,
+        vocabulary,
     ):
         # we need to ensure blank is the last token in the vocab for the case of RNNT and Multi-blank RNNT.
         blank_id = len(vocabulary) + joint.num_extra_outputs
@@ -1186,7 +1190,10 @@ def __init__(
         self.labels_map = dict([(i, vocabulary[i]) for i in range(len(vocabulary))])
 
         super(RNNTDecoding, self).__init__(
-            decoding_cfg=decoding_cfg, decoder=decoder, joint=joint, blank_id=blank_id,
+            decoding_cfg=decoding_cfg,
+            decoder=decoder,
+            joint=joint,
+            blank_id=blank_id,
         )
 
         if isinstance(self.decoding, rnnt_beam_decoding.BeamRNNTInfer):
diff --git a/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py b/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py
index b2fa9b85b5fd..fa7a5cc95fec 100644
--- a/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py
+++ b/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py
@@ -45,7 +45,10 @@
 from nemo.utils import logging
 
 
-def pack_hypotheses(hypotheses: List[rnnt_utils.Hypothesis], logitlen: torch.Tensor,) -> List[rnnt_utils.Hypothesis]:
+def pack_hypotheses(
+    hypotheses: List[rnnt_utils.Hypothesis],
+    logitlen: torch.Tensor,
+) -> List[rnnt_utils.Hypothesis]:
 
     if hasattr(logitlen, 'cpu'):
         logitlen_cpu = logitlen.to('cpu')
@@ -139,8 +142,7 @@ class _GreedyRNNTInfer(Typing, ConfidenceMethodMixin):
 
     @property
     def input_types(self):
-        """Returns definitions of module input ports.
-        """
+        """Returns definitions of module input ports."""
         return {
             "encoder_output": NeuralType(('B', 'D', 'T'), AcousticEncodedRepresentation()),
             "encoded_lengths": NeuralType(tuple('B'), LengthsType()),
@@ -149,8 +151,7 @@ def input_types(self):
 
     @property
     def output_types(self):
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {"predictions": [NeuralType(elements_type=HypothesisType())]}
 
     def __init__(
@@ -578,6 +579,7 @@ class GreedyBatchedRNNTInfer(_GreedyRNNTInfer, WithOptionalCudaGraphs):
             (evaluating Joint multiple times in inner loop); It uses a minimal possible amount of calls
             to prediction network (with maximum possible batch size),
             which makes it especially useful for scaling the prediction network.
+        use_cuda_graph_decoder: if CUDA graphs should be enabled for decoding (currently recommended only for inference)
     """
 
     def __init__(
@@ -590,7 +592,7 @@ def __init__(
         preserve_frame_confidence: bool = False,
         confidence_method_cfg: Optional[DictConfig] = None,
         loop_labels: bool = True,
-        use_cuda_graph_decoder: bool = True,
+        use_cuda_graph_decoder: bool = False,
     ):
         super().__init__(
             decoder_model=decoder_model,
@@ -2358,7 +2360,7 @@ class GreedyBatchedRNNTInferConfig:
     tdt_include_duration_confidence: bool = False
     confidence_method_cfg: Optional[ConfidenceMethodConfig] = field(default_factory=lambda: ConfidenceMethodConfig())
     loop_labels: bool = True
-    use_cuda_graph_decoder: bool = True
+    use_cuda_graph_decoder: bool = False
 
     def __post_init__(self):
         # OmegaConf.structured ensures that post_init check is always executed
@@ -2695,6 +2697,8 @@ class GreedyBatchedTDTInfer(_GreedyRNNTInfer, WithOptionalCudaGraphs):
                 Supported values:
                     - 'lin' for using the linear mapping.
                     - 'exp' for using exponential mapping with linear shift.
+
+        use_cuda_graph_decoder: if CUDA graphs should be enabled for decoding (currently recommended only for inference)
     """
 
     def __init__(
@@ -2708,7 +2712,7 @@ def __init__(
         preserve_frame_confidence: bool = False,
         include_duration_confidence: bool = False,
         confidence_method_cfg: Optional[DictConfig] = None,
-        use_cuda_graph_decoder: bool = True,
+        use_cuda_graph_decoder: bool = False,
     ):
         super().__init__(
             decoder_model=decoder_model,
diff --git a/nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py b/nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py
index b920dba09cfd..718deb7a409c 100644
--- a/nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py
+++ b/nemo/collections/asr/parts/submodules/rnnt_loop_labels_computer.py
@@ -112,7 +112,9 @@ def __init__(
         self.max_time = max_time
 
         self.encoder_output_projected = torch.zeros(
-            (self.batch_size, self.max_time, encoder_dim), dtype=float_dtype, device=self.device,
+            (self.batch_size, self.max_time, encoder_dim),
+            dtype=float_dtype,
+            device=self.device,
         )
         self.encoder_output_length = torch.zeros((self.batch_size,), dtype=torch.long, device=self.device)
 
@@ -288,7 +290,9 @@ def reset_cuda_graphs_state(self):
         self.separate_graphs = None
 
     def loop_labels_torch(
-        self, encoder_output: torch.Tensor, encoder_output_length: torch.Tensor,
+        self,
+        encoder_output: torch.Tensor,
+        encoder_output_length: torch.Tensor,
     ) -> Tuple[rnnt_utils.BatchedHyps, Optional[rnnt_utils.BatchedAlignments], Any]:
         """
         Pure PyTorch implementation
@@ -361,7 +365,8 @@ def loop_labels_torch(
             # blank label in `labels` tensor means "end of hypothesis" (for this index)
             logits = (
                 self.joint.joint_after_projection(
-                    encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1), decoder_output,
+                    encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1),
+                    decoder_output,
                 )
                 .squeeze(1)
                 .squeeze(1)
@@ -378,9 +383,11 @@ def loop_labels_torch(
                     time_indices=time_indices_current_labels,
                     logits=logits if self.preserve_alignments else None,
                     labels=labels if self.preserve_alignments else None,
-                    confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
-                    if self.preserve_frame_confidence
-                    else None,
+                    confidence=(
+                        self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
+                        if self.preserve_frame_confidence
+                        else None
+                    ),
                 )
 
             # advance_mask is a mask for current batch for searching non-blank labels;
@@ -397,7 +404,8 @@ def loop_labels_torch(
                 torch.where(advance_mask, time_indices, time_indices_current_labels, out=time_indices_current_labels)
                 logits = (
                     self.joint.joint_after_projection(
-                        encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1), decoder_output,
+                        encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1),
+                        decoder_output,
                     )
                     .squeeze(1)
                     .squeeze(1)
@@ -416,9 +424,11 @@ def loop_labels_torch(
                         time_indices=time_indices_current_labels,
                         logits=logits if self.preserve_alignments else None,
                         labels=more_labels if self.preserve_alignments else None,
-                        confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
-                        if self.preserve_frame_confidence
-                        else None,
+                        confidence=(
+                            self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
+                            if self.preserve_frame_confidence
+                            else None
+                        ),
                     )
 
                 blank_mask = labels == self._blank_index
@@ -432,19 +442,27 @@ def loop_labels_torch(
             # this seems to be redundant, but used in the `loop_frames` output
             torch.ne(active_mask, active_mask_prev, out=became_inactive_mask)
             self.decoder.batch_replace_states_mask(
-                src_states=state, dst_states=last_decoder_state, mask=became_inactive_mask,
+                src_states=state,
+                dst_states=last_decoder_state,
+                mask=became_inactive_mask,
             )
 
             # store hypotheses
             if self.max_symbols is not None:
                 # pre-allocated memory, no need for checks
                 batched_hyps.add_results_masked_no_checks_(
-                    active_mask, labels, time_indices_current_labels, scores,
+                    active_mask,
+                    labels,
+                    time_indices_current_labels,
+                    scores,
                 )
             else:
                 # auto-adjusted storage
                 batched_hyps.add_results_masked_(
-                    active_mask, labels, time_indices_current_labels, scores,
+                    active_mask,
+                    labels,
+                    time_indices_current_labels,
+                    scores,
                 )
 
             # stage 4: to avoid looping, go to next frame after max_symbols emission
@@ -455,7 +473,8 @@ def loop_labels_torch(
                     active_mask,
                     torch.logical_and(
                         torch.logical_and(
-                            labels != self._blank_index, batched_hyps.last_timestep_lasts >= self.max_symbols,
+                            labels != self._blank_index,
+                            batched_hyps.last_timestep_lasts >= self.max_symbols,
                         ),
                         batched_hyps.last_timestep == time_indices,
                     ),
@@ -470,7 +489,9 @@ def loop_labels_torch(
         return batched_hyps, None, last_decoder_state
 
     def loop_labels_cuda_graphs(
-        self, encoder_output: torch.Tensor, encoder_output_length: torch.Tensor,
+        self,
+        encoder_output: torch.Tensor,
+        encoder_output_length: torch.Tensor,
     ) -> Tuple[rnnt_utils.BatchedHyps, Optional[rnnt_utils.BatchedAlignments], Any]:
         """
         Implementation with CUDA graphs.
@@ -565,7 +586,9 @@ def _create_inner_while_loop_kernel(cls):
         return run_nvrtc(kernel_string, b"inner_find_non_blank_conditional", cls.CUDA_PROGRAM_NAME)
 
     def _graph_reinitialize(
-        self, encoder_output_projected: torch.Tensor, encoder_output_length: torch.Tensor,
+        self,
+        encoder_output_projected: torch.Tensor,
+        encoder_output_length: torch.Tensor,
     ):
         batch_size, max_time, encoder_dim = encoder_output_projected.shape
 
@@ -602,25 +625,34 @@ def _partial_graphs_compile(self):
         """Compile decoding by parts"""
         # Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
         stream_for_graph = torch.cuda.Stream(self.state.device)
+        stream_for_graph.wait_stream(torch.cuda.default_stream(self.state.device))
         self.separate_graphs = SeparateGraphsLoopLabels()
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.before_outer_loop, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.before_outer_loop, stream=stream_for_graph),
         ):
             self._before_outer_loop()
 
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.before_inner_loop, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.before_inner_loop, stream=stream_for_graph),
         ):
             self._before_inner_loop_get_decoder_output()
             self._before_inner_loop_get_joint_output()
 
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.inner_loop_code, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.inner_loop_code, stream=stream_for_graph),
         ):
             self._inner_loop_code()
 
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.after_inner_loop, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.after_inner_loop, stream=stream_for_graph),
         ):
             self._after_inner_loop()
 
@@ -628,9 +660,12 @@ def _full_graph_compile(self):
         """Compile full graph for decoding"""
         # Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
         stream_for_graph = torch.cuda.Stream(self.state.device)
+        stream_for_graph.wait_stream(torch.cuda.default_stream(self.state.device))
         self.full_graph = torch.cuda.CUDAGraph()
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.full_graph, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.full_graph, stream=stream_for_graph),
         ):
             self._before_outer_loop()
 
@@ -644,7 +679,8 @@ def _full_graph_compile(self):
             outer_loop_kernel = self._create_outer_while_loop_kernel()
             active_mask_any_ptr = np.array([self.state.active_mask_any.data_ptr()], dtype=np.uint64)
             outer_loop_args = np.array(
-                [outer_loop_conditional_handle.getPtr(), active_mask_any_ptr.ctypes.data], dtype=np.uint64,
+                [outer_loop_conditional_handle.getPtr(), active_mask_any_ptr.ctypes.data],
+                dtype=np.uint64,
             )
             # loop while there are active utterances
             with with_conditional_node(
@@ -657,7 +693,11 @@ def _full_graph_compile(self):
                 (inner_loop_conditional_handle,) = cu_call(cudart.cudaGraphConditionalHandleCreate(graph, 0, 0))
                 advance_mask_any_ptr = np.array([self.state.advance_mask_any.data_ptr()], dtype=np.uint64)
                 inner_loop_args = np.array(
-                    [inner_loop_conditional_handle.getPtr(), advance_mask_any_ptr.ctypes.data,], dtype=np.uint64,
+                    [
+                        inner_loop_conditional_handle.getPtr(),
+                        advance_mask_any_ptr.ctypes.data,
+                    ],
+                    dtype=np.uint64,
                 )
                 with with_conditional_node(
                     inner_while_loop_kernel, inner_loop_args, inner_loop_conditional_handle, device=self.state.device
@@ -734,9 +774,11 @@ def _before_inner_loop_get_joint_output(self):
                 time_indices=self.state.time_indices_current_labels,
                 logits=logits if self.preserve_alignments else None,
                 labels=self.state.labels if self.preserve_alignments else None,
-                confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
-                if self.preserve_frame_confidence
-                else None,
+                confidence=(
+                    self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
+                    if self.preserve_frame_confidence
+                    else None
+                ),
             )
 
         # advance_mask is a mask for current batch for searching non-blank labels;
@@ -785,9 +827,11 @@ def _inner_loop_code(self):
                 time_indices=self.state.time_indices_current_labels,
                 logits=logits if self.preserve_alignments else None,
                 labels=more_labels if self.preserve_alignments else None,
-                confidence=self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
-                if self.preserve_frame_confidence
-                else None,
+                confidence=(
+                    self._get_confidence_tensor(F.log_softmax(logits, dim=-1)).to(dtype=float_dtype)
+                    if self.preserve_frame_confidence
+                    else None
+                ),
             )
 
         # blank_mask = self.labels == self._blank_index
@@ -813,7 +857,10 @@ def _after_inner_loop(self):
         )
 
         self.state.batched_hyps.add_results_masked_no_checks_(
-            self.state.active_mask, self.state.labels, self.state.time_indices_current_labels, self.state.scores,
+            self.state.active_mask,
+            self.state.labels,
+            self.state.time_indices_current_labels,
+            self.state.scores,
         )
 
         # stage 4: to avoid looping, go to next frame after max_symbols emission
@@ -837,7 +884,9 @@ def _after_inner_loop(self):
         torch.any(self.state.active_mask, out=self.state.active_mask_any)
 
     def __call__(
-        self, x: torch.Tensor, out_len: torch.Tensor,
+        self,
+        x: torch.Tensor,
+        out_len: torch.Tensor,
     ) -> Tuple[rnnt_utils.BatchedHyps, Optional[rnnt_utils.BatchedAlignments], Any]:
         if self.cuda_graphs_mode is not None and x.device.type == "cuda":
             return self.loop_labels_cuda_graphs(encoder_output=x, encoder_output_length=out_len)
diff --git a/nemo/collections/asr/parts/submodules/tdt_loop_labels_computer.py b/nemo/collections/asr/parts/submodules/tdt_loop_labels_computer.py
index 4e514966db2b..7ad7065e019c 100644
--- a/nemo/collections/asr/parts/submodules/tdt_loop_labels_computer.py
+++ b/nemo/collections/asr/parts/submodules/tdt_loop_labels_computer.py
@@ -117,7 +117,9 @@ def __init__(
         self.max_time = max_time
 
         self.encoder_output_projected = torch.zeros(
-            (self.batch_size, self.max_time, encoder_dim), dtype=float_dtype, device=self.device,
+            (self.batch_size, self.max_time, encoder_dim),
+            dtype=float_dtype,
+            device=self.device,
         )
         self.encoder_output_length = torch.zeros((self.batch_size,), dtype=torch.long, device=self.device)
 
@@ -301,7 +303,9 @@ def force_cuda_graphs_mode(self, mode: Optional[Union[str, CudaGraphsMode]]):
         self.state = None
 
     def loop_labels_torch(
-        self, encoder_output: torch.Tensor, encoder_output_length: torch.Tensor,
+        self,
+        encoder_output: torch.Tensor,
+        encoder_output_length: torch.Tensor,
     ) -> Tuple[rnnt_utils.BatchedHyps, Optional[rnnt_utils.BatchedAlignments], Any]:
         """
         Pure PyTorch implementation
@@ -379,7 +383,8 @@ def loop_labels_torch(
             # blank label in `labels` tensor means "end of hypothesis" (for this index)
             logits = (
                 self.joint.joint_after_projection(
-                    encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1), decoder_output,
+                    encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1),
+                    decoder_output,
                 )
                 .squeeze(1)
                 .squeeze(1)
@@ -400,23 +405,27 @@ def loop_labels_torch(
                     time_indices=time_indices_current_labels,
                     logits=logits if self.preserve_alignments else None,
                     labels=labels if self.preserve_alignments else None,
-                    confidence=torch.stack(
-                        (
-                            self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
-                                dtype=float_dtype
+                    confidence=(
+                        torch.stack(
+                            (
+                                self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
+                                    dtype=float_dtype
+                                ),
+                                self._get_confidence_tensor(F.log_softmax(logits[:, -num_durations:], dim=-1)).to(
+                                    dtype=float_dtype
+                                ),
                             ),
-                            self._get_confidence_tensor(F.log_softmax(logits[:, -num_durations:], dim=-1)).to(
+                            dim=-1,
+                        )
+                        if self.include_duration_confidence
+                        else (
+                            self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
                                 dtype=float_dtype
-                            ),
-                        ),
-                        dim=-1,
-                    )
-                    if self.include_duration_confidence
-                    else self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
-                        dtype=float_dtype
-                    )
-                    if self.preserve_frame_confidence
-                    else None,
+                            )
+                            if self.preserve_frame_confidence
+                            else None
+                        )
+                    ),
                 )
 
             # advance_mask is a mask for current batch for searching non-blank labels;
@@ -433,7 +442,8 @@ def loop_labels_torch(
                 torch.where(advance_mask, time_indices, time_indices_current_labels, out=time_indices_current_labels)
                 logits = (
                     self.joint.joint_after_projection(
-                        encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1), decoder_output,
+                        encoder_output_projected[batch_indices, safe_time_indices].unsqueeze(1),
+                        decoder_output,
                     )
                     .squeeze(1)
                     .squeeze(1)
@@ -454,23 +464,27 @@ def loop_labels_torch(
                         time_indices=time_indices_current_labels,
                         logits=logits if self.preserve_alignments else None,
                         labels=more_labels if self.preserve_alignments else None,
-                        confidence=torch.stack(
-                            (
-                                self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
-                                    dtype=float_dtype
+                        confidence=(
+                            torch.stack(
+                                (
+                                    self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
+                                        dtype=float_dtype
+                                    ),
+                                    self._get_confidence_tensor(F.log_softmax(logits[:, -num_durations:], dim=-1)).to(
+                                        dtype=float_dtype
+                                    ),
                                 ),
-                                self._get_confidence_tensor(F.log_softmax(logits[:, -num_durations:], dim=-1)).to(
+                                dim=-1,
+                            )
+                            if self.include_duration_confidence
+                            else (
+                                self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
                                     dtype=float_dtype
-                                ),
-                            ),
-                            dim=-1,
-                        )
-                        if self.include_duration_confidence
-                        else self._get_confidence_tensor(F.log_softmax(logits[:, :-num_durations], dim=-1)).to(
-                            dtype=float_dtype
-                        )
-                        if self.preserve_frame_confidence
-                        else None,
+                                )
+                                if self.preserve_frame_confidence
+                                else None
+                            )
+                        ),
                     )
 
                 blank_mask = labels == self._blank_index
@@ -487,19 +501,27 @@ def loop_labels_torch(
             # this seems to be redundant, but used in the `loop_frames` output
             torch.ne(active_mask, active_mask_prev, out=became_inactive_mask)
             self.decoder.batch_replace_states_mask(
-                src_states=state, dst_states=last_decoder_state, mask=became_inactive_mask,
+                src_states=state,
+                dst_states=last_decoder_state,
+                mask=became_inactive_mask,
             )
 
             # store hypotheses
             if self.max_symbols is not None:
                 # pre-allocated memory, no need for checks
                 batched_hyps.add_results_masked_no_checks_(
-                    active_mask, labels, time_indices_current_labels, scores,
+                    active_mask,
+                    labels,
+                    time_indices_current_labels,
+                    scores,
                 )
             else:
                 # auto-adjusted storage
                 batched_hyps.add_results_masked_(
-                    active_mask, labels, time_indices_current_labels, scores,
+                    active_mask,
+                    labels,
+                    time_indices_current_labels,
+                    scores,
                 )
 
             # stage 4: to avoid looping, go to next frame after max_symbols emission
@@ -510,7 +532,8 @@ def loop_labels_torch(
                     active_mask,
                     torch.logical_and(
                         torch.logical_and(
-                            labels != self._blank_index, batched_hyps.last_timestep_lasts >= self.max_symbols,
+                            labels != self._blank_index,
+                            batched_hyps.last_timestep_lasts >= self.max_symbols,
                         ),
                         batched_hyps.last_timestep == time_indices,
                     ),
@@ -525,7 +548,9 @@ def loop_labels_torch(
         return batched_hyps, None, last_decoder_state
 
     def loop_labels_cuda_graphs(
-        self, encoder_output: torch.Tensor, encoder_output_length: torch.Tensor,
+        self,
+        encoder_output: torch.Tensor,
+        encoder_output_length: torch.Tensor,
     ) -> Tuple[rnnt_utils.BatchedHyps, Optional[rnnt_utils.BatchedAlignments], Any]:
         """
         Implementation with CUDA graphs.
@@ -620,7 +645,9 @@ def _create_inner_while_loop_kernel(cls):
         return run_nvrtc(kernel_string, b"inner_find_non_blank_conditional", cls.CUDA_PROGRAM_NAME)
 
     def _graph_reinitialize(
-        self, encoder_output_projected: torch.Tensor, encoder_output_length: torch.Tensor,
+        self,
+        encoder_output_projected: torch.Tensor,
+        encoder_output_length: torch.Tensor,
     ):
         batch_size, max_time, encoder_dim = encoder_output_projected.shape
 
@@ -659,25 +686,34 @@ def _partial_graphs_compile(self):
         """Compile decoding by parts"""
         # Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
         stream_for_graph = torch.cuda.Stream(self.state.device)
+        stream_for_graph.wait_stream(torch.cuda.default_stream(self.state.device))
         self.separate_graphs = SeparateGraphsLoopLabels()
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.before_outer_loop, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.before_outer_loop, stream=stream_for_graph),
         ):
             self._before_outer_loop()
 
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.before_inner_loop, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.before_inner_loop, stream=stream_for_graph),
         ):
             self._before_inner_loop_get_decoder_output()
             self._before_inner_loop_get_joint_output()
 
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.inner_loop_code, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.inner_loop_code, stream=stream_for_graph),
         ):
             self._inner_loop_code()
 
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.separate_graphs.after_inner_loop, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.separate_graphs.after_inner_loop, stream=stream_for_graph),
         ):
             self._after_inner_loop()
 
@@ -685,9 +721,12 @@ def _full_graph_compile(self):
         """Compile full graph for decoding"""
         # Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
         stream_for_graph = torch.cuda.Stream(self.state.device)
+        stream_for_graph.wait_stream(torch.cuda.default_stream(self.state.device))
         self.full_graph = torch.cuda.CUDAGraph()
-        with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(
-            self.full_graph, stream=stream_for_graph
+        with (
+            torch.cuda.stream(stream_for_graph),
+            torch.inference_mode(),
+            torch.cuda.graph(self.full_graph, stream=stream_for_graph),
         ):
             self._before_outer_loop()
 
@@ -700,7 +739,8 @@ def _full_graph_compile(self):
             outer_loop_kernel = self._create_outer_while_loop_kernel()
             active_mask_any_ptr = np.array([self.state.active_mask_any.data_ptr()], dtype=np.uint64)
             outer_loop_args = np.array(
-                [outer_loop_conditional_handle.getPtr(), active_mask_any_ptr.ctypes.data], dtype=np.uint64,
+                [outer_loop_conditional_handle.getPtr(), active_mask_any_ptr.ctypes.data],
+                dtype=np.uint64,
             )
 
             # loop while there are active utterances
@@ -714,7 +754,11 @@ def _full_graph_compile(self):
                 (inner_loop_conditional_handle,) = cu_call(cudart.cudaGraphConditionalHandleCreate(graph, 0, 0))
                 advance_mask_any_ptr = np.array([self.state.advance_mask_any.data_ptr()], dtype=np.uint64)
                 inner_loop_args = np.array(
-                    [inner_loop_conditional_handle.getPtr(), advance_mask_any_ptr.ctypes.data,], dtype=np.uint64,
+                    [
+                        inner_loop_conditional_handle.getPtr(),
+                        advance_mask_any_ptr.ctypes.data,
+                    ],
+                    dtype=np.uint64,
                 )
                 # while self.advance_mask_any.item():
 
@@ -797,23 +841,27 @@ def _before_inner_loop_get_joint_output(self):
                 time_indices=self.state.time_indices_current_labels,
                 logits=logits if self.preserve_alignments else None,
                 labels=self.state.labels if self.preserve_alignments else None,
-                confidence=torch.stack(
-                    (
+                confidence=(
+                    torch.stack(
+                        (
+                            self._get_confidence_tensor(
+                                F.log_softmax(logits[:, : -self.state.all_durations.shape[0]], dim=-1)
+                            ).to(dtype=float_dtype),
+                            self._get_confidence_tensor(
+                                F.log_softmax(logits[:, -self.state.all_durations.shape[0] :], dim=-1)
+                            ).to(dtype=float_dtype),
+                        ),
+                        dim=-1,
+                    )
+                    if self.include_duration_confidence
+                    else (
                         self._get_confidence_tensor(
                             F.log_softmax(logits[:, : -self.state.all_durations.shape[0]], dim=-1)
-                        ).to(dtype=float_dtype),
-                        self._get_confidence_tensor(
-                            F.log_softmax(logits[:, -self.state.all_durations.shape[0] :], dim=-1)
-                        ).to(dtype=float_dtype),
-                    ),
-                    dim=-1,
-                )
-                if self.include_duration_confidence
-                else self._get_confidence_tensor(
-                    F.log_softmax(logits[:, : -self.state.all_durations.shape[0]], dim=-1)
-                ).to(dtype=float_dtype)
-                if self.preserve_frame_confidence
-                else None,
+                        ).to(dtype=float_dtype)
+                        if self.preserve_frame_confidence
+                        else None
+                    )
+                ),
             )
 
         # advance_mask is a mask for current batch for searching non-blank labels;
@@ -864,23 +912,27 @@ def _inner_loop_code(self):
                 time_indices=self.state.time_indices_current_labels,
                 logits=logits if self.preserve_alignments else None,
                 labels=more_labels if self.preserve_alignments else None,
-                confidence=torch.stack(
-                    (
+                confidence=(
+                    torch.stack(
+                        (
+                            self._get_confidence_tensor(
+                                F.log_softmax(logits[:, : -self.state.all_durations.shape[0]], dim=-1)
+                            ).to(dtype=float_dtype),
+                            self._get_confidence_tensor(
+                                F.log_softmax(logits[:, -self.state.all_durations.shape[0] :], dim=-1)
+                            ).to(dtype=float_dtype),
+                        ),
+                        dim=-1,
+                    )
+                    if self.include_duration_confidence
+                    else (
                         self._get_confidence_tensor(
                             F.log_softmax(logits[:, : -self.state.all_durations.shape[0]], dim=-1)
-                        ).to(dtype=float_dtype),
-                        self._get_confidence_tensor(
-                            F.log_softmax(logits[:, -self.state.all_durations.shape[0] :], dim=-1)
-                        ).to(dtype=float_dtype),
-                    ),
-                    dim=-1,
-                )
-                if self.include_duration_confidence
-                else self._get_confidence_tensor(
-                    F.log_softmax(logits[:, : -self.state.all_durations.shape[0]], dim=-1)
-                ).to(dtype=float_dtype)
-                if self.preserve_frame_confidence
-                else None,
+                        ).to(dtype=float_dtype)
+                        if self.preserve_frame_confidence
+                        else None
+                    )
+                ),
             )
 
         # blank_mask = self.labels == self._blank_index
@@ -913,7 +965,10 @@ def _after_inner_loop(self):
         )
 
         self.state.batched_hyps.add_results_masked_no_checks_(
-            self.state.active_mask, self.state.labels, self.state.time_indices_current_labels, self.state.scores,
+            self.state.active_mask,
+            self.state.labels,
+            self.state.time_indices_current_labels,
+            self.state.scores,
         )
 
         # stage 4: to avoid looping, go to next frame after max_symbols emission
@@ -937,7 +992,9 @@ def _after_inner_loop(self):
         torch.any(self.state.active_mask, out=self.state.active_mask_any)
 
     def __call__(
-        self, x: torch.Tensor, out_len: torch.Tensor,
+        self,
+        x: torch.Tensor,
+        out_len: torch.Tensor,
     ) -> Tuple[rnnt_utils.BatchedHyps, Optional[rnnt_utils.BatchedAlignments], Any]:
         if self.cuda_graphs_mode is not None and x.device.type == "cuda":
             return self.loop_labels_cuda_graphs(encoder_output=x, encoder_output_length=out_len)

From 4167641fae262b4f6b6828498b65aa148511c51c Mon Sep 17 00:00:00 2001
From: Jason <jasoli@nvidia.com>
Date: Tue, 14 May 2024 14:15:17 -0400
Subject: [PATCH 11/36] move tts fixtures (#9183)

* move tts fixtures

Signed-off-by: Jason <jasoli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: blisc <blisc@users.noreply.github.com>

---------

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: blisc <blisc@users.noreply.github.com>
Co-authored-by: blisc <blisc@users.noreply.github.com>
---
 .../tts.py => collections/tts/conftest.py}          |  0
 tests/conftest.py                                   | 13 +++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)
 rename tests/{fixtures/tts.py => collections/tts/conftest.py} (100%)

diff --git a/tests/fixtures/tts.py b/tests/collections/tts/conftest.py
similarity index 100%
rename from tests/fixtures/tts.py
rename to tests/collections/tts/conftest.py
diff --git a/tests/conftest.py b/tests/conftest.py
index 5069890e4840..6298ed051c68 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -25,8 +25,6 @@
 
 import pytest
 
-from tests.fixtures.tts import *
-
 # Those variables probably should go to main NeMo configuration file (config.yaml).
 __TEST_DATA_FILENAME = "test_data.tar.gz"
 __TEST_DATA_URL = "https://github.com/NVIDIA/NeMo/releases/download/v1.0.0rc1/"
@@ -68,7 +66,7 @@ def pytest_addoption(parser):
 
 @pytest.fixture
 def device(request):
-    """ Simple fixture returning string denoting the device [CPU | GPU] """
+    """Simple fixture returning string denoting the device [CPU | GPU]"""
     if request.config.getoption("--cpu"):
         return "CPU"
     else:
@@ -193,13 +191,16 @@ def pytest_configure(config):
     If file absent or sizes not equal, function downloads the archive from github and unpacks it.
     """
     config.addinivalue_line(
-        "markers", "run_only_on(device): runs the test only on a given device [CPU | GPU]",
+        "markers",
+        "run_only_on(device): runs the test only on a given device [CPU | GPU]",
     )
     config.addinivalue_line(
-        "markers", "with_downloads: runs the test using data present in tests/.data",
+        "markers",
+        "with_downloads: runs the test using data present in tests/.data",
     )
     config.addinivalue_line(
-        "markers", "nightly: runs the nightly test for QA.",
+        "markers",
+        "nightly: runs the nightly test for QA.",
     )
     # Test dir and archive filepath.
     test_dir = join(dirname(__file__), __TEST_DATA_SUBDIR)

From 4d574fe493df9f7e86629d2a0afe880f1a52764d Mon Sep 17 00:00:00 2001
From: Adi Renduchintala <adithya.r@gmail.com>
Date: Tue, 14 May 2024 13:23:32 -0700
Subject: [PATCH 12/36] enable matryoshka embedding learning (#9130)

* enable matryoshka embedding learning

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply isort and black reformatting

Signed-off-by: arendu <arendu@users.noreply.github.com>

---------

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: arendu <arendu@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: arendu <arendu@users.noreply.github.com>
---
 .../megatron_gpt_embedding_model.py           | 49 +++++++++++++++----
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py b/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py
index d477b337cd29..389c90d7f97c 100644
--- a/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py
+++ b/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py
@@ -58,6 +58,13 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):
         self.temperature = self.cfg.get('temperature', 0.02)
         self.use_all_possible_negatives = self.cfg.get("use_all_possible_negatives", True)
         self.global_inbatch_negatives = self.cfg.get("global_inbatch_negatives", True)
+        if self.cfg.get("do_mrl", False):
+            min_mrl = self.cfg.get("min_mrl_dim", int(np.log2(32))) - 1
+            max_mrl = int(np.log2(self.cfg.hidden_size // 2))
+            self.mrl_dims = [2**i for i in range(max_mrl, min_mrl, -1)]
+        else:
+            self.mrl_dims = []
+
         assert (
             self.cfg.get("post_process", False) is False
         ), "post_process must be False to get hidden states in the loss_func"
@@ -255,7 +262,14 @@ def gather_and_maybe_write_predictions(self, output, data_cfg, mode, averaged_me
         gathered_output_batches = [None for _ in range(parallel_state.get_data_parallel_world_size())]
         torch.distributed.all_gather_object(
             gathered_output_batches,
-            [{'q_hs': batch['q_hs'], 'd_hs': batch['d_hs'], 'metadata': batch['metadata'],} for batch in output],
+            [
+                {
+                    'q_hs': batch['q_hs'],
+                    'd_hs': batch['d_hs'],
+                    'metadata': batch['metadata'],
+                }
+                for batch in output
+            ],
             group=parallel_state.get_data_parallel_group(),
         )
 
@@ -272,7 +286,11 @@ def gather_and_maybe_write_predictions(self, output, data_cfg, mode, averaged_me
                 l_d_hs = listify(batch['d_hs'])
                 l_m = batch['metadata']
                 assert len(l_m) == len(l_q_hs) == len(l_d_hs)
-                for q_hs, d_hs, metadata in zip(l_q_hs, l_d_hs, l_m,):
+                for q_hs, d_hs, metadata in zip(
+                    l_q_hs,
+                    l_d_hs,
+                    l_m,
+                ):
                     total_size += 1
                     if not metadata.get("__AUTOGENERATED__", False):
                         deduplicated_outputs['q_hs'].append(q_hs)
@@ -326,10 +344,10 @@ def write_embeddings_to_file(self, outputs, output_file_path, d_idx):
 
     def local_validation_step(self, dataloader_iter):
         """
-            Our dataloaders produce a micro-batch and then we fetch
-            a number of microbatches depending on the global batch size and model parallel size
-            from the dataloader to produce a list of microbatches.
-            The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
+        Our dataloaders produce a micro-batch and then we fetch
+        a number of microbatches depending on the global batch size and model parallel size
+        from the dataloader to produce a list of microbatches.
+        The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
         """
         # Check if iterator is exhausted
         # dataloader_iter, done = self._val_iterator_done(dataloader_iter)
@@ -377,7 +395,7 @@ def local_validation_step(self, dataloader_iter):
 
         return loss, non_loss_tensors
 
-    def constrastive_scores(self, pos_doc_hs, neg_doc_hs, query_hs, bs, use_all_possible_negatives=False):
+    def constrastive_scores(self, pos_doc_hs, neg_doc_hs, query_hs, bs, temperature, use_all_possible_negatives=False):
         all_doc_hs = torch.cat([pos_doc_hs, neg_doc_hs], dim=0)  # (2bs) x hidden_size
         cs = torch.mm(query_hs, all_doc_hs.transpose(0, 1))  # (bs) x (2bs)
         pos_cs = cs[:, :bs].diag()
@@ -389,6 +407,8 @@ def constrastive_scores(self, pos_doc_hs, neg_doc_hs, query_hs, bs, use_all_poss
             cs = torch.cat([pos_cs.unsqueeze(1), neg_cs.unsqueeze(1)], dim=1)
         pos_cs = pos_cs.clone().detach().mean()
         neg_cs = neg_cs.clone().detach().mean()
+        cs = cs.clamp(-1.0, 1.0)
+        cs = cs / temperature
         return cs, pos_cs, neg_cs, labels
 
     def inference_loss_func(self, loss_mask, num_valid_tokens_in_ub, eos_tensors):
@@ -426,11 +446,20 @@ def loss_func(self, loss_mask, num_valid_tokens_in_ub, output_tensor):
         neg_doc_hs = torch.nn.functional.normalize(neg_doc_hs, dim=1)
 
         cs, pos_cs, neg_cs, labels = self.constrastive_scores(
-            pos_doc_hs, neg_doc_hs, query_hs, bs, self.use_all_possible_negatives
+            pos_doc_hs, neg_doc_hs, query_hs, bs, self.temperature, self.use_all_possible_negatives
         )
-        cs = cs.clamp(-1.0, 1.0)
-        cs = cs / self.temperature
         loss = torch.nn.functional.cross_entropy(cs, labels)
+        if self.mrl_dims:
+            for dim in self.mrl_dims:
+                cs_dim, _, _, _ = self.constrastive_scores(
+                    pos_doc_hs[:, :dim],
+                    neg_doc_hs[:, :dim],
+                    query_hs[:, :dim],
+                    bs,
+                    self.temperature,
+                    self.use_all_possible_negatives,
+                )
+                loss += torch.nn.functional.cross_entropy(cs_dim, labels)
 
         cp_size = self.cfg.get('context_parallel_size', 1)
         if cp_size > 1:

From 5df8e11255802a2ce2f33db6362e60990e215b64 Mon Sep 17 00:00:00 2001
From: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Date: Tue, 14 May 2024 15:16:21 -0700
Subject: [PATCH 13/36] Add guards to SD imports (#9158)

* Add guards to SD imports

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---
 .../modules/imagen/diffusionmodules/layers.py |  9 +++++++-
 .../modules/stable_diffusion/attention.py     | 17 ++++++++++++---
 .../diffusionmodules/model.py                 |  9 +++++++-
 .../diffusionmodules/openaimodel.py           | 21 +++++++++++++++----
 .../stable_diffusion/diffusionmodules/util.py |  9 +++++++-
 5 files changed, 55 insertions(+), 10 deletions(-)

diff --git a/nemo/collections/multimodal/modules/imagen/diffusionmodules/layers.py b/nemo/collections/multimodal/modules/imagen/diffusionmodules/layers.py
index 72e70250f0d7..f5beca436ecf 100644
--- a/nemo/collections/multimodal/modules/imagen/diffusionmodules/layers.py
+++ b/nemo/collections/multimodal/modules/imagen/diffusionmodules/layers.py
@@ -43,7 +43,14 @@
 import torch as th
 import torch.nn as nn
 import torch.nn.functional as F
-from apex.contrib.group_norm import GroupNorm
+
+try:
+    from apex.contrib.group_norm import GroupNorm
+
+    OPT_GROUP_NORM = True
+except Exception:
+    print('Fused optimized group norm has not been installed.')
+    OPT_GROUP_NORM = False
 
 
 def conv_nd(dims, *args, **kwargs):
diff --git a/nemo/collections/multimodal/modules/stable_diffusion/attention.py b/nemo/collections/multimodal/modules/stable_diffusion/attention.py
index f5689c706e2c..c70b59d39481 100644
--- a/nemo/collections/multimodal/modules/stable_diffusion/attention.py
+++ b/nemo/collections/multimodal/modules/stable_diffusion/attention.py
@@ -17,7 +17,6 @@
 
 import torch
 import torch.nn.functional as F
-from apex.contrib.group_norm import GroupNorm
 from einops import rearrange, repeat
 from torch import einsum, nn
 from torch._dynamo import disable
@@ -25,9 +24,13 @@
 if os.environ.get("USE_NATIVE_GROUP_NORM", "0") == "1":
     from nemo.gn_native import GroupNormNormlization as GroupNorm
 else:
-    from apex.contrib.group_norm import GroupNorm
+    try:
+        from apex.contrib.group_norm import GroupNorm
 
-from transformer_engine.pytorch.module import LayerNormLinear, LayerNormMLP
+        OPT_GROUP_NORM = True
+    except Exception:
+        print('Fused optimized group norm has not been installed.')
+        OPT_GROUP_NORM = False
 
 from nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.util import checkpoint
 from nemo.collections.nlp.modules.common.megatron.adapters.parallel_adapters import (
@@ -37,6 +40,14 @@
 from nemo.core import adapter_mixins
 from nemo.utils import logging
 
+try:
+    from transformer_engine.pytorch.module import LayerNormLinear, LayerNormMLP
+
+    HAVE_TE = True
+
+except (ImportError, ModuleNotFoundError):
+    HAVE_TE = False
+
 
 def check_cuda():
     if not torch.cuda.is_available():
diff --git a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/model.py b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/model.py
index 7fc5c208004f..644efafaf06a 100644
--- a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/model.py
+++ b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/model.py
@@ -17,12 +17,19 @@
 import numpy as np
 import torch
 import torch.nn as nn
-from apex.contrib.group_norm import GroupNorm
 from einops import rearrange
 
 from nemo.collections.multimodal.modules.stable_diffusion.attention import LinearAttention
 from nemo.collections.multimodal.parts.stable_diffusion.utils import instantiate_from_config
 
+try:
+    from apex.contrib.group_norm import GroupNorm
+
+    OPT_GROUP_NORM = True
+except Exception:
+    print('Fused optimized group norm has not been installed.')
+    OPT_GROUP_NORM = False
+
 
 def get_timestep_embedding(timesteps, embedding_dim):
     """
diff --git a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py
index b610f921a22a..3e301f0b8fc1 100644
--- a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py
+++ b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py
@@ -26,10 +26,6 @@
 import torch.nn as nn
 import torch.nn.functional as F
 
-# FP8 related import
-import transformer_engine
-from apex.contrib.group_norm import GroupNorm
-
 from nemo.collections.multimodal.modules.stable_diffusion.attention import SpatialTransformer
 from nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.util import (
     avg_pool_nd,
@@ -45,6 +41,23 @@
 )
 from nemo.utils import logging
 
+try:
+    # FP8 related import
+    import transformer_engine
+
+    HAVE_TE = True
+
+except (ImportError, ModuleNotFoundError):
+    HAVE_TE = False
+
+try:
+    from apex.contrib.group_norm import GroupNorm
+
+    OPT_GROUP_NORM = True
+except Exception:
+    print('Fused optimized group norm has not been installed.')
+    OPT_GROUP_NORM = False
+
 
 def convert_module_to_dtype(module, dtype, enable_norm_layers=False):
     # Convert module parameters to dtype
diff --git a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/util.py b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/util.py
index 3b446f4a42c3..53f9669a0b8f 100644
--- a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/util.py
+++ b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/util.py
@@ -29,11 +29,18 @@
 import numpy as np
 import torch
 import torch.nn as nn
-from apex.contrib.group_norm import GroupNorm
 from einops import repeat
 from torch._dynamo import disable
 from torch.cuda.amp import custom_bwd, custom_fwd
 
+try:
+    from apex.contrib.group_norm import GroupNorm
+
+    OPT_GROUP_NORM = True
+except Exception:
+    print('Fused optimized group norm has not been installed.')
+    OPT_GROUP_NORM = False
+
 
 def make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
     if schedule == "linear":

From c2daa916b6454fe568706b4ab5da06500e2c6728 Mon Sep 17 00:00:00 2001
From: mikolajblaz <mikolajblaz@users.noreply.github.com>
Date: Wed, 15 May 2024 13:57:18 +0200
Subject: [PATCH 14/36] Implement async distributed checkpoint save (#9028)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Prevent duplicated checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Introduce DistributedCheckpointIO

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix DistCkptIO usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use NeMo logger

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [DCIO] Fix save_to dist ckpt path

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add versioning to save_to

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning logic to all .nemo files

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist-ckpt test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Rename existing ckpts instead of using different name

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add comment

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist ckpt flag in all methods

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve error msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist ckpt unit tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load_checkpoint

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix auto-issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix ckpt_dir var

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore skipping behavior

The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix steps on single-GPU machine

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run dist-ckpt test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply black

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Prevent saving last for non-equal val intervals

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move checkpoint on rank 0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix num steps in tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async ckpt implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Abstract AsyncFinalizableCheckpointIO away

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Change async_save flag location

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add debug info

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Handle multiple async saves

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move finalization calls to a callback

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid deadlock in teardown

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Adjust to MCore implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add notes and copyrights

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async_request attribute

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add MCore import guards

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix finalize_fn arg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove checkpoints from accurate steps

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix MCore class usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix logger usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix rebase

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix code scan issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove unsused import

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt for Bert

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load checkpoint return val

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt based on sharded_state_dict

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async logging

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove deprecated argument

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use correct checkpoint_io

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bad merge

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve debug msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run async test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async ckpt unit test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>

* Clarify async logs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add schema print

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---
 .../conf/megatron_gpt_config.yaml             |   1 +
 .../nlp/parts/megatron_trainer_builder.py     |  62 +++-
 nemo/collections/nlp/parts/nlp_overrides.py   | 107 ++++---
 nemo/utils/callbacks/checkpointing_context.py |   0
 nemo/utils/callbacks/dist_ckpt_io.py          | 221 ++++++++++++-
 nemo/utils/callbacks/nemo_model_checkpoint.py | 100 +++++-
 nemo/utils/callbacks/torch_dist_async.py      | 298 ++++++++++++++++++
 nemo/utils/exp_manager.py                     |  27 +-
 tests/core/test_dist_ckpt.py                  |  99 +++++-
 9 files changed, 806 insertions(+), 109 deletions(-)
 create mode 100644 nemo/utils/callbacks/checkpointing_context.py
 create mode 100644 nemo/utils/callbacks/torch_dist_async.py

diff --git a/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml b/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
index aa43dfe7e53e..20e20744833c 100755
--- a/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
@@ -52,6 +52,7 @@ exp_manager:
     save_nemo_on_train_end: False # not recommended when training large models on clusters with short time limits
     filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
     model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
+    async_save: False # Set to True to enable async checkpoint save. Currently works only with distributed checkpoints
 
 model:
   # use GPTModel from megatron.core
diff --git a/nemo/collections/nlp/parts/megatron_trainer_builder.py b/nemo/collections/nlp/parts/megatron_trainer_builder.py
index a97b9301fb26..e1a780f09756 100644
--- a/nemo/collections/nlp/parts/megatron_trainer_builder.py
+++ b/nemo/collections/nlp/parts/megatron_trainer_builder.py
@@ -13,8 +13,9 @@
 # limitations under the License.
 
 import sys
-from typing import Union
+from typing import Optional, Union
 
+from lightning_fabric.utilities.exceptions import MisconfigurationException
 from omegaconf import DictConfig
 from pytorch_lightning import Trainer
 from pytorch_lightning.callbacks import ModelSummary
@@ -31,7 +32,11 @@
     PipelineMixedPrecisionPlugin,
 )
 from nemo.utils import logging
-from nemo.utils.callbacks.dist_ckpt_io import DistributedCheckpointIO
+from nemo.utils.callbacks.dist_ckpt_io import (
+    AsyncFinalizableCheckpointIO,
+    AsyncFinalizerCallback,
+    DistributedCheckpointIO,
+)
 
 
 class MegatronTrainerBuilder:
@@ -51,7 +56,10 @@ def _training_strategy(self) -> Union[NLPDDPStrategy, NLPFSDPStrategy]:
         _IS_INTERACTIVE = hasattr(sys, "ps1") or bool(sys.flags.interactive)
         if _IS_INTERACTIVE and self.cfg.trainer.devices == 1:
             logging.info("Detected interactive environment, using NLPDDPStrategyNotebook")
-            return NLPDDPStrategyNotebook(no_ddp_communication_hook=True, find_unused_parameters=False,)
+            return NLPDDPStrategyNotebook(
+                no_ddp_communication_hook=True,
+                find_unused_parameters=False,
+            )
 
         if self.cfg.model.get('fsdp', False):
             assert (
@@ -89,7 +97,7 @@ def _grad_scaler(self) -> GradScaler:
         Returns a scaler for precision plugins.
         """
         return GradScaler(
-            init_scale=self.cfg.model.get('native_amp_init_scale', 2 ** 32),
+            init_scale=self.cfg.model.get('native_amp_init_scale', 2**32),
             growth_interval=self.cfg.model.get('native_amp_growth_interval', 1000),
             hysteresis=self.cfg.model.get('hysteresis', 2),
         )
@@ -137,19 +145,41 @@ def _plugins(self) -> list:
         use_dist_ckpt = not self.cfg.model.get('fsdp', False) and (
             self.cfg.model.get('mcore_gpt', False) or self.cfg.model.get('mcore_bert', False)
         )
+        async_save = self.cfg.exp_manager.checkpoint_callback_params.get('async_save', False)
         if use_dist_ckpt:
-            plugins.append(DistributedCheckpointIO.from_config(self.cfg.model))
+            checkpoint_io = DistributedCheckpointIO.from_config(self.cfg.model, async_save)
+            if async_save:
+                checkpoint_io = AsyncFinalizableCheckpointIO(checkpoint_io)
+            plugins.append(checkpoint_io)
+        elif async_save:
+            raise MisconfigurationException(
+                'exp_manager.checkpoint_callback_params.async_save=True without'
+                'distributed checkpoints is currently not supported'
+            )
 
         return plugins
 
+    def _callbacks(self, callbacks: Optional[list]) -> list:
+        """
+        Returns:
+            callbacks: list of callbacks passed to Trainer.callbacks.
+        """
+        if callbacks is None:
+            callbacks = []
+        # enable_progress_bar is True by default. If cfg.trainer.enable_progress_bar=False, CustomProgressBar is not appended to callbacks
+        if 'enable_progress_bar' not in self.cfg.trainer or self.cfg.trainer.enable_progress_bar:
+            callbacks.append(CustomProgressBar())
+
+        if self.cfg.exp_manager.checkpoint_callback_params.get('async_save', False):
+            callbacks.append(AsyncFinalizerCallback())
+        return callbacks
+
     def create_trainer(self, callbacks=None) -> Trainer:
         # cfg.trainer.precision becomes None in Trainer if precision_plugins exist since both precision plugins and precision
         precision = self.cfg.trainer.precision
         strategy = self._training_strategy()
         plugins = self._plugins()
-        # enable_progress_bar is True by default. If cfg.trainer.enable_progress_bar=False, CustomProgressBar is not appended to callbacks
-        if 'enable_progress_bar' not in self.cfg.trainer or self.cfg.trainer.enable_progress_bar:
-            callbacks = [CustomProgressBar()]
+        callbacks = self._callbacks(callbacks)
         trainer = Trainer(plugins=plugins, strategy=strategy, **self.cfg.trainer, callbacks=callbacks)
         # Restore the precision value after Trainer is built.
         self.cfg.trainer.precision = precision
@@ -161,7 +191,7 @@ class MegatronBertTrainerBuilder(MegatronTrainerBuilder):
 
     def _grad_scaler(self) -> GradScaler:
         return GradScaler(
-            init_scale=self.cfg.model.get('native_amp_init_scale', 2 ** 32),
+            init_scale=self.cfg.model.get('native_amp_init_scale', 2**32),
             growth_interval=self.cfg.model.get('native_amp_growth_interval', 1000),
         )
 
@@ -169,13 +199,15 @@ def _grad_scaler(self) -> GradScaler:
 class MegatronT5TrainerBuilder(MegatronTrainerBuilder):
     """Builder for T5 model Trainer with overrides."""
 
-    def create_trainer(self) -> Trainer:
+    def _callbacks(self, callbacks: Optional[list]) -> list:
+        callbacks = super()._callbacks(callbacks)
+        callbacks.append(ModelSummary(max_depth=3))
+        return callbacks
+
+    def create_trainer(self, callbacks=None) -> Trainer:
         strategy = self._training_strategy()
         plugins = self._plugins()
-        callbacks = [ModelSummary(max_depth=3)]
-        # enable_progress_bar is True by default. If cfg.trainer.enable_progress_bar=False, CustomProgressBar is not appended to callbacks
-        if 'enable_progress_bar' not in self.cfg.trainer or self.cfg.trainer.enable_progress_bar:
-            callbacks.append(CustomProgressBar())
+        callbacks = self._callbacks(callbacks)
         return Trainer(plugins=plugins, strategy=strategy, **self.cfg.trainer, callbacks=callbacks)
 
 
@@ -207,7 +239,7 @@ class MegatronLMPPTrainerBuilder(MegatronTrainerBuilder):
 
     def _grad_scaler(self) -> GradScaler:
         return GradScaler(
-            init_scale=self.cfg.model.get("native_amp_init_scale", 2 ** 32),
+            init_scale=self.cfg.model.get("native_amp_init_scale", 2**32),
             growth_interval=self.cfg.model.get("native_amp_growth_interval", 1000),
             hysteresis=self.cfg.model.get("hysteresis", 2),
             enabled=False if self.cfg.model.pipeline_model_parallel_size > 1 else True,
diff --git a/nemo/collections/nlp/parts/nlp_overrides.py b/nemo/collections/nlp/parts/nlp_overrides.py
index 1c68ebff8121..65ffb7df47f4 100644
--- a/nemo/collections/nlp/parts/nlp_overrides.py
+++ b/nemo/collections/nlp/parts/nlp_overrides.py
@@ -35,6 +35,7 @@
 from pytorch_lightning.loops.fetchers import _DataFetcher
 from pytorch_lightning.plugins import ClusterEnvironment
 from pytorch_lightning.plugins.io.checkpoint_plugin import CheckpointIO
+from pytorch_lightning.plugins.io.wrapper import _WrappingCheckpointIO
 from pytorch_lightning.plugins.precision import MixedPrecisionPlugin
 from pytorch_lightning.plugins.precision.fsdp import FSDPPrecision
 from pytorch_lightning.strategies import DDPStrategy, FSDPStrategy
@@ -120,7 +121,7 @@
 def init_model_parallel(
     sharp: bool, nccl_communicator_config_path: str = None, distributed_timeout_minutes: int = 30
 ) -> None:
-    """ Initializes Megatron-LM model parallel if using model parallelism.
+    """Initializes Megatron-LM model parallel if using model parallelism.
 
     Args:
         sharp: Apply SHARP to NCCL data-parallel communication.
@@ -164,7 +165,7 @@ def init_model_parallel(
 
 
 class NLPDDPStrategy(DDPStrategy):
-    """ DDP plugin for Pytorch Lightning. Needed to customize DDP for model parallel models.
+    """DDP plugin for Pytorch Lightning. Needed to customize DDP for model parallel models.
 
     Args:
         no_ddp_communication_hook: Disable DDP communication hook when using AMP-O2
@@ -231,8 +232,8 @@ def setup_distributed(self, global_rank: int = None, world_size: int = None) ->
                 )
 
     def configure_ddp(self):
-        """ Override LightningModule ddp if using model parallel.
-            Sets find_unused_parameters to False to use activation-checkpoint-recomputation.
+        """Override LightningModule ddp if using model parallel.
+        Sets find_unused_parameters to False to use activation-checkpoint-recomputation.
         """
 
         if (hasattr(self.model, 'megatron_amp_O2') and self.model.megatron_amp_O2) or (
@@ -362,9 +363,6 @@ def save_checkpoint(
                 unsharded_optim_state=checkpoint['optimizer_states'][0]
             )
             checkpoint['optimizer_states'] = [sharded_optim_state]
-            # dist_checkpointing expects a directory so we will name the directory
-            # using the path with the file extension removed
-            checkpoint_dir = ckpt_to_dir(filepath)
             # remove device state_dict
             checkpoint['state_dict'] = OrderedDict([])
 
@@ -406,7 +404,7 @@ def load_model_state_dict(self, checkpoint: Mapping[str, Any], strict: bool = Tr
             self.lightning_module.load_state_dict(checkpoint["state_dict"], strict=strict)
 
     def _fix_tensors_device(self, ckpt: Dict) -> Dict:
-        """ Ensure checkpoint tensors are on the correct device."""
+        """Ensure checkpoint tensors are on the correct device."""
         assert torch.cuda.is_initialized(), (torch.cuda.is_available(), torch.cuda.is_initialized())
         cur_dev = torch.device("cuda", index=torch.cuda.current_device())
 
@@ -418,10 +416,10 @@ def _fix_device(t):
         return dict_list_map_outplace(_fix_device, ckpt)
 
     def load_checkpoint(self, checkpoint_path: Union[str, Path]) -> Dict[str, Any]:
-        """ PTL method which we override to integrate distributed checkpoints for model parallel models.
-            In order to load distributed checkpoints we need to provide the sharded_state_dict to 
-            the distributed load function. We get the sharded_state_dict from self.lightning_module
-            which makes it convenient to have the loading logic happen at the strategy level.
+        """PTL method which we override to integrate distributed checkpoints for model parallel models.
+        In order to load distributed checkpoints we need to provide the sharded_state_dict to
+        the distributed load function. We get the sharded_state_dict from self.lightning_module
+        which makes it convenient to have the loading logic happen at the strategy level.
         """
 
         fs = get_filesystem(checkpoint_path)
@@ -466,7 +464,10 @@ def remove_checkpoint(self, filepath: Union[str, Path]) -> None:
 
     @property
     def use_distributed_checkpointing(self):
-        has_dist_ckpt_io = HAVE_MEGATRON_CORE and isinstance(self.checkpoint_io, DistributedCheckpointIO)
+        checkpoint_io = self.checkpoint_io
+        while isinstance(checkpoint_io, _WrappingCheckpointIO):
+            checkpoint_io = checkpoint_io.checkpoint_io
+        has_dist_ckpt_io = HAVE_MEGATRON_CORE and isinstance(checkpoint_io, DistributedCheckpointIO)
         has_sharded_state_dict = (
             hasattr(self.lightning_module, 'sharded_state_dict')
             and self.lightning_module.sharded_state_dict() is not None
@@ -500,15 +501,15 @@ def distributed_sampler_kwargs(self):
 
     @property
     def restore_checkpoint_after_setup(self) -> bool:
-        """ This needs to be True for distributed checkpointing because
-            we require the model to have configured the optimizer before 
-            deserializing the checkpoint.
+        """This needs to be True for distributed checkpointing because
+        we require the model to have configured the optimizer before
+        deserializing the checkpoint.
         """
         return True
 
 
 class NLPDDPStrategyNotebook(NLPDDPStrategy):
-    """ Version of NLPDDPStrategy to be used in a Jupyter Notebook
+    """Version of NLPDDPStrategy to be used in a Jupyter Notebook
     A large portion of Megatron code has DDP dependency, so it has been necessary to use NLPDDPStrategy even for
     single-GPU training (e.g. in a Jupyter notebook)
     A PTL 2.0 changes has prevented DDPStrategy to be used in a notebook.
@@ -546,7 +547,7 @@ def _get_full_state_dict_context(module: torch.nn.Module, rank0_only: bool = Fal
 
 
 class NLPFSDPStrategy(FSDPStrategy):
-    """ FSDP plugin for Pytorch Lightning with the support for tensor-parallelism.
+    """FSDP plugin for Pytorch Lightning with the support for tensor-parallelism.
 
     Args:
         sharding_strategy: FSDP parameter sharding strategy.
@@ -639,7 +640,11 @@ def _set_mixed_precision_recipe(
             reduce_dtype = utils_funcs.torch_dtype_from_precision(grad_reduce_dtype, None)
         if set_buffer_dtype is not None:
             buffer_dtype = utils_funcs.torch_dtype_from_precision(buffer_dtype, None)
-        return MixedPrecision(param_dtype=param_dtype, reduce_dtype=reduce_dtype, buffer_dtype=buffer_dtype,)
+        return MixedPrecision(
+            param_dtype=param_dtype,
+            reduce_dtype=reduce_dtype,
+            buffer_dtype=buffer_dtype,
+        )
 
     def setup_environment(self) -> None:
         """
@@ -750,7 +755,9 @@ def _get_osd(opt_state):
                         with FSDP.summon_full_params(self.model, writeback=True, rank0_only=False):
                             # rekey the osd stored from non-FSDP model
                             rekeyed_osd = FSDP.rekey_optim_state_dict(
-                                temp_osd, OptimStateKeyType.PARAM_NAME, self.model,
+                                temp_osd,
+                                OptimStateKeyType.PARAM_NAME,
+                                self.model,
                             )
                         temp_osd = FSDP.shard_full_optim_state_dict(rekeyed_osd, self.model)
                     except Exception as e:
@@ -758,7 +765,9 @@ def _get_osd(opt_state):
                         exit(1)
                 # Shard optimizer state dict
                 sharded_osd = FSDP.optim_state_dict_to_load(
-                    optim_state_dict=temp_osd, model=self.model, optim=optimizer,
+                    optim_state_dict=temp_osd,
+                    model=self.model,
+                    optim=optimizer,
                 )
 
                 optimizer.load_state_dict(sharded_osd)
@@ -767,9 +776,9 @@ def _get_osd(opt_state):
     def save_checkpoint(
         self, checkpoint: Dict[str, Any], filepath: Union[str, Path], storage_options: Optional[Any] = None
     ) -> None:
-        """ Store checkpoints
-            1. In case of sharded checkpoint, all ranks store unique checkpoints.
-            2. In case of non-sharded checkpoint, all data-parallel rank 0 store checkpoints.
+        """Store checkpoints
+        1. In case of sharded checkpoint, all ranks store unique checkpoints.
+        2. In case of non-sharded checkpoint, all data-parallel rank 0 store checkpoints.
         """
         app_state = AppState()
         filepath = inject_model_parallel_rank(filepath, fsdp_sharded_ckpt=self.sharded_checkpoint)
@@ -780,8 +789,7 @@ def save_checkpoint(
             self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
 
     def load_checkpoint(self, checkpoint_path: Union[str, Path]) -> Dict[str, Any]:
-        """ Load checkpoints
-        """
+        """Load checkpoints"""
         # 1. Load normal or FSDP-sharded checkpoints.
         fs = get_filesystem(checkpoint_path)
 
@@ -798,8 +806,7 @@ def load_checkpoint(self, checkpoint_path: Union[str, Path]) -> Dict[str, Any]:
         return checkpoint
 
     def remove_checkpoint(self, filepath: Union[str, Path]) -> None:
-        """ Remove checkpoints
-        """
+        """Remove checkpoints"""
         # legacy checkpoint logic, does not use megatron core
         app_state = AppState()
         # PTL override to accomodate model parallel checkpoints
@@ -814,9 +821,9 @@ def remove_checkpoint(self, filepath: Union[str, Path]) -> None:
 
     @property
     def restore_checkpoint_after_setup(self) -> bool:
-        """ When loading FSDP-sharded checkpoint, need to restore checkpoint after configuring
-            FSDP sharding to match FSDP-sharded format between the checkpoint and the current
-            model and optimizer.
+        """When loading FSDP-sharded checkpoint, need to restore checkpoint after configuring
+        FSDP sharding to match FSDP-sharded format between the checkpoint and the current
+        model and optimizer.
         """
         return True
 
@@ -915,7 +922,8 @@ def dummy():
                     else:
                         # move weights to the tmpdir
                         for tp_rank, pp_rank in itertools.product(
-                            range(app_state.tensor_model_parallel_size), range(app_state.pipeline_model_parallel_size),
+                            range(app_state.tensor_model_parallel_size),
+                            range(app_state.pipeline_model_parallel_size),
                         ):
                             os.makedirs(os.path.join(tmpdir, f'tp_rank_{tp_rank:02d}_pp_rank_{pp_rank:03d}'))
                             mp_model_weights = os.path.join(
@@ -1000,6 +1008,7 @@ def modify_state_dict(self, conf, state_dict):
         loaded_keys = state_dict.keys()
         if 'model.model.diffusion_model.input_blocks.1.0.in_layers.2.weight' in loaded_keys:
             new_state_dict = {}
+
             # GroupNormOpt fuses activation function to one layer, thus the indexing of weights are shifted for following
             def should_process(key):
                 base_str = "model.model.diffusion_model."
@@ -1110,7 +1119,13 @@ def restore_from(
         # Get path where the command is executed - the artifacts will be "retrieved" there
         # (original .nemo behavior)
         loaded_params = super().load_config_and_state_dict(
-            calling_cls, restore_path, override_config_path, map_location, strict, return_config, trainer,
+            calling_cls,
+            restore_path,
+            override_config_path,
+            map_location,
+            strict,
+            return_config,
+            trainer,
         )
         if not isinstance(loaded_params, tuple) or return_config is True:
             return loaded_params
@@ -1165,12 +1180,12 @@ def dummy():
 
 
 class PipelineMixedPrecisionPlugin(MixedPrecisionPlugin):
-    """ Overrides PTL autocasting to not wrap training/val/test_step.
-        We do this because we have the megatron-core fwd/bwd functions in training_step.
-        This means .backward is being called in training_step so we do not want the whole
-        step wrapped in autocast.
+    """Overrides PTL autocasting to not wrap training/val/test_step.
+    We do this because we have the megatron-core fwd/bwd functions in training_step.
+    This means .backward is being called in training_step so we do not want the whole
+    step wrapped in autocast.
 
-        We instead wrap the fwd_output_and_loss_func that is passed to the megatron-core fwd/bwd functions.
+    We instead wrap the fwd_output_and_loss_func that is passed to the megatron-core fwd/bwd functions.
     """
 
     def __init__(
@@ -1206,12 +1221,12 @@ def forward_context(self) -> Generator[None, None, None]:
 
 
 class FSDPMixedPrecisionPlugin(FSDPPrecision):
-    """ Overrides PTL autocasting to not wrap training/val/test_step.
-        We do this because we have the megatron-core fwd/bwd functions in training_step.
-        This means .backward is being called in training_step so we do not want the whole
-        step wrapped in autocast.
+    """Overrides PTL autocasting to not wrap training/val/test_step.
+    We do this because we have the megatron-core fwd/bwd functions in training_step.
+    This means .backward is being called in training_step so we do not want the whole
+    step wrapped in autocast.
 
-        We instead wrap the fwd_output_and_loss_func that is passed to the megatron-core fwd/bwd functions.
+    We instead wrap the fwd_output_and_loss_func that is passed to the megatron-core fwd/bwd functions.
     """
 
     def __init__(
@@ -1246,7 +1261,7 @@ class GradScaler(torch.cuda.amp.GradScaler):
 
     def __init__(
         self,
-        init_scale=2.0 ** 16,
+        init_scale=2.0**16,
         growth_factor=2.0,
         backoff_factor=0.5,
         growth_interval=2000,
@@ -1500,7 +1515,7 @@ def optimizer_step(
 
     @contextmanager
     def forward_context(self) -> Generator[None, None, None]:
-        """ No explicit precision casting. Inputs are supposed to be manually casted """
+        """No explicit precision casting. Inputs are supposed to be manually casted"""
         try:
             yield
         finally:
@@ -1508,7 +1523,7 @@ def forward_context(self) -> Generator[None, None, None]:
 
 
 class GlobalBatchDataFetcher(_DataFetcher):
-    """ Overrides PTL DataFetcher. Used to fetch global batches."""
+    """Overrides PTL DataFetcher. Used to fetch global batches."""
 
     def __init__(self, prefetch_batches: int = 0, store_on_device: bool = False) -> None:
 
diff --git a/nemo/utils/callbacks/checkpointing_context.py b/nemo/utils/callbacks/checkpointing_context.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/nemo/utils/callbacks/dist_ckpt_io.py b/nemo/utils/callbacks/dist_ckpt_io.py
index 2e695dd7bbaa..905de4eb3567 100644
--- a/nemo/utils/callbacks/dist_ckpt_io.py
+++ b/nemo/utils/callbacks/dist_ckpt_io.py
@@ -1,41 +1,217 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import shutil
+from abc import ABC, abstractmethod
+from contextlib import contextmanager
+from time import time
 from typing import Any, Dict, Optional
 
+import pytorch_lightning as pl
 from lightning_fabric.plugins import CheckpointIO
 from lightning_fabric.utilities.cloud_io import get_filesystem
 from lightning_fabric.utilities.types import _PATH
-from megatron.core import dist_checkpointing
-from megatron.core.dist_checkpointing.strategies import tensorstore
+from pytorch_lightning import Callback
+from pytorch_lightning.plugins.io.wrapper import _WrappingCheckpointIO
 
 from nemo.utils import logging
 
+try:
+    from megatron.core import dist_checkpointing
+    from megatron.core.dist_checkpointing.strategies import tensorstore
+
+    from nemo.utils.callbacks.torch_dist_async import AsyncCallsQueue, AsyncRequest, TorchDistAsyncSaveShardedStrategy
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError) as IMPORT_ERROR_EXC:
+
+    HAVE_MEGATRON_CORE = False
+    IMPORT_ERROR = "megatron-core was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt."
+
+
+@contextmanager
+def _debug_time(name: str):
+    """Simple context manager for timing functions/code blocks."""
+    start = time()
+    try:
+        yield
+    finally:
+        logging.debug(f'{name} took {time() - start:.3f}s')
+
+
+class AsyncCompatibleCheckpointIO(CheckpointIO, ABC):
+    """CheckpointIO that can be used together with async saving.
+
+    Differs from the regular CheckpointIO only by the `save_checkpoint`
+    return type. The `save_checkpoint` method itself is synchronous, but returns
+    callbacks that can be performed asynchronously.
+    """
+
+    @abstractmethod
+    def save_checkpoint(
+        self, checkpoint: Dict[str, Any], path: _PATH, storage_options: Optional[Any] = None
+    ) -> 'AsyncRequest':
+        raise NotImplementedError
+
 
-class DistributedCheckpointIO(CheckpointIO):
-    """ CheckpointIO for a distributed checkpoint format.
+class AsyncFinalizableCheckpointIO(_WrappingCheckpointIO):
+    """CheckpointIO wrapper for async checkpoint saving and synchronous finalization.
+
+    Runs main part of the checkpoint save in a separate process (not thread as the PTL
+    AsyncCheckpointIO does). Allows to perform a (synchronous) finalization
+    function after all ranks finish checkpoint saving.
+
+    NOTE: for correctness, this plugin must be used together with the
+    AsyncFinalizerCallback callback which performs the finalization checks.
+
+    Args:
+        checkpoint_io (CheckpointIO): wrapped checkpoint_io object. Must be
+            of type AsyncCompatibleCheckpointIO.
+    Requires the underlying checkpoint_io.save_checkpoint to return save_fn, save_args, finalize_fn.
+    """
+
+    def __init__(self, checkpoint_io: AsyncCompatibleCheckpointIO) -> None:
+        if not HAVE_MEGATRON_CORE:
+            raise ImportError(IMPORT_ERROR) from IMPORT_ERROR_EXC
+        if not isinstance(checkpoint_io, AsyncCompatibleCheckpointIO):
+            raise ValueError(f'Incompatible wrapped checkpoint_io type: {type(checkpoint_io)}')
+
+        super().__init__(checkpoint_io)
+        self.async_calls_queue = AsyncCallsQueue()
+
+    def save_checkpoint(self, checkpoint: Dict[str, Any], path: _PATH, storage_options: Optional[Any] = None) -> None:
+        """Executes async request returned from the underlying checkpoint_io asynchronously.
+
+        Requires the underlying checkpoint_io.save_checkpoint to return an AsyncRequest.
+        It is then applied with `self.async_calls_queue` asynchronously.
+
+        Args:
+            checkpoint (Dict[str, Any]): checkpoint to save. Passed to underlying
+                checkpoint_io without modifications.
+            path (_PATH): path to save the checkpoint. Passed to underlying
+                checkpoint_io without modifications.
+            storage_options (Any, optional): storage control modifiers. This class
+                consumed the `finalize_fn` parameter (if any), which is expected to be
+                a callback and is appended to async finalization functions.
+
+        Applies underlying checkpoint_io finalize callback first, then the external one (postfix order).
+        """
+        external_finalize_fn = (storage_options or {}).pop('finalize_fn', None)
+        assert isinstance(self.checkpoint_io, AsyncCompatibleCheckpointIO), type(self.checkpoint_io)
+        async_request = self.checkpoint_io.save_checkpoint(checkpoint, path, storage_options)
+        if external_finalize_fn is not None:
+            async_request.add_finalize_fn(external_finalize_fn)
+        call_idx = self.async_calls_queue.schedule_async_request(async_request)
+        logging.debug(f'Scheduled an async call #{call_idx}')
+
+    @_debug_time('AsyncFinalizableCheckpointIO.maybe_finalize_save_checkpoint')
+    def maybe_finalize_save_checkpoint(self, blocking: bool = False):
+        """Performs checkpoint finalization (if possible).
+
+        Args:
+            blocking (bool, optional): if True, waits until all async saves are
+                completed. Otherwise, finalizes only those async calls which are
+                already done on all ranks. Defaults to False.
+        """
+        call_idx_finalized = self.async_calls_queue.maybe_finalize_async_calls(blocking)
+        if call_idx_finalized:
+            logging.debug(f'Finalized async calls: {[f"#{idx}" for idx in call_idx_finalized]}')
+        return len(call_idx_finalized) > 0
+
+    def teardown(self) -> None:
+        """Warns if there are any pending checkpoint saves."""
+        super().teardown()
+        if self.async_calls_queue.get_num_unfinalized_calls() > 0:
+            # Can't do finalization now because some ranks might be lost
+            logging.warning('Some async checkpoint saves might be not finalized properly.')
+
+
+class AsyncFinalizerCallback(Callback):
+    """Callback which finalizes async saves initiated by the AsyncFinalizableCheckpointIO.
+
+    Tries to perform non-blocking finalization on train_batch_end and train_epoch_end.
+    On train_end performs a blocking finalization of all pending checkpoints.
+    """
+
+    def on_train_batch_end(self, trainer: "pl.Trainer", *args, **kwargs) -> None:
+        self._get_checkpoint_io(trainer).maybe_finalize_save_checkpoint(blocking=False)
+
+    def on_train_epoch_end(self, trainer: "pl.Trainer", *args, **kwargs) -> None:
+        self._get_checkpoint_io(trainer).maybe_finalize_save_checkpoint(blocking=False)
+
+    def on_train_end(self, trainer: "pl.Trainer", *args, **kwargs) -> None:
+        checkpoint_io = self._get_checkpoint_io(trainer)
+        if checkpoint_io.async_calls_queue.get_num_unfinalized_calls() > 0:
+            logging.info('Pending async checkpoint saves. Finalizing them synchronously now')
+        self._get_checkpoint_io(trainer).maybe_finalize_save_checkpoint(blocking=True)
+
+    def _get_checkpoint_io(self, trainer) -> AsyncFinalizableCheckpointIO:
+        checkpoint_io = trainer.strategy.checkpoint_io
+        if not isinstance(checkpoint_io, AsyncFinalizableCheckpointIO):
+            raise ValueError(f'Async finalizer requires an async compatible CheckpointIO, got: {checkpoint_io}')
+        return checkpoint_io
+
+
+class DistributedCheckpointIO(AsyncCompatibleCheckpointIO):
+    """CheckpointIO for a distributed checkpoint format.
 
     Args:
         save_ckpt_format (str): Distributed checkpoint format to use for checkpoint saving.
         load_directly_on_device (bool, optional): if True, loads the weights directly
             on GPU. Has effect only for `zarr` based checkpoints (PyT Distributed
             always loads on device). Defaults to True.
+        async_save (bool): whether to save asynchronously. Should be set to True if
+            this class will be wrapped with AsyncFinalizableCheckpointIO.
     """
 
-    def __init__(self, save_ckpt_format: str, load_directly_on_device: bool = True):
+    def __init__(
+        self,
+        save_ckpt_format: str,
+        load_directly_on_device: bool = True,
+        async_save: bool = False,
+    ):
         super().__init__()
+        if not HAVE_MEGATRON_CORE:
+            raise ImportError(IMPORT_ERROR) from IMPORT_ERROR_EXC
+
         self.save_ckpt_format = save_ckpt_format
         self.load_directly_on_device = load_directly_on_device
-
-        self.save_sharded_strategy = self.determine_dist_ckpt_save_strategy()
+        self.async_save = async_save
+        self.save_sharded_strategy = self._determine_dist_ckpt_save_strategy()
 
     @classmethod
-    def from_config(cls, model_cfg):
+    def from_config(cls, model_cfg: dict, async_save: bool = False):
+        """Instantiates a DistributedCheckpointIO from a config dict.
+
+        Args:
+            model_cfg (dict): model config dict. Most of the configuration
+                is extracted from this config.
+            async_save (bool, optional): async_save flag is not part of the model config,
+                it should be provided separately. Defaults to False.
+        """
         return cls(
             save_ckpt_format=model_cfg.get('dist_ckpt_format', 'zarr'),
             load_directly_on_device=model_cfg.get('dist_ckpt_load_on_device', True),
+            async_save=async_save,
         )
 
-    def save_checkpoint(self, checkpoint: Dict[str, Any], path: _PATH, storage_options: Optional[Any] = None) -> None:
-        """ Saves a distributed checkpoint. Creates the checkpoint root directory if doesn't exist.
+    @_debug_time('DistributedCheckpointIO.save_checkpoint')
+    def save_checkpoint(
+        self, checkpoint: Dict[str, Any], path: _PATH, storage_options: Optional[Any] = None
+    ) -> Optional['AsyncRequest']:
+        """Saves a distributed checkpoint. Creates the checkpoint root directory if doesn't exist.
 
         Args:
             checkpoint (Dict[str, Any]): sharded state dict to save
@@ -48,11 +224,19 @@ def save_checkpoint(self, checkpoint: Dict[str, Any], path: _PATH, storage_optio
         dist_checkpointing.save(
             sharded_state_dict=checkpoint, checkpoint_dir=path, sharded_strategy=self.save_sharded_strategy
         )
+        if not self.async_save:
+            return None
+        # NOTE: this logic will be simplified in MCore v0.7
+        assert self.save_sharded_strategy.async_request is not None
+        async_request = self.save_sharded_strategy.async_request
+        self.save_sharded_strategy.async_request = None
+        return async_request
 
+    @_debug_time('DistributedCheckpointIO.load_checkpoint')
     def load_checkpoint(
         self, path: _PATH, map_location: Optional[Any] = None, sharded_state_dict: Dict[str, Any] = None
     ) -> Dict[str, Any]:
-        """ Loads a distributed checkpoint.
+        """Loads a distributed checkpoint.
 
         Args:
             path (_PATH): checkpoint directory
@@ -79,18 +263,25 @@ def load_checkpoint(
             sharded_state_dict=sharded_state_dict, checkpoint_dir=path, sharded_strategy=sharded_strategy
         )
 
+    @_debug_time('DistributedCheckpointIO.remove_checkpoint')
     def remove_checkpoint(self, path: _PATH) -> None:
-        """ Remove a distributed checkpoint.
+        """Remove a distributed checkpoint.
 
         Due to potentially large number of files, the implementation remove the whole directory at once.
         """
         shutil.rmtree(path, ignore_errors=True)
 
-    def determine_dist_ckpt_save_strategy(self):
-        """ Determine the saving strategy based on storage config.
+    def _determine_dist_ckpt_save_strategy(self):
+        """Determine the saving strategy based on constructor args.
 
-        For now only decides the checkpoint format.
+        If self.async_save is True instantiates an async PyT Dist strategy,
+        otherwise relies on MCore to create a proper strategy based on ckpt format.
         """
         save_strategy = (self.save_ckpt_format, 1)
+        if self.async_save:
+            if save_strategy[0] != 'torch_dist':
+                raise ValueError('Async dist-ckpt save supported only for torch_dist format')
+            save_strategy = TorchDistAsyncSaveShardedStrategy('torch_dist', 1)
+
         logging.info(f'Using {save_strategy} dist-ckpt save strategy.')
         return save_strategy
diff --git a/nemo/utils/callbacks/nemo_model_checkpoint.py b/nemo/utils/callbacks/nemo_model_checkpoint.py
index f8bdb9d9b294..15e8a4e21f55 100644
--- a/nemo/utils/callbacks/nemo_model_checkpoint.py
+++ b/nemo/utils/callbacks/nemo_model_checkpoint.py
@@ -21,19 +21,21 @@
 
 import pytorch_lightning
 import torch
+from _weakref import proxy
 from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint, _is_local_file_protocol
 from pytorch_lightning.utilities import rank_zero_info
 
 from nemo.collections.common.callbacks import EMA
 from nemo.utils import logging
 from nemo.utils.app_state import AppState
+from nemo.utils.callbacks.dist_ckpt_io import AsyncFinalizableCheckpointIO
 from nemo.utils.get_rank import is_global_rank_zero
 from nemo.utils.model_utils import ckpt_to_dir, inject_model_parallel_rank, uninject_model_parallel_rank
 
 
 class NeMoModelCheckpoint(ModelCheckpoint):
-    """ Light wrapper around Lightning's ModelCheckpoint to force a saved checkpoint on train_end.
-    Extends Lightning's on_save_checkpoint func to save the .nemo file. Saves the .nemo file based 
+    """Light wrapper around Lightning's ModelCheckpoint to force a saved checkpoint on train_end.
+    Extends Lightning's on_save_checkpoint func to save the .nemo file. Saves the .nemo file based
     on the best checkpoint saved (according to the monitor value).
     Also contains func to save the EMA copy of the model.
     """
@@ -48,6 +50,7 @@ def __init__(
         postfix: str = ".nemo",
         n_resume: bool = False,
         model_parallel_size: int = None,
+        async_save: bool = False,  # controls only finalize callbacks
         **kwargs,
     ):
         # Parse and store "extended" parameters: save_best model and postfix.
@@ -64,6 +67,13 @@ def __init__(
         self.postfix = postfix
         self.previous_best_path = ""
         self.model_parallel_size = model_parallel_size
+        self.async_save = async_save
+        self.async_finalize_cb = None
+        # Checkpoints which removal is deferred until async save is done.
+        # Each element of `deferred_ckpts_to_remove` is a growing list
+        # that `self._remove_checkpoint` adds to. Once `self._save_checkpoint`
+        # is called, the last element is frozen and a new element is added.
+        self.deferred_ckpts_to_remove: List[List[str]] = []
 
         # `prefix` is deprecated
         if 'prefix' in kwargs:
@@ -262,7 +272,7 @@ def on_train_end(self, trainer, pl_module):
             pl_module.save_to(save_path=self._format_nemo_checkpoint_name())
 
     def _backup_existing_nemo_ckpt(self, trainer) -> str:
-        """ Search for an available name with version infix and rename existing checkpoint.
+        """Search for an available name with version infix and rename existing checkpoint.
 
         NOTE: this behavior is slightly different from regular checkpoints.
         PTL creates new regular checkpoint with the first available name.
@@ -330,15 +340,15 @@ def _ema_callback(self, trainer: 'pytorch_lightning.Trainer') -> Optional[EMA]:
 
     @staticmethod
     def format_checkpoint_unfinished_marker_path(checkpoint_path: Union[Path, str]) -> Path:
-        """ Format the path to the unfinished checkpoint marker file.
-        
+        """Format the path to the unfinished checkpoint marker file.
+
         If the marker file exists, corresponding checkpoint is considered unfinished/incomplete.
         NOTE: Marker path for the EMA checkpoint part is the same as for the original checkpoint.
-        
+
         Args:
             checkpoint_path: Path to the checkpoint file or dir.
               Does not need to exist.
-            
+
         Returns:
             Path to the unfinished checkpoint marker file.
         """
@@ -350,7 +360,7 @@ def format_checkpoint_unfinished_marker_path(checkpoint_path: Union[Path, str])
 
     @staticmethod
     def is_checkpoint_unfinished(checkpoint_path: Union[Path, str]) -> bool:
-        """ Check if the checkpoint is unfinished.
+        """Check if the checkpoint is unfinished.
 
         Args:
             checkpoint_path: Path to the checkpoint file or dir.
@@ -363,7 +373,7 @@ def is_checkpoint_unfinished(checkpoint_path: Union[Path, str]) -> bool:
 
     @staticmethod
     def set_checkpoint_unfinished_marker(checkpoint_path: Union[Path, str], barrier_after=False) -> None:
-        """ Marks given checkpoint as unfinished.
+        """Marks given checkpoint as unfinished.
 
         Args:
             checkpoint_filepath: Path to the checkpoint file or dir.
@@ -409,6 +419,8 @@ def _save_checkpoint(self, trainer: 'pytorch_lightning.Trainer', filepath: str)
         self.set_checkpoint_unfinished_marker(filepath, barrier_after=True)
         ema_callback = self._ema_callback(trainer)
         if ema_callback is not None:
+            if self.async_save:
+                raise ValueError('async_save with EMA not supported')
             with ema_callback.save_original_optimizer_state(trainer):
                 super()._save_checkpoint(trainer, filepath)
 
@@ -418,13 +430,71 @@ def _save_checkpoint(self, trainer: 'pytorch_lightning.Trainer', filepath: str)
                 if self.verbose:
                     rank_zero_info(f"Saving EMA weights to separate checkpoint {filepath}")
                 super()._save_checkpoint(trainer, filepath)
+            self.remove_checkpoint_unfinished_marker(filepath, barrier_before=True)
         else:
-            super()._save_checkpoint(trainer, filepath)
-        # barrier_before=True, so all ranks synchronize before removing the unfinished checkpoint marker
-        # we don't want to remove the marker until all checkpointing is done.
-        self.remove_checkpoint_unfinished_marker(filepath, barrier_before=True)
+            # Async save passed the finalization function to checkpoint_io,
+            # sync save calls the finalization function immediately after save.
+            finalize_fn = self._get_finalize_save_checkpoint_callback(trainer, filepath, trainer.global_step)
+            if self.async_save:
+                checkpoint_io = trainer.strategy.checkpoint_io
+                if not isinstance(checkpoint_io, AsyncFinalizableCheckpointIO):
+                    raise ValueError('Async save requires async compatible CheckpointIO')
+                storage_options = dict(finalize_fn=finalize_fn)
+                # Each upcoming ckpt removal request will be executed as part of this save finalization
+                self.deferred_ckpts_to_remove.append([])
+            else:
+                storage_options = None
+            trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
+            if self.async_save:
+                logging.info(f'Scheduled async checkpoint save for {filepath}')
+            else:
+                finalize_fn()
+
+    def _get_finalize_save_checkpoint_callback(
+        self, trainer: 'pytorch_lightning.Trainer', filepath: str, global_step: int
+    ):
+        """Creates a callback that can be used to finalize async (and sync) ckpt saves."""
 
-    def _remove_checkpoint(self, trainer: "pytorch_lightning.Trainer", filepath: str) -> None:
+        def _cb():
+            logging.debug(f'Finalize callback called for step {global_step}, filepath {filepath}')
+            self._last_global_step_saved = global_step
+            self._last_checkpoint_saved = filepath
+
+            # notify loggers
+            if trainer.is_global_zero:
+                for logger in trainer.loggers:
+                    logger.after_save_checkpoint(proxy(self))
+
+            # barrier_before=True, so all ranks synchronize before removing the unfinished checkpoint marker
+            # we don't want to remove the marker until all checkpointing is done.
+            self.remove_checkpoint_unfinished_marker(filepath, barrier_before=True)
+
+            if not self.async_save:
+                return
+
+            logging.info(f'Async checkpoint save for step {global_step} ({filepath}) finalized successfully.')
+
+            # Remove checkpoints marked for removal by `self._remove_checkpoint`
+            # For each finalization there is exactly one entry in self.deferred_ckpts_to_remove
+            assert self.deferred_ckpts_to_remove
+            ckpts_to_remove = self.deferred_ckpts_to_remove.pop(0)
+            logging.debug(f'Checkpoints to remove: {ckpts_to_remove}')
+            for ckpt_to_remove in ckpts_to_remove:
+                self._remove_checkpoint(trainer, ckpt_to_remove, override_async=True)
+
+        return _cb
+
+    def _remove_checkpoint(self, trainer: "pytorch_lightning.Trainer", filepath: str, override_async=False) -> None:
+        """Performs checkpoint removal or deferred removal.
+
+        With async save, `self._remove_checkpoint` is called before the checkpoint
+        is actually finished so we can't remove it. Instead we add it to
+        `self.deferred_ckpts_to_remove` for future removal.
+        """
+        if self.async_save and not override_async:
+            # Register checkpoint removal in the last (active) checkpoint removal list
+            self.deferred_ckpts_to_remove[-1].append(filepath)
+            return
         # barrier_after=True, so all ranks continue after the unfinished checkpoint marker is placed.
         # if anything goes wrong during removal, we should be able to detect that data is incomplete.
         self.set_checkpoint_unfinished_marker(filepath, barrier_after=True)
@@ -499,7 +569,7 @@ def _should_remove_checkpoint(self, trainer: "pl.Trainer", previous: str, curren
         A checkpoint won't be deleted if any of the cases apply:
         - The previous checkpoint is the same as the current checkpoint (means the old was already overwritten by new)
         - The previous checkpoint is not in the current checkpoint directory and the filesystem is local
-        - The previous checkpoint is the checkpoint the Trainer resumed from and the filesystem is local 
+        - The previous checkpoint is the checkpoint the Trainer resumed from and the filesystem is local
             and the resumed from checkpoint is not the last checkpoint
         """
         if previous == current:
diff --git a/nemo/utils/callbacks/torch_dist_async.py b/nemo/utils/callbacks/torch_dist_async.py
new file mode 100644
index 000000000000..1cd226af9cdb
--- /dev/null
+++ b/nemo/utils/callbacks/torch_dist_async.py
@@ -0,0 +1,298 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import deque
+from pathlib import Path
+from time import time
+from typing import Callable, List, NamedTuple, Optional, Tuple
+
+import torch
+from megatron.core.dist_checkpointing.mapping import ShardedStateDict
+from megatron.core.dist_checkpointing.strategies.filesystem_async import FileSystemWriterAsync
+from megatron.core.dist_checkpointing.strategies.state_dict_saver import (
+    save_state_dict_async_finalize,
+    save_state_dict_async_plan,
+)
+from megatron.core.dist_checkpointing.strategies.torch import (
+    MCoreSavePlanner,
+    TorchDistSaveShardedStrategy,
+    _replace_state_dict_keys_with_sharded_keys,
+    mcore_to_pyt_state_dict,
+)
+from torch import multiprocessing as mp
+
+from nemo.utils import logging
+
+
+class TorchDistAsyncSaveShardedStrategy(TorchDistSaveShardedStrategy):
+    """Async save strategy for the PyT Distributed format.
+
+    NOTE: this class will be removed and replaced with an MCore version
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.async_request = None
+
+    def save(self, sharded_state_dict: ShardedStateDict, checkpoint_dir: Path):
+        """Translates MCore ShardedTensors to PyT ShardedTensors and saves in PyT Distributed format.
+
+        Args:
+            sharded_state_dict (ShardedStateDict): sharded state dict to save
+            checkpoint_dir (Path): checkpoint directory
+
+        Returns: None
+        """
+        # Translate the state dict
+        (
+            sharded_state_dict,
+            flat_mapping,
+            rename_mapping,
+        ) = _replace_state_dict_keys_with_sharded_keys(sharded_state_dict, self.keep_only_main_replica)
+        pyt_state_dict = mcore_to_pyt_state_dict(sharded_state_dict, False)
+        # Use PyT saving mechanism
+        writer = FileSystemWriterAsync(checkpoint_dir, thread_count=self.thread_count)
+
+        save_state_dict_ret = save_state_dict_async_plan(
+            pyt_state_dict,
+            writer,
+            None,
+            planner=MCoreSavePlanner(),
+        )
+        self.async_request = self._get_save_and_finalize_callbacks(writer, save_state_dict_ret)
+        return self.async_request
+
+    def _get_save_and_finalize_callbacks(self, writer, save_state_dict_ret):
+        save_fn_args = writer.get_save_function_and_args()
+        if save_fn_args is None:  # this check can be removed with MCore v0.7
+            save_fn_args = None, ()
+        save_fn, save_args = save_fn_args
+
+        def finalize_fn():
+            save_state_dict_async_finalize(*save_state_dict_ret)
+            torch.distributed.barrier()
+
+        return AsyncRequest(save_fn, save_args, [finalize_fn])
+
+
+class AsyncRequest(NamedTuple):
+    """Represents an async request that needs to be scheduled for execution.
+
+    NOTE: this class will be removed and replaced with an MCore version
+
+    Args:
+        async_fn (Callable, optional): async function to call. None represents noop.
+        async_fn_args (Tuple): args to pass to `async_fn`.
+        finalize_fns (List[Callable]): list of functions to call to finalize the request.
+            These functions will be called synchronously after `async_fn` is done
+            *on all ranks*.
+    """
+
+    async_fn: Optional[Callable]
+    async_fn_args: Tuple
+    finalize_fns: List[Callable]
+    is_frozen: bool = False
+
+    def add_finalize_fn(self, fn: Callable) -> None:
+        """Adds a new finalize function to the request.
+
+        Args:
+            fn (Callable): function to add to the async request. This function
+                will be called *after* existing finalization functions.
+
+        Returns:
+            None
+        """
+        if self.is_frozen:
+            raise RuntimeError('Cannot add finalization functions to a frozen AsyncRequest')
+        self.finalize_fns.append(fn)
+
+    def execute_sync(self) -> None:
+        """Helper to synchronously execute the request.
+
+        This logic is equivalent to what should happen in case of the async call.
+        """
+        if self.async_fn is not None:
+            self.async_fn(*self.async_fn_args)
+        torch.distributed.barrier()
+        for finalize_fn in self.finalize_fns:
+            finalize_fn()
+
+    def freeze(self) -> 'AsyncRequest':
+        """Freezes the async request, disallowing adding new finalization functions.
+
+        Returns:
+            AsyncRequest: new async request with all same fields except for the
+                `is_frozen` flag.
+        """
+        return self._replace(is_frozen=True)
+
+
+class DistributedAsyncCaller:
+    """Wrapper around mp.Process that ensures correct semantic of distributed finalization.
+
+    NOTE: this class will be removed and replaced with an MCore version
+
+    Starts process asynchronously and allows checking if all processes on all ranks are done.
+    """
+
+    def __init__(self):
+        self.process: Optional[mp.Process] = None
+        self.start_time: Optional[float] = None
+
+    def schedule_async_call(
+        self,
+        async_fn: Optional[Callable],
+        save_args: Tuple,
+    ) -> None:
+        """Spawn a process with `async_fn` as the target.
+
+        This method must be called on all ranks.
+
+        Args:
+            async_fn (Callable, optional): async function to call. If None,
+                no process will be started.
+            save_args (Tuple): async function args.
+        """
+        if async_fn is None:
+            return  # nothing to do
+        torch.cuda.synchronize()
+        ctx = mp.get_context('fork')
+        self.start_time = time()
+        self.process = ctx.Process(
+            target=async_fn,
+            args=save_args,
+        )
+        self.process.start()
+
+    def is_current_async_call_done(self, blocking=False) -> bool:
+        """Check if async save is finished on all ranks.
+
+        For semantic correctness, requires rank synchronization in each check.
+        This method must be called on all ranks.
+
+        Args:
+            blocking (bool, optional): if True, will wait until the call is done
+                on all ranks. Otherwise, returns immediately if at least one rank
+                is still active. Defaults to False.
+
+        Returns:
+            bool: True if all ranks are done (immediately of after active wait
+                if `blocking` is True), False if at least one rank is still active.
+        """
+        # The following takes the same overhead as torch.distributed.barrier (single integer all-reduce)
+        is_alive = int(self.process.is_alive()) if self.process is not None else 0
+        ten = torch.tensor([is_alive], dtype=torch.int, device=torch.cuda.current_device())
+        logging.debug(f"[rank {torch.distributed.get_rank()}] DistributedAsyncCaller is_alive:{is_alive}")
+        torch.distributed.all_reduce(ten)
+        if ten[0] > 0 and not blocking:
+            return False
+        else:
+            if self.process is not None:
+                logging.debug(f"rank: {torch.distributed.get_rank()}, joining self.process")
+                self.process.join()
+                self.process = None
+
+                logging.debug(
+                    f"DistributedAsyncCaller: Async process join finished after {time() - self.start_time:.2f}s from forking"
+                )
+                self.start_time = None
+            return True
+
+
+class _ActiveAsyncRequest(NamedTuple):
+    """Helper to represent an active async call.
+
+    NOTE: this class will be removed and replaced with an MCore version
+
+    Args:
+        idx (int): index of the call (starting from 0)
+        async_caller (DistributedAsyncCaller): async caller instance that represents
+            the async process handling the async request
+        async_request (AsyncRequest):  async request that is being called
+    """
+
+    idx: int
+    async_caller: DistributedAsyncCaller
+    async_request: AsyncRequest
+
+
+class AsyncCallsQueue:
+    """Manages a queue of async calls.
+
+    NOTE: this class will be removed and replaced with an MCore version
+
+    Allows adding a new async call with `schedule_async_request` and finalizing
+    active calls with `maybe_finalize_async_calls`.
+    """
+
+    def __init__(self):
+        self.async_calls: deque[_ActiveAsyncRequest] = deque([])
+        self.call_idx: int = -1
+
+    def schedule_async_request(self, async_request: AsyncRequest) -> int:
+        """Start a new async call and add it to a queue of active async calls.
+
+        This method must be called on all ranks.
+
+        Args:
+            async_request (AsyncRequest): async request to start.
+
+        Returns:
+            int: index of the async call that was started.
+                This can help the user keep track of the async calls.
+        """
+        self.call_idx += 1
+        async_caller = DistributedAsyncCaller()
+        async_request = async_request.freeze()
+        async_caller.schedule_async_call(async_request.async_fn, async_request.async_fn_args)
+        self.async_calls.append(_ActiveAsyncRequest(self.call_idx, async_caller, async_request))
+        return self.call_idx
+
+    def maybe_finalize_async_calls(self, blocking=False) -> List[int]:
+        """Finalizes all available calls.
+
+        This method must be called on all ranks.
+
+        Args:
+            blocking (bool, optional): if True, will wait until all active requests
+                are done. Otherwise, finalizes only the async request that already
+                finished. Defaults to False.
+        Returns:
+            List[int]: list of indices (as returned by `schedule_async_request`)
+                of async calls that have been successfully finalized.
+        """
+        call_idx_finalized = []
+        while self.async_calls:
+            next_async_done = self.async_calls[0].async_caller.is_current_async_call_done(blocking)
+            if not next_async_done:
+                break
+            call_idx, _, async_request = self.async_calls.popleft()
+            for finalize_fn in async_request.finalize_fns:
+                finalize_fn()
+            ten = torch.tensor([call_idx], dtype=torch.int, device=torch.cuda.current_device())
+            torch.distributed.all_reduce(ten, op=torch.distributed.ReduceOp.MAX)
+            assert (
+                ten.item() == call_idx
+            ), 'Unmatched async calls. That probably means not all ranks are participating in async finalization'
+            call_idx_finalized.append(call_idx)
+        return call_idx_finalized
+
+    def get_num_unfinalized_calls(self):
+        """Get the number of active async calls."""
+        return len(self.async_calls)
+
+    def close(self):
+        """Finalize all calls upon closing."""
+        self.maybe_finalize_async_calls(blocking=True)
diff --git a/nemo/utils/exp_manager.py b/nemo/utils/exp_manager.py
index 5c7cac5a9a55..9e8b55eade1f 100644
--- a/nemo/utils/exp_manager.py
+++ b/nemo/utils/exp_manager.py
@@ -51,11 +51,11 @@
 
 
 class NotFoundError(NeMoBaseException):
-    """ Raised when a file or folder is not found"""
+    """Raised when a file or folder is not found"""
 
 
 class LoggerMisconfigurationError(NeMoBaseException):
-    """ Raised when a mismatch between trainer.logger and exp_manager occurs"""
+    """Raised when a mismatch between trainer.logger and exp_manager occurs"""
 
     def __init__(self, message):
         message = (
@@ -66,7 +66,7 @@ def __init__(self, message):
 
 
 class CheckpointMisconfigurationError(NeMoBaseException):
-    """ Raised when a mismatch between trainer.callbacks and exp_manager occurs"""
+    """Raised when a mismatch between trainer.callbacks and exp_manager occurs"""
 
 
 @dataclass
@@ -106,6 +106,7 @@ class CallbackParams:
     save_nemo_on_train_end: Optional[bool] = True  # Whether to automatically save .nemo file durin on_train_end hook
     model_parallel_size: Optional[int] = None  # tensor parallel size * pipeline parallel size
     save_on_train_epoch_end: Optional[bool] = False  # Save after training, not after validation
+    async_save: Optional[bool] = False  # save the checkpoint asynchronously
 
 
 @dataclass
@@ -128,8 +129,7 @@ class EMAParams:
 
 @dataclass
 class ExpManagerConfig:
-    """Experiment Manager config for validation of passed arguments.
-    """
+    """Experiment Manager config for validation of passed arguments."""
 
     # Log dir creation parameters
     explicit_log_dir: Optional[str] = None
@@ -313,7 +313,7 @@ def exp_manager(trainer: 'pytorch_lightning.Trainer', cfg: Optional[Union[DictCo
                 Set this to True if you are using DDP with many GPUs and do not want many log files in your exp dir.
             - log_global_rank_0_only (bool): Whether to only create log files for global rank 0. Defaults to False.
                 Set this to True if you are using DDP with many GPUs and do not want many log files in your exp dir.
-            - max_time (str): The maximum wall clock time *per run*. This is intended to be used on clusters where you want 
+            - max_time (str): The maximum wall clock time *per run*. This is intended to be used on clusters where you want
                 a checkpoint to be saved after this specified time and be able to resume from that checkpoint. Defaults to None.
             - seconds_to_sleep (float): seconds to sleep non rank 0 processes for. Used to give enough time for rank 0 to initialize
 
@@ -336,6 +336,10 @@ def exp_manager(trainer: 'pytorch_lightning.Trainer', cfg: Optional[Union[DictCo
 
     # Ensure passed cfg is compliant with ExpManagerConfig
     schema = OmegaConf.structured(ExpManagerConfig)
+    # TODO: remove this check
+    if is_global_rank_zero():
+        logging.info('ExpManager schema')
+        logging.info(schema)
     if isinstance(cfg, dict):
         cfg = OmegaConf.create(cfg)
     elif not isinstance(cfg, DictConfig):
@@ -681,7 +685,7 @@ def check_resume(
 def check_explicit_log_dir(
     trainer: 'pytorch_lightning.Trainer', explicit_log_dir: Union[Path, str], exp_dir: str, name: str, version: str
 ) -> Tuple[Path, str, str, str]:
-    """ Checks that the passed arguments are compatible with explicit_log_dir.
+    """Checks that the passed arguments are compatible with explicit_log_dir.
 
     Returns:
         log_dir (Path): the log_dir
@@ -918,7 +922,7 @@ def configure_checkpointing(
     params: 'DictConfig',
     create_preemption_callback: bool,
 ):
-    """ Adds ModelCheckpoint to trainer. Raises CheckpointMisconfigurationError if trainer already has a ModelCheckpoint
+    """Adds ModelCheckpoint to trainer. Raises CheckpointMisconfigurationError if trainer already has a ModelCheckpoint
     callback
     """
     for callback in trainer.callbacks:
@@ -995,7 +999,12 @@ def check_slurm(trainer):
 class StatelessTimer(Timer):
     """Extension of PTL timers to be per run."""
 
-    def __init__(self, duration: timedelta = None, interval: str = Interval.step, verbose: bool = True,) -> None:
+    def __init__(
+        self,
+        duration: timedelta = None,
+        interval: str = Interval.step,
+        verbose: bool = True,
+    ) -> None:
         super().__init__(duration, interval, verbose)
 
     # Override PTL Timer's state dict to not store elapsed time information so that we can restore and continue training.
diff --git a/tests/core/test_dist_ckpt.py b/tests/core/test_dist_ckpt.py
index b6dc5ca89d3e..8fe21a316854 100644
--- a/tests/core/test_dist_ckpt.py
+++ b/tests/core/test_dist_ckpt.py
@@ -1,6 +1,7 @@
 import os
 import types
 from pathlib import Path
+from typing import Any, Dict
 
 import pytest
 import pytorch_lightning as pl
@@ -9,7 +10,19 @@
 from pytorch_lightning.demos.boring_classes import BoringModel
 
 from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy
-from nemo.utils.callbacks.dist_ckpt_io import DistributedCheckpointIO
+from nemo.utils.callbacks.dist_ckpt_io import (
+    AsyncFinalizableCheckpointIO,
+    AsyncFinalizerCallback,
+    DistributedCheckpointIO,
+)
+
+try:
+    from megatron.core.dist_checkpointing import ShardedTensor
+
+    HAVE_MEGATRON_CORE = True
+
+except (ImportError, ModuleNotFoundError):
+    HAVE_MEGATRON_CORE = False
 
 
 class ExampleModel(BoringModel):
@@ -19,7 +32,13 @@ def on_validation_epoch_end(self) -> None:
 
 class ExampleMCoreModel(ExampleModel):
     def sharded_state_dict(self):
-        return {'a': 3}
+        return {
+            'a': ShardedTensor.from_rank_offsets('a', self.layer.weight, replica_id=torch.distributed.get_rank()),
+            'const': 3,
+        }
+
+    def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
+        checkpoint['sharded_state_dict'] = self.sharded_state_dict()
 
 
 class MockDistributedCheckpointIO(DistributedCheckpointIO):
@@ -42,17 +61,22 @@ def save_checkpoint(self, *args, **kwargs) -> None:
 
 def _get_last_checkpoint_dir(root_dir: Path, model: pl.LightningModule, suffix: str = '') -> Path:
     steps = len(model.train_dataloader().dataset) * model.trainer.max_epochs // torch.distributed.get_world_size()
-    return root_dir / 'checkpoints' / f'epoch=1-step={steps}{suffix}'
+    return root_dir / 'checkpoints' / f'epoch={model.trainer.max_epochs - 1}-step={steps}{suffix}'
+
+
+def _get_nlp_strategy_without_optimizer_state():
+    strategy = NLPDDPStrategy()
+    # this ensures optimizer sharded state creation is skipped
+    strategy.optimizer_sharded_state_dict = types.MethodType(
+        lambda self, unsharded_optim_state: unsharded_optim_state, strategy
+    )
+    return strategy
 
 
 class TestDistCkptIO:
     @pytest.mark.run_only_on('GPU')
     def test_dist_ckpt_io_called_for_mcore_models(self, tmp_path):
-        strategy = NLPDDPStrategy()
-        # skip optimizer sharded state creation:
-        strategy.optimizer_sharded_state_dict = types.MethodType(
-            lambda self, unsharded_optim_state: unsharded_optim_state, strategy
-        )
+        strategy = _get_nlp_strategy_without_optimizer_state()
         checkpoint_io = MockDistributedCheckpointIO('xxx')
 
         test_trainer = pl.Trainer(
@@ -70,7 +94,7 @@ def test_dist_ckpt_io_called_for_mcore_models(self, tmp_path):
         assert checkpoint_io.save_checkpoint_called_args is not None
         (state_dict, path), _ = checkpoint_io.save_checkpoint_called_args
         # Ckpt path doesn't contain the .ckpt suffix
-        assert path.name == _get_last_checkpoint_dir(tmp_path, model).name, len(test_trainer.strategy.parallel_devices)
+        assert path.name == _get_last_checkpoint_dir(tmp_path, model).name
 
     @pytest.mark.run_only_on('GPU')
     def test_dist_ckpt_path_not_executed_for_non_core_models(self, tmp_path):
@@ -96,3 +120,60 @@ def test_dist_ckpt_path_not_executed_for_non_core_models(self, tmp_path):
             assert os.path.basename(path) == _get_last_checkpoint_dir(tmp_path, model, suffix='.ckpt').name
         else:
             assert checkpoint_io.save_checkpoint_called_args is None
+
+
+class TestAsyncSave:
+    @pytest.mark.run_only_on('GPU')
+    def test_async_save_produces_same_checkpoints_as_sync(self, tmp_path):
+        strategy = _get_nlp_strategy_without_optimizer_state()
+        sync_checkpoint_io = DistributedCheckpointIO('torch_dist')
+        async_checkpoint_io = AsyncFinalizableCheckpointIO(DistributedCheckpointIO('torch_dist', async_save=True))
+
+        model = ExampleMCoreModel()
+
+        # dummy_trainer just to initialize NCCL
+        dummy_trainer = pl.Trainer(
+            enable_checkpointing=False,
+            logger=False,
+            max_epochs=1,
+            strategy=_get_nlp_strategy_without_optimizer_state(),
+            plugins=[sync_checkpoint_io],
+        )
+        dummy_trainer.fit(model)
+        tmp_path = strategy.broadcast(tmp_path)
+
+        sync_ckpt_dir = tmp_path / 'sync_checkpoints'
+        async_ckpt_dir = tmp_path / 'async_checkpoints'
+
+        sync_test_trainer = pl.Trainer(
+            enable_checkpointing=True,
+            logger=False,
+            max_epochs=1,
+            strategy=_get_nlp_strategy_without_optimizer_state(),
+            plugins=[sync_checkpoint_io],
+            default_root_dir=sync_ckpt_dir,
+        )
+        sync_test_trainer.fit(model)
+
+        async_test_trainer = pl.Trainer(
+            enable_checkpointing=True,
+            logger=False,
+            max_epochs=1,
+            strategy=_get_nlp_strategy_without_optimizer_state(),
+            plugins=[async_checkpoint_io],
+            callbacks=AsyncFinalizerCallback(),
+            default_root_dir=async_ckpt_dir,
+        )
+        async_test_trainer.fit(model)
+
+        # Load and compare checkpoints
+        checkpoint = {'sharded_state_dict': model.sharded_state_dict()}
+        sync_state_dict = sync_checkpoint_io.load_checkpoint(
+            _get_last_checkpoint_dir(sync_ckpt_dir, model), sharded_state_dict=checkpoint
+        )
+        async_state_dict = async_checkpoint_io.load_checkpoint(
+            _get_last_checkpoint_dir(async_ckpt_dir, model), sharded_state_dict=checkpoint
+        )
+
+        assert sync_state_dict['sharded_state_dict']['const'] == async_state_dict['sharded_state_dict']['const']
+        assert torch.all(sync_state_dict['sharded_state_dict']['a'] == async_state_dict['sharded_state_dict']['a'])

From 1de4b49d46da12e86716f4c30dac9d01590cb1ae Mon Sep 17 00:00:00 2001
From: mikolajblaz <mikolajblaz@users.noreply.github.com>
Date: Wed, 15 May 2024 13:57:43 +0200
Subject: [PATCH 15/36] Fix incorrect checkpoint removal logic (#9192)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Fix incorrect if logic

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
---
 nemo/collections/nlp/parts/nlp_overrides.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/nemo/collections/nlp/parts/nlp_overrides.py b/nemo/collections/nlp/parts/nlp_overrides.py
index 65ffb7df47f4..079732f6b9c5 100644
--- a/nemo/collections/nlp/parts/nlp_overrides.py
+++ b/nemo/collections/nlp/parts/nlp_overrides.py
@@ -450,8 +450,9 @@ def load_checkpoint(self, checkpoint_path: Union[str, Path]) -> Dict[str, Any]:
 
     def remove_checkpoint(self, filepath: Union[str, Path]) -> None:
         # check if filepath is a distributed checkpoint
-        if self.use_distributed_checkpointing and self.is_global_zero:
-            self.checkpoint_io.remove_checkpoint(ckpt_to_dir(filepath))
+        if self.use_distributed_checkpointing:
+            if self.is_global_zero:
+                self.checkpoint_io.remove_checkpoint(ckpt_to_dir(filepath))
 
         # legacy checkpoint logic, does not use megatron core
         else:

From 6cb618a81d9239611da22e9ef23d075498d18336 Mon Sep 17 00:00:00 2001
From: Jan Lasek <janek.lasek@gmail.com>
Date: Wed, 15 May 2024 15:01:42 +0200
Subject: [PATCH 16/36] Update to using Model Optimizer (formerly AMMO) in PTQ
 workflow (#9178)

* Update PTQ to use nvidia-modelopt

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Restore PTQ tests

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Update docs

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Comment on apply_rope_fusion

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Support for calibration PP > 1

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Apply isort and black reformatting

Signed-off-by: janekl <janekl@users.noreply.github.com>

* Fix cicd-main.yml indent

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Set data/tensor parallel groups

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Install only torch dependecies

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Follow up on recent modelopt changes

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Model support matrix

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Apply isort and black reformatting

Signed-off-by: janekl <janekl@users.noreply.github.com>

* Rename PTQ script as it should be model-agnostic

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Remove unused import

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Update setup instructions

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: janekl <janekl@users.noreply.github.com>
Co-authored-by: janekl <janekl@users.noreply.github.com>
---
 .github/workflows/cicd-main.yml               | 135 +++++++++---------
 Dockerfile                                    |   2 -
 docs/source/nlp/quantization.rst              |  48 ++++++-
 docs/source/starthere/intro.rst               |   6 +-
 ...zation.yaml => megatron_quantization.yaml} |   0
 ...antization.py => megatron_quantization.py} |   6 +-
 ...mmo_spec.py => gpt_layer_modelopt_spec.py} |  11 +-
 .../language_modeling/megatron_gpt_model.py   |   4 +-
 nemo/export/quantize/quantizer.py             | 111 +++++++++-----
 9 files changed, 204 insertions(+), 119 deletions(-)
 rename examples/nlp/language_modeling/conf/{megatron_llama_quantization.yaml => megatron_quantization.yaml} (100%)
 rename examples/nlp/language_modeling/{megatron_llama_quantization.py => megatron_quantization.py} (92%)
 rename nemo/collections/nlp/models/language_modeling/megatron/{gpt_layer_ammo_spec.py => gpt_layer_modelopt_spec.py} (91%)

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
index 4652e4d19f89..291eeaed7f89 100644
--- a/.github/workflows/cicd-main.yml
+++ b/.github/workflows/cicd-main.yml
@@ -132,8 +132,8 @@ jobs:
                 apt-get update && apt-get install libsox-fmt-all -y && \
                 popd
 
-            # AMMO installation
-            pip install nvidia-ammo~=0.9.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
+            # ModelOpt installation
+            pip install nvidia-modelopt[torch]~=0.11.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
 
             # PyTorch Lightning version
             python -c "import pytorch_lightning; print(pytorch_lightning.__version__)"
@@ -394,7 +394,7 @@ jobs:
         - name: Checkout repository
           uses: actions/checkout@v4
         - run: |
-            python examples/nlp/language_modeling/megatron_llama_quantization.py \
+            python examples/nlp/language_modeling/megatron_quantization.py \
             model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
             quantization.algorithm=null \
             model_save=/home/TestData/nlp/megatron_llama/ci_baseline
@@ -403,69 +403,70 @@ jobs:
         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
           if: "failure()"
 
-  # L2_PTQ_Llama2_FP8:
-  #   needs: [cicd-test-container-setup]
-  #   runs-on: self-hosted-azure
-  #   timeout-minutes: 10
-  #   container:
-  #     image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
-  #     options:
-  #       # --user 0:128
-  #       --device=/dev/nvidia0
-  #       --gpus all
-  #       --shm-size=8g
-  #       --env TRANSFORMERS_OFFLINE=0
-  #       --env HYDRA_FULL_ERROR=1
-  #       --volume /mnt/datadrive/TestData:/home/TestData
-  #   steps:
-  #       - name: Checkout repository
-  #         uses: actions/checkout@v4
-  #       - run: |
-  #           python examples/nlp/language_modeling/megatron_llama_quantization.py \
-  #           model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
-  #           tensor_model_parallel_size=2 \
-  #           trainer.devices=2 \
-  #           quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
-  #           quantization.algorithm=fp8 \
-  #           quantization.num_calib_size=8 \
-  #           inference.batch_size=2 \
-  #           export.inference_tensor_parallel=2 \
-  #           model_save=/home/TestData/nlp/megatron_llama/ci_fp8.qnemo
-
-  #           rm -rf /home/TestData/nlp/megatron_llama/ci_fp8.qnemo
-  #       - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
-  #         if: "failure()"
-
-  # L2_PTQ_Llama2_INT8_SQ:
-  #   needs: [cicd-test-container-setup]
-  #   runs-on: self-hosted-azure
-  #   timeout-minutes: 10
-  #   container:
-  #     image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
-  #     options:
-  #       # --user 0:128
-  #       --device=/dev/nvidia0
-  #       --gpus all
-  #       --shm-size=8g
-  #       --env TRANSFORMERS_OFFLINE=0
-  #       --env HYDRA_FULL_ERROR=1
-  #       --volume /mnt/datadrive/TestData:/home/TestData
-  #   steps:
-  #       - name: Checkout repository
-  #         uses: actions/checkout@v4
-  #       - run: |
-  #           python examples/nlp/language_modeling/megatron_llama_quantization.py \
-  #           model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
-  #           quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
-  #           quantization.algorithm=int8_sq \
-  #           quantization.num_calib_size=8 \
-  #           inference.batch_size=2 \
-  #           model_save=/home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
-
-  #           rm -rf /home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
-  #       - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
-  #         if: "failure()"
-
+  L2_PTQ_Llama2_FP8:
+     needs: [cicd-test-container-setup]
+     runs-on: self-hosted-azure
+     timeout-minutes: 10
+     container:
+       image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
+       options:
+         # --user 0:128
+         --device=/dev/nvidia0
+         --gpus all
+         --shm-size=8g
+         --env TRANSFORMERS_OFFLINE=0
+         --env HYDRA_FULL_ERROR=1
+         --volume /mnt/datadrive/TestData:/home/TestData
+     steps:
+         - name: Checkout repository
+           uses: actions/checkout@v4
+         - run: |
+             python examples/nlp/language_modeling/megatron_quantization.py \
+             model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
+             tensor_model_parallel_size=2 \
+             trainer.devices=2 \
+             quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
+             quantization.algorithm=fp8 \
+             quantization.num_calib_size=8 \
+             inference.batch_size=2 \
+             export.inference_tensor_parallel=2 \
+             model_save=/home/TestData/nlp/megatron_llama/ci_fp8.qnemo
+
+             rm -rf /home/TestData/nlp/megatron_llama/ci_fp8.qnemo
+         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
+           if: "failure()"
+
+  L2_PTQ_Llama2_INT8_SQ:
+     needs: [cicd-test-container-setup]
+     runs-on: self-hosted-azure
+     timeout-minutes: 10
+     container:
+       image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
+       options:
+         # --user 0:128
+         --device=/dev/nvidia0
+         --gpus all
+         --shm-size=8g
+         --env TRANSFORMERS_OFFLINE=0
+         --env HYDRA_FULL_ERROR=1
+         --volume /mnt/datadrive/TestData:/home/TestData
+     steps:
+         - name: Checkout repository
+           uses: actions/checkout@v4
+         - run: |
+             python examples/nlp/language_modeling/megatron_quantization.py \
+             model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
+             quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
+             quantization.algorithm=int8_sq \
+             quantization.num_calib_size=8 \
+             inference.batch_size=2 \
+             model_save=/home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
+
+             rm -rf /home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
+         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
+           if: "failure()"
+
+  # TODO: investigate int4_awq stuck issues and restore the test
   #L2_PTQ_Llama2_INT4_AWQ:
   #  needs: [cicd-test-container-setup]
   #  runs-on: self-hosted-azure
@@ -484,7 +485,7 @@ jobs:
   #      - name: Checkout repository
   #        uses: actions/checkout@v4
   #      - run: |
-  #          python examples/nlp/language_modeling/megatron_llama_quantization.py \
+  #          python examples/nlp/language_modeling/megatron_quantization.py \
   #          model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
   #          tensor_model_parallel_size=1 \
   #          trainer.devices=1 \
diff --git a/Dockerfile b/Dockerfile
index 396645d37019..c27048784244 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -133,8 +133,6 @@ RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-chec
 RUN pip install flash-attn
 # install numba for latest containers
 RUN pip install numba>=0.57.1
-# install ammo
-RUN pip install nvidia-ammo~=0.9.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
 
 # copy nemo source into a scratch image
 FROM scratch as nemo-src
diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst
index afe2e9eccbca..cc40b6a972a2 100644
--- a/docs/source/nlp/quantization.rst
+++ b/docs/source/nlp/quantization.rst
@@ -10,7 +10,7 @@ PTQ enables deploying a model in a low-precision format -- FP8, INT4, or INT8 --
 
 Model quantization has two primary benefits: reduced model memory requirements and increased inference throughput.
 
-In NeMo, quantization is enabled by the Nvidia AMMO library -- a unified algorithmic model optimization & deployment toolkit.
+In NeMo, quantization is enabled by the `NVIDIA TensorRT Model Optimizer (ModelOpt) <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ library -- a library to quantize and compress deep learning models for optimized inference on GPUs.
 
 The quantization process consists of the following steps:
 
@@ -18,10 +18,52 @@ The quantization process consists of the following steps:
 2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
 3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
 
-Loading models requires using an AMMO spec defined in `megatron.core.inference.gpt.model_specs.py <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/inference/gpt/model_specs.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.
+Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.
 
 Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.
 
+Support Matrix
+^^^^^^^^^^^^^^
+
+Table below presents verified model support matrix for popular LLM architectures. Each model entry also optionally provides a download link to a corresponding Nemo checkpoint for testing purposes. Support for other model families is experimental.
+
+.. list-table:: Model Support Matrix
+   :widths: 15 15 15 15
+   :header-rows: 1
+
+   * - **Model Family**
+     - **FP8**
+     - **INT8_SQ**
+     - **INT4_AWQ**
+   * - Llama (1, 2, 3)
+     - ✅
+     - ✅
+     - ✅
+   * - Mistral
+     - ✅
+     - ✅
+     - ✅
+   * - `GPT-3 <https://huggingface.co/nvidia/GPT-2B-001>`_
+     - ✅
+     - ✅
+     - ✅
+   * - `Nemotron-3 8b <https://huggingface.co/nvidia/nemotron-3-8b-base-4k>`_
+     - ✅
+     - ✅
+     - ✅
+   * - Nemotron-4 15b
+     - ✅
+     - ✅
+     - ✅
+   * - StarCoder 2
+     - ✅
+     - ✅
+     - ✅
+   * - Gemma
+     - ✅
+     - ✅
+     - ✅
+
 
 Example
 ^^^^^^^
@@ -31,7 +73,7 @@ The script must be launched correctly with the number of processes equal to tens
 
 .. code-block:: bash
 
-    torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_llama_quantization.py \
+    torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_quantization.py \
         model_file=llama2-70b-base-bf16.nemo \
         tensor_model_parallel_size=8 \
         pipeline_model_parallel_size=1 \
diff --git a/docs/source/starthere/intro.rst b/docs/source/starthere/intro.rst
index 63fdcfb0406e..ebbe1551c39e 100644
--- a/docs/source/starthere/intro.rst
+++ b/docs/source/starthere/intro.rst
@@ -96,13 +96,13 @@ This section details the steps to clone and install the Megatron Core.
     git checkout a5415fcfacef2a37416259bd38b7c4b673583675 && \
     pip install .
 
-AMMO Installation
+Model Optimizer Installation
 
-This final step involves installing the AMMO package.
+This final step involves installing the Model Optimizer package.
 
 .. code-block:: bash
 
-    pip install nvidia-ammo~=0.7.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
+    pip install nvidia-modelopt[torch]~=0.11.0 --extra-index-url https://pypi.nvidia.com
 
 
 .. code-block:: bash
diff --git a/examples/nlp/language_modeling/conf/megatron_llama_quantization.yaml b/examples/nlp/language_modeling/conf/megatron_quantization.yaml
similarity index 100%
rename from examples/nlp/language_modeling/conf/megatron_llama_quantization.yaml
rename to examples/nlp/language_modeling/conf/megatron_quantization.yaml
diff --git a/examples/nlp/language_modeling/megatron_llama_quantization.py b/examples/nlp/language_modeling/megatron_quantization.py
similarity index 92%
rename from examples/nlp/language_modeling/megatron_llama_quantization.py
rename to examples/nlp/language_modeling/megatron_quantization.py
index 92ead6b4ed69..d4d6a8b6b917 100644
--- a/examples/nlp/language_modeling/megatron_llama_quantization.py
+++ b/examples/nlp/language_modeling/megatron_quantization.py
@@ -25,12 +25,12 @@
 Nemo quantization example script.
 
 Please consult nemo.export.quantize.Quantizer class
-and examples/nlp/language_modeling/conf/megatron_llama_quantization.yaml config on available quantization methods,
+and examples/nlp/language_modeling/conf/megatron_quantization.yaml config on available quantization methods,
 models supported as well as how to set up data and inference for calibration (with defaults recommended).
 
 Example usage:
 ```
-python examples/nlp/language_modeling/megatron_llama_quantization.py \
+python examples/nlp/language_modeling/megatron_quantization.py \
     model_file=llama2-7b-fp16.nemo \
     model_save=llama2-7b-fp8.qnemo \
     quantization.algorithm=fp8 \
@@ -59,7 +59,7 @@ def get_calib_dataloader(data="cnn_dailymail", batch_size=64, calib_size=512, ma
         yield batch
 
 
-@hydra_runner(config_path="conf", config_name="megatron_llama_quantization")
+@hydra_runner(config_path="conf", config_name="megatron_quantization")
 def main(cfg) -> None:
     if not torch.cuda.is_available():
         raise EnvironmentError("GPU is required for the inference.")
diff --git a/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_ammo_spec.py b/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py
similarity index 91%
rename from nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_ammo_spec.py
rename to nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py
index e51ecaba463a..f9ba58736cbd 100644
--- a/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_ammo_spec.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py
@@ -36,8 +36,9 @@
     HAVE_MEGATRON_CORE = False
     IMPORT_ERROR = e
 
-# Use this spec for AMMO PTQ and TensorRT-LLM export
-def get_gpt_layer_ammo_spec() -> ModuleSpec:
+
+# Use this spec for Model Optimizer PTQ and TensorRT-LLM export
+def get_gpt_layer_modelopt_spec() -> ModuleSpec:
     """Mix the native spec with TENorm.
 
     This is essentially the native local spec except for the layernorm implementation
@@ -65,7 +66,11 @@ def get_gpt_layer_ammo_spec() -> ModuleSpec:
             self_attn_bda=get_bias_dropout_add,
             pre_mlp_layernorm=TENorm,
             mlp=ModuleSpec(
-                module=MLP, submodules=MLPSubmodules(linear_fc1=ColumnParallelLinear, linear_fc2=RowParallelLinear,),
+                module=MLP,
+                submodules=MLPSubmodules(
+                    linear_fc1=ColumnParallelLinear,
+                    linear_fc2=RowParallelLinear,
+                ),
             ),
             mlp_bda=get_bias_dropout_add,
             # Map TE-layernorm-fusion keys back
diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
index 536fc5bff7c8..3660a5145b10 100644
--- a/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
@@ -41,7 +41,7 @@
 from nemo.collections.nlp.models.language_modeling.megatron.gpt_full_te_layer_autocast_spec import (
     get_gpt_full_te_layer_autocast_spec,
 )
-from nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_ammo_spec import get_gpt_layer_ammo_spec
+from nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec import get_gpt_layer_modelopt_spec
 from nemo.collections.nlp.models.language_modeling.megatron.gpt_model import GPTModel
 from nemo.collections.nlp.models.language_modeling.megatron_base_model import MegatronBaseModel
 from nemo.collections.nlp.modules.common.megatron.build_model import build_model
@@ -154,7 +154,7 @@ def get_specs(spec_name, num_experts=None, moe_grouped_gemm=False, use_te=True):
         "te_gpt": get_gpt_layer_with_transformer_engine_spec(num_experts, moe_grouped_gemm),
         "megatron_falcon_gpt": get_falcon_layer_spec(),
         "megatron_gpt_full_te_layer_autocast": get_gpt_full_te_layer_autocast_spec(),
-        "ammo": get_gpt_layer_ammo_spec(),
+        "modelopt": get_gpt_layer_modelopt_spec(),
     }
     if spec_name not in name_spec_dict:
         raise ValueError(f"Spec name '{spec_name}' is not recognized.")
diff --git a/nemo/export/quantize/quantizer.py b/nemo/export/quantize/quantizer.py
index 783f47a08e79..4748f4957a52 100644
--- a/nemo/export/quantize/quantizer.py
+++ b/nemo/export/quantize/quantizer.py
@@ -18,11 +18,12 @@
 
 import torch
 import torch.distributed as dist
-from megatron.core import parallel_state
+from megatron.core import mpu, parallel_state
 from megatron.core.transformer.module import Float16Module
 from omegaconf import OmegaConf
 from omegaconf.omegaconf import DictConfig, open_dict
 from pytorch_lightning.trainer.trainer import Trainer
+from tqdm import tqdm
 
 from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
 from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector
@@ -32,18 +33,18 @@
 from nemo.utils.model_utils import load_config, save_artifacts, unwrap_model
 
 try:
-    import ammo.torch.quantization as atq
-    from ammo.torch.export import export_tensorrt_llm_checkpoint
+    import modelopt.torch.quantization as mtq
+    from modelopt.torch.export import export_tensorrt_llm_checkpoint
+    from modelopt.torch.utils.distributed import set_data_parallel_group, set_tensor_parallel_group
 
-    HAVE_AMMO = True
+    HAVE_MODELOPT = True
 
 except (ImportError, ModuleNotFoundError) as e:
-    HAVE_AMMO = False
-    HAVE_AMMO_ERROR = e
+    HAVE_MODELOPT = False
+    HAVE_MODELOPT_ERROR = e
 
 
 class Quantizer:
-
     """
     Post-training quantization of Nemo checkpoints.
 
@@ -63,9 +64,9 @@ class Quantizer:
     model families is experimental and might not be fully supported.
 
     Available quantization methods are listed in QUANT_CFG_CHOICES dictionary below.
-    Please consult AMMO documentation for details. You can also inspect different choices in
-    examples/nlp/language_modeling/conf/megatron_llama_quantization.yaml for quantization algorithms and
-    calibration data as well as recommended settings.
+    Please consult Model Optimizer documentation https://nvidia.github.io/TensorRT-Model-Optimizer/ for details.
+    You can also inspect different choices in examples/nlp/language_modeling/conf/megatron_quantization.yaml
+    for quantization algorithms and calibration data as well as recommended settings.
 
     Quantization algorithm can also be conveniently set to 'null' to perform only weights export step
     for TensorRT-LLM deployment. This is useful to getting baseline results for a full-precision model.
@@ -78,14 +79,14 @@ def __init__(
         export_config: DictConfig,
         trainer_config: DictConfig,
     ):
-        if not HAVE_AMMO:
-            raise RuntimeError("nvidia-ammo is needed to use Quantizer") from HAVE_AMMO_ERROR
+        if not HAVE_MODELOPT:
+            raise RuntimeError("nvidia-modelopt is needed to use Quantizer") from HAVE_MODELOPT_ERROR
         QUANT_CFG_CHOICES = {
-            "int8": atq.INT8_DEFAULT_CFG,
-            "int8_sq": atq.INT8_SMOOTHQUANT_CFG,
-            "fp8": atq.FP8_DEFAULT_CFG,
-            "int4_awq": atq.INT4_AWQ_CFG,
-            "w4a8_awq": atq.W4A8_AWQ_BETA_CFG,
+            "int8": mtq.INT8_DEFAULT_CFG,
+            "int8_sq": mtq.INT8_SMOOTHQUANT_CFG,
+            "fp8": mtq.FP8_DEFAULT_CFG,
+            "int4_awq": mtq.INT4_AWQ_CFG,
+            "w4a8_awq": mtq.W4A8_AWQ_BETA_CFG,
         }
         SUPPORTED_DTYPE = [16, "16", "bf16"]  # Default precision for non-quantized layers
         assert export_config.dtype in SUPPORTED_DTYPE
@@ -95,25 +96,30 @@ def __init__(
         self.export_config = export_config
         self.trainer_config = trainer_config
         if quantization_config.algorithm is not None:
-            atq_config = QUANT_CFG_CHOICES[quantization_config.algorithm]
+            quant_cfg = QUANT_CFG_CHOICES[quantization_config.algorithm]
 
             if "awq" in quantization_config.algorithm:
-                weight_quantizer = atq_config["quant_cfg"]["*weight_quantizer"]
+                weight_quantizer = quant_cfg["quant_cfg"]["*weight_quantizer"]
                 if isinstance(weight_quantizer, list):
                     weight_quantizer = weight_quantizer[0]
                 weight_quantizer["block_sizes"][-1] = quantization_config.awq_block_size
 
             # Always turn on FP8 kv cache to save memory footprint.
             # For int8_sq, we use int8 kv cache.
-            atq_config["quant_cfg"]["*output_quantizer"] = {
+            # TODO: Investigate why enabling FP8 kv cache will cause accuracy regressions for Nemotron.
+            enable_quant_kv_cache = (
+                "int8" not in quantization_config.algorithm and export_config.decoder_type != "gptnext"
+            )
+            print(f'{"Enable" if enable_quant_kv_cache else "Disable"} KV cache quantization')
+            quant_cfg["quant_cfg"]["*output_quantizer"] = {
                 "num_bits": 8 if quantization_config.algorithm == "int8_sq" else (4, 3),
                 "axis": None,
-                "enable": export_config.decoder_type != "gptnext",
+                "enable": enable_quant_kv_cache,
             }
 
-            self.atq_config = atq_config
+            self.quant_cfg = quant_cfg
         else:
-            self.atq_config = None
+            self.quant_cfg = None
 
     def _load_model(
         self,
@@ -121,14 +127,17 @@ def _load_model(
         tensor_model_parallel_size: Optional[int] = None,
         pipeline_model_parallel_size: Optional[int] = None,
     ):
-        """Load model using AMMO layer spec for quantization."""
+        """Load model using ModelOpt layer spec for quantization."""
         model_cfg = self._load_and_modify_config(model_file, tensor_model_parallel_size, pipeline_model_parallel_size)
 
         trainer = Trainer(strategy=NLPDDPStrategy(), **self.trainer_config)
         connector = NLPSaveRestoreConnector()
 
         model = MegatronGPTModel.restore_from(
-            restore_path=model_file, trainer=trainer, override_config_path=model_cfg, save_restore_connector=connector,
+            restore_path=model_file,
+            trainer=trainer,
+            override_config_path=model_cfg,
+            save_restore_connector=connector,
         )
         model.freeze()
 
@@ -144,7 +153,8 @@ def _load_model(
 
         return model
 
-    def _check_ddp_initialized(self, model):
+    @staticmethod
+    def _check_ddp_initialized(model):
         if not parallel_state.is_initialized():
 
             def dummy():
@@ -154,8 +164,11 @@ def dummy():
                 model.trainer.strategy.launcher.launch(dummy, trainer=model.trainer)
             model.trainer.strategy.setup_environment()
 
+        set_data_parallel_group(mpu.get_data_parallel_group())
+        set_tensor_parallel_group(mpu.get_tensor_model_parallel_group())
+
+    @staticmethod
     def _load_and_modify_config(
-        self,
         model_file: str,
         tensor_model_parallel_size: Optional[int] = None,
         pipeline_model_parallel_size: Optional[int] = None,
@@ -170,12 +183,35 @@ def _load_and_modify_config(
                 model_cfg.tensor_model_parallel_size = tensor_model_parallel_size
             if pipeline_model_parallel_size is not None:
                 model_cfg.pipeline_model_parallel_size = pipeline_model_parallel_size
-            # Only custom AMMO spec is supported for PTQ: this custom spec is largely based on local Megatron-LM
+            # Only custom ModelOpt spec is supported for PTQ: this custom spec is largely based on local Megatron-LM
             # layer definitions to avoid Transformer Engine implementations that are currently not supported.
-            model_cfg.name = "ammo"
+            # This layer spec also requires RoPE fusion to be disabled for tensor view operations in attention
+            # layer implementation from megatron/core/transformer/dot_product_attention.py to be functional.
+            model_cfg.name = "modelopt"
+            model_cfg.apply_rope_fusion = False
 
         return model_cfg
 
+    @staticmethod
+    def _sample_output(model):
+        """Generate sample output for a model instance."""
+        if torch.distributed.get_rank() == 0:
+            print("Generating sample output for a model...")
+
+        response = model.generate(
+            inputs=[
+                "Born in north-east France, Soyer trained as a",
+                "Born in California, Soyer trained as a",
+            ],
+            length_params={
+                "max_length": 100,
+                "min_length": 100,
+            },
+        )
+
+        if torch.distributed.get_rank() == 0:
+            print(f'Example NeMo output after PTQ: {response["sentences"]}"')
+
     def quantize(
         self,
         model_file: str,
@@ -191,13 +227,12 @@ def quantize(
 
         model.set_inference_config(OmegaConf.to_container(self.inference_config))
 
-        def forward_loop():
-            for i, batch in enumerate(dataloader):
-                if dist.get_rank() == 0:
-                    print(f"Calibrating batch {i}")
+        def forward_loop(model):
+            print("Calibrating the model...")
+            for i, batch in enumerate(tqdm(dataloader)):
                 model.predict_step(batch, i)
 
-        model = atq.quantize(model, self.atq_config, forward_loop)
+        model = mtq.quantize(model, self.quant_cfg, forward_loop)
 
         if self.export_config == "gptnext":
             # We found squared_relu may have an under-calibration problem.
@@ -207,12 +242,12 @@ def forward_loop():
                 maxbound = 448
             elif self.quantization_config.quantization.algorithm == "int8_sq":
                 maxbound = 127
-            model = atq.postprocess_amax(
+            model = mtq.postprocess_amax(
                 model, "*input_quantizer", lambda amax: torch.clamp(amax, min=0.01 * maxbound)
             )
 
         if dist.get_rank() == 0:
-            atq.print_quant_summary(model)
+            mtq.print_quant_summary(model)
 
         return model
 
@@ -220,6 +255,8 @@ def export(self, model, model_save: str):
         """Export model to '.qnemo' format for TensorRT-LLM engine build."""
         torch_dtype = torch_dtype_from_precision(self.export_config.dtype)
 
+        self._sample_output(model)
+
         if model.cfg.megatron_amp_O2:
             model.model = unwrap_model(model.model, Float16Module)
 
@@ -239,6 +276,8 @@ def export(self, model, model_save: str):
                 export_dir=export_dir,
                 inference_tensor_parallel=self.export_config.inference_tensor_parallel,
                 inference_pipeline_parallel=self.export_config.inference_pipeline_parallel,
+                use_nfs_workspace=self.export_config.inference_pipeline_parallel == 1
+                and model.cfg.pipeline_model_parallel_size > 1,
             )
             dist.barrier()  # Wait until all ranks complete export_model_config step
             if dist.get_rank() == 0:

From 061cc452cf6c6b8687093799b9d048e55aad5fd8 Mon Sep 17 00:00:00 2001
From: Alessandro Morari <amorari@nvidia.com>
Date: Wed, 15 May 2024 16:43:25 -0400
Subject: [PATCH 17/36] GPU-based vectorized Specaug Version 2 (#9155)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* GPU-based vectorized SpecAug

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Wider dtypes for specaug mask bounds computation

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fast spec augmentation v2

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed randint code, added comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed padding coverage bug, fixed long casting bug, fixed comments

Signed-off-by: Alessandro Morari <amorari@nvidia.com>

* fixed bug due to using freq_axis with length

Signed-off-by: Alessandro Morari <amorari@nvidia.com>

* Added tests for vectorized spectrogram augmentation

Signed-off-by: Alessandro Morari <amorari@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: pzelasko <pzelasko@users.noreply.github.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Alessandro Morari <amorari@nvidia.com>
Signed-off-by: pzelasko <pzelasko@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pzelasko <pzelasko@users.noreply.github.com>
---
 .../asr/modules/audio_preprocessing.py        | 239 +++++++++---------
 .../asr/parts/submodules/spectr_augment.py    | 118 ++++++++-
 tests/collections/asr/test_asr_modules.py     |  35 ++-
 3 files changed, 261 insertions(+), 131 deletions(-)

diff --git a/nemo/collections/asr/modules/audio_preprocessing.py b/nemo/collections/asr/modules/audio_preprocessing.py
index 643bc4a69d69..d45c0acf314f 100644
--- a/nemo/collections/asr/modules/audio_preprocessing.py
+++ b/nemo/collections/asr/modules/audio_preprocessing.py
@@ -66,8 +66,8 @@
 
 class AudioPreprocessor(NeuralModule, ABC):
     """
-        An interface for Neural Modules that performs audio pre-processing,
-        transforming the wav files to features.
+    An interface for Neural Modules that performs audio pre-processing,
+    transforming the wav files to features.
     """
 
     def __init__(self, win_length, hop_length):
@@ -101,72 +101,72 @@ def get_features(self, input_signal, length):
 class AudioToMelSpectrogramPreprocessor(AudioPreprocessor, Exportable):
     """Featurizer module that converts wavs to mel spectrograms.
 
-        Args:
-            sample_rate (int): Sample rate of the input audio data.
-                Defaults to 16000
-            window_size (float): Size of window for fft in seconds
-                Defaults to 0.02
-            window_stride (float): Stride of window for fft in seconds
-                Defaults to 0.01
-            n_window_size (int): Size of window for fft in samples
-                Defaults to None. Use one of window_size or n_window_size.
-            n_window_stride (int): Stride of window for fft in samples
-                Defaults to None. Use one of window_stride or n_window_stride.
-            window (str): Windowing function for fft. can be one of ['hann',
-                'hamming', 'blackman', 'bartlett']
-                Defaults to "hann"
-            normalize (str): Can be one of ['per_feature', 'all_features']; all
-                other options disable feature normalization. 'all_features'
-                normalizes the entire spectrogram to be mean 0 with std 1.
-                'pre_features' normalizes per channel / freq instead.
-                Defaults to "per_feature"
-            n_fft (int): Length of FT window. If None, it uses the smallest power
-                of 2 that is larger than n_window_size.
-                Defaults to None
-            preemph (float): Amount of pre emphasis to add to audio. Can be
-                disabled by passing None.
-                Defaults to 0.97
-            features (int): Number of mel spectrogram freq bins to output.
-                Defaults to 64
-            lowfreq (int): Lower bound on mel basis in Hz.
-                Defaults to 0
-            highfreq  (int): Lower bound on mel basis in Hz.
-                Defaults to None
-            log (bool): Log features.
-                Defaults to True
-            log_zero_guard_type(str): Need to avoid taking the log of zero. There
-                are two options: "add" or "clamp".
-                Defaults to "add".
-            log_zero_guard_value(float, or str): Add or clamp requires the number
-                to add with or clamp to. log_zero_guard_value can either be a float
-                or "tiny" or "eps". torch.finfo is used if "tiny" or "eps" is
-                passed.
-                Defaults to 2**-24.
-            dither (float): Amount of white-noise dithering.
-                Defaults to 1e-5
-            pad_to (int): Ensures that the output size of the time dimension is
-                a multiple of pad_to.
-                Defaults to 16
-            frame_splicing (int): Defaults to 1
-            exact_pad (bool): If True, sets stft center to False and adds padding, such that num_frames = audio_length
-                // hop_length. Defaults to False.
-            pad_value (float): The value that shorter mels are padded with.
-                Defaults to 0
-            mag_power (float): The power that the linear spectrogram is raised to
-                prior to multiplication with mel basis.
-                Defaults to 2 for a power spec
-            rng : Random number generator
-            nb_augmentation_prob (float) : Probability with which narrowband augmentation would be applied to
-                samples in the batch.
-                Defaults to 0.0
-            nb_max_freq (int) : Frequency above which all frequencies will be masked for narrowband augmentation.
-                Defaults to 4000
-            use_torchaudio: Whether to use the `torchaudio` implementation.
-            mel_norm: Normalization used for mel filterbank weights.
-                Defaults to 'slaney' (area normalization)
-            stft_exact_pad: Deprecated argument, kept for compatibility with older checkpoints.
-            stft_conv: Deprecated argument, kept for compatibility with older checkpoints.
-        """
+    Args:
+        sample_rate (int): Sample rate of the input audio data.
+            Defaults to 16000
+        window_size (float): Size of window for fft in seconds
+            Defaults to 0.02
+        window_stride (float): Stride of window for fft in seconds
+            Defaults to 0.01
+        n_window_size (int): Size of window for fft in samples
+            Defaults to None. Use one of window_size or n_window_size.
+        n_window_stride (int): Stride of window for fft in samples
+            Defaults to None. Use one of window_stride or n_window_stride.
+        window (str): Windowing function for fft. can be one of ['hann',
+            'hamming', 'blackman', 'bartlett']
+            Defaults to "hann"
+        normalize (str): Can be one of ['per_feature', 'all_features']; all
+            other options disable feature normalization. 'all_features'
+            normalizes the entire spectrogram to be mean 0 with std 1.
+            'pre_features' normalizes per channel / freq instead.
+            Defaults to "per_feature"
+        n_fft (int): Length of FT window. If None, it uses the smallest power
+            of 2 that is larger than n_window_size.
+            Defaults to None
+        preemph (float): Amount of pre emphasis to add to audio. Can be
+            disabled by passing None.
+            Defaults to 0.97
+        features (int): Number of mel spectrogram freq bins to output.
+            Defaults to 64
+        lowfreq (int): Lower bound on mel basis in Hz.
+            Defaults to 0
+        highfreq  (int): Lower bound on mel basis in Hz.
+            Defaults to None
+        log (bool): Log features.
+            Defaults to True
+        log_zero_guard_type(str): Need to avoid taking the log of zero. There
+            are two options: "add" or "clamp".
+            Defaults to "add".
+        log_zero_guard_value(float, or str): Add or clamp requires the number
+            to add with or clamp to. log_zero_guard_value can either be a float
+            or "tiny" or "eps". torch.finfo is used if "tiny" or "eps" is
+            passed.
+            Defaults to 2**-24.
+        dither (float): Amount of white-noise dithering.
+            Defaults to 1e-5
+        pad_to (int): Ensures that the output size of the time dimension is
+            a multiple of pad_to.
+            Defaults to 16
+        frame_splicing (int): Defaults to 1
+        exact_pad (bool): If True, sets stft center to False and adds padding, such that num_frames = audio_length
+            // hop_length. Defaults to False.
+        pad_value (float): The value that shorter mels are padded with.
+            Defaults to 0
+        mag_power (float): The power that the linear spectrogram is raised to
+            prior to multiplication with mel basis.
+            Defaults to 2 for a power spec
+        rng : Random number generator
+        nb_augmentation_prob (float) : Probability with which narrowband augmentation would be applied to
+            samples in the batch.
+            Defaults to 0.0
+        nb_max_freq (int) : Frequency above which all frequencies will be masked for narrowband augmentation.
+            Defaults to 4000
+        use_torchaudio: Whether to use the `torchaudio` implementation.
+        mel_norm: Normalization used for mel filterbank weights.
+            Defaults to 'slaney' (area normalization)
+        stft_exact_pad: Deprecated argument, kept for compatibility with older checkpoints.
+        stft_conv: Deprecated argument, kept for compatibility with older checkpoints.
+    """
 
     def save_to(self, save_path: str):
         pass
@@ -177,8 +177,7 @@ def restore_from(cls, restore_path: str):
 
     @property
     def input_types(self):
-        """Returns definitions of module input ports.
-        """
+        """Returns definitions of module input ports."""
         return {
             "input_signal": NeuralType(('B', 'T'), AudioSignal(freq=self._sample_rate)),
             "length": NeuralType(
@@ -218,7 +217,7 @@ def __init__(
         highfreq=None,
         log=True,
         log_zero_guard_type="add",
-        log_zero_guard_value=2 ** -24,
+        log_zero_guard_value=2**-24,
         dither=1e-5,
         pad_to=16,
         frame_splicing=1,
@@ -335,8 +334,7 @@ class AudioToMFCCPreprocessor(AudioPreprocessor):
 
     @property
     def input_types(self):
-        """Returns definitions of module input ports.
-        """
+        """Returns definitions of module input ports."""
         return {
             "input_signal": NeuralType(('B', 'T'), AudioSignal(freq=self._sample_rate)),
             "length": NeuralType(tuple('B'), LengthsType()),
@@ -344,8 +342,7 @@ def input_types(self):
 
     @property
     def output_types(self):
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {
             "processed_signal": NeuralType(('B', 'D', 'T'), MFCCSpectrogramType()),
             "processed_length": NeuralType(tuple('B'), LengthsType()),
@@ -463,12 +460,14 @@ class SpectrogramAugmentation(NeuralModule):
         rect_time (int): maximum size of cut rectangles along the time
             dimension
             Defaults to 25.
+        use_numba_spec_augment: use numba code for Spectrogram augmentation
+        use_vectorized_spec_augment: use vectorized code for Spectrogram augmentation
+
     """
 
     @property
     def input_types(self):
-        """Returns definitions of module input types
-        """
+        """Returns definitions of module input types"""
         return {
             "input_spec": NeuralType(('B', 'D', 'T'), SpectrogramType()),
             "length": NeuralType(tuple('B'), LengthsType()),
@@ -476,8 +475,7 @@ def input_types(self):
 
     @property
     def output_types(self):
-        """Returns definitions of module output types
-        """
+        """Returns definitions of module output types"""
         return {"augmented_spec": NeuralType(('B', 'D', 'T'), SpectrogramType())}
 
     def __init__(
@@ -491,12 +489,18 @@ def __init__(
         rect_freq=20,
         rng=None,
         mask_value=0.0,
-        use_numba_spec_augment: bool = True,
+        use_vectorized_spec_augment: bool = True,
+        use_numba_spec_augment: bool = False,
     ):
         super().__init__()
 
         if rect_masks > 0:
-            self.spec_cutout = SpecCutout(rect_masks=rect_masks, rect_time=rect_time, rect_freq=rect_freq, rng=rng,)
+            self.spec_cutout = SpecCutout(
+                rect_masks=rect_masks,
+                rect_time=rect_time,
+                rect_freq=rect_freq,
+                rng=rng,
+            )
             # self.spec_cutout.to(self._device)
         else:
             self.spec_cutout = lambda input_spec: input_spec
@@ -508,6 +512,7 @@ def __init__(
                 time_width=time_width,
                 rng=rng,
                 mask_value=mask_value,
+                use_vectorized_code=use_vectorized_spec_augment,
             )
         else:
             self.spec_augment = lambda input_spec, length: input_spec
@@ -541,26 +546,25 @@ def forward(self, input_spec, length):
 
 class MaskedPatchAugmentation(NeuralModule):
     """
-        Zeroes out fixed size time patches of the spectrogram.
-        All samples in batch are guaranteed to have the same amount of masked time steps.
-        Optionally also performs frequency masking in the same way as SpecAugment.
-        Args:
-            patch_size (int): up to how many time steps does one patch consist of.
-                Defaults to 48.
-            mask_patches (float): how many patches should be masked in each sample.
-                if >= 1., interpreted as number of patches (after converting to int)
-                if <1.,   interpreted as fraction of total tokens to be masked (number of patches is rounded up)
-                Defaults to 10.
-            freq_masks (int): how many frequency segments should be cut.
-                Defaults to 0.
-            freq_width (int): maximum number of frequencies to be cut in a segment.
-                Defaults to 0.
+    Zeroes out fixed size time patches of the spectrogram.
+    All samples in batch are guaranteed to have the same amount of masked time steps.
+    Optionally also performs frequency masking in the same way as SpecAugment.
+    Args:
+        patch_size (int): up to how many time steps does one patch consist of.
+            Defaults to 48.
+        mask_patches (float): how many patches should be masked in each sample.
+            if >= 1., interpreted as number of patches (after converting to int)
+            if <1.,   interpreted as fraction of total tokens to be masked (number of patches is rounded up)
+            Defaults to 10.
+        freq_masks (int): how many frequency segments should be cut.
+            Defaults to 0.
+        freq_width (int): maximum number of frequencies to be cut in a segment.
+            Defaults to 0.
     """
 
     @property
     def input_types(self):
-        """Returns definitions of module input types
-        """
+        """Returns definitions of module input types"""
         return {
             "input_spec": NeuralType(('B', 'D', 'T'), SpectrogramType()),
             "length": NeuralType(tuple('B'), LengthsType()),
@@ -568,12 +572,15 @@ def input_types(self):
 
     @property
     def output_types(self):
-        """Returns definitions of module output types
-        """
+        """Returns definitions of module output types"""
         return {"augmented_spec": NeuralType(('B', 'D', 'T'), SpectrogramType())}
 
     def __init__(
-        self, patch_size: int = 48, mask_patches: float = 10.0, freq_masks: int = 0, freq_width: int = 0,
+        self,
+        patch_size: int = 48,
+        mask_patches: float = 10.0,
+        freq_masks: int = 0,
+        freq_width: int = 0,
     ):
         super().__init__()
         self.patch_size = patch_size
@@ -586,7 +593,12 @@ def __init__(
             raise ValueError('mask_patches cannot be negative')
 
         if freq_masks > 0:
-            self.spec_augment = SpecAugment(freq_masks=freq_masks, time_masks=0, freq_width=freq_width, time_width=0,)
+            self.spec_augment = SpecAugment(
+                freq_masks=freq_masks,
+                time_masks=0,
+                freq_width=freq_width,
+                time_width=0,
+            )
         else:
             self.spec_augment = None
 
@@ -676,8 +688,7 @@ def forward(self, input_signal, length):
 
     @property
     def input_types(self):
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {
             "input_signal": NeuralType(('B', 'D', 'T'), SpectrogramType()),
             "length": NeuralType(tuple('B'), LengthsType()),
@@ -685,8 +696,7 @@ def input_types(self):
 
     @property
     def output_types(self):
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {
             "processed_signal": NeuralType(('B', 'D', 'T'), SpectrogramType()),
             "processed_length": NeuralType(tuple('B'), LengthsType()),
@@ -754,8 +764,7 @@ def num_subbands(self) -> int:
 
     @property
     def input_types(self) -> Dict[str, NeuralType]:
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {
             "input": NeuralType(('B', 'C', 'T'), AudioSignal()),
             "input_length": NeuralType(('B',), LengthsType(), optional=True),
@@ -763,8 +772,7 @@ def input_types(self) -> Dict[str, NeuralType]:
 
     @property
     def output_types(self) -> Dict[str, NeuralType]:
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {
             "output": NeuralType(('B', 'C', 'D', 'T'), SpectrogramType()),
             "output_length": NeuralType(('B',), LengthsType()),
@@ -835,7 +843,7 @@ class SpectrogramToAudio(NeuralModule):
         fft_length: length of FFT
         hop_length: length of hops/shifts of the sliding window
         magnitude_power: Transform magnitude of the spectrogram as x^(1/magnitude_power).
-        scale: Spectrogram will be scaled with 1/scale before the inverse transform. 
+        scale: Spectrogram will be scaled with 1/scale before the inverse transform.
     """
 
     def __init__(self, fft_length: int, hop_length: int, magnitude_power: float = 1.0, scale: float = 1.0):
@@ -878,8 +886,7 @@ def num_subbands(self) -> int:
 
     @property
     def input_types(self) -> Dict[str, NeuralType]:
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {
             "input": NeuralType(('B', 'C', 'D', 'T'), SpectrogramType()),
             "input_length": NeuralType(('B',), LengthsType(), optional=True),
@@ -887,8 +894,7 @@ def input_types(self) -> Dict[str, NeuralType]:
 
     @property
     def output_types(self) -> Dict[str, NeuralType]:
-        """Returns definitions of module output ports.
-        """
+        """Returns definitions of module output ports."""
         return {
             "output": NeuralType(('B', 'C', 'T'), AudioSignal()),
             "output_length": NeuralType(('B',), LengthsType()),
@@ -968,7 +974,7 @@ class AudioToMelSpectrogramPreprocessorConfig:
     highfreq: Optional[int] = None
     log: bool = True
     log_zero_guard_type: str = "add"
-    log_zero_guard_value: float = 2 ** -24
+    log_zero_guard_value: float = 2**-24
     dither: float = 1e-5
     pad_to: int = 16
     frame_splicing: int = 1
@@ -1015,7 +1021,8 @@ class SpectrogramAugmentationConfig:
     rect_freq: int = 0
     mask_value: float = 0
     rng: Optional[Any] = None  # random.Random() type
-    use_numba_spec_augment: bool = True
+    use_numba_spec_augment: bool = False
+    use_vectorized_spec_augment: bool = True
 
 
 @dataclass
diff --git a/nemo/collections/asr/parts/submodules/spectr_augment.py b/nemo/collections/asr/parts/submodules/spectr_augment.py
index 9b379ce10f37..5bc7104816af 100644
--- a/nemo/collections/asr/parts/submodules/spectr_augment.py
+++ b/nemo/collections/asr/parts/submodules/spectr_augment.py
@@ -38,12 +38,18 @@ class SpecAugment(nn.Module, Typing):
         to be cut in one segment.
         If a float value, defines maximum percentage of timesteps that
         are cut adaptively.
+    use_vectorized_code - GPU-based implementation with batched masking and GPU rng,
+        setting it to False reverts to the legacy implementation.
+        Fast implementation is inspired by torchaudio:
+        https://github.com/pytorch/audio/blob/ea437b31ce316ea3d66fe73768c0dcb94edb79ad/src/torchaudio/functional/functional.py#L816
     """
 
+    FREQ_AXIS = 1  # Frequency axis in the spectrogram tensor
+    TIME_AXIS = 2  # Time axis in the spectrogram tensor
+
     @property
     def input_types(self):
-        """Returns definitions of module input types
-        """
+        """Returns definitions of module input types"""
         return {
             "input_spec": NeuralType(('B', 'D', 'T'), SpectrogramType()),
             "length": NeuralType(tuple('B'), LengthsType()),
@@ -51,12 +57,18 @@ def input_types(self):
 
     @property
     def output_types(self):
-        """Returns definitions of module output types
-        """
+        """Returns definitions of module output types"""
         return {"augmented_spec": NeuralType(('B', 'D', 'T'), SpectrogramType())}
 
     def __init__(
-        self, freq_masks=0, time_masks=0, freq_width=10, time_width=10, rng=None, mask_value=0.0,
+        self,
+        freq_masks: int = 0,
+        time_masks: int = 0,
+        freq_width: int = 10,
+        time_width: int | float = 10,
+        rng: random.Random | None = None,
+        mask_value: float = 0.0,
+        use_vectorized_code: bool = True,
     ):
         super().__init__()
 
@@ -69,6 +81,7 @@ def __init__(
         self.time_width = time_width
 
         self.mask_value = mask_value
+        self.use_vectorized_code = use_vectorized_code
 
         if isinstance(time_width, int):
             self.adaptive_temporal_width = False
@@ -81,6 +94,12 @@ def __init__(
     @typecheck()
     @torch.no_grad()
     def forward(self, input_spec, length):
+        if self.use_vectorized_code:
+            return self._forward_vectorized(input_spec, length)
+        else:
+            return self._forward_legacy(input_spec, length)
+
+    def _forward_legacy(self, input_spec, length):
         batch_size, num_freq_bins, _ = input_spec.shape
         # Move lengths to CPU before repeated indexing
         lengths_cpu = length.cpu().numpy()
@@ -112,6 +131,89 @@ def forward(self, input_spec, length):
         masked_spec = input_spec.masked_fill(mask=fill_mask, value=self.mask_value)
         return masked_spec
 
+    def _forward_vectorized(self, input_spec: torch.Tensor, length: torch.Tensor) -> torch.Tensor:
+        # time masks
+        input_spec = self._apply_masks(
+            input_spec=input_spec,
+            num_masks=self.time_masks,
+            length=length,
+            width=self.time_width,
+            axis=self.TIME_AXIS,
+            mask_value=self.mask_value,
+        )
+        # freq masks
+        input_spec = self._apply_masks(
+            input_spec=input_spec,
+            num_masks=self.freq_masks,
+            length=length,
+            width=self.freq_width,
+            axis=self.FREQ_AXIS,
+            mask_value=self.mask_value,
+        )
+        return input_spec
+
+    def _apply_masks(
+        self,
+        input_spec: torch.Tensor,
+        num_masks: int,
+        length: torch.Tensor,
+        width: int | float,
+        mask_value: float,
+        axis: int,
+    ) -> torch.Tensor:
+
+        assert axis in (
+            self.FREQ_AXIS,
+            self.TIME_AXIS,
+        ), f"Axis can be only be equal to frequency \
+            ({self.FREQ_AXIS}) or time ({self.TIME_AXIS}). Received: {axis=}"
+        assert not (
+            isinstance(width, float) and axis == self.FREQ_AXIS
+        ), "Float width supported \
+            only with time axis."
+
+        batch_size = input_spec.shape[0]
+        axis_length = input_spec.shape[axis]
+
+        # If width is float then it is transformed into a tensor
+        if axis == self.TIME_AXIS and isinstance(width, float):
+            width = torch.clamp(width * length, max=axis_length).unsqueeze(1)
+
+        # Generate [0-1) random numbers and then scale the tensors.
+        # Use float32 dtype for begin/end mask markers before they are quantized to long.
+        mask_width = torch.rand((batch_size, num_masks), device=input_spec.device, dtype=torch.float32) * width
+        mask_width = mask_width.long()
+        mask_start = torch.rand((batch_size, num_masks), device=input_spec.device, dtype=torch.float32)
+
+        if axis == self.TIME_AXIS:
+            # length can only be used for the time axis
+            mask_start = mask_start * (length.unsqueeze(1) - mask_width)
+        else:
+            mask_start = mask_start * (axis_length - mask_width)
+
+        mask_start = mask_start.long()
+        mask_end = mask_start + mask_width
+
+        # Create mask values using vectorized indexing
+        indices = torch.arange(axis_length, device=input_spec.device)
+        # Create a mask_tensor with all the indices.
+        # The mask_tensor shape is (batch_size, num_masks, axis_length).
+        mask_tensor = (indices >= mask_start.unsqueeze(-1)) & (indices < mask_end.unsqueeze(-1))
+
+        # Reduce masks to one mask
+        mask_tensor = mask_tensor.any(dim=1)
+
+        # Create a final mask that aligns with the full tensor
+        mask = torch.zeros_like(input_spec, dtype=torch.bool)
+        if axis == self.TIME_AXIS:
+            mask_ranges = mask_tensor[:, None, :]
+        else:  # axis == self.FREQ_AXIS
+            mask_ranges = mask_tensor[:, :, None]
+        mask[:, :, :] = mask_ranges
+
+        # Apply the mask value
+        return input_spec.masked_fill(mask=mask, value=mask_value)
+
 
 class SpecCutout(nn.Module, Typing):
     """
@@ -126,14 +228,12 @@ class SpecCutout(nn.Module, Typing):
 
     @property
     def input_types(self):
-        """Returns definitions of module input types
-        """
+        """Returns definitions of module input types"""
         return {"input_spec": NeuralType(('B', 'D', 'T'), SpectrogramType())}
 
     @property
     def output_types(self):
-        """Returns definitions of module output types
-        """
+        """Returns definitions of module output types"""
         return {"augmented_spec": NeuralType(('B', 'D', 'T'), SpectrogramType())}
 
     def __init__(self, rect_masks=0, rect_time=5, rect_freq=20, rng=None):
diff --git a/tests/collections/asr/test_asr_modules.py b/tests/collections/asr/test_asr_modules.py
index b47a72fe0476..1a845232b2a7 100644
--- a/tests/collections/asr/test_asr_modules.py
+++ b/tests/collections/asr/test_asr_modules.py
@@ -69,10 +69,28 @@ def test_AudioToMelSpectrogramPreprocessor_batch(self):
             assert diff <= 1e-3
 
     @pytest.mark.unit
-    def test_SpectrogramAugmentationr(self):
+    def test_SpectrogramAugmentationr_legacy(self):
         # Make sure constructor works
         instance1 = modules.SpectrogramAugmentation(
-            freq_masks=10, time_masks=3, rect_masks=3, use_numba_spec_augment=False
+            freq_masks=10, time_masks=3, rect_masks=3, use_numba_spec_augment=False, use_vectorized_spec_augment=False
+        )
+        assert isinstance(instance1, modules.SpectrogramAugmentation)
+
+        # Make sure forward doesn't throw with expected input
+        instance0 = modules.AudioToMelSpectrogramPreprocessor(dither=0)
+        input_signal = torch.randn(size=(4, 512))
+        length = torch.randint(low=161, high=500, size=[4])
+        res0 = instance0(input_signal=input_signal, length=length)
+        res = instance1(input_spec=res0[0], length=length)
+
+        assert res.shape == res0[0].shape
+
+    @pytest.mark.unit
+    @pytest.mark.run_only_on('GPU')
+    def test_SpectrogramAugmentationr_vectorized(self):
+        # Make sure constructor works
+        instance1 = modules.SpectrogramAugmentation(
+            freq_masks=10, time_masks=3, rect_masks=3, use_numba_spec_augment=False, use_vectorized_spec_augment=True
         )
         assert isinstance(instance1, modules.SpectrogramAugmentation)
 
@@ -97,7 +115,7 @@ def test_SpectrogramAugmentationr_numba_kernel(self, caplog):
 
         # Make sure constructor works
         instance1 = modules.SpectrogramAugmentation(
-            freq_masks=10, time_masks=3, rect_masks=3, use_numba_spec_augment=True
+            freq_masks=10, time_masks=3, rect_masks=3, use_numba_spec_augment=True, use_vectorized_spec_augment=False
         )
         assert isinstance(instance1, modules.SpectrogramAugmentation)
 
@@ -120,7 +138,8 @@ def test_SpectrogramAugmentationr_numba_kernel(self, caplog):
     def test_SpectrogramAugmentationr_config(self):
         # Test that dataclass matches signature of module
         result = config_utils.assert_dataclass_signature_match(
-            modules.SpectrogramAugmentation, modules.audio_preprocessing.SpectrogramAugmentationConfig,
+            modules.SpectrogramAugmentation,
+            modules.audio_preprocessing.SpectrogramAugmentationConfig,
         )
         signatures_match, cls_subset, dataclass_subset = result
 
@@ -178,7 +197,8 @@ def test_MaskedPatchAugmentation(self):
     def test_MaskedPatchAugmentation_config(self):
         # Test that dataclass matches signature of module
         result = config_utils.assert_dataclass_signature_match(
-            modules.MaskedPatchAugmentation, modules.audio_preprocessing.MaskedPatchAugmentationConfig,
+            modules.MaskedPatchAugmentation,
+            modules.audio_preprocessing.MaskedPatchAugmentationConfig,
         )
         signatures_match, cls_subset, dataclass_subset = result
 
@@ -195,7 +215,10 @@ def test_RNNTDecoder(self):
         pred_config = OmegaConf.create(
             {
                 '_target_': 'nemo.collections.asr.modules.RNNTDecoder',
-                'prednet': {'pred_hidden': 32, 'pred_rnn_layers': 1,},
+                'prednet': {
+                    'pred_hidden': 32,
+                    'pred_rnn_layers': 1,
+                },
                 'vocab_size': vocab_size,
                 'blank_as_pad': True,
             }

From 964ea3cb5faab50791d08226ec49741418774aa8 Mon Sep 17 00:00:00 2001
From: Pablo Garay <palenq@gmail.com>
Date: Wed, 15 May 2024 21:57:22 -0700
Subject: [PATCH 18/36] run_cicd_for_release_branches_also (#9213)

---
 .github/workflows/cicd-main.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
index 291eeaed7f89..8430dae56418 100644
--- a/.github/workflows/cicd-main.yml
+++ b/.github/workflows/cicd-main.yml
@@ -15,7 +15,9 @@ name: "CICD NeMo"
 
 on:
   pull_request:
-    branches: [ "main" ]
+    branches:
+      - 'main'
+      - 'r**'
     types: [ labeled ]
 
 concurrency:

From d0a453531e686cc7d126600b42fb3d385b20a6ae Mon Sep 17 00:00:00 2001
From: Jan Lasek <janek.lasek@gmail.com>
Date: Thu, 16 May 2024 18:11:45 +0200
Subject: [PATCH 19/36] Update nemo.export module for quantized models (#9218)

* Remove config aligner - no longer needed after TRT-LLM 0.9 update

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Change default export precision to bf16 (more frequent)

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Specify gpt_attention_plugin

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
---
 .../conf/megatron_quantization.yaml           |  2 +-
 nemo/export/trt_llm/qnemo/__init__.py         |  1 -
 nemo/export/trt_llm/qnemo/align_config.py     | 46 -------------------
 .../trt_llm/qnemo/qnemo_to_tensorrt_llm.py    | 40 ++--------------
 4 files changed, 5 insertions(+), 84 deletions(-)
 delete mode 100644 nemo/export/trt_llm/qnemo/align_config.py

diff --git a/examples/nlp/language_modeling/conf/megatron_quantization.yaml b/examples/nlp/language_modeling/conf/megatron_quantization.yaml
index 79a5bfbd8fe6..88d10ae0a66c 100644
--- a/examples/nlp/language_modeling/conf/megatron_quantization.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_quantization.yaml
@@ -31,7 +31,7 @@ export:
   decoder_type: llama # gptnext, gpt2, llama
   inference_tensor_parallel: 1 # Default using 1 TP for inference
   inference_pipeline_parallel: 1 # Default using 1 PP for inference
-  dtype: 16 # Default precision data type
+  dtype: bf16 # Default precision data type
 
 model_file: llama2-7b-fp16.nemo # Nemo file path
 model_save: llama2-7b-fp8.qnemo # Path where the quantized model will be saved
diff --git a/nemo/export/trt_llm/qnemo/__init__.py b/nemo/export/trt_llm/qnemo/__init__.py
index 77832d749b66..59b9eb8ae6a6 100644
--- a/nemo/export/trt_llm/qnemo/__init__.py
+++ b/nemo/export/trt_llm/qnemo/__init__.py
@@ -12,5 +12,4 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from .align_config import align_config
 from .qnemo_to_tensorrt_llm import qnemo_to_tensorrt_llm
diff --git a/nemo/export/trt_llm/qnemo/align_config.py b/nemo/export/trt_llm/qnemo/align_config.py
deleted file mode 100644
index abc53224e4b3..000000000000
--- a/nemo/export/trt_llm/qnemo/align_config.py
+++ /dev/null
@@ -1,46 +0,0 @@
-# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import copy
-from typing import Any, Dict
-
-
-def align_config(config_trtllm_build: Dict[str, Any]) -> Dict[str, Any]:
-    """Function to align config produced by trtllm-build API for consistency
-    with how ModelConfig from tensorrt_llm.runtime is used in the project.
-    """
-    config = {}
-
-    config_trtllm_build = copy.deepcopy(config_trtllm_build)
-
-    # Builder config
-    config["builder_config"] = {}
-    config["builder_config"]["name"] = "NeMo"
-    config["builder_config"].update(config_trtllm_build["build_config"])
-    config["builder_config"].update(config_trtllm_build["pretrained_config"])
-
-    # Plugin config
-    config["plugin_config"] = config["builder_config"].pop("plugin_config")
-
-    # Parallelism config
-    config["builder_config"]["world_size"] = config["builder_config"]["mapping"]["world_size"]
-    config["builder_config"]["tensor_parallel"] = config["builder_config"]["mapping"]["tp_size"]
-    config["builder_config"]["pipeline_parallel"] = config["builder_config"]["mapping"]["pp_size"]
-
-    # Other parameters
-    config["builder_config"]["num_heads"] = config_trtllm_build["pretrained_config"]["num_attention_heads"]
-    config["builder_config"]["num_layers"] = config_trtllm_build["pretrained_config"]["num_hidden_layers"]
-    config["builder_config"]["add_bos"] = False
-    config["builder_config"]["precision"] = config["builder_config"]["dtype"]
-    return config
diff --git a/nemo/export/trt_llm/qnemo/qnemo_to_tensorrt_llm.py b/nemo/export/trt_llm/qnemo/qnemo_to_tensorrt_llm.py
index 4e74d8e5fb58..b7e2f7bc2973 100644
--- a/nemo/export/trt_llm/qnemo/qnemo_to_tensorrt_llm.py
+++ b/nemo/export/trt_llm/qnemo/qnemo_to_tensorrt_llm.py
@@ -15,13 +15,10 @@
 import json
 import os
 import subprocess
-from typing import List, Optional
 
-from nemo.export.trt_llm.qnemo import align_config
-from nemo.export.trt_llm.tensorrt_llm_build import MODEL_NAME, get_engine_name
+from typing import List, Optional
 
 CONFIG_NAME = "config.json"
-CONFIG_TRTLLM_BUILD_NAME = "config_trtllm_build.json"
 
 
 def qnemo_to_tensorrt_llm(
@@ -34,6 +31,7 @@ def qnemo_to_tensorrt_llm(
     lora_target_modules: Optional[List[str]] = None,
 ):
     """Build TRT-LLM engine via trtllm-build CLI API in a subprocess."""
+    assert not lora_target_modules, f"LoRA is not supported for quantized checkpoints, got {lora_target_modules}"
     print(
         "Note that setting n_gpus, tensor_parallel_size and pipeline_parallel_size parameters"
         " for quantized models is possible only on export step via nemo.export.quantize module."
@@ -58,6 +56,8 @@ def qnemo_to_tensorrt_llm(
         str(max_prompt_embedding_table_size),
         "--gemm_plugin",
         model_config["dtype"],
+        "--gpt_attention_plugin",
+        model_config["dtype"],
         "--strongly_typed",
         "--use_custom_all_reduce",
         "disable",
@@ -75,35 +75,3 @@ def qnemo_to_tensorrt_llm(
 
     print("Building engine done. Full logs are:")
     print(result.stdout.decode())
-
-    # Alignment to make nemo-fw tensorrt_llm.runtime ModelConfig definition compatible with config
-    # produced by trtllm-build API. The new config is saved as "config.json" while the source build
-    # config is saved as "config_trtllm_build.json" in the engine directory for reference.
-    os.rename(os.path.join(engine_dir, CONFIG_NAME), os.path.join(engine_dir, CONFIG_TRTLLM_BUILD_NAME))
-    with open(os.path.join(engine_dir, CONFIG_TRTLLM_BUILD_NAME), "r") as f:
-        config_trtllm_build = json.load(f)
-
-    config = align_config(config_trtllm_build)
-
-    # Other parameters
-    assert lora_target_modules is None
-    config["builder_config"]["lora_target_modules"] = lora_target_modules
-
-    with open(os.path.join(engine_dir, CONFIG_NAME), "w") as f:
-        json.dump(config, f, indent=2)
-
-    # Rename for consistency with how engine is run later
-    for i in range(config["builder_config"]["world_size"]):
-        os.rename(
-            os.path.join(engine_dir, f"rank{i}.engine"),
-            os.path.join(
-                engine_dir,
-                get_engine_name(
-                    MODEL_NAME,
-                    config["builder_config"]["precision"],
-                    config["builder_config"]["tensor_parallel"],
-                    config["builder_config"]["pipeline_parallel"],
-                    i,
-                ),
-            ),
-        )

From b489fba96227657b3d04ab71e390cb017bbcf685 Mon Sep 17 00:00:00 2001
From: jgerh <163925524+jgerh@users.noreply.github.com>
Date: Thu, 16 May 2024 10:26:29 -0700
Subject: [PATCH 20/36] Update index.rst (#9080)

Removed best-practices.rst file

Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
---
 docs/source/index.rst | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 82d3359480ca..eb586f749842 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -41,7 +41,6 @@ For quick guides and tutorials, see the "Getting started" section below.
    :titlesonly:
 
    starthere/intro
-   starthere/best-practices
    starthere/tutorials
 
 For more information, browse the developer docs for your area of interest in the contents section below or on the left sidebar.
@@ -86,4 +85,4 @@ For more information, browse the developer docs for your area of interest in the
    :name: Speech AI Tools
    :titlesonly:
 
-   tools/intro
\ No newline at end of file
+   tools/intro

From 526b6ade4bb078635d88feff76b5941c24db9e66 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?oliver=20k=C3=B6nig?= <okoenig@nvidia.com>
Date: Thu, 16 May 2024 20:45:05 +0200
Subject: [PATCH 21/36] ci: Speeding NeMo-CI up by using caching (#9174)

* build: Add `Dockerfile.ci`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* ci: Build, push, and test ci image

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* chore: Disable cache dir for NeMo reinstall

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* revert: Modify `reinstall.sh`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix: install modelopt[torch] instead of ammo

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* deduplicate requirements

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* make mcore/datasets

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
---
 .github/workflows/cicd-main.yml | 123 ++++++++++----------------------
 Dockerfile.ci                   |  74 +++++++++++++++++++
 2 files changed, 112 insertions(+), 85 deletions(-)
 create mode 100644 Dockerfile.ci

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
index 8430dae56418..ed2fc9f71f49 100644
--- a/.github/workflows/cicd-main.yml
+++ b/.github/workflows/cicd-main.yml
@@ -75,92 +75,45 @@ jobs:
       uses: actions/checkout@v4
       with:
         path: ${{ github.run_id }}
-
-    - name: Container setup
-      run: |
-        # Pull base PyTorch container
-        docker pull nvcr.io/nvidia/pytorch:24.02-py3
-        docker run --device=/dev/nvidia0 --gpus all --shm-size=8g --env TRANSFORMERS_OFFLINE=0 --env HYDRA_FULL_ERROR=1 --env PYTHONUNBUFFERED=1 --volume ${{ github.workspace }}/${{ github.run_id }}:/workspace --volume /mnt/datadrive/TestData:/home/TestData nvcr.io/nvidia/pytorch:24.02-py3 /bin/bash -c '
-            set -x
-
-            # PyTorch version
-            python -c "import torch; print(torch.__version__)"
-            python -c "import torchvision; print(torchvision.__version__)"
-
-            # Install test requirements
-            apt-get update && apt-get install -y bc && pip install -r requirements/requirements_test.txt && pip install -r requirements/requirements_lightning.txt
-
-            # Code formatting checks
-            python setup.py style
-
-            # Copyright Headers check
-            python tests/check_copyright_header.py --dir .
-
-            # NeMo Installation
-            ./reinstall.sh release
-
-            # Transformer Engine installation
-            git clone https://github.com/NVIDIA/TransformerEngine.git && \
-                pushd TransformerEngine && \
-                git fetch origin bfe21c3d68b0a9951e5716fb520045db53419c5e && \
-                git checkout FETCH_HEAD && \
-                git submodule init && git submodule update && \
-                NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .  && \
-                popd
-
-            # Apex installation
-            git clone https://github.com/NVIDIA/apex.git && \
-                pushd apex && \
-                git checkout 810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c && \
-                cp -R apex /usr/local/lib/python3.10/dist-packages && \
-                popd
-
-            # pip package should be working with main, if not we can update the commit here
-            # until the pip package is updated
-            # Megatron Core installation
-            git clone https://github.com/NVIDIA/Megatron-LM.git && \
-                pushd Megatron-LM && \
-                git checkout c90aa1671fc0b97f80fa6c3bb892ce6f8e88e7c9 && \
-                pip install . && \
-                  pushd megatron/core/datasets && \
-                  make && \
-                  popd && \
-                popd
-            export PYTHONPATH="${PYTHONPATH}:/workspace/Megatron-LM"
-
-            # Install only for test: L2: Segmentation Tool
-            pushd tools/ctc_segmentation && \
-                pip install -r requirements.txt && \
-                apt-get update && apt-get install libsox-fmt-all -y && \
-                popd
-
-            # ModelOpt installation
-            pip install nvidia-modelopt[torch]~=0.11.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
-
-            # PyTorch Lightning version
-            python -c "import pytorch_lightning; print(pytorch_lightning.__version__)"
-
-            # PyTorch Lightning DDP Checks
-            CUDA_VISIBLE_DEVICES="0,1" python "tests/core_ptl/check_for_ranks.py"
-
-            # Basic Import Checks
-            python -c "import nemo.collections.asr as nemo_asr"
-            python -c "import nemo.collections.nlp as nemo_nlp"
-            python -c "import nemo.collections.tts as nemo_tts"
-
-            # set permission
-            chmod 777 -R /workspace
-            '
-            ### \'\'
-
-    - name: Push container to registry for future use
+    
+    - name: Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+      with: 
+        # We use `docker` driver as this speeds things up for 
+        # trivial (non-multi-stage) builds.
+        driver: docker
+
+    - name: Build and push
+      uses: docker/build-push-action@v5
+      with:
+        file: Dockerfile.ci
+        push: true
+        cache-from: nemoci.azurecr.io/nemo_container:latest
+        cache-to: type=inline
+        tags: |
+          nemoci.azurecr.io/nemo_container_${{ github.run_id }}
+          nemoci.azurecr.io/nemo_container:latest
+
+    - name: Run some checks
       run: |
-        # Push container
-        echo "Docker: List containers" && docker ps -a
-        DOCKER_COMMIT=$(docker ps --latest --quiet)  # latest container
-        docker commit $DOCKER_COMMIT nemoci.azurecr.io/nemo_container_${{ github.run_id }}
-        docker tag nemoci.azurecr.io/nemo_container_${{ github.run_id }} nemoci.azurecr.io/nemo_container_${{ github.run_id }}
-        docker push nemoci.azurecr.io/nemo_container_${{ github.run_id }}
+        docker run --rm --device=/dev/nvidia0 --gpus all --shm-size=8g --env TRANSFORMERS_OFFLINE=0 --env HYDRA_FULL_ERROR=1 --env PYTHONUNBUFFERED=1 nemoci.azurecr.io/nemo_container_${{ github.run_id }} bash -c '\
+          # PyTorch Lightning version
+          python -c "import pytorch_lightning; print(pytorch_lightning.__version__)"
+
+          # PyTorch Lightning DDP Checks
+          CUDA_VISIBLE_DEVICES="0,1" python "tests/core_ptl/check_for_ranks.py"
+
+          # Basic Import Checks
+          python -c "import nemo.collections.asr as nemo_asr"
+          python -c "import nemo.collections.nlp as nemo_nlp"
+          python -c "import nemo.collections.tts as nemo_tts"
+
+          python setup.py style
+          python tests/check_copyright_header.py --dir .
+
+          # These checks are not crucial
+          exit 0
+        '
 
     # - name: Build and push to local registry
     #   uses: docker/build-push-action@v5
diff --git a/Dockerfile.ci b/Dockerfile.ci
new file mode 100644
index 000000000000..5b2cd8d6eb61
--- /dev/null
+++ b/Dockerfile.ci
@@ -0,0 +1,74 @@
+# syntax=docker/dockerfile:1-labs
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.02-py3
+
+FROM ${BASE_IMAGE}
+
+ENV TRANSFORMERS_OFFLINE=0 
+ENV HYDRA_FULL_ERROR=1
+ENV PYTHONUNBUFFERED=1
+
+# APT packages
+RUN <<"EOF" bash -ex
+apt-get update
+apt-get install -y bc libsox-fmt-all -y 
+apt-get clean
+EOF
+
+WORKDIR /workspace
+
+# Install NeMo requirements
+ARG TE_TAG=bfe21c3d68b0a9951e5716fb520045db53419c5e
+ARG MODELOPT_VERSION=0.11.0
+ARG MCORE_TAG=c90aa1671fc0b97f80fa6c3bb892ce6f8e88e7c9
+ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
+RUN \
+--mount=type=bind,source=requirements,target=requirements \
+--mount=type=bind,source=tools,target=tools \
+--mount=type=bind,source=setup.py,target=setup.py \
+--mount=type=bind,source=nemo/package_info.py,target=nemo/package_info.py \
+--mount=type=bind,source=nemo/__init__.py,target=nemo/__init__.py <<"EOF" bash -ex
+pip install --no-cache-dir --no-build-isolation --extra-index-url https://pypi.nvidia.com \
+"transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@${TE_TAG}" \
+"megatron_core @ git+https://github.com/NVIDIA/Megatron-LM.git@${MCORE_TAG}" \
+"nvidia-modelopt[torch]~=${MODELOPT_VERSION}" \
+"apex @ git+https://github.com/NVIDIA/apex.git@${APEX_TAG}" \
+-r tools/ctc_segmentation/requirements.txt \
+".[all]"
+
+# Megatron Core installation
+git clone https://github.com/NVIDIA/Megatron-LM.git && \
+pushd Megatron-LM && \
+git checkout ${MCORE_TAG} && \
+  pushd megatron/core/datasets && \
+  make && \
+  popd && \
+popd
+export PYTHONPATH="${PYTHONPATH}:/workspace/Megatron-LM"
+EOF
+
+# Copy over NeMo code
+COPY ./ ./
+RUN <<"EOF" bash -ex
+pip install --no-cache-dir --no-build-isolation ".[all]"
+
+# set permission
+chmod 777 -R /workspace
+EOF
+
+ENV PYTHONPATH="${PYTHONPATH}:/workspace/Megatron-LM"
+

From e465b9c880b2cb54f8863c2eb5360c24d674ef6f Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Thu, 16 May 2024 15:09:33 -0400
Subject: [PATCH 22/36] Add save option to the TRT-LLM export test script
 (#9221)

* Add save option to the test script

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
---
 tests/export/test_nemo_export.py | 141 ++++++++++++++++++++++++-------
 1 file changed, 110 insertions(+), 31 deletions(-)

diff --git a/tests/export/test_nemo_export.py b/tests/export/test_nemo_export.py
index 0c5a9d9e2309..0e9981403a1a 100644
--- a/tests/export/test_nemo_export.py
+++ b/tests/export/test_nemo_export.py
@@ -81,7 +81,12 @@ def get_accuracy_with_lambada(model, nq, task_ids, lora_uids, test_data_path=Non
 
             if nq is not None:
                 trtllm_deployed_output = nq.query_llm(
-                    prompts=[prompt], max_output_token=1, top_k=1, top_p=0, temperature=0.1, task_id=task_ids,
+                    prompts=[prompt],
+                    max_output_token=1,
+                    top_k=1,
+                    top_p=0,
+                    temperature=0.1,
+                    task_id=task_ids,
                 )
                 trtllm_deployed_output = trtllm_deployed_output[0][0].strip().lower()
 
@@ -140,6 +145,7 @@ def run_trt_llm_inference(
     stop_words_list=None,
     test_deployment=False,
     test_data_path=None,
+    save_trt_engine=False,
 ):
     if Path(checkpoint_path).exists():
         if n_gpu > torch.cuda.device_count():
@@ -213,7 +219,8 @@ def run_trt_llm_inference(
 
         if ptuning:
             trt_llm_exporter.add_prompt_table(
-                task_name="0", prompt_embeddings_checkpoint_path=prompt_embeddings_checkpoint_path,
+                task_name="0",
+                prompt_embeddings_checkpoint_path=prompt_embeddings_checkpoint_path,
             )
 
         output = trt_llm_exporter.forward(
@@ -232,7 +239,11 @@ def run_trt_llm_inference(
         nm = None
         output_deployed = ""
         if test_deployment:
-            nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name=model_name, port=8000,)
+            nm = DeployPyTriton(
+                model=trt_llm_exporter,
+                triton_model_name=model_name,
+                port=8000,
+            )
             nm.deploy()
             nm.run()
             nq = NemoQueryLLM(url="localhost:8000", model_name=model_name)
@@ -261,12 +272,17 @@ def run_trt_llm_inference(
             result = get_accuracy_with_lambada(trt_llm_exporter, nq, task_ids, lora_uids, test_data_path)
             if test_deployment:
                 nm.stop()
-            shutil.rmtree(trt_llm_model_dir)
+
+            if not save_trt_engine:
+                shutil.rmtree(trt_llm_model_dir)
             return result
 
         if test_deployment:
             nm.stop()
-        shutil.rmtree(trt_llm_model_dir)
+
+        if not save_trt_engine:
+            shutil.rmtree(trt_llm_model_dir)
+
         return None, None, None, None, None
     else:
         raise Exception("Checkpoint {0} could not be found.".format(checkpoint_path))
@@ -284,6 +300,7 @@ def run_existing_checkpoints(
     test_deployment=False,
     stop_words_list=None,
     test_data_path=None,
+    save_trt_engine=False,
 ):
     if n_gpus > torch.cuda.device_count():
         print("Skipping the test due to not enough number of GPUs")
@@ -338,6 +355,7 @@ def run_existing_checkpoints(
         stop_words_list=stop_words_list,
         test_deployment=test_deployment,
         test_data_path=test_data_path,
+        save_trt_engine=save_trt_engine,
     )
 
 
@@ -348,87 +366,146 @@ def get_args():
     )
 
     parser.add_argument(
-        "--model_name", type=str, required=True,
+        "--model_name",
+        type=str,
+        required=True,
     )
     parser.add_argument(
-        "--existing_test_models", default=False, action='store_true',
+        "--existing_test_models",
+        default=False,
+        action='store_true',
     )
     parser.add_argument(
-        "--model_type", type=str, required=False,
+        "--model_type",
+        type=str,
+        required=False,
     )
     parser.add_argument(
-        "--min_gpus", type=int, default=1, required=True,
+        "--min_gpus",
+        type=int,
+        default=1,
+        required=True,
     )
     parser.add_argument(
-        "--max_gpus", type=int,
+        "--max_gpus",
+        type=int,
     )
     parser.add_argument(
-        "--checkpoint_dir", type=str, default="/tmp/nemo_checkpoint/", required=False,
+        "--checkpoint_dir",
+        type=str,
+        default="/tmp/nemo_checkpoint/",
+        required=False,
     )
     parser.add_argument(
-        "--trt_llm_model_dir", type=str,
+        "--trt_llm_model_dir",
+        type=str,
     )
     parser.add_argument(
-        "--max_batch_size", type=int, default=8,
+        "--max_batch_size",
+        type=int,
+        default=8,
     )
     parser.add_argument(
-        "--max_input_token", type=int, default=256,
+        "--max_input_token",
+        type=int,
+        default=256,
     )
     parser.add_argument(
-        "--max_output_token", type=int, default=128,
+        "--max_output_token",
+        type=int,
+        default=128,
     )
     parser.add_argument(
-        "--p_tuning_checkpoint", type=str,
+        "--p_tuning_checkpoint",
+        type=str,
     )
     parser.add_argument(
-        "--ptuning", default=False, action='store_true',
+        "--ptuning",
+        default=False,
+        action='store_true',
     )
     parser.add_argument(
-        "--lora_checkpoint", type=str,
+        "--lora_checkpoint",
+        type=str,
     )
     parser.add_argument(
-        "--lora", default=False, action='store_true',
+        "--lora",
+        default=False,
+        action='store_true',
     )
     parser.add_argument(
-        "--tp_size", type=int,
+        "--tp_size",
+        type=int,
     )
     parser.add_argument(
-        "--pp_size", type=int,
+        "--pp_size",
+        type=int,
     )
     parser.add_argument(
-        "--top_k", type=int, default=1,
+        "--top_k",
+        type=int,
+        default=1,
     )
     parser.add_argument(
-        "--top_p", type=float, default=0.0,
+        "--top_p",
+        type=float,
+        default=0.0,
     )
     parser.add_argument(
-        "--temperature", type=float, default=1.0,
+        "--temperature",
+        type=float,
+        default=1.0,
     )
     parser.add_argument(
-        "--run_accuracy", default=False, action='store_true',
+        "--run_accuracy",
+        type=str,
+        default="False",
     )
     parser.add_argument("--streaming", default=False, action="store_true")
     parser.add_argument(
-        "--test_deployment", type=str, default="False",
+        "--test_deployment",
+        type=str,
+        default="False",
+    )
+    parser.add_argument(
+        "--debug",
+        default=False,
+        action='store_true',
     )
     parser.add_argument(
-        "--debug", default=False, action='store_true',
+        "--ci_upload_test_results_to_cloud",
+        default=False,
+        action='store_true',
     )
     parser.add_argument(
-        "--ci_upload_test_results_to_cloud", default=False, action='store_true',
+        "--test_data_path",
+        type=str,
+        default=None,
     )
     parser.add_argument(
-        "--test_data_path", type=str, default=None,
+        "--save_trt_engine",
+        type=str,
+        default="False",
     )
 
     return parser.parse_args()
 
 
 def run_inference_tests(args):
-    if args.test_deployment == "False":
+    if args.test_deployment == "True":
+        args.test_deployment = True
+    else:
         args.test_deployment = False
+
+    if args.save_trt_engine == "True":
+        args.save_trt_engine = True
     else:
-        args.test_deployment = True
+        args.save_trt_engine = False
+
+    if args.run_accuracy == "True":
+        args.run_accuracy = True
+    else:
+        args.run_accuracy = False
 
     if args.run_accuracy:
         if args.test_data_path is None:
@@ -453,6 +530,7 @@ def run_inference_tests(args):
                 test_deployment=args.test_deployment,
                 run_accuracy=args.run_accuracy,
                 test_data_path=args.test_data_path,
+                save_trt_engine=args.save_trt_engine,
             )
 
             n_gpus = n_gpus * 2
@@ -487,6 +565,7 @@ def run_inference_tests(args):
                 streaming=args.streaming,
                 test_deployment=args.test_deployment,
                 test_data_path=args.test_data_path,
+                save_trt_engine=args.save_trt_engine,
             )
 
             n_gpus = n_gpus * 2

From 73edac445e7881500d17e288ef04fcc406a69f4e Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
 <41898282+github-actions[bot]@users.noreply.github.com>
Date: Thu, 16 May 2024 16:55:10 -0400
Subject: [PATCH 23/36] rename paths2audiofiles to audio (#9209) (#9220)

* rename paths2audiofiles to audio


* update transcribe to audio


---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
---
 tutorials/VoiceSwapSample.ipynb   | 2 +-
 tutorials/asr/Multilang_ASR.ipynb | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tutorials/VoiceSwapSample.ipynb b/tutorials/VoiceSwapSample.ipynb
index addf19f3b236..c56544d9a043 100644
--- a/tutorials/VoiceSwapSample.ipynb
+++ b/tutorials/VoiceSwapSample.ipynb
@@ -146,7 +146,7 @@
                 "files = [Audio_sample]\n",
                 "raw_text = ''\n",
                 "text = ''\n",
-                "for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):\n",
+                "for fname, transcription in zip(files, quartznet.transcribe(audio=files)):\n",
                 "  raw_text = transcription\n",
                 "\n",
                 "# Add capitalization and punctuation\n",
diff --git a/tutorials/asr/Multilang_ASR.ipynb b/tutorials/asr/Multilang_ASR.ipynb
index 6354ce10ec6d..9877b983f2a1 100644
--- a/tutorials/asr/Multilang_ASR.ipynb
+++ b/tutorials/asr/Multilang_ASR.ipynb
@@ -701,7 +701,7 @@
    },
    "outputs": [],
    "source": [
-    "asr_model.transcribe(transcribe = es_files) [0]"
+    "asr_model.transcribe(audio = es_files) [0]"
    ]
   },
   {
@@ -1173,7 +1173,7 @@
    },
    "outputs": [],
    "source": [
-    "asr_model.transcribe(transcribe = en_files)[0]"
+    "asr_model.transcribe(audio = en_files)[0]"
    ]
   },
   {
@@ -1221,7 +1221,7 @@
    },
    "outputs": [],
    "source": [
-    "asr_model.transcribe(transcribe = es_files)[0]"
+    "asr_model.transcribe(audio = es_files)[0]"
    ]
   },
   {

From d2e047a203ed9b2b3e332a5737a9c8019dfb55b6 Mon Sep 17 00:00:00 2001
From: Ao Tang <aot@nvidia.com>
Date: Thu, 16 May 2024 23:47:31 -0400
Subject: [PATCH 24/36] Checkpoint resuming compatible for 2403 container
 (#9199)

* make ckpt loading backward compatible

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

* if not using dist optimizer, the states are stored in 'optimizer'

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

* code refactor

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

* typo

---------

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>
Co-authored-by: suiyoubi <suiyoubi@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
---
 nemo/collections/nlp/parts/nlp_overrides.py | 68 +++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/nemo/collections/nlp/parts/nlp_overrides.py b/nemo/collections/nlp/parts/nlp_overrides.py
index 079732f6b9c5..f50a467cf71a 100644
--- a/nemo/collections/nlp/parts/nlp_overrides.py
+++ b/nemo/collections/nlp/parts/nlp_overrides.py
@@ -98,6 +98,7 @@
 try:
     from megatron.core import dist_checkpointing, parallel_state
     from megatron.core.dist_checkpointing.dict_utils import dict_list_map_outplace
+    from megatron.core.dist_checkpointing.mapping import LocalNonpersitentObject
     from megatron.core.dist_checkpointing.optimizer import (
         get_param_id_to_sharded_param_map,
         make_sharded_optimizer_tensor,
@@ -415,6 +416,70 @@ def _fix_device(t):
 
         return dict_list_map_outplace(_fix_device, ckpt)
 
+    def _get_param_group(self, state_dict: Dict[str, Any]):
+        """Return the param groups in the state dict"""
+        return (
+            state_dict['optimizer_states'][0]['param_groups']
+            if 'optimizer' not in state_dict['optimizer_states'][0]
+            else state_dict['optimizer_states'][0]['optimizer']['param_groups']
+        )
+
+    def _check_param_groups_mismatch(self, checkpoint_path: Union[str, Path], sharded_state_dict: Dict[str, Any]):
+        """
+        Check if the number of param groups in the checkpoint not match with the sharded_state_dict
+        Returns:
+            bool: True if the number of param groups does not match
+        """
+        common_state_dict = dist_checkpointing.load_common_state_dict(checkpoint_path)
+        model_param_groups = self._get_param_group(common_state_dict)
+        checkpoint_param_groups = self._get_param_group(sharded_state_dict)
+        return len(model_param_groups) != len(checkpoint_param_groups)
+
+    def _fix_param_groups(
+        self, checkpoint_path: Union[str, Path], sharded_state_dict: Dict[str, Any]
+    ) -> Dict[str, Any]:
+        """
+        Try to fix the param groups in the checkpoint.
+        This is to fix the bug that in 24.03, all checkpoints store EP param group regardless of using EP or not.
+        This function makes sure all checkpoints are compatible for loading.
+        Returns:
+            sharded_state_dict: Loaded dictionary for the distributed load function
+        """
+        common_state_dict = dist_checkpointing.load_common_state_dict(checkpoint_path)
+        model_param_groups = self._get_param_group(sharded_state_dict)
+        checkpoint_param_groups = self._get_param_group(common_state_dict)
+
+        model_has_expert_param = any(param.get('is_expert', False) for param in model_param_groups)
+        checkpoint_has_expert_param = any(param.get('is_expert', False) for param in checkpoint_param_groups)
+
+        expert_index = None
+        if checkpoint_has_expert_param and not model_has_expert_param:
+            logging.warning(
+                'Currently training the model without expert parallelism while restored checkpoint has EP params. Ignoring the EP params for restoring.'
+            )
+            expert_index = next(
+                (index for index, entry in enumerate(checkpoint_param_groups) if entry.get('is_expert', False)),
+                None,
+            )
+            if expert_index:
+                # Temporary empty params so that loading doesn't fail
+                model_param_groups.insert(expert_index, {'params': LocalNonpersitentObject([]), 'is_expert': True})
+                if 'optimizer' in sharded_state_dict['optimizer_states'][0]:
+                    sharded_state_dict['optimizer_states'][0]['optimizer']['param_groups'] = model_param_groups
+                else:
+                    sharded_state_dict['optimizer_states'][0]['param_groups'] = model_param_groups
+            else:
+                raise ValueError('Cannot find expert param in the checkpoint.')
+
+        loaded_state_dict = self.checkpoint_io.load_checkpoint(checkpoint_path, sharded_state_dict=sharded_state_dict)
+        if expert_index is not None:
+            # Remove the temporary empty params added above
+            if 'optimizer' in loaded_state_dict['optimizer_states'][0]:
+                loaded_state_dict['optimizer_states'][0]['optimizer']['param_groups'].pop(expert_index)
+            else:
+                loaded_state_dict['optimizer_states'][0]['param_groups'].pop(expert_index)
+        return loaded_state_dict
+
     def load_checkpoint(self, checkpoint_path: Union[str, Path]) -> Dict[str, Any]:
         """PTL method which we override to integrate distributed checkpoints for model parallel models.
         In order to load distributed checkpoints we need to provide the sharded_state_dict to
@@ -437,6 +502,9 @@ def load_checkpoint(self, checkpoint_path: Union[str, Path]) -> Dict[str, Any]:
             # after dist_checkpointing.load, sharded tensors will be replaced with tensors
             checkpoint['state_dict'] = sharded_state_dict
             checkpoint['optimizer_states'] = [self.optimizer_sharded_state_dict()]
+
+            if self._check_param_groups_mismatch(checkpoint_path, checkpoint):
+                return self._fix_param_groups(checkpoint_path, checkpoint)
             return self.checkpoint_io.load_checkpoint(checkpoint_path, sharded_state_dict=checkpoint)
 
         # Legacy model parallel checkpointing logic, does not use megatron core

From b715f5a460d52d07eb935e7d58f3ff3a8c938521 Mon Sep 17 00:00:00 2001
From: Cathy <815244047@qq.com>
Date: Fri, 17 May 2024 12:54:40 +0800
Subject: [PATCH 25/36] support QWen1.5/QWen2 (#9055)

* support qwen1.5(qwen2)

Signed-off-by: Agoniii <815244047@qq.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unused import

Signed-off-by: Cathy <815244047@qq.com>

* Apply isort and black reformatting

Signed-off-by: pablo-garay <pablo-garay@users.noreply.github.com>

---------

Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Cathy <815244047@qq.com>
Signed-off-by: pablo-garay <pablo-garay@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Co-authored-by: pablo-garay <pablo-garay@users.noreply.github.com>
---
 .../conf/megatron_qwen2_config.yaml           | 227 +++++++++++++
 .../conf/megatron_qwen2_inference.yaml        |  39 +++
 .../convert_qwen2_hf_to_nemo.py               | 307 ++++++++++++++++++
 .../convert_qwen2_nemo_to_hf.py               | 307 ++++++++++++++++++
 4 files changed, 880 insertions(+)
 create mode 100644 examples/nlp/language_modeling/conf/megatron_qwen2_config.yaml
 create mode 100644 examples/nlp/language_modeling/conf/megatron_qwen2_inference.yaml
 create mode 100644 scripts/checkpoint_converters/convert_qwen2_hf_to_nemo.py
 create mode 100644 scripts/checkpoint_converters/convert_qwen2_nemo_to_hf.py

diff --git a/examples/nlp/language_modeling/conf/megatron_qwen2_config.yaml b/examples/nlp/language_modeling/conf/megatron_qwen2_config.yaml
new file mode 100644
index 000000000000..e96ba0599bb3
--- /dev/null
+++ b/examples/nlp/language_modeling/conf/megatron_qwen2_config.yaml
@@ -0,0 +1,227 @@
+name: megatron_qwen2
+restore_from_path: null # used when starting from a .nemo file
+
+trainer:
+  devices: 1
+  num_nodes: 1
+  accelerator: gpu
+  precision: bf16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  use_distributed_sampler: False
+  max_epochs: -1 # PTL default. In practice, max_steps will be reached first. 
+  max_steps: 100000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
+  log_every_n_steps: 10
+  val_check_interval: 100
+  limit_val_batches: 50
+  limit_test_batches: 500
+  accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models
+  gradient_clip_val: 1.0
+  benchmark: False
+  enable_model_summary: False # default PTL callback for this does not support model parallelism, instead we log manually
+
+exp_manager:
+  explicit_log_dir: null
+  exp_dir: null
+  name: megatron_qwen2
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: val_loss
+    save_top_k: 10
+    mode: min
+    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
+    save_nemo_on_train_end: False # not recommended when training large models on clusters with short time limits
+    filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
+    model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
+
+model:
+  mcore_gpt: True
+  # specify micro_batch_size, global_batch_size, and model parallelism
+  # gradient accumulation will be done automatically based on data_parallel_size
+  micro_batch_size: 4 # limited by GPU memory
+  global_batch_size: 8 # will use more micro batches to reach global batch size
+  tensor_model_parallel_size: 1 # intra-layer model parallelism
+  pipeline_model_parallel_size: 1 # inter-layer model parallelism
+  virtual_pipeline_model_parallel_size: null # interleaved pipeline
+
+  # model architecture
+  encoder_seq_length: 32768
+  max_position_embeddings: ${.encoder_seq_length}
+  num_layers: 40 # 4b: 40 | 7b: 32 | 14b: 40 | 72b: 80
+  hidden_size: 2560 # 4b: 2560 | 7b: 4096 | 14b: 5120 | 72b: 8192
+  ffn_hidden_size: 6912 # Transformer FFN hidden size. Usually 4 * hidden_size. | 4b: 6912 | 7b: 11008 | 14b: 13696 | 72b: 24576
+  num_attention_heads: 20 # 4b: 20 | 7b: 32 | 14b: 40 | 72b: 64
+  init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
+  use_scaled_init_method: True # use scaled residuals initialization
+  hidden_dropout: 0.0 # Dropout probability for hidden state transformer.
+  attention_dropout: 0.0 # Dropout probability for attention
+  ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
+  kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
+  apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
+  normalization: 'rmsnorm' # Normalization layer to use. Options are 'layernorm', 'rmsnorm'
+  layernorm_epsilon: 1e-5
+  do_layer_norm_weight_decay: False # True means weight decay on all params
+  make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
+  pre_process: True # add embedding
+  post_process: True # add pooler
+  persist_layer_norm: True # Use of persistent fused layer norm kernel.
+  bias: False # Whether to use bias terms in all weight matrices.
+  qkv_bias: True
+  activation: 'fast-swiglu' # Options ['gelu', 'geglu', 'swiglu', 'reglu', 'squared-relu', 'fast-geglu', 'fast-swiglu', 'fast-reglu']
+  headscale: False # Whether to learn extra parameters that scale the output of the each self-attention head.
+  transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
+  openai_gelu: False # Use OpenAI's GELU instead of the default GeLU
+  normalize_attention_scores: True # Whether to scale the output Q * K^T by 1 / sqrt(hidden_size_per_head). This arg is provided as a configuration option mostly for compatibility with models that have been weight-converted from HF. You almost always want to se this to True.
+  position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope']
+  rotary_percentage: 1.0 # If using position_embedding_type=rope, then the per head dim is multiplied by this.
+  attention_type: 'multihead' # Attention type. Options ['multihead']
+  share_embeddings_and_output_weights: False # Share embedding and output layer weights.
+  overlap_p2p_comm: False # Overlap p2p communication with computes. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
+  batch_p2p_comm: True # Batch consecutive inter-peer send/recv operations. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
+  num_query_groups: 20 # Number of query groups for group query attention. If None, normal attention is used. | 4b: 20 | 7b: 32 | 14b: 40 | 72b: 64
+  override_vocab_size: 151936 # 4b: 151936 | 7b: 151936 | 14b: 152064 | 72b: 152064
+  rotary_base: 5000000.0 #  4b: 5000000.0 | 7b: 1000000.0 | 14b: 1000000.0 | 72b: 1000000.0
+
+  tokenizer:
+    library: 'huggingface'
+    type: Qwen/Qwen1.5-4B
+    model: null # /path/to/tokenizer.model
+    vocab_file: null
+    merge_file: null 
+    delimiter: null # only used for tabular tokenizer
+    sentencepiece_legacy: False # Legacy=True allows you to add special tokens to sentencepiece tokenizers.
+
+  # Mixed precision
+  native_amp_init_scale: 4294967296 # 2 ** 32
+  native_amp_growth_interval: 1000
+  hysteresis: 2 # Gradient scale hysteresis
+  fp32_residual_connection: False # Move residual connections to fp32
+  fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16
+
+  # Megatron O2-style half-precision
+  megatron_amp_O2: True # Enable O2-level automatic mixed precision using main parameters
+  grad_allreduce_chunk_size_mb: 125
+
+  # Fusion
+  grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce. Only used with O2 and no pipeline parallelism..
+  gradient_accumulation_fusion: False # Fuse weight gradient accumulation to GEMMs. Only used with pipeline parallelism and O2.
+  bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function.
+  bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
+  masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
+  get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
+  apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
+
+
+  # Miscellaneous
+  seed: 1234
+  resume_from_checkpoint: null # manually set the checkpoint file to load from
+  use_cpu_initialization: False # Init weights on the CPU (slow for large models)
+  onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
+  apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this
+  gradient_as_bucket_view: True # PyTorch DDP argument. Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory)
+  sync_batch_comm: False # Enable stream synchronization after each p2p communication between pipeline stages
+
+  ## Activation Checkpointing
+  # NeMo Megatron supports 'selective' activation checkpointing where only the memory intensive part of attention is checkpointed.
+  # These memory intensive activations are also less compute intensive which makes activation checkpointing more efficient for LLMs (20B+).
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  # 'full' will checkpoint the entire transformer layer.
+  activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_method: null # 'uniform', 'block'
+  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
+  # of each chunk at the specified granularity. When used with 'selective', 'uniform' checkpoints all attention blocks in the model.
+  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+  activations_checkpoint_num_layers: null
+  # when using 'uniform' this creates groups of transformer layers to checkpoint. Usually set to 1. Increase to save more memory.
+  # when using 'block' this this will checkpoint the first activations_checkpoint_num_layers per pipeline stage.
+  num_micro_batches_with_partial_activation_checkpoints: null
+  # This feature is valid only when used with pipeline-model-parallelism.
+  # When an integer value is provided, it sets the number of micro-batches where only a partial number of Transformer layers get checkpointed
+  # and recomputed within a window of micro-batches. The rest of micro-batches in the window checkpoint all Transformer layers. The size of window is
+  # set by the maximum outstanding micro-batch backpropagations, which varies at different pipeline stages. The number of partial layers to checkpoint
+  # per micro-batch is set by 'activations_checkpoint_num_layers' with 'activations_checkpoint_method' of 'block'.
+  # This feature enables using activation checkpoint at a fraction of micro-batches up to the point of full GPU memory usage.
+  activations_checkpoint_layers_per_pipeline: null
+  # This feature is valid only when used with pipeline-model-parallelism.
+  # When an integer value (rounded down when float is given) is provided, it sets the number of Transformer layers to skip checkpointing at later
+  # pipeline stages. For example, 'activations_checkpoint_layers_per_pipeline' of 3 makes pipeline stage 1 to checkpoint 3 layers less than
+  # stage 0 and stage 2 to checkpoint 6 layers less stage 0, and so on. This is possible because later pipeline stage
+  # uses less GPU memory with fewer outstanding micro-batch backpropagations. Used with 'num_micro_batches_with_partial_activation_checkpoints',
+  # this feature removes most of activation checkpoints at the last pipeline stage, which is the critical execution path.
+
+  ## Sequence Parallelism
+  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  sequence_parallel: False
+
+  ## Transformer Engine
+  transformer_engine: True
+  fp8: False # enables fp8 in TransformerLayer forward
+  fp8_e4m3: False # sets fp8_format = recipe.Format.E4M3 
+  fp8_hybrid: True # sets fp8_format = recipe.Format.HYBRID
+  fp8_margin: 0 # scaling margin 
+  fp8_interval: 1 # scaling update interval
+  fp8_amax_history_len: 1024 # Number of steps for which amax history is recorded per tensor
+  fp8_amax_compute_algo: max # 'most_recent' or 'max'. Algorithm for computing amax from history
+  reduce_amax: True # Perform reduction to sync amax tensors across GPUs after every iteration
+  use_emha: False # Use fused multi-head attention for large sequence-length. Note this is not yet supported. Please set to False.
+
+  data:
+   # Path to data must be specified by the user.
+    # Supports List, String and Dictionary
+    # List : can override from the CLI: "model.data.data_prefix=[.5,/raid/data/pile/my-gpt3_00_text_document,.5,/raid/data/pile/my-gpt3_01_text_document]",
+    # Or see example below: 
+    # data_prefix: 
+    #   - .5
+    #   - /raid/data/pile/my-gpt3_00_text_document
+    #   - .5
+    #   - /raid/data/pile/my-gpt3_01_text_document
+    # Dictionary: can override from CLI "model.data.data_prefix"={"train":[1.0, /path/to/data], "validation":/path/to/data, "test":/path/to/test}
+    # Or see example below:
+    # "model.data.data_prefix: {train:[1.0,/path/to/data], validation:[/path/to/data], test:[/path/to/test]}"
+    # data_prefix: ???
+    index_mapping_dir: null # path to save index mapping .npy files, by default will save in the same location as data_prefix
+    data_impl: mmap
+    splits_string: 900,50,50
+    seq_length: ${model.encoder_seq_length}
+    skip_warmup: True
+    num_workers: 2
+    dataloader_type: single # cyclic
+    reset_position_ids: False # Reset position ids after end-of-document token
+    reset_attention_mask: False # Reset attention mask after end-of-document token
+    eod_mask_loss: False # Mask loss for the end of document tokens
+    validation_drop_last: True # Set to false if the last partial validation samples is to be consumed
+    no_seqlen_plus_one_input_tokens: False # Set to True to disable fetching (sequence length + 1) input tokens, instead get (sequence length) input tokens and mask the last token
+    pad_samples_to_global_batch_size: False # Set to True if you want to pad the last partial batch with -1's to equal global batch size
+    shuffle_documents: True # Set to False to disable documents shuffling. Sample index will still be shuffled
+
+  # Nsys profiling options
+  nsys_profile:
+    enabled: False
+    start_step: 10  # Global batch to start profiling
+    end_step: 10 # Global batch to end profiling
+    ranks: [0] # Global rank IDs to profile
+    gen_shape: False # Generate model and kernel details including input shapes
+
+  optim:
+    name: distributed_fused_adam
+    lr: 0.00015
+    weight_decay: 0.1
+    betas:
+    - 0.9
+    - 0.95
+    bucket_cap_mb: 125
+    overlap_grad_sync: True
+    overlap_param_sync: True
+    contiguous_grad_buffer: True
+    sched:
+      name: CosineAnnealing
+      warmup_steps: 2000
+      constant_steps: 11873
+      min_lr: 1.0e-05
diff --git a/examples/nlp/language_modeling/conf/megatron_qwen2_inference.yaml b/examples/nlp/language_modeling/conf/megatron_qwen2_inference.yaml
new file mode 100644
index 000000000000..e508b01858f5
--- /dev/null
+++ b/examples/nlp/language_modeling/conf/megatron_qwen2_inference.yaml
@@ -0,0 +1,39 @@
+inference:
+  greedy: False # Whether or not to use sampling ; use greedy decoding otherwise
+  top_k: 0  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+  top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+  temperature: 1.0 # sampling temperature
+  add_BOS: True # add the bos token at the begining of the prompt
+  tokens_to_generate: 30 # The minimum length of the sequence to be generated.
+  all_probs: False  # whether return the log prob for all the tokens in vocab
+  repetition_penalty: 1.2  # The parameter for repetition penalty. 1.0 means no penalty.
+  min_tokens_to_generate: 0  # The minimum length of the sequence to be generated.
+  compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+  end_strings: ["</s>"]  # generation will stop when one of these tokens is generated
+
+trainer:
+  devices: 1
+  num_nodes: 1
+  accelerator: gpu
+  logger: False # logger provided by exp_manager
+  precision: 32 # 16, 32, or bf16
+  use_distributed_sampler: False
+
+tensor_model_parallel_size: -1
+pipeline_model_parallel_size: -1
+pipeline_model_parallel_split_rank: -1 # used for encoder and decoder model (0 for others)
+megatron_amp_O2: False  # Enable O2-level automatic mixed precision to save memory
+gpt_model_file: null  # GPT nemo file path
+checkpoint_dir: null # checkpoint file dir. This is used to load the PTL checkpoint generated during the GPT training
+checkpoint_name: null # PTL checkpoint file name, only used for PTL checkpoint loading
+hparams_file: null # model configuration file, only used for PTL checkpoint loading
+prompts: # prompts for GPT inference
+  - "Q: How are you?"
+  - "Q: How big is the universe?"
+server: False  # whether launch the API server
+port: 5555 # the port number for the inference server
+web_server: False # whether launch the web inference server
+share: False  # whether create a public URL
+username: test # user name for web client
+password: test2  # password for web client
+web_port: 9889 # the port number of the web server
diff --git a/scripts/checkpoint_converters/convert_qwen2_hf_to_nemo.py b/scripts/checkpoint_converters/convert_qwen2_hf_to_nemo.py
new file mode 100644
index 000000000000..223c7af50843
--- /dev/null
+++ b/scripts/checkpoint_converters/convert_qwen2_hf_to_nemo.py
@@ -0,0 +1,307 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+Conversion script to convert Huggingface QWen2(QWen1.5) checkpoints into nemo checkpoint.
+  Example to run this conversion script:
+    python convert_qwen2_hf_to_nemo.py \
+     --input_name_or_path <path_to_hf_checkpoints_folder> \
+     --output_path <path_to_output_nemo_file>
+"""
+
+import os
+from argparse import ArgumentParser
+from collections import OrderedDict
+
+import torch
+from omegaconf import OmegaConf
+from pytorch_lightning.trainer.trainer import Trainer
+from transformers import Qwen2ForCausalLM, Qwen2Tokenizer
+
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+from nemo.collections.nlp.parts.nlp_overrides import (
+    GradScaler,
+    MegatronHalfPrecisionPlugin,
+    NLPDDPStrategy,
+    NLPSaveRestoreConnector,
+    PipelineMixedPrecisionPlugin,
+)
+from nemo.collections.nlp.parts.utils_funcs import load_state_dict_helper, torch_dtype_from_precision
+from nemo.utils import logging
+
+
+def get_args():
+    parser = ArgumentParser()
+    parser.add_argument(
+        "--input_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to Huggingface QWen2 checkpoints",
+    )
+    parser.add_argument("--output_path", type=str, default=None, required=True, help="Path to output .nemo file.")
+    parser.add_argument(
+        "--hparams_file",
+        type=str,
+        default=os.path.join(
+            os.path.dirname(__file__), '../../examples/nlp/language_modeling/conf/megatron_qwen2_config.yaml'
+        ),
+        required=False,
+        help="Path config for restoring. It's created during training and may need to be modified during restore if restore environment is different than training. Ex: /raid/nemo_experiments/megatron_gpt/hparams.yaml",
+    )
+    parser.add_argument("--precision", type=str, default="16", help="Model precision")
+    args = parser.parse_args()
+    return args
+
+
+def load_config(args, qwen_config):
+    nemo_config = OmegaConf.load(args.hparams_file).model
+    if qwen_config.get('rope_theta', None):
+        nemo_config['rotary_base'] = qwen_config['rope_theta']
+    nemo_config.encoder_seq_length = qwen_config['max_position_embeddings']
+    nemo_config.num_layers = int(qwen_config['num_hidden_layers'])
+    nemo_config.hidden_size = qwen_config['hidden_size']
+    nemo_config.ffn_hidden_size = qwen_config['intermediate_size']
+    nemo_config.num_attention_heads = qwen_config['num_attention_heads']
+    nemo_config.max_position_embeddings = qwen_config['max_position_embeddings']
+    nemo_config.init_method_std = qwen_config['initializer_range']
+    nemo_config.layernorm_epsilon = qwen_config['rms_norm_eps']
+    if 'num_key_value_heads' in qwen_config:
+        nemo_config.num_query_groups = qwen_config['num_key_value_heads']
+    nemo_config.use_cpu_initialization = True
+    nemo_config.activation = 'fast-swiglu'
+    nemo_config.tokenizer.type = str(args.input_name_or_path)
+    nemo_config.tokenizer.model = str(args.input_name_or_path) + '/vocab.json'
+    nemo_config.override_vocab_size = qwen_config['vocab_size']
+
+    base = 128
+    while qwen_config['vocab_size'] % base != 0:
+        base //= 2
+    nemo_config.make_vocab_size_divisible_by = base
+
+    return nemo_config
+
+
+def convert(args):
+    logging.info(f"loading checkpoint {args.input_name_or_path}")
+    model = Qwen2ForCausalLM.from_pretrained(args.input_name_or_path)
+    tokenizer = Qwen2Tokenizer.from_pretrained(args.input_name_or_path)
+    hf_config = vars(model.config)
+    print(f"hf_config: {hf_config}")
+    print("named parameters:")
+    for name, param in model.named_parameters():
+        print(f"- {name}")
+
+    nemo_config = load_config(args, hf_config)
+
+    if args.precision in ["32", "16"]:
+        precision = int(float(args.precision))
+    elif args.precision in ["bf16", "bf16-mixed"]:
+        if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
+            precision = args.precision
+        else:
+            logging.warning("BF16 is not supported on this device. Using FP16 instead.")
+            precision = args.precision[2:]  # prune bf in string
+    else:
+        precision = args.precision
+
+    plugins = []
+    if precision in [16, '16', 'bf16', '16-mixed', 'bf16-mixed']:
+        scaler = None
+        if precision in [16, '16', '16-mixed']:
+            scaler = GradScaler(
+                init_scale=nemo_config.get('native_amp_init_scale', 2**32),
+                growth_interval=nemo_config.get('native_amp_growth_interval', 1000),
+                hysteresis=nemo_config.get('hysteresis', 2),
+            )
+            # MixedPrecisionPlugin in PTL >= 2.0 requires precision to be 16-mixed or bf16-mixed
+            plugin_precision = '16-mixed'
+        else:
+            plugin_precision = 'bf16-mixed'
+
+        if nemo_config.get('megatron_amp_O2', False):
+            plugins.append(MegatronHalfPrecisionPlugin(precision=plugin_precision, device='cuda', scaler=scaler))
+        else:
+            plugins.append(PipelineMixedPrecisionPlugin(precision=plugin_precision, device='cuda', scaler=scaler))
+
+    nemo_config.precision = precision
+    print(f"nemo_config: {nemo_config}")
+
+    # Remove precision arg, since with PTL >= 2.1 both precision and precision plugin cannot exist together.
+    trainer = Trainer(plugins=plugins, accelerator='cpu', strategy=NLPDDPStrategy())
+
+    hidden_size = hf_config["hidden_size"]
+    head_num = hf_config["num_attention_heads"]
+    head_size = hidden_size // head_num
+    num_layers = hf_config["num_hidden_layers"]
+
+    mcore_gpt = nemo_config.mcore_gpt
+
+    assert mcore_gpt == nemo_config.get(
+        'transformer_engine', False
+    ), "mcore_gpt transformer_engine must be enabled (or disabled) together."
+
+    param_to_weights = lambda param: param.float()
+
+    checkpoint = OrderedDict()
+    checkpoint['state_dict'] = OrderedDict()
+
+    embed_weight = model.state_dict()[f'model.embed_tokens.weight']
+    if mcore_gpt:
+        embed_weights_base_name = f'model.embedding.word_embeddings.weight'
+    else:
+        embed_weights_base_name = f'model.language_model.embedding.word_embeddings.weight'
+    checkpoint['state_dict'][embed_weights_base_name] = param_to_weights(embed_weight)
+
+    # in hf, this is defined as register_buffer(..., persistent=False) so it won't be in the state dict
+    if f'model.layers.0.self_attn.rotary_emb.inv_freq' in model.state_dict():
+        rotary_embed_weight = model.state_dict()[f'model.layers.0.self_attn.rotary_emb.inv_freq']
+        if mcore_gpt:
+            rotary_embed_weight_base_name = f'model.rotary_pos_emb.inv_freq'
+        else:
+            rotary_embed_weight_base_name = f'model.language_model.rotary_pos_emb.inv_freq'
+        checkpoint['state_dict'][rotary_embed_weight_base_name] = param_to_weights(rotary_embed_weight)
+
+    if nemo_config.num_query_groups is None or nemo_config.num_query_groups == head_num:
+        num_query_groups = head_num
+    else:
+        num_query_groups = nemo_config.num_query_groups
+        assert head_num % num_query_groups == 0, 'head_num must be divisible by num_query_groups'
+    if mcore_gpt:
+        assert nemo_config.activation.startswith('fast-'), 'mcore only supports fast version of gated linear unit.'
+
+    for l in range(int(num_layers)):
+        print(f"converting layer {l}")
+        old_tensor_shape = model.state_dict()[f'model.layers.{l}.self_attn.q_proj.weight'].size()
+        new_q_tensor_shape = (head_num, head_size) + old_tensor_shape[1:]
+        new_kv_tensor_shape = (num_query_groups, head_size) + old_tensor_shape[1:]
+        q = model.state_dict()[f'model.layers.{l}.self_attn.q_proj.weight'].view(*new_q_tensor_shape)
+        k = model.state_dict()[f'model.layers.{l}.self_attn.k_proj.weight'].view(*new_kv_tensor_shape)
+        v = model.state_dict()[f'model.layers.{l}.self_attn.v_proj.weight'].view(*new_kv_tensor_shape)
+        qkv_weights = torch.empty((0, head_size) + old_tensor_shape[1:])
+        heads_per_group = head_num // num_query_groups
+        for i in range(num_query_groups):
+            qkv_weights = torch.cat((qkv_weights, q[i * heads_per_group : (i + 1) * heads_per_group, :, :]))
+            qkv_weights = torch.cat((qkv_weights, k[i : i + 1, :, :]))
+            qkv_weights = torch.cat((qkv_weights, v[i : i + 1, :, :]))
+        qkv_weights = qkv_weights.reshape([head_size * (head_num + 2 * num_query_groups), hidden_size])
+        if mcore_gpt:
+            qkv_weights_base_name = f'model.decoder.layers.{l}.self_attention.linear_qkv.weight'
+        else:
+            qkv_weights_base_name = f'model.language_model.encoder.layers.{l}.self_attention.query_key_value.weight'
+        checkpoint['state_dict'][qkv_weights_base_name] = param_to_weights(qkv_weights)
+
+        new_q_tensor_shape = (head_num, head_size)
+        new_kv_tensor_shape = (num_query_groups, head_size)
+        q = model.state_dict()[f'model.layers.{l}.self_attn.q_proj.bias'].view(*new_q_tensor_shape)
+        k = model.state_dict()[f'model.layers.{l}.self_attn.k_proj.bias'].view(*new_kv_tensor_shape)
+        v = model.state_dict()[f'model.layers.{l}.self_attn.v_proj.bias'].view(*new_kv_tensor_shape)
+        qkv_bias = torch.empty((0, head_size))
+        heads_per_group = head_num // num_query_groups
+        for i in range(num_query_groups):
+            qkv_bias = torch.cat((qkv_bias, q[i * heads_per_group : (i + 1) * heads_per_group, :]))
+            qkv_bias = torch.cat((qkv_bias, k[i : i + 1, :]))
+            qkv_bias = torch.cat((qkv_bias, v[i : i + 1, :]))
+        qkv_bias = qkv_bias.reshape(
+            [
+                head_size * (head_num + 2 * num_query_groups),
+            ]
+        )
+        if mcore_gpt:
+            qkv_bias_base_name = f'model.decoder.layers.{l}.self_attention.linear_qkv.bias'
+        else:
+            qkv_bias_base_name = f'model.language_model.encoder.layers.{l}.self_attention.query_key_value.bias'
+        checkpoint['state_dict'][qkv_bias_base_name] = param_to_weights(qkv_bias)
+
+        # attention dense
+        o_weight = model.state_dict()[f'model.layers.{l}.self_attn.o_proj.weight']
+        if mcore_gpt:
+            o_weight_base_name = f'model.decoder.layers.{l}.self_attention.linear_proj.weight'
+        else:
+            o_weight_base_name = f'model.language_model.encoder.layers.{l}.self_attention.dense.weight'
+        checkpoint['state_dict'][o_weight_base_name] = param_to_weights(o_weight)
+
+        # MLP
+        mlp_down_weight = model.state_dict()[f'model.layers.{l}.mlp.gate_proj.weight']
+        mlp_gate_weight = model.state_dict()[f'model.layers.{l}.mlp.up_proj.weight']
+        if mcore_gpt:
+            mlp_down_base_name = f'model.decoder.layers.{l}.mlp.linear_fc1.weight'
+        else:
+            mlp_down_base_name = f'model.language_model.encoder.layers.{l}.mlp.dense_h_to_4h.weight'
+        mlp_down_weight = torch.cat((mlp_down_weight, mlp_gate_weight), axis=0)
+        checkpoint['state_dict'][mlp_down_base_name] = param_to_weights(mlp_down_weight)
+
+        mlp_up_weight = model.state_dict()[f'model.layers.{l}.mlp.down_proj.weight']
+        if mcore_gpt:
+            mlp_up_base_name = f'model.decoder.layers.{l}.mlp.linear_fc2.weight'
+        else:
+            mlp_up_base_name = f'model.language_model.encoder.layers.{l}.mlp.dense_4h_to_h.weight'
+        checkpoint['state_dict'][mlp_up_base_name] = param_to_weights(mlp_up_weight)
+
+        # LayerNorm
+        input_ln_weight = model.state_dict()[f'model.layers.{l}.input_layernorm.weight']
+        if mcore_gpt:
+            input_ln_base_name = f'model.decoder.layers.{l}.self_attention.linear_qkv.layer_norm_weight'
+        else:
+            input_ln_base_name = f'model.language_model.encoder.layers.{l}.input_layernorm.weight'
+        checkpoint['state_dict'][input_ln_base_name] = param_to_weights(input_ln_weight)
+
+        post_attn_ln_weight = model.state_dict()[f'model.layers.{l}.post_attention_layernorm.weight']
+        if mcore_gpt:
+            post_attn_ln_base_name = f'model.decoder.layers.{l}.mlp.linear_fc1.layer_norm_weight'
+        else:
+            post_attn_ln_base_name = f'model.language_model.encoder.layers.{l}.post_attention_layernorm.weight'
+        checkpoint['state_dict'][post_attn_ln_base_name] = param_to_weights(post_attn_ln_weight)
+
+        print(f"done layer {l}")
+
+    final_ln_weight = model.state_dict()[f'model.norm.weight']
+    if mcore_gpt:
+        final_ln_base_name = f'model.decoder.final_layernorm.weight'
+    else:
+        final_ln_base_name = f'model.language_model.encoder.final_layernorm.weight'
+    checkpoint['state_dict'][final_ln_base_name] = param_to_weights(final_ln_weight)
+
+    output_layer_weight = model.state_dict()[f'lm_head.weight']
+    if mcore_gpt:
+        output_layer_base_name = f'model.output_layer.weight'
+    else:
+        output_layer_base_name = f'model.language_model.output_layer.weight'
+    checkpoint['state_dict'][output_layer_base_name] = param_to_weights(output_layer_weight)
+
+    checkpoint[MegatronGPTModel.CHECKPOINT_HYPER_PARAMS_KEY] = nemo_config
+
+    del model
+
+    if nemo_config.get('megatron_amp_O2', False):
+        keys = list(checkpoint['state_dict'].keys())
+        for key in keys:
+            checkpoint['state_dict'][key.replace('model.', 'model.module.', 1)] = checkpoint['state_dict'].pop(key)
+
+    model = load_state_dict_helper(MegatronGPTModel, nemo_config, trainer, checkpoint['state_dict'])
+
+    model._save_restore_connector = NLPSaveRestoreConnector()
+
+    # cast to target precision and disable cpu init
+    dtype = torch_dtype_from_precision(precision)
+    model = model.to(dtype=dtype)
+    model.cfg.use_cpu_initialization = False
+
+    model.save_to(args.output_path)
+    logging.info(f'NeMo model saved to: {args.output_path}')
+
+
+if __name__ == '__main__':
+    args = get_args()
+    convert(args)
diff --git a/scripts/checkpoint_converters/convert_qwen2_nemo_to_hf.py b/scripts/checkpoint_converters/convert_qwen2_nemo_to_hf.py
new file mode 100644
index 000000000000..c6a218020c21
--- /dev/null
+++ b/scripts/checkpoint_converters/convert_qwen2_nemo_to_hf.py
@@ -0,0 +1,307 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from argparse import ArgumentParser
+from collections import OrderedDict
+
+import torch
+from pytorch_lightning import Trainer
+from transformers import Qwen2ForCausalLM, Qwen2Tokenizer, Qwen2TokenizerFast, convert_slow_tokenizer
+
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy
+from nemo.utils import logging
+
+"""
+Script to convert a QWen2 checkpoint in nemo (mcore path) into a HuggingFace checkpoint.
+This script can be used to 1) generate only the HF weights, or 2) generate an entire HF model folder.
+
+1) Generate only HF weights from a nemo file:
+
+    python convert_qwen2_nemo_to_hf.py \
+    --input_name_or_path /path/to/file.nemo or /path/to/extracted_folder \
+    --output_path /path/to/pytorch_model.bin
+    
+2) Generate the full HF model folder
+
+    python convert_qwen2_nemo_to_hf.py \
+    --input_name_or_path /path/to/file.nemo or /path/to/extracted_folder \
+    --output_path /path/to/pytorch_model.bin \
+    --hf_input_path /path/to/input_hf_folder \
+    --hf_output_path /path/to/output_hf_folder \
+    --input_tokenizer /path/to/tokenizer \
+    --hf_output_tokenizer /path/to/output_tokenizer \
+
+    Use the --cpu-only flag if the model cannot fit in the GPU (e.g. qwen1.5 72b). 
+    However this option makes the conversion script significantly slower.
+"""
+
+
+def get_args():
+    parser = ArgumentParser()
+    parser.add_argument(
+        "--input_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to .nemo file or extracted folder",
+    )
+    parser.add_argument("--output_path", type=str, default=None, required=True, help="Path to HF .bin file")
+    parser.add_argument(
+        "--hf_input_path",
+        type=str,
+        default=None,
+        help="A HF model path, " "e.g. a folder containing https://huggingface.co/Qwen/Qwen1.5-72B/tree/main",
+    )
+    parser.add_argument(
+        "--hf_output_path",
+        type=str,
+        default=None,
+        help="Output HF model path, " "with the same format as above but user's own weights",
+    )
+    parser.add_argument(
+        "--input_tokenizer",
+        type=str,
+        default=None,
+        help="Path to tokenizer used for the input nemo model. (need to extract the .nemo file first)",
+    )
+    parser.add_argument(
+        "--hf_output_tokenizer",
+        type=str,
+        default=None,
+        help="Path to save the tokenizer used for the output HF model.",
+    )
+    parser.add_argument(
+        "--precision",
+        type=str,
+        default=None,
+        help="Precision of output weights."
+        "Defaults to precision of the input nemo weights (model.cfg.trainer.precision)",
+    )
+    parser.add_argument(
+        "--cpu-only",
+        action="store_true",
+        help="Load model in cpu only. Useful if the model cannot fit in GPU memory, "
+        "but this option makes the conversion script significantly slower.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def convert(input_nemo_file, output_hf_file, precision=None, cpu_only=False) -> None:
+    """
+    Convert NeMo weights to HF weights
+    """
+    dummy_trainer = Trainer(devices=1, accelerator='cpu', strategy=NLPDDPStrategy())
+    model_config = MegatronGPTModel.restore_from(input_nemo_file, trainer=dummy_trainer, return_config=True)
+    model_config.tensor_model_parallel_size = 1
+    model_config.pipeline_model_parallel_size = 1
+    if cpu_only:
+        map_location = torch.device('cpu')
+        model_config.use_cpu_initialization = True
+    else:
+        map_location = None
+
+    if cpu_only:
+        logging.info("******** Loading model on CPU. This will take a significant amount of time.")
+    model = MegatronGPTModel.restore_from(
+        input_nemo_file, trainer=dummy_trainer, override_config_path=model_config, map_location=map_location
+    )
+    if precision is None:
+        precision = model.cfg.precision
+    if precision in [32, "32"]:
+        dtype = torch.float32
+    elif precision in [16, "16", "16-mixed"]:
+        dtype = torch.float16
+    elif precision in ["bf16", "bf16-mixed"]:
+        dtype = torch.bfloat16
+    else:
+        logging.warning(f"Precision string {precision} is not recognized, falling back to fp32")
+        dtype = torch.float32  # fallback
+    logging.info(f"Using precision {dtype}")
+
+    param_to_weights = lambda param: param.to(dtype)
+    checkpoint = OrderedDict()
+
+    hidden_size = model.cfg.hidden_size
+    head_num = model.cfg.num_attention_heads
+    num_layers = model.cfg.num_layers
+    ffn_hidden_size = model.cfg.ffn_hidden_size
+    num_query_groups = model.cfg.get("num_query_groups", head_num)
+
+    head_size = hidden_size // head_num
+    heads_per_group = head_num // num_query_groups
+    qkv_total_dim = head_num + 2 * num_query_groups
+
+    # Embedding
+    embed_weight = model.state_dict()[f'model.embedding.word_embeddings.weight']
+    embed_weights_base_name = f'model.embed_tokens.weight'
+    checkpoint[embed_weights_base_name] = param_to_weights(embed_weight)
+
+    for l in range(int(num_layers)):
+        print(f"converting layer {l}")
+        # qkv weight
+        qkv_weights = model.state_dict()[f'model.decoder.layers.{l}.self_attention.linear_qkv.weight']
+        qkv_weights = qkv_weights.reshape([qkv_total_dim, head_size, hidden_size])
+
+        q_slice = torch.cat(
+            [
+                torch.arange((heads_per_group + 2) * i, (heads_per_group + 2) * i + heads_per_group)
+                for i in range(num_query_groups)
+            ]
+        )
+        k_slice = torch.arange(heads_per_group, qkv_total_dim, (heads_per_group + 2))
+        v_slice = torch.arange(heads_per_group + 1, qkv_total_dim, (heads_per_group + 2))
+
+        q_weights_base_name = f'model.layers.{l}.self_attn.q_proj.weight'
+        k_weights_base_name = f'model.layers.{l}.self_attn.k_proj.weight'
+        v_weights_base_name = f'model.layers.{l}.self_attn.v_proj.weight'
+
+        checkpoint[q_weights_base_name] = param_to_weights(qkv_weights[q_slice].reshape(-1, hidden_size))
+        checkpoint[k_weights_base_name] = param_to_weights(qkv_weights[k_slice].reshape(-1, hidden_size))
+        checkpoint[v_weights_base_name] = param_to_weights(qkv_weights[v_slice].reshape(-1, hidden_size))
+
+        # qkv bias
+        qkv_bias = model.state_dict()[f'model.decoder.layers.{l}.self_attention.linear_qkv.bias']
+        qkv_bias = qkv_bias.reshape([qkv_total_dim, head_size])
+
+        q_slice = torch.cat(
+            [
+                torch.arange((heads_per_group + 2) * i, (heads_per_group + 2) * i + heads_per_group)
+                for i in range(num_query_groups)
+            ]
+        )
+        k_slice = torch.arange(heads_per_group, qkv_total_dim, (heads_per_group + 2))
+        v_slice = torch.arange(heads_per_group + 1, qkv_total_dim, (heads_per_group + 2))
+
+        q_bias_base_name = f'model.layers.{l}.self_attn.q_proj.bias'
+        k_bias_base_name = f'model.layers.{l}.self_attn.k_proj.bias'
+        v_bias_base_name = f'model.layers.{l}.self_attn.v_proj.bias'
+
+        checkpoint[q_bias_base_name] = param_to_weights(
+            qkv_bias[q_slice].reshape(
+                -1,
+            )
+        )
+        checkpoint[k_bias_base_name] = param_to_weights(
+            qkv_bias[k_slice].reshape(
+                -1,
+            )
+        )
+        checkpoint[v_bias_base_name] = param_to_weights(
+            qkv_bias[v_slice].reshape(
+                -1,
+            )
+        )
+
+        # attention dense
+        o_weight = model.state_dict()[f'model.decoder.layers.{l}.self_attention.linear_proj.weight']
+        o_weight_base_name = f'model.layers.{l}.self_attn.o_proj.weight'
+        checkpoint[o_weight_base_name] = param_to_weights(o_weight)
+
+        # mlp
+        mlp_weights = model.state_dict()[f'model.decoder.layers.{l}.mlp.linear_fc1.weight']
+        mlp_down_proj_weight = mlp_weights[:ffn_hidden_size, :]
+        mlp_gate_proj_weight = mlp_weights[ffn_hidden_size:, :]
+
+        mlp_down_proj_base_name = f'model.layers.{l}.mlp.gate_proj.weight'
+        mlp_gate_proj_base_name = f'model.layers.{l}.mlp.up_proj.weight'
+
+        checkpoint[mlp_down_proj_base_name] = param_to_weights(mlp_down_proj_weight)
+        checkpoint[mlp_gate_proj_base_name] = param_to_weights(mlp_gate_proj_weight)
+
+        mlp_up_proj_weight = model.state_dict()[f'model.decoder.layers.{l}.mlp.linear_fc2.weight']
+        mlp_up_proj_base_name = f'model.layers.{l}.mlp.down_proj.weight'
+        checkpoint[mlp_up_proj_base_name] = param_to_weights(mlp_up_proj_weight)
+
+        # layernorm
+        input_ln_weight = model.state_dict()[f'model.decoder.layers.{l}.self_attention.linear_qkv.layer_norm_weight']
+        input_ln_base_name = f'model.layers.{l}.input_layernorm.weight'
+        checkpoint[input_ln_base_name] = param_to_weights(input_ln_weight)
+
+        post_attn_ln_weight = model.state_dict()[f'model.decoder.layers.{l}.mlp.linear_fc1.layer_norm_weight']
+        post_attn_ln_base_name = f'model.layers.{l}.post_attention_layernorm.weight'
+        checkpoint[post_attn_ln_base_name] = param_to_weights(post_attn_ln_weight)
+
+        print(f"done layer {l}")
+
+    final_ln_weight = model.state_dict()[f'model.decoder.final_layernorm.weight']
+    final_ln_base_name = f'model.norm.weight'
+    checkpoint[final_ln_base_name] = param_to_weights(final_ln_weight)
+
+    output_layer_weight = model.state_dict()[f'model.output_layer.weight']
+    output_layer_base_name = f'lm_head.weight'
+    checkpoint[output_layer_base_name] = param_to_weights(output_layer_weight)
+
+    os.makedirs(os.path.dirname(output_hf_file), exist_ok=True)
+    torch.save(checkpoint, output_hf_file)
+    logging.info(f"Weights saved to {output_hf_file}")
+
+    return dtype
+
+
+def replace_hf_weights_and_tokenizer(
+    weights_file,
+    dtype,
+    input_hf_path,
+    output_hf_path,
+    tokenizer_path,
+    output_hf_tokenizer,
+):
+    model = Qwen2ForCausalLM.from_pretrained(
+        input_hf_path,
+        local_files_only=True,
+        torch_dtype=dtype,
+    )
+    nemo_exported = torch.load(weights_file)
+
+    if tokenizer_path:
+        tokenizer = Qwen2Tokenizer.from_pretrained(
+            tokenizer_path,
+            local_files_only=True,
+            legacy=False,
+        )
+        tmp_tokenizer = convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)
+        fast_tokenizer = Qwen2TokenizerFast(tokenizer_object=tmp_tokenizer)
+        tokenizer_length = len(fast_tokenizer)
+        model.resize_token_embeddings(tokenizer_length)
+
+    model.load_state_dict(nemo_exported)
+    model.save_pretrained(output_hf_path)
+    logging.info(f"Full HF model saved to {output_hf_path}")
+
+    if tokenizer_path:
+        fast_tokenizer.save_pretrained(output_hf_tokenizer)
+        tokenizer.save_pretrained(output_hf_tokenizer)
+        logging.info(f"Tokenizer saved to {output_hf_tokenizer}")
+
+
+if __name__ == '__main__':
+    args = get_args()
+    if not args.hf_output_tokenizer and args.hf_output_path:
+        args.hf_output_tokenizer = args.hf_output_path
+    dtype = convert(args.input_name_or_path, args.output_path, precision=args.precision, cpu_only=args.cpu_only)
+    if args.hf_input_path and args.hf_output_path:
+        replace_hf_weights_and_tokenizer(
+            args.output_path,
+            dtype,
+            args.hf_input_path,
+            args.hf_output_path,
+            args.input_tokenizer,
+            args.hf_output_tokenizer,
+        )
+    else:
+        logging.info("`hf_input_path` and/or `hf_output_path` not provided, not generating full HF model.")
+        logging.info(f".bin file is saved to {args.output_path}")

From 18eed4d782fda1197d33358a2820fb9ba87731e8 Mon Sep 17 00:00:00 2001
From: mikolajblaz <mikolajblaz@users.noreply.github.com>
Date: Fri, 17 May 2024 10:26:56 +0200
Subject: [PATCH 26/36] Implement export with PyT Distributed checkpoints
 (#9058)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Implement PyT Dist load with MCore

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use plain PyT Dist utils

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Implement TarPath compatible version

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply black

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---
 nemo/export/trt_llm/nemo/nemo_ckpt_convert.py | 51 +++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py b/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py
index 44133de381bd..8112bb8755e3 100644
--- a/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py
+++ b/nemo/export/trt_llm/nemo/nemo_ckpt_convert.py
@@ -14,6 +14,7 @@
 
 
 import configparser
+import json
 import logging
 import math
 import multiprocessing
@@ -28,6 +29,8 @@
 import torch
 import zarr
 from tensorrt_llm._utils import np_bfloat16, pad_vocab_size, str_dtype_to_torch, torch_to_numpy
+from torch.distributed.checkpoint import FileSystemReader, TensorStorageMetadata
+from torch.distributed.checkpoint.state_dict_loader import load_state_dict
 from tqdm import tqdm
 from transformers import AutoTokenizer, GPT2Tokenizer, LlamaConfig
 
@@ -122,6 +125,54 @@ def rename_key_dist_ckpt(old_key: str, layer: int):
 
 
 def load_sharded_metadata(checkpoint_dir: Union[Path, TarPath], torch_tensor=True):
+    with (checkpoint_dir / 'metadata.json').open(mode='r') as f:
+        config_dict = json.load(f)
+    if config_dict['sharded_backend'] == 'zarr':
+        return load_sharded_metadata_zarr(checkpoint_dir, torch_tensor)
+    elif config_dict['sharded_backend'] == 'torch_dist':
+        return load_sharded_metadata_torch_dist(checkpoint_dir, torch_tensor)
+    else:
+        raise NotImplementedError(f'Distributed checkpoint backend {config_dict["sharded_backend"]} not supported')
+
+
+class TarFileSystemReader(FileSystemReader):
+    """Reader that accepts both Path and TarPath checkpoint directory.
+
+    The FileSystemReader works with TarPath, but expects a pure Path.
+    It's enough to skip the Path check in __init__.
+    """
+
+    def __init__(self, path: Union[Path, TarPath]) -> None:
+        """No call to super().__init__ because it expects pure Path."""
+        self.path = path
+        self.storage_data = dict()
+
+
+def load_sharded_metadata_torch_dist(checkpoint_dir: Union[Path, TarPath], torch_tensor=True):
+    fs_reader = TarFileSystemReader(checkpoint_dir)
+    metadata = fs_reader.read_metadata()
+
+    state_dict = {
+        k: torch.empty(tp.size, dtype=tp.properties.dtype)
+        for k, tp in metadata.state_dict_metadata.items()
+        if isinstance(tp, TensorStorageMetadata)
+    }
+    load_state_dict(
+        state_dict,
+        storage_reader=fs_reader,
+        no_dist=True,
+    )
+
+    if not torch_tensor:
+        for k, v in state_dict.items():
+            if v.dtype == torch.bfloat16:
+                state_dict[k] = v.view(torch.int16).numpy().view(np_bfloat16)
+            else:
+                state_dict[k] = v.numpy()
+    return state_dict
+
+
+def load_sharded_metadata_zarr(checkpoint_dir: Union[Path, TarPath], torch_tensor=True):
     sharded_state_dict = {}
     for subdir in checkpoint_dir.iterdir():
         if not subdir.is_dir() or not (subdir / '.zarray').exists():

From ce1612d8a75d15cb3bf2e5a97d65f3f82e1bad2f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?oliver=20k=C3=B6nig?= <okoenig@nvidia.com>
Date: Fri, 17 May 2024 17:34:37 +0200
Subject: [PATCH 27/36] ci: Multi-tenancy for tests and garbage collection
 (#9179)

* ci: Multi-tenancy for tests and garbage collection

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* add remaining testcases

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
---
 .github/workflows/cicd-main.yml | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
index ed2fc9f71f49..4efb525100d9 100644
--- a/.github/workflows/cicd-main.yml
+++ b/.github/workflows/cicd-main.yml
@@ -225,6 +225,10 @@ jobs:
             --input_name_or_path=/home/TestData/nlp/megatron_llama/llama-ci-hf-tiny \
             --output_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
             --precision=16
+        - name: Cleanup
+          if: "always()"
+          run: |
+            rm -rf /home/TestData/nlp/megatron_llama/model_weights
         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
           if: "failure()"
 
@@ -251,6 +255,10 @@ jobs:
             --output_path=/home/TestData/nlp/megatron_llama/llama3-ci-hf/llama3_ci.nemo \
             --precision=16
             rm -f /home/TestData/nlp/megatron_llama/llama3-ci-hf/llama3_ci.nemo
+        - name: Cleanup
+          if: "always()"
+          run: |
+            rm -rf /home/TestData/nlp/megatron_llama/llama3-ci-hf/model_weights
         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
           if: "failure()"
 
@@ -272,10 +280,19 @@ jobs:
         - name: Checkout repository
           uses: actions/checkout@v4
         - run: |
+            mkdir -p /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }};
             python scripts/checkpoint_converters/convert_starcoder_hf_to_nemo.py \
             --input_name_or_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf \
-            --output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf
-            rm -f /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo
+            --output_path /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}
+        - name: Cleanup
+          if: "always()"
+          run: |
+            rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/megatron_starcoder_tp1_pp1.nemo; 
+            rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/${{ github.run_id }}/
+        - name: Cleanup
+          if: "always()"
+          run: |
+            rm -rf /home/TestData/nlp/megatron_gpt/starcoder-ci-hf/model_weights
         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
           if: "failure()"
 
@@ -301,6 +318,10 @@ jobs:
             --input_name_or_path /home/TestData/nlp/megatron_gpt/falcon-ci-hf \
             --output_path /home/TestData/nlp/megatron_gpt/falcon-ci-hf/falcon_ci.nemo
             rm -f /home/TestData/nlp/megatron_gpt/falcon-ci-hf/falcon_ci.nemo
+        - name: Cleanup
+          if: "always()"
+          run: |
+            rm -rf /home/TestData/nlp/megatron_gpt/falcon-ci-hf/model_weights
         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
           if: "failure()"
 

From eb31309851aa6e5db3a3292a411cee1018ed536a Mon Sep 17 00:00:00 2001
From: tbartley94 <90423858+tbartley94@users.noreply.github.com>
Date: Fri, 17 May 2024 08:38:19 -0700
Subject: [PATCH 28/36] Lhotse Sharding Fix (#9187)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* quick fix

Signed-off-by: Travis Bartley <tbartley@nvidia.com>

* Update nemo/collections/common/data/lhotse/nemo_adapters.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* adding warning flag for non-sharded data.

Signed-off-by: Travis Bartley <tbartley@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: tbartley94 <tbartley94@users.noreply.github.com>

---------

Signed-off-by: Travis Bartley <tbartley@nvidia.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: tbartley94 <tbartley94@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: tbartley94 <tbartley94@users.noreply.github.com>
---
 .../common/data/lhotse/nemo_adapters.py        | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/nemo/collections/common/data/lhotse/nemo_adapters.py b/nemo/collections/common/data/lhotse/nemo_adapters.py
index b2ca1186c8e3..d24ce794da5a 100644
--- a/nemo/collections/common/data/lhotse/nemo_adapters.py
+++ b/nemo/collections/common/data/lhotse/nemo_adapters.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import logging
 import random
 import re
 import tarfile
@@ -197,6 +198,12 @@ def __init__(
         self.shard_id_to_manifest: dict[int, Iterable[dict]]
         self.paths = expand_sharded_filepaths(manifest_path)
         if len(self.paths) == 1:
+            logging.warning(
+                f"""You are using Lhotse dataloading for tarred audio with a non-sharded manifest.
+                            This will incur significant memory overhead and slow-down training. To prevent this error message
+                            please shard file '{self.paths[0]}' using 'scripts/speech_recognition/convert_to_tarred_audio_dataset.py'
+                            WITHOUT '--no_shard_manifest'"""
+            )
             self.source = LazyJsonlIterator(self.paths[0])
             self.shard_id_to_manifest = groupby("shard_id", self.source)
         else:
@@ -272,15 +279,16 @@ def __iter__(self) -> Generator[Cut, None, None]:
             random.Random(seed).shuffle(shard_ids)
 
         for sid in shard_ids:
-            shard_manifest = self.shard_id_to_manifest[sid]
+            manifest_path = self.paths[sid] if len(self.paths) > 1 else self.paths[0]
+            shard_manifest = {data["audio_filepath"]: data for data in self.shard_id_to_manifest[sid]}
             tar_path = self.shard_id_to_tar_path[sid]
             with tarfile.open(fileobj=open_best(tar_path, mode="rb"), mode="r|*") as tar:
-                for data, tar_info in zip(shard_manifest, tar):
-                    manifest_path = self.paths[sid] if len(self.paths) > 1 else self.paths[0]
-                    assert data["audio_filepath"] == tar_info.name, (
+                for tar_info in tar:
+                    assert tar_info.name in shard_manifest, (
                         f"Mismatched entry between JSON manifest ('{manifest_path}') and tar file ('{tar_path}'). "
-                        f"Conflicting audio file names are JSON='{data['audio_filepath']}' and TAR='{tar_info.name}'"
+                        f"Cannot locate JSON entry for tar file '{tar_info.name}'"
                     )
+                    data = shard_manifest[tar_info.name]
                     raw_audio = tar.extractfile(tar_info).read()
                     # Note: Lhotse has a Recording.from_bytes() utility that we won't use here because
                     #       the profiling indicated significant overhead in torchaudio ffmpeg integration

From 7f3e535fa9f75467152b642c9a90587d7ac30bb5 Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
 <41898282+github-actions[bot]@users.noreply.github.com>
Date: Fri, 17 May 2024 09:10:56 -0700
Subject: [PATCH 29/36] fix graphviz installation for local run (#9233) (#9234)

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>
Co-authored-by: Andrei Andrusenko <52885736+andrusenkoau@users.noreply.github.com>
---
 tutorials/asr/ASR_Context_Biasing.ipynb | 40 ++++++++++++++-----------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/tutorials/asr/ASR_Context_Biasing.ipynb b/tutorials/asr/ASR_Context_Biasing.ipynb
index bca4585e45cb..dd2e8176ad33 100644
--- a/tutorials/asr/ASR_Context_Biasing.ipynb
+++ b/tutorials/asr/ASR_Context_Biasing.ipynb
@@ -259,6 +259,7 @@
    "execution_count": null,
    "id": "d34ee0ba",
    "metadata": {
+    "collapsed": true,
     "jupyter": {
      "outputs_hidden": true
     },
@@ -717,6 +718,28 @@
     "The context graph consists of a composition of a prefix tree (Trie) with the CTC transition topology for words and phrases from the context-biasing list. We use a BPE tokenizer from the target ASR model for word segmentation."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "55a36a27-919c-4d64-9163-b0b2c9dca15e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# install graphviz from source in case of local run (not Google Colab)\n",
+    "# this may take about 5-10 minutes\n",
+    "# make sure that env variables have been set\n",
+    "\n",
+    "if not IN_COLAB:\n",
+    "\n",
+    "    os.environ['DEBIAN_FRONTEND'] = 'noninteractive'\n",
+    "    os.environ['TZ'] = 'Etc/UTC'\n",
+    "\n",
+    "    !echo $DEBIAN_FRONTEND\n",
+    "    !echo $TZ\n",
+    "\n",
+    "    !{NEMO_DIR_PATH}/scripts/installers/install_graphviz.sh"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -750,23 +773,6 @@
     "context_graph.draw()"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e1c57878",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# install graphviz from source if you have problems with graph picture\n",
-    "# set instal_graphviz = True\n",
-    "# this may take about 5-10 minutes\n",
-    "\n",
-    "instal_graphviz = False\n",
-    "\n",
-    "if instal_graphviz:\n",
-    "    !{NEMO_DIR_PATH}/scripts/installers/install_graphviz.sh"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "04a6f4be",

From 67401eda531bf12b184e0482a847d1298f48480c Mon Sep 17 00:00:00 2001
From: Somshubra Majumdar <titu1994@gmail.com>
Date: Fri, 17 May 2024 09:18:36 -0700
Subject: [PATCH 30/36] Support dataloader as input to `audio` for
 transcription (#9201)

* Support dataloader as input to `audio` for transcription

Signed-off-by: smajumdar <titu1994@gmail.com>

* Apply isort and black reformatting

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

* Support dataloader as input to `audio` for transcription

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update transcribe signatures

Signed-off-by: smajumdar <titu1994@gmail.com>

* Apply isort and black reformatting

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: titu1994 <titu1994@users.noreply.github.com>
---
 .../asr/models/aed_multitask_models.py        | 12 +++--
 .../asr/models/classification_models.py       | 20 +++++---
 nemo/collections/asr/models/ctc_models.py     | 29 +++++++++---
 .../asr/models/hybrid_rnnt_ctc_models.py      | 12 +++--
 nemo/collections/asr/models/rnnt_models.py    | 28 +++++++----
 nemo/collections/asr/models/slu_models.py     | 15 +++---
 .../asr/models/transformer_bpe_models.py      | 12 +++--
 .../asr/parts/mixins/transcription.py         |  9 +++-
 .../asr/mixins/test_transcription.py          | 46 +++++++++++++++++++
 9 files changed, 139 insertions(+), 44 deletions(-)

diff --git a/nemo/collections/asr/models/aed_multitask_models.py b/nemo/collections/asr/models/aed_multitask_models.py
index f9413a4dd738..b11d744a7e6a 100644
--- a/nemo/collections/asr/models/aed_multitask_models.py
+++ b/nemo/collections/asr/models/aed_multitask_models.py
@@ -21,6 +21,7 @@
 import torch
 from omegaconf import DictConfig, OmegaConf, open_dict
 from pytorch_lightning import Trainer
+from torch.utils.data import DataLoader
 
 from nemo.collections.asr.data.audio_to_text_lhotse_prompted import (
     PromptedAudioToTextLhotseDataset,
@@ -156,7 +157,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):
             self.transf_encoder = EncDecMultiTaskModel.from_config_dict(transf_encoder_cfg_dict)
 
             # Initialize weights
-            std_init_range = 1 / self.cfg.model_defaults.lm_enc_hidden ** 0.5
+            std_init_range = 1 / self.cfg.model_defaults.lm_enc_hidden**0.5
             self.transf_encoder.apply(lambda module: transformer_weights_init(module, std_init_range))
 
         transf_decoder_cfg_dict = cfg.transf_decoder
@@ -182,7 +183,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):
             self.log_softmax.mlp.layer0.weight = self.transf_decoder.embedding.token_embedding.weight
 
         # Initialize weights
-        std_init_range = 1 / self.cfg.model_defaults.lm_dec_hidden ** 0.5
+        std_init_range = 1 / self.cfg.model_defaults.lm_dec_hidden**0.5
         self.transf_decoder.apply(lambda module: transformer_weights_init(module, std_init_range))
         self.log_softmax.apply(lambda module: transformer_weights_init(module, std_init_range))
 
@@ -347,7 +348,7 @@ def change_vocabulary(
             self.log_softmax.mlp.layer0.weight = self.transf_decoder.embedding.token_embedding.weight
 
         # Initialize weights of token classifier
-        std_init_range = 1 / self.cfg.model_defaults.lm_dec_hidden ** 0.5
+        std_init_range = 1 / self.cfg.model_defaults.lm_dec_hidden**0.5
         self.log_softmax.apply(lambda module: transformer_weights_init(module, std_init_range))
 
         # Setup Decoding class
@@ -387,7 +388,7 @@ def change_vocabulary(
     @torch.no_grad()
     def transcribe(
         self,
-        audio: Union[List[str], str],
+        audio: Union[str, List[str], np.ndarray, DataLoader],
         batch_size: int = 4,
         return_hypotheses: bool = False,
         task: Optional[str] = None,
@@ -403,7 +404,8 @@ def transcribe(
         """
         Uses greedy decoding to transcribe audio files. Use this method for debugging and prototyping.
         Args:
-            audio: (a list) of paths to audio files. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is between 5 and 25 seconds. \
                 But it is possible to pass a few hours long file if enough GPU memory is available.
             batch_size: (int) batch size to use during inference.
diff --git a/nemo/collections/asr/models/classification_models.py b/nemo/collections/asr/models/classification_models.py
index c1294de5bdc0..7b226f59e364 100644
--- a/nemo/collections/asr/models/classification_models.py
+++ b/nemo/collections/asr/models/classification_models.py
@@ -15,7 +15,6 @@
 import copy
 import json
 import os
-import tempfile
 from abc import abstractmethod
 from dataclasses import dataclass, field
 from math import ceil, floor
@@ -24,6 +23,7 @@
 import torch
 from omegaconf import DictConfig, ListConfig, OmegaConf
 from pytorch_lightning import Trainer
+from torch.utils.data import DataLoader
 from torchmetrics import Accuracy
 from torchmetrics.regression import MeanAbsoluteError, MeanSquaredError
 
@@ -169,7 +169,8 @@ def forward(
 
         if not has_processed_signal:
             processed_signal, processed_signal_length = self.preprocessor(
-                input_signal=input_signal, length=input_signal_length,
+                input_signal=input_signal,
+                length=input_signal_length,
             )
         # Crop or pad is always applied
         if self.crop_or_pad is not None:
@@ -355,7 +356,7 @@ def _setup_feature_label_dataloader(self, config: DictConfig) -> torch.utils.dat
     @torch.no_grad()
     def transcribe(
         self,
-        audio: List[str],
+        audio: Union[List[str], DataLoader],
         batch_size: int = 4,
         logprobs=None,
         override_config: Optional[ClassificationInferConfig] | Optional[RegressionInferConfig] = None,
@@ -364,7 +365,8 @@ def transcribe(
         Generate class labels for provided audio files. Use this method for debugging and prototyping.
 
         Args:
-            audio: (a single or list) of paths to audio files or a np.ndarray audio sample. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is approximately 1 second.
             batch_size: (int) batch size to use during inference. \
                 Bigger will result in better throughput performance but would use more memory.
@@ -952,7 +954,10 @@ def _setup_dataloader_from_config(self, config: DictConfig):
 
             shuffle_n = config.get('shuffle_n', 4 * config['batch_size']) if shuffle else 0
             dataset = audio_to_label_dataset.get_tarred_audio_multi_label_dataset(
-                cfg=config, shuffle_n=shuffle_n, global_rank=self.global_rank, world_size=self.world_size,
+                cfg=config,
+                shuffle_n=shuffle_n,
+                global_rank=self.global_rank,
+                world_size=self.world_size,
             )
             shuffle = False
             if hasattr(dataset, 'collate_fn'):
@@ -1022,7 +1027,8 @@ def forward(
 
         if not has_processed_signal:
             processed_signal, processed_signal_length = self.preprocessor(
-                input_signal=input_signal, length=input_signal_length,
+                input_signal=input_signal,
+                length=input_signal_length,
             )
 
         # Crop or pad is always applied
@@ -1124,7 +1130,7 @@ def multi_test_epoch_end(self, outputs, dataloader_idx: int = 0):
     def reshape_labels(self, logits, labels, logits_len, labels_len):
         """
         Reshape labels to match logits shape. For example, each label is expected to cover a 40ms frame, while each frme prediction from the
-        model covers 20ms. If labels are shorter than logits, labels are repeated, otherwise labels are folded and argmax is applied to obtain 
+        model covers 20ms. If labels are shorter than logits, labels are repeated, otherwise labels are folded and argmax is applied to obtain
         the label of each frame. When lengths of labels and logits are not factors of each other, labels are truncated or padded with zeros.
         The ratio_threshold=0.2 is used to determine whether to pad or truncate labels, where the value 0.2 is not important as in real cases the ratio
         is very close to either ceil(ratio) or floor(ratio). We use 0.2 here for easier unit-testing. This implementation does not allow frame length
diff --git a/nemo/collections/asr/models/ctc_models.py b/nemo/collections/asr/models/ctc_models.py
index 4df02b1177cd..177da81f85f2 100644
--- a/nemo/collections/asr/models/ctc_models.py
+++ b/nemo/collections/asr/models/ctc_models.py
@@ -22,6 +22,7 @@
 import torch
 from omegaconf import DictConfig, OmegaConf, open_dict
 from pytorch_lightning import Trainer
+from torch.utils.data import DataLoader
 from tqdm.auto import tqdm
 
 from nemo.collections.asr.data import audio_to_text_dataset
@@ -119,7 +120,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):
 
     def transcribe(
         self,
-        audio: Union[str, List[str], torch.Tensor, np.ndarray],
+        audio: Union[str, List[str], torch.Tensor, np.ndarray, DataLoader],
         batch_size: int = 4,
         return_hypotheses: bool = False,
         num_workers: int = 0,
@@ -135,7 +136,8 @@ def transcribe(
         Uses greedy decoding to transcribe audio files. Use this method for debugging and prototyping.
 
         Args:
-            audio: (a single or list) of paths to audio files or a np.ndarray audio array. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is between 5 and 25 seconds. \
                 But it is possible to pass a few hours long file if enough GPU memory is available.
             batch_size: (int) batch size to use during inference.
@@ -493,7 +495,8 @@ def forward(
 
         if not has_processed_signal:
             processed_signal, processed_signal_length = self.preprocessor(
-                input_signal=input_signal, length=input_signal_length,
+                input_signal=input_signal,
+                length=input_signal_length,
             )
 
         if self.spec_augmentation is not None and self.training:
@@ -579,7 +582,9 @@ def predict_step(self, batch, batch_idx, dataloader_idx=0):
             log_probs, encoded_len, predictions = self.forward(input_signal=signal, input_signal_length=signal_len)
 
         transcribed_texts, _ = self.wer.decoding.ctc_decoder_predictions_tensor(
-            decoder_outputs=log_probs, decoder_lengths=encoded_len, return_hypotheses=False,
+            decoder_outputs=log_probs,
+            decoder_lengths=encoded_len,
+            return_hypotheses=False,
         )
 
         sample_id = sample_id.cpu().detach().numpy()
@@ -601,11 +606,19 @@ def validation_pass(self, batch, batch_idx, dataloader_idx=0):
             log_probs=log_probs, targets=transcript, input_lengths=encoded_len, target_lengths=transcript_len
         )
         loss_value, metrics = self.add_interctc_losses(
-            loss_value, transcript, transcript_len, compute_wer=True, log_wer_num_denom=True, log_prefix="val_",
+            loss_value,
+            transcript,
+            transcript_len,
+            compute_wer=True,
+            log_wer_num_denom=True,
+            log_prefix="val_",
         )
 
         self.wer.update(
-            predictions=log_probs, targets=transcript, targets_lengths=transcript_len, predictions_lengths=encoded_len,
+            predictions=log_probs,
+            targets=transcript,
+            targets_lengths=transcript_len,
+            predictions_lengths=encoded_len,
         )
         wer, wer_num, wer_denom = self.wer.compute()
         self.wer.reset()
@@ -677,7 +690,9 @@ def _transcribe_output_processing(self, outputs, trcfg: TranscribeConfig) -> Gen
         logits_len = outputs.pop('logits_len')
 
         current_hypotheses, all_hyp = self.decoding.ctc_decoder_predictions_tensor(
-            logits, decoder_lengths=logits_len, return_hypotheses=trcfg.return_hypotheses,
+            logits,
+            decoder_lengths=logits_len,
+            return_hypotheses=trcfg.return_hypotheses,
         )
         if trcfg.return_hypotheses:
             if logits.is_cuda:
diff --git a/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py b/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py
index 3eaab9961ef8..9a5c4188aebd 100644
--- a/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py
+++ b/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py
@@ -111,7 +111,8 @@ def transcribe(
 
         Args:
 
-            audio: (a list) of paths to audio files. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is between 5 and 25 seconds. \
                 But it is possible to pass a few hours long file if enough GPU memory is available.
             batch_size: (int) batch size to use during inference. \
@@ -182,7 +183,9 @@ def _transcribe_output_processing(
         encoded_len = outputs.pop('encoded_len')
 
         best_hyp, all_hyp = self.ctc_decoding.ctc_decoder_predictions_tensor(
-            logits, encoded_len, return_hypotheses=trcfg.return_hypotheses,
+            logits,
+            encoded_len,
+            return_hypotheses=trcfg.return_hypotheses,
         )
         logits = logits.cpu()
 
@@ -554,7 +557,10 @@ def validation_pass(self, batch, batch_idx, dataloader_idx):
             loss_value = (1 - self.ctc_loss_weight) * loss_value + self.ctc_loss_weight * ctc_loss
             tensorboard_logs['val_loss'] = loss_value
         self.ctc_wer.update(
-            predictions=log_probs, targets=transcript, targets_lengths=transcript_len, predictions_lengths=encoded_len,
+            predictions=log_probs,
+            targets=transcript,
+            targets_lengths=transcript_len,
+            predictions_lengths=encoded_len,
         )
         ctc_wer, ctc_wer_num, ctc_wer_denom = self.ctc_wer.compute()
         self.ctc_wer.reset()
diff --git a/nemo/collections/asr/models/rnnt_models.py b/nemo/collections/asr/models/rnnt_models.py
index 386f2a915142..cb2505fbadbf 100644
--- a/nemo/collections/asr/models/rnnt_models.py
+++ b/nemo/collections/asr/models/rnnt_models.py
@@ -13,16 +13,15 @@
 # limitations under the License.
 
 import copy
-import json
 import os
-import tempfile
 from math import ceil
 from typing import Any, Dict, List, Optional, Tuple, Union
 
+import numpy as np
 import torch
 from omegaconf import DictConfig, OmegaConf, open_dict
 from pytorch_lightning import Trainer
-from tqdm.auto import tqdm
+from torch.utils.data import DataLoader
 
 from nemo.collections.asr.data import audio_to_text_dataset
 from nemo.collections.asr.data.audio_to_text import _AudioTextDataset
@@ -101,7 +100,10 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):
         self.cfg.decoding = self.set_decoding_type_according_to_loss(self.cfg.decoding)
         # Setup decoding objects
         self.decoding = RNNTDecoding(
-            decoding_cfg=self.cfg.decoding, decoder=self.decoder, joint=self.joint, vocabulary=self.joint.vocabulary,
+            decoding_cfg=self.cfg.decoding,
+            decoder=self.decoder,
+            joint=self.joint,
+            vocabulary=self.joint.vocabulary,
         )
         # Setup WER calculation
         self.wer = WER(
@@ -236,7 +238,7 @@ def set_decoding_type_according_to_loss(self, decoding_cfg):
     @torch.no_grad()
     def transcribe(
         self,
-        audio: List[str],
+        audio: Union[str, List[str], np.ndarray, DataLoader],
         batch_size: int = 4,
         return_hypotheses: bool = False,
         partial_hypothesis: Optional[List['Hypothesis']] = None,
@@ -250,7 +252,8 @@ def transcribe(
         Uses greedy decoding to transcribe audio files. Use this method for debugging and prototyping.
 
         Args:
-            audio: (a list) of paths to audio files. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is between 5 and 25 seconds. \
                 But it is possible to pass a few hours long file if enough GPU memory is available.
             batch_size: (int) batch size to use during inference. \
@@ -338,7 +341,10 @@ def change_vocabulary(self, new_vocabulary: List[str], decoding_cfg: Optional[Di
             decoding_cfg = self.set_decoding_type_according_to_loss(decoding_cfg)
 
             self.decoding = RNNTDecoding(
-                decoding_cfg=decoding_cfg, decoder=self.decoder, joint=self.joint, vocabulary=self.joint.vocabulary,
+                decoding_cfg=decoding_cfg,
+                decoder=self.decoder,
+                joint=self.joint,
+                vocabulary=self.joint.vocabulary,
             )
 
             self.wer = WER(
@@ -394,7 +400,10 @@ def change_decoding_strategy(self, decoding_cfg: DictConfig):
         decoding_cfg = self.set_decoding_type_according_to_loss(decoding_cfg)
 
         self.decoding = RNNTDecoding(
-            decoding_cfg=decoding_cfg, decoder=self.decoder, joint=self.joint, vocabulary=self.joint.vocabulary,
+            decoding_cfg=decoding_cfg,
+            decoder=self.decoder,
+            joint=self.joint,
+            vocabulary=self.joint.vocabulary,
         )
 
         self.wer = WER(
@@ -649,7 +658,8 @@ def forward(
 
         if not has_processed_signal:
             processed_signal, processed_signal_length = self.preprocessor(
-                input_signal=input_signal, length=input_signal_length,
+                input_signal=input_signal,
+                length=input_signal_length,
             )
 
         # Spec augment is not applied during evaluation/testing
diff --git a/nemo/collections/asr/models/slu_models.py b/nemo/collections/asr/models/slu_models.py
index 1303bbfde7ea..c599b7f4272a 100644
--- a/nemo/collections/asr/models/slu_models.py
+++ b/nemo/collections/asr/models/slu_models.py
@@ -13,15 +13,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import json
 import os
-import tempfile
 from math import ceil
 from typing import Any, Dict, List, Optional, Union
 
 import torch
 from omegaconf import DictConfig, OmegaConf, open_dict
-from tqdm.auto import tqdm
+from torch.utils.data import DataLoader
 
 from nemo.collections.asr.data import audio_to_text_dataset
 from nemo.collections.asr.data.audio_to_text_dali import DALIOutputs
@@ -190,7 +188,8 @@ def forward(
 
         if not has_processed_signal:
             processed_signal, processed_signal_length = self.preprocessor(
-                input_signal=input_signal, length=input_signal_length,
+                input_signal=input_signal,
+                length=input_signal_length,
             )
 
         if self.spec_augmentation is not None and self.training:
@@ -278,7 +277,8 @@ def predict(
 
         if not has_processed_signal:
             processed_signal, processed_signal_length = self.preprocessor(
-                input_signal=input_signal, length=input_signal_length,
+                input_signal=input_signal,
+                length=input_signal_length,
             )
 
         if self.spec_augmentation is not None and self.training:
@@ -560,7 +560,7 @@ def _setup_transcribe_dataloader(self, config: Dict) -> 'torch.utils.data.DataLo
     @torch.no_grad()
     def transcribe(
         self,
-        audio: List[str],
+        audio: Union[List[str], DataLoader],
         batch_size: int = 4,
         return_hypotheses: bool = False,
         num_workers: int = 0,
@@ -571,7 +571,8 @@ def transcribe(
         Use this method for debugging and prototyping.
 
         Args:
-            audio: (a list) of paths to audio files. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is between 5 and 25 seconds. \
                 But it is possible to pass a few hours long file if enough GPU memory is available.
             batch_size: (int) batch size to use during inference.
diff --git a/nemo/collections/asr/models/transformer_bpe_models.py b/nemo/collections/asr/models/transformer_bpe_models.py
index 21a5f34b3038..e7e67f8fbb2f 100644
--- a/nemo/collections/asr/models/transformer_bpe_models.py
+++ b/nemo/collections/asr/models/transformer_bpe_models.py
@@ -24,6 +24,7 @@
 import torch.distributed as dist
 from omegaconf import DictConfig, OmegaConf, open_dict
 from pytorch_lightning import Trainer
+from torch.utils.data import DataLoader
 from torchmetrics.text import SacreBLEUScore
 from tqdm.auto import tqdm
 
@@ -141,7 +142,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):
             num_layers=self.cfg.head.num_layers,
         )
         self.log_softmax.mlp.layer0.weight = self.transf_decoder.embedding.token_embedding.weight
-        std_init_range = 1 / self.transf_decoder.hidden_size ** 0.5
+        std_init_range = 1 / self.transf_decoder.hidden_size**0.5
         self.transf_decoder.apply(lambda module: transformer_weights_init(module, std_init_range))
         self.log_softmax.apply(lambda module: transformer_weights_init(module, std_init_range))
 
@@ -174,7 +175,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):
     @torch.no_grad()
     def transcribe(
         self,
-        audio: List[str],
+        audio: Union[List[str], DataLoader],
         batch_size: int = 4,
         return_hypotheses: bool = False,
         num_workers: int = 0,
@@ -185,7 +186,8 @@ def transcribe(
         """
         Uses greedy decoding to transcribe audio files. Use this method for debugging and prototyping.
         Args:
-            audio: (a list) of paths to audio files. \
+            audio: (a list) of paths to audio files.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is between 5 and 25 seconds. \
                 But it is possible to pass a few hours long file if enough GPU memory is available.
             batch_size: (int) batch size to use during inference.
@@ -225,7 +227,9 @@ def _setup_dataloader_from_config(self, config: Optional[Dict]):
                 config,
                 global_rank=self.global_rank,
                 world_size=self.world_size,
-                dataset=LhotseSpeechToTextBpeDataset(tokenizer=self.tokenizer,),
+                dataset=LhotseSpeechToTextBpeDataset(
+                    tokenizer=self.tokenizer,
+                ),
             )
 
         dataset = audio_to_text_dataset.get_audio_to_text_bpe_dataset_from_config(
diff --git a/nemo/collections/asr/parts/mixins/transcription.py b/nemo/collections/asr/parts/mixins/transcription.py
index c252d498dc08..df8d6bac50a9 100644
--- a/nemo/collections/asr/parts/mixins/transcription.py
+++ b/nemo/collections/asr/parts/mixins/transcription.py
@@ -186,7 +186,7 @@ class TranscriptionMixin(ABC):
     @torch.no_grad()
     def transcribe(
         self,
-        audio: Union[str, List[str], np.ndarray],
+        audio: Union[str, List[str], np.ndarray, DataLoader],
         batch_size: int = 4,
         return_hypotheses: bool = False,
         num_workers: int = 0,
@@ -201,6 +201,7 @@ def transcribe(
 
         Args:
             audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.
                 Recommended length per file is between 5 and 25 seconds.
                 But it is possible to pass a few hours long file if enough GPU memory is available.
             batch_size: (int) batch size to use during inference.
@@ -368,7 +369,11 @@ def transcribe_generator(self, audio, override_config: Optional[TranscribeConfig
             with tempfile.TemporaryDirectory() as tmpdir:
                 transcribe_cfg._internal.temp_dir = tmpdir
 
-                dataloader = self._transcribe_input_processing(audio, transcribe_cfg)
+                # Create a DataLoader if not already present
+                if not isinstance(audio, DataLoader):
+                    dataloader = self._transcribe_input_processing(audio, transcribe_cfg)
+                else:
+                    dataloader = audio
 
                 if hasattr(transcribe_cfg, 'verbose'):
                     verbose = transcribe_cfg.verbose
diff --git a/tests/collections/asr/mixins/test_transcription.py b/tests/collections/asr/mixins/test_transcription.py
index 794213c72397..1a6f38681d0c 100644
--- a/tests/collections/asr/mixins/test_transcription.py
+++ b/tests/collections/asr/mixins/test_transcription.py
@@ -22,6 +22,7 @@
 import torch
 from torch.utils.data import DataLoader, Dataset
 
+from nemo.collections.asr.data.audio_to_text import _speech_collate_fn
 from nemo.collections.asr.models import ASRModel
 from nemo.collections.asr.parts.mixins import TranscribeConfig, TranscriptionMixin
 from nemo.collections.asr.parts.mixins.transcription import GenericTranscriptionType
@@ -121,6 +122,27 @@ def _transcribe_on_end(self, trcfg: TranscribeConfig):
         self.flag_end = True
 
 
+class DummyDataset(Dataset):
+    def __init__(self, audio_tensors: List[str], config: Dict = None):
+        self.audio_tensors = audio_tensors
+        self.config = config
+
+    def __getitem__(self, index):
+        data = self.audio_tensors[index]
+        samples = torch.tensor(data)
+        # Calculate seq length
+        seq_len = torch.tensor(samples.shape[0], dtype=torch.long)
+
+        # Dummy text tokens
+        text_tokens = torch.tensor([0], dtype=torch.long)
+        text_tokens_len = torch.tensor(1, dtype=torch.long)
+
+        return (samples, seq_len, text_tokens, text_tokens_len)
+
+    def __len__(self):
+        return len(self.audio_tensors)
+
+
 @pytest.fixture()
 def dummy_model():
     return TranscribableDummy()
@@ -326,3 +348,27 @@ def test_transcribe_multiple_tensor(self, test_data_dir):
         assert len(outputs) == 2
         assert isinstance(outputs[0], str)
         assert isinstance(outputs[1], str)
+
+    @pytest.mark.with_downloads()
+    @pytest.mark.unit
+    def test_transcribe_dataloader(self, test_data_dir):
+        model = ASRModel.from_pretrained("stt_en_conformer_ctc_small")
+
+        # Load audio file
+        import soundfile as sf
+
+        audio_file = os.path.join(test_data_dir, "asr", "train", "an4", "wav", "an46-mmap-b.wav")
+        audio, sr = sf.read(audio_file, dtype='float32')
+
+        audio_file2 = os.path.join(test_data_dir, "asr", "train", "an4", "wav", "an152-mwhw-b.wav")
+        audio2, sr = sf.read(audio_file2, dtype='float32')
+
+        dataset = DummyDataset([audio, audio2])
+        collate_fn = lambda x: _speech_collate_fn(x, pad_id=0)
+        dataloader = DataLoader(dataset, batch_size=2, shuffle=False, num_workers=0, collate_fn=collate_fn)
+
+        # DataLoader test
+        outputs = model.transcribe(dataloader, batch_size=1)
+        assert len(outputs) == 2
+        assert isinstance(outputs[0], str)
+        assert isinstance(outputs[1], str)

From 51c2c3fc389af23ef28d5a00ac43fb20d8cdcb17 Mon Sep 17 00:00:00 2001
From: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Date: Fri, 17 May 2024 09:47:54 -0700
Subject: [PATCH 31/36] NeMo Dev Doc Feature Updates 1: Some parallelisms
 (#9184)

* add various docs fixes

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* make conf.py changes clearer

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* fix Duplicate explicit target name error for links

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* more fixes, mainly citations

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* fix some code formatting

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* update hf space iframe link

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* fix new ERRORs

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Update docs

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add MQA and GQA

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix small issues

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add parallelisms

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add seq packing in NeMo dev doc

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix few issues

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix table

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix table

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix table

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix table

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add EP

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* squeeze in neva updates

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* rename Megatron-Core to Megatron Core

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix typo

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update index

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
---
 docs/source/features/memory_optimizations.rst |  58 +++++++-
 docs/source/features/parallelisms.rst         | 133 +++++++++++++++++-
 .../features/throughput_optimizations.rst     |   8 ++
 docs/source/multimodal/mllm/datasets.rst      |   8 ++
 docs/source/multimodal/mllm/intro.rst         |   1 +
 .../multimodal/mllm/sequence_packing.rst      | 127 +++++++++++++++++
 6 files changed, 328 insertions(+), 7 deletions(-)
 create mode 100644 docs/source/multimodal/mllm/sequence_packing.rst

diff --git a/docs/source/features/memory_optimizations.rst b/docs/source/features/memory_optimizations.rst
index d72d54ab7c2c..d87cb1e191ca 100644
--- a/docs/source/features/memory_optimizations.rst
+++ b/docs/source/features/memory_optimizations.rst
@@ -11,7 +11,7 @@ Flash Attention
 Overview
 ^^^^^^^^
 
-Flash Attention is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as Natural Language Processing (NLP). Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms.
+Flash Attention is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as Natural Language Processing (NLP). Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. Flash Attention is an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high-bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms.
 
 Turn Flash Attention On and Off
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -38,4 +38,58 @@ Selective Activation Recomputation
 """"""""""""""""""""""""""""""""""
 This method reduces memory footprint of activations significantly via smart activation checkpointing. This approach involves selectively storing only crucial activations and recomputing the others as needed. It is particularly useful in large models to minimize memory usage while controlling the computational cost.
 
-Refer to "Reducing Activation Recomputation in Large Transformer Models" for more details: https://arxiv.org/abs/2205.05198
+Refer to "Reducing Activation Recomputation in Large Transformer Models" for more details: https://arxiv.org/abs/2205.05198.
+
+Multi-query Attention (MQA) and Grouped-query Attention (GQA)
+-------------------------------------------------------------
+
+**Multi-query Attention (MQA)** and **Grouped-query Attention (GQA)** are modifications of the traditional multihead attention mechanism in Transformer models. These methods improve the efficiency and effectiveness of attention mechanisms.
+
+Overview
+^^^^^^^^
+
+**Multi-query Attention (MQA)**
+    MQA treats all attention heads as a single group, reducing computational complexity and accelerating training times. It is beneficial when model scalability or limited computational resources are concerns.
+
+**Grouped-query Attention (GQA)**
+    GQA groups the heads into clusters, each processing a subset of queries independently. This method balances the detailed focus of traditional multihead attention with the broad approach of MQA, enhancing nuanced input data processing.
+
+These attention variants offer:
+
+- **Reduced computational load**: Both methods decrease computation, beneficial for large models.
+- **Increased processing speed**: Simplifying attention leads to faster training and inference.
+- **Flexibility and adaptability**: Adjustments can be made based on task needs or hardware constraints.
+
+Enable MQA and GQA
+^^^^^^^^^^^^^^^^^^
+
+To use MQA or GQA in the NeMo Framework, adjust the ``num_query_groups`` parameter in the model configuration:
+
+1. **For Multi-query Attention (MQA)**:
+   - Set ``num_query_groups`` to `1` to treat all attention heads as a single group.
+
+   .. code-block:: yaml
+
+       num_query_groups: 1  # Enables Multi-query Attention
+
+2. **For Grouped-query Attention (GQA)**:
+   - Set ``num_query_groups`` to a number that is a divisor of the total number of attention heads (more than one but less than the total heads).
+
+   .. code-block:: yaml
+
+       num_query_groups: <number_of_groups>  # Enables Grouped-query Attention
+
+   - For regular attention, set this parameter to `None` or match it with the number of heads.
+
+   .. code-block:: yaml
+
+       num_query_groups: null  # Default setting for regular multihead attention
+
+Adjust the ``num_query_groups`` to explore different attention mechanisms and optimize your model's performance based on specific needs.
+
+Implement MQA or GQA
+^^^^^^^^^^^^^^^^^^^^
+
+NeMo's support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class:
+
+Check implementation details from Attention Class in Megatron Core Repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49
diff --git a/docs/source/features/parallelisms.rst b/docs/source/features/parallelisms.rst
index b10477e4232c..9d5f33196c4e 100644
--- a/docs/source/features/parallelisms.rst
+++ b/docs/source/features/parallelisms.rst
@@ -17,41 +17,164 @@ Distributed Data Parallelism (DDP) creates idential copies of the model across m
 
 Tensor Parallelism
 ^^^^^^^^^^^^^^^^^^
-With Tensor Paralellism (TP) a tensor is split into non-overlapping pieces and
-different parts are distributed and processed on separate GPUs.
+
+**Tensor Parallelism (TP)** is a method for distributing a model's computation across multiple GPUs by splitting tensors into non-overlapping pieces. This allows different parts of the tensor to be processed simultaneously on separate GPUs, enhancing performance and enabling the training of larger models.
 
 .. image:: ../nlp/nemo_megatron/images/tp.gif
     :align: center
     :width: 800px
     :alt: Tensor Parallel
 
+Enable Tensor Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To enable TP in the NeMo framework, configure the ``tensor_model_parallel_size`` parameter in the model configuration. This parameter determines the number of GPUs among which the model's tensors are partitioned.
+
+**For Tensor Parallelism**:
+   - Set ``tensor_model_parallel_size`` to greater than ``1`` to enable intra-layer model parallelism.
+
+   .. code-block:: yaml
+
+       tensor_model_parallel_size: 1  # Example to enable Tensor Parallelism
+
+The configuration file can be adjusted here: `NeMo Megatron GPT Config <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L65>`_.
+
+Implement Tensor Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NeMo integrates Tensor Parallelism through the implementation from Megatron Core. To understand how TP is activated within transformer blocks, refer to the code in the following repository: `Megatron-LM Transformer Block <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/transformer_block.py>`_.
+
+For detailed API usage and additional configurations, consult the `Megatron Core Developer Guide <https://docs.nvidia.com/Megatron Core/developer-guide/latest/api-guide/tensor_parallel.html>`_.
+
 Pipeline Parallelism
 ^^^^^^^^^^^^^^^^^^^^
-With Pipeline Paralellism (PP) consecutive layer chunks are assigned to different GPUs.
+
+**Pipeline Parallelism (PP)** is a technique that assigns consecutive layers or segments of a neural network to different GPUs. This division allows each GPU to process different stages of the network sequentially.
 
 .. image:: ../nlp/nemo_megatron/images/pp.gif
     :align: center
     :width: 800px
     :alt: Pipeline Parallel
 
+
+Enable Pipeline Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To utilize PP in the NeMo framework, you need to set the ``pipeline_model_parallel_size`` parameter in the model's configuration. This parameter specifies the number of GPUs among which the model's layers are distributed.
+
+**For Pipeline Parallelism**:
+   - Set ``pipeline_model_parallel_size`` to a value greater than ``1`` to enable inter-layer model parallelism.
+
+   .. code-block:: yaml
+
+       pipeline_model_parallel_size: 1  # Example to enable Pipeline Parallelism
+
+Adjust the configuration accordingly here: `NeMo Megatron GPT Config <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L66>`_.
+
+Interleaved Pipeline Parallel Schedule
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. For instance, instead of each GPU processing a continuous set of four layers, it might handle two model chunks with two layers each. This method ensures that each GPU in the pipeline manages multiple stages rather than on a single contiguous block.
+
+   .. code-block:: yaml
+
+       virtual_pipeline_model_parallel_size: 2 # Set for interleaved pipeline
+
+For more insights into this approach, see our detailed blog: `Scaling Language Model Training <https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#pipeline_parallelism>`_.
+
+Implement Pipeline Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NeMo's implementation of PP leverages functionalities from Megatron Core. For a practical example of how PP is implemented within transformer blocks in NeMo, you can inspect the following codebase: `Megatron-LM Transformer Block <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/transformer_block.py>`_.
+
+For more detailed API usage and configurations related to PP, visit the `Megatron Core Developer Guide <https://docs.nvidia.com/Megatron Core/developer-guide/latest/api-guide/tensor_parallel.html>`_.
+
 Sequence Parallelism
 ^^^^^^^^^^^^^^^^^^^^
 
+**Sequence Parallelism** extends tensor-level model parallelism by distributing computing load and activation memory across multiple GPUs along the sequence dimension of transformer layers. This method is particularly useful for portions of the layer that have previously not been parallelized, enhancing overall model performance and efficiency.
+
 .. image:: ../nlp/nemo_megatron/images/sp.gif
     :align: center
     :width: 800px
     :alt: Sequence Parallel
 
+Enable Sequence Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To utilize Sequence Parallelism in NeMo, set the ``sequence_parallel`` parameter to ``True`` in the model's configuration. Note that this feature is effective only when the tensor parallel size (``tensor_model_parallel_size``) is greater than ``1``.
+
+   .. code-block:: yaml
+
+       sequence_parallel: True  # Enable Sequence Parallelism
+
+For further information on configuration, refer to the following documentation: `NeMo Megatron GPT Config <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L66>`_.
+
+Implement Sequence Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NeMo's implementation of Sequence Parallelism utilizes functionality from Megatron Core. For an in-depth look at how Sequence Parallelism is integrated into the Megatron Core architecture, you can examine the source code here: `Megatron-LM Sequence Parallel Source Code <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py>`_.
+
+Context Parallelism
+^^^^^^^^^^^^^^^^^^^
+
+**Context Parallelism (CP)** is a method for parallelizing the processing of neural network activations across multiple GPUs, focusing on the sequence dimension of the input data. Unlike Sequence Parallelism (SP) that only partitions specific types of activations, CP divides all network activations along the sequence dimension.
+
+Enable Context Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To activate CP in the NeMo framework, set the ``context_parallel_size`` parameter in the model configuration. This parameter specifies the number of GPUs among which the model's sequence activations are distributed.
+
+**For Context Parallelism**:
+   - Set ``context_parallel_size`` to a value greater than ``1`` to enable sequence-wide model parallelism.
+
+   .. code-block:: yaml
+
+       context_parallel_size: 1  # Example to enable Context Parallelism
+
+The configuration can be found and modified here: `NeMo Megatron Core Context Config <https://docs.nvidia.com/Megatron Core/developer-guide/latest/api-guide/context_parallel.html>`_.
+
+Implement Context Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NeMo leverages functionalities from both Megatron Core and transformer-engine to implement CP efficiently. During forward propagation, each GPU handles a segment of the sequence, storing only the necessary Key and Value (KV) pairs. In the backward pass, these KV pairs are reassembled across GPUs using advanced communication schemes like all-gather and reduce-scatter transformed into point-to-point communications in a ring topology. This method reduces the memory footprint significantly while maintaining computational efficiency.
+
+Additionally, NeMo's CP supports integration with various forms of model parallelism such as TP (Tensor Parallelism), PP (Pipeline Parallelism), and DP (Data Parallelism), ensuring broad usability and flexibility in large-scale model training environments.
+
+Visit our source code for more insights into the implementation:
+- Megatron Core transformer engine: `Megatron Core <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/custom_layers/transformer_engine.py>`_
+- Transformer Engine repository: `Transformer Engine Code <https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py>`_
+
+
 Expert Parallelism
 ^^^^^^^^^^^^^^^^^^
-Expert Paralellim (EP) distributes experts across GPUs.
-
+**Expert Parallelism (EP)** is a type of model parallelism that distributes experts of an MoE across GPUs.
 
 .. image:: ../nlp/nemo_megatron/images/ep.png
     :align: center
     :width: 800px
     :alt: Expert Parallelism
 
+Enable Expert Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To enable it users can pass ``model.expert_model_parallel_size=k``, where k is an integer with the desired
+expert parallelism level, for example if the model has three experts (i.e. ``model.num_moe_experts=3``), we can specify
+k=3 (i.e. via CLI using ``model.expert_model_parallel_size=3``). The number of experts should be exactly divisible by the ``expert_model_parallel_size``.
+
+   .. code-block:: yaml
+
+       expert_model_parallel_size: 3  # Set EP to 3
+
+For further information on configuration, refer to the following documentation: `NeMo Megatron GPT Config <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L68>`_.
+
+
+Implement Expert Parallelism
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NeMo's expert parallelism functionality is provided by Megatron-LM repository, please consult the corresponding `Moe-layer <https://github.com/NVIDIA/Megatron-LM/blob/e2ec14ab5690fead7e33760b0f8fb20c83b4fd1f/megatron/core/transformer/moe/moe_layer.py#L29>`_ for more moe implementation details.
+
+
 Parallelism nomenclature
 ^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/docs/source/features/throughput_optimizations.rst b/docs/source/features/throughput_optimizations.rst
index 825c3add5dfb..3f3ded01b1a2 100644
--- a/docs/source/features/throughput_optimizations.rst
+++ b/docs/source/features/throughput_optimizations.rst
@@ -135,6 +135,14 @@ To train with packed sequences, you need to change four items in the SFT/PEFT co
 
 Now you are all set to finetune your model with a much improved throughput!
 
+Sequence Packing for NeVA
+-------------------------
+
+Sequence packing in NeVA (Multimodal LLMs) differs slightly from the LLM SFT/PEFT approach. For details,
+please refer to the documentation below
+
+:doc:`../multimodal/mllm/sequence_packing`
+
 Communication Overlap
 ---------------------
 NeMo leverages Megatron-Core's optimizations to enhance bandwidth utilization and effectively overlap computation with communication. Additional details will be provided soon.
diff --git a/docs/source/multimodal/mllm/datasets.rst b/docs/source/multimodal/mllm/datasets.rst
index 1c64c4d317d2..2f2000124e4d 100644
--- a/docs/source/multimodal/mllm/datasets.rst
+++ b/docs/source/multimodal/mllm/datasets.rst
@@ -90,6 +90,14 @@ For NeVA training, integrating special tokens into the tokenizer is vital. After
 
 .. code-block:: bash
 
+   cd /opt; git clone https://github.com/google/sentencepiece.git && \
+     cd sentencepiece && \
+     mkdir build && \
+     cd build && \
+     cmake .. && \
+     make && \
+     make install && \
+     ldconfig
    cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
    python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
    --input_file /path/to/neva/tokenizers/tokenizer.model \
diff --git a/docs/source/multimodal/mllm/intro.rst b/docs/source/multimodal/mllm/intro.rst
index be564a81a826..0e76a9737a0f 100644
--- a/docs/source/multimodal/mllm/intro.rst
+++ b/docs/source/multimodal/mllm/intro.rst
@@ -11,3 +11,4 @@ The endeavor to extend Language Models (LLMs) into multimodal domains by integra
    checkpoint
    neva
    video_neva
+   sequence_packing
diff --git a/docs/source/multimodal/mllm/sequence_packing.rst b/docs/source/multimodal/mllm/sequence_packing.rst
new file mode 100644
index 000000000000..b061ee1d89c6
--- /dev/null
+++ b/docs/source/multimodal/mllm/sequence_packing.rst
@@ -0,0 +1,127 @@
+Sequence Packing for NeVA
+=========================
+
+Overview
+--------
+As outlined in the throughput optimizations section, most multimodal LLM datasets, such as the LLaVA datasets, exhibit a skewed distribution of sequence lengths. Many sequences are short, and a few are very long, conforming to Zipf’s Law. Transformer models require fixed-length inputs, necessitating padding with many unused pad tokens, which is inefficient for two reasons:
+
+1. Computation on pad values is disregarded in the final model output, resulting in wasted FLOPs.
+2. The micro batch size is often constrained by the batch containing the longest sequences, leading to underutilized GPU memory in most other batches.
+
+Sequence packing is a training technique wherein multiple training sequences (examples) are concatenated into one long sequence (pack). This approach eliminates the need for padding and allows for more tokens to be processed per micro batch, optimizing both GPU compute and memory utilization.
+
+For Sequence Packing in SFT / PEFT for LLMs, NeVA considers the following design:
+
+1. Original Datasets to Sequence Lengths Files
+
+   1.1. **PyTorch Loaders for Dataset Processing Efficiency**
+        To efficiently manage large datasets (~700K sequences), the system utilizes PyTorch's DataLoader with multi-worker capabilities, significantly speeding up the data processing phase by parallelizing the loading and pre-processing steps.
+   1.2. **Handling Large Datasets**
+        The system writes sequence lengths to disk on the fly, ensuring scalability and efficient memory usage, as loading all data into memory is impractical.
+   1.3. **Efficient I/O Operations**
+        To facilitate efficient I/O operations necessary for parallelized data loading, the system employs IndexedDataset from Megatron-Core, chosen for its ability to dynamically build binary tensor files.
+
+2. Packing Sequences into Bins
+
+   2.1. **Algorithm Choices and Performance**
+        The first_fit_decreasing and first_fit_shuffle algorithms initially used for packing sequences into bins showed performance issues due to their O(n^2) complexity, making the processing of NeVA samples time-consuming.
+   2.2. **Introduction of shuffle_and_pack**
+        To address these inefficiencies, the shuffle_and_pack algorithm was introduced, an O(n) complexity algorithm that shuffles the sequence lengths before packing them into bins sequentially, significantly improving processing time.
+   2.3. **Parallelization of Packing Process**
+        The system implements a parallelized approach to the first_fit_shuffle algorithm by dividing the samples into chunks (~20K samples each) and processing them separately, effectively mitigating the quadratic complexity problem. The bins from each chunk are then combined in the final step, enhancing overall efficiency.
+   2.4. **Efficiency Improvements with completed_bins**
+        A minor optimization involves using completed_bins to prevent the algorithm from iterating over bins that cannot accommodate the minimum sequence length, leading to a more efficient packing process.
+
+3. Reading Sequence Lengths and Packing into New Files
+   After determining the optimal bins for packing, the system reads the sequence lengths from the generated files and packs these lengths into new files based on the bins' assignments. This final step consolidates the sequences into efficiently packed bins, ready for further processing or analysis.
+
+Performance Improvement
+-----------------------
+A 40% speed increase was achieved with optimized sequence packing for sequence length w/ Vicuna-1.5 13B (LLaVA 1.5 recipe). Detailed performance metrics across different configurations and stages are provided in the tables below.
+
+Fine-tuning Performance Table:
+
++--------------+---------------------------+----------------+----+----+-----------+------------------+-----------------+-------------------+---------------+-------------------+
+| Stage        | Vision Encoder            | LLM Model      | TP | PP | Precision | Sequence Packing | Step Timing (s) | Global Batch Size | Samples / Sec | Perf Improvement  |
++==============+===========================+================+====+====+===========+==================+=================+===================+===============+===================+
+| Fine-tuning  | openai/clip-vit-large-    | Vicuna-1.5 13B | 8  | 1  | BF16      | No               | 2.008           | 128               | 63.745        | 0%                |
+|              | patch14-336               |                |    |    |           |                  |                 |                   |               |                   |
++--------------+---------------------------+----------------+----+----+-----------+------------------+-----------------+-------------------+---------------+-------------------+
+| Fine-tuning  | openai/clip-vit-large-    | Vicuna-1.5 13B | 4  | 2  | BF16      | No               | 1.889           | 128               | 67.761        | 6%                |
+|              | patch14-336               |                |    |    |           |                  |                 |                   |               |                   |
++--------------+---------------------------+----------------+----+----+-----------+------------------+-----------------+-------------------+---------------+-------------------+
+| Fine-tuning  | openai/clip-vit-large-    | Vicuna-1.5 13B | 8  | 1  | BF16      | Yes              | 1.302           | 116.08            | 89.155        | 40%               |
+|              | patch14-336               |                |    |    |           |                  |                 |                   |               |                   |
++--------------+---------------------------+----------------+----+----+-----------+------------------+-----------------+-------------------+---------------+-------------------+
+| Fine-tuning  | openai/clip-vit-large-    | Vicuna-1.5 13B | 4  | 2  | BF16      | Yes              | 1.237           | 116.08            | 93.840        | 47%               |
+|              | patch14-336               |                |    |    |           |                  |                 |                   |               |                   |
++--------------+---------------------------+----------------+----+----+-----------+------------------+-----------------+-------------------+---------------+-------------------+
+
+How to Run NeVA with Packed Sequence
+------------------------------------
+Prepare Dataset
+^^^^^^^^^^^^^^^
+We provide an easy-to-use script for preprocessing a dataset for the NeMo Multimodal Learning framework. It requires specifying paths for data, images, and the tokenizer model, among other parameters.
+
+.. code-block:: bash
+
+    python examples/multimodal/multimodal_llm/neva/sequence_packing/preprocess_dataset.py \
+     --data_path=/path/to/LLaVA-Instruct-150K/llava_v1_5_mix665k_filtered.json \
+     --image_folder=/path/to/LLaVA-Instruct-150K/images \
+     --tokenizer_path=/path/to/checkpoints/tokenizer_add_special.model \
+     --output_dir=/path/to/LLaVA-Instruct-150K/packed_seq_12288_336_v1 \
+     --max_seq_length=12288 \
+     --packing_algorithm=first_fit_shuffle \
+     --hf_vision_encoder=openai/clip-vit-large-patch14-336 \
+     --conv_template=v1 \
+     --image_aspect_ratio=pad \
+     --seed=42
+
+Parameters:
+* ``--data_path``: Path to the dataset file in JSON format.
+* ``--image_folder``: Directory containing the images referenced in the dataset.
+* ``--tokenizer_path``: Path to the tokenizer model.
+* ``--output_dir``: Directory where the processed dataset will be stored.
+* ``--max_seq_length``: The maximum sequence length of the model.
+* ``--packing_algorithm``: Algorithm used for packing sequences. Defaults to 'first_fit_shuffle'.
+* ``--hf_vision_encoder``: The Hugging Face vision encoder to use. Default is 'openai/clip-vit-large-patch14-336'.
+* ``--conv_template``: Template for data conversion. Default is 'plain', with 'v1' as an alternative.
+* ``--image_aspect_ratio``: The aspect ratio for processing images. Defaults to 'square', 'pad' for padding to maintain aspect ratio.
+* ``--seed``: Seed for random operations in 'first_fit_shuffle'.
+* ``--hparams_file``: Optional path to a YAML file containing additional hyperparameters.
+
+Remarks:
+1. The current version of data processing saves processed image tensors in the sequence packing, which may require significant storage. This issue will be addressed in future iterations.
+2. The ``max_seq_length`` is crucial for achieving optimal performance. Excessive length can lead to out-of-memory errors, while insufficient length may degrade performance.
+3. The conversation prompt template is inserted during this step to ensure accurate sequence length calculation.
+
+Adjust Training Config
+""""""""""""""""""""""
+To train with packed sequences, modify four items in the SFT/PEFT config file.
+
+1. Enable the ``packed_sequence`` flag:
+
+.. code-block:: bash
+
+    ++model.data.data_prefix=/lustre/fsw/coreai_dlalgo_genai/datasets/LLaVA-Instruct-150K/packed_seq_12288_336_v1/packed_seq_dataset
+    ++model.data.crop_size=[224,224]
+    ++model.data.packed_sequence=True
+
+2. Use the new dataset file instead of the original JSONL file and ensure the crop sizes are correctly specified since images are now cached:
+
+.. code-block:: bash
+
+    ++model.data.data_prefix=/path/to/datasets/LLaVA-Instruct-150K/packed_seq_12288_336_v1/packed_seq_dataset
+    ++model.data.crop_size=[336,336]
+
+4. Adjust batch sizes:
+
+* Micro batch size should be set to 1 due to concatenation in the preprocessing step. Increase ``pack_size`` to achieve a higher micro batch size.
+* Global batch size should be adjusted based on the average number of sequences per pack (``n``), calculated as the total number of sequences divided by the number of packs. This maintains the training recipe by ensuring each gradient iteration sees, on average, the same number of tokens.
+
+.. code-block:: bash
+
+    model.micro_batch_size=1
+    model.global_batch_size=<GBS divided by n>
+
+Now, you are ready to fine-tune your model with significantly improved throughput!

From 659e025fd841c1b173ba7c1807fc17c6e6fc66be Mon Sep 17 00:00:00 2001
From: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Date: Fri, 17 May 2024 11:27:09 -0700
Subject: [PATCH 32/36] Clean up dev docs collection section (#9205)

* Update examples

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update index

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update index

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
---
 docs/source/{core/core_index.rst => apis.rst} | 26 ++++++++----
 docs/source/asr/api.rst                       |  4 +-
 docs/source/asr/ssl/intro.rst                 |  4 +-
 docs/source/collections.rst                   | 42 +++++++------------
 docs/source/common/intro.rst                  |  4 +-
 docs/source/core/api.rst                      |  4 +-
 docs/source/index.rst                         |  2 +-
 docs/source/multimodal/api.rst                |  4 +-
 docs/source/nlp/api.rst                       |  4 +-
 docs/source/tts/api.rst                       |  4 +-
 docs/source/tts/intro.rst                     |  2 -
 11 files changed, 48 insertions(+), 52 deletions(-)
 rename docs/source/{core/core_index.rst => apis.rst} (74%)

diff --git a/docs/source/core/core_index.rst b/docs/source/apis.rst
similarity index 74%
rename from docs/source/core/core_index.rst
rename to docs/source/apis.rst
index 01977c1b5101..e3c199bb47d5 100644
--- a/docs/source/core/core_index.rst
+++ b/docs/source/apis.rst
@@ -14,14 +14,26 @@ You can learn more about aspects of the NeMo "core" by following the links below
    :name: core
    :titlesonly:
 
-   core
-   neural_modules
-   exp_manager
-   neural_types
-   export
-   adapters/intro
-   api
+   core/core
+   core/neural_modules
+   core/exp_manager
+   core/neural_types
+   core/export
+   core/adapters/intro
 
+You can learn more about aspects of the NeMo APIs by following the links below:
+
+.. toctree::
+   :maxdepth: 1
+   :name: API
+   :titlesonly:
+
+   core/api
+   common/intro
+   nlp/api
+   multimodal/api
+   asr/api
+   tts/api
 
 
 Alternatively, you can jump straight to the documentation for the individual collections:
diff --git a/docs/source/asr/api.rst b/docs/source/asr/api.rst
index 2eb687d97d8e..c99d92c0371a 100644
--- a/docs/source/asr/api.rst
+++ b/docs/source/asr/api.rst
@@ -1,5 +1,5 @@
-NeMo ASR Collection API
-=======================
+NeMo ASR API
+============
 
 
 Model Classes
diff --git a/docs/source/asr/ssl/intro.rst b/docs/source/asr/ssl/intro.rst
index d1a7366164d8..76a3a75dcf37 100644
--- a/docs/source/asr/ssl/intro.rst
+++ b/docs/source/asr/ssl/intro.rst
@@ -1,5 +1,5 @@
-Self-Supervised Learning
-=================================
+Speech Self-Supervised Learning
+===============================
 
 Self-Supervised Learning (SSL) refers to the problem of learning without explicit labels. As 
 any learning process require feedback, without explit labels, SSL derives supervisory signals from 
diff --git a/docs/source/collections.rst b/docs/source/collections.rst
index 1cc7a654b9c1..d4bea503513b 100644
--- a/docs/source/collections.rst
+++ b/docs/source/collections.rst
@@ -11,26 +11,9 @@ Documentation for the individual collections
    :titlesonly:
 
    nlp/nemo_megatron/intro
-   nlp/models
    nlp/machine_translation/machine_translation
    nlp/megatron_onnx_export
    nlp/quantization
-   nlp/api
-
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Speech AI
-   :name: Speech AI
-   :titlesonly:
-
-   asr/intro
-   asr/speech_classification/intro
-   asr/speaker_recognition/intro
-   asr/speaker_diarization/intro
-   asr/ssl/intro
-   asr/speech_intent_slot/intro
-
 
 .. toctree::
    :maxdepth: 1
@@ -42,29 +25,32 @@ Documentation for the individual collections
    multimodal/vlm/intro
    multimodal/text2img/intro
    multimodal/nerf/intro
-   multimodal/api
-
 
 .. toctree::
    :maxdepth: 1
-   :caption: Text To Speech (TTS)
-   :name: Text To Speech
+   :caption: Vision (CV)
+   :name: vision
    :titlesonly:
 
-   tts/intro
+   vision/intro
 
 .. toctree::
    :maxdepth: 1
-   :caption: Vision (CV)
-   :name: vision
+   :caption: Speech AI
+   :name: Speech AI
    :titlesonly:
 
-   vision/intro
+   asr/intro
+   asr/speech_classification/intro
+   asr/speaker_recognition/intro
+   asr/speaker_diarization/intro
+   asr/ssl/intro
+   asr/speech_intent_slot/intro
 
 .. toctree::
    :maxdepth: 1
-   :caption: Common
-   :name: Common
+   :caption: Text To Speech (TTS)
+   :name: Text To Speech
    :titlesonly:
 
-   common/intro
\ No newline at end of file
+   tts/intro
diff --git a/docs/source/common/intro.rst b/docs/source/common/intro.rst
index fadbd9528485..a89f1a480e5d 100644
--- a/docs/source/common/intro.rst
+++ b/docs/source/common/intro.rst
@@ -1,5 +1,5 @@
-Common Collection
-=================
+NeMo Common Collection API
+==========================
 
 The common collection contains things that could be used across all collections.
 
diff --git a/docs/source/core/api.rst b/docs/source/core/api.rst
index 6b389ca3be85..1aceb73de0d9 100644
--- a/docs/source/core/api.rst
+++ b/docs/source/core/api.rst
@@ -1,6 +1,6 @@
 
-Core APIs
-=========
+NeMo Core APIs
+==============
 
 Base class for all NeMo models
 ------------------------------
diff --git a/docs/source/index.rst b/docs/source/index.rst
index eb586f749842..511d3ef700c9 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -69,7 +69,7 @@ For more information, browse the developer docs for your area of interest in the
    :name: APIs
    :titlesonly:
 
-   core/core_index
+   apis
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/source/multimodal/api.rst b/docs/source/multimodal/api.rst
index 3228cd76d4ad..7a9fe2822d07 100644
--- a/docs/source/multimodal/api.rst
+++ b/docs/source/multimodal/api.rst
@@ -1,5 +1,5 @@
-Multimodal API
-=======================
+NeMo Multimodal API
+===================
 
 Model Classes
 -------------
diff --git a/docs/source/nlp/api.rst b/docs/source/nlp/api.rst
index 52c1b537b0bf..cb7db1ba943a 100755
--- a/docs/source/nlp/api.rst
+++ b/docs/source/nlp/api.rst
@@ -1,5 +1,5 @@
-Large language Model API
-========================
+NeMo Large language Model API
+=============================
 
 Pretraining Model Classes
 -------------------------
diff --git a/docs/source/tts/api.rst b/docs/source/tts/api.rst
index 3e9b06b4e9a9..8664adafa6d7 100644
--- a/docs/source/tts/api.rst
+++ b/docs/source/tts/api.rst
@@ -1,5 +1,5 @@
-NeMo TTS Collection API
-=======================
+NeMo TTS API
+============
 
 Model Classes
 -------------
diff --git a/docs/source/tts/intro.rst b/docs/source/tts/intro.rst
index 3964319234b3..b7d717e7ac68 100644
--- a/docs/source/tts/intro.rst
+++ b/docs/source/tts/intro.rst
@@ -15,8 +15,6 @@ We will illustrate details in the following sections.
     datasets
     checkpoints
     configs
-    api
-    resources
     g2p
 
 .. include:: resources.rst

From 074401642bc954f5b78755adffda8369560afccc Mon Sep 17 00:00:00 2001
From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Date: Fri, 17 May 2024 14:05:24 -0700
Subject: [PATCH 33/36] use get with fallback when reading
 checkpoint_callback_params (#9223)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
---
 nemo/collections/nlp/parts/megatron_trainer_builder.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/nemo/collections/nlp/parts/megatron_trainer_builder.py b/nemo/collections/nlp/parts/megatron_trainer_builder.py
index e1a780f09756..03cf5fb755bd 100644
--- a/nemo/collections/nlp/parts/megatron_trainer_builder.py
+++ b/nemo/collections/nlp/parts/megatron_trainer_builder.py
@@ -145,7 +145,7 @@ def _plugins(self) -> list:
         use_dist_ckpt = not self.cfg.model.get('fsdp', False) and (
             self.cfg.model.get('mcore_gpt', False) or self.cfg.model.get('mcore_bert', False)
         )
-        async_save = self.cfg.exp_manager.checkpoint_callback_params.get('async_save', False)
+        async_save = self.cfg.exp_manager.get('checkpoint_callback_params', {}).get('async_save', False)
         if use_dist_ckpt:
             checkpoint_io = DistributedCheckpointIO.from_config(self.cfg.model, async_save)
             if async_save:
@@ -170,7 +170,7 @@ def _callbacks(self, callbacks: Optional[list]) -> list:
         if 'enable_progress_bar' not in self.cfg.trainer or self.cfg.trainer.enable_progress_bar:
             callbacks.append(CustomProgressBar())
 
-        if self.cfg.exp_manager.checkpoint_callback_params.get('async_save', False):
+        if self.cfg.exp_manager.get('checkpoint_callback_params', {}).get('async_save', False):
             callbacks.append(AsyncFinalizerCallback())
         return callbacks
 

From cd6d67b209015912fbe43d81aaa6271b7c553956 Mon Sep 17 00:00:00 2001
From: Chen Cui <chcui@nvidia.com>
Date: Fri, 17 May 2024 18:40:12 -0400
Subject: [PATCH 34/36] Revert rope fusion defaults (#9237)

* revert rope fusion defaults

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
---
 .../conf/megatron_chatglm_config.yaml         |  2 +-
 .../conf/megatron_falcon_config.yaml          |  2 +-
 .../conf/megatron_gpt_config.yaml             |  2 +-
 .../conf/megatron_llama_config.yaml           |  2 +-
 .../conf/megatron_starcoder_config.yaml       |  2 +-
 .../language_modeling/megatron_gpt_eval.py    |  9 ++-
 nemo/collections/multimodal/parts/utils.py    | 34 +++++++---
 .../megatron_gpt_embedding_model.py           |  3 -
 .../language_modeling/megatron_base_model.py  | 64 ++++++++++---------
 9 files changed, 73 insertions(+), 47 deletions(-)

diff --git a/examples/nlp/language_modeling/conf/megatron_chatglm_config.yaml b/examples/nlp/language_modeling/conf/megatron_chatglm_config.yaml
index 5c1191dbe64e..84fbd1b801d4 100644
--- a/examples/nlp/language_modeling/conf/megatron_chatglm_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_chatglm_config.yaml
@@ -81,7 +81,7 @@ model:
   position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope']
   rotary_percentage: 0.5 # If using position_embedding_type=rope, then the per head dim is multiplied by this. For chatglm2, it is 0.5 (https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L754)
   rotary_interleaved: True # chatglm2 use interleaved rotary embedding
-  apply_rope_fusion: True
+  apply_rope_fusion: False
   attention_type: 'multihead' # Attention type. Options ['multihead']
   share_embeddings_and_output_weights: False # Share embedding and output layer weights.
   overlap_p2p_comm: False # Overlap p2p communication with computes. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
diff --git a/examples/nlp/language_modeling/conf/megatron_falcon_config.yaml b/examples/nlp/language_modeling/conf/megatron_falcon_config.yaml
index f5746433cc78..8905abaf3ac2 100644
--- a/examples/nlp/language_modeling/conf/megatron_falcon_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_falcon_config.yaml
@@ -113,7 +113,7 @@ model:
   bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
   masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
   get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
-  apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
+  apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
 
 
   # Miscellaneous
diff --git a/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml b/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
index 20e20744833c..269aa8f55153 100755
--- a/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
@@ -131,7 +131,7 @@ model:
   bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition.
   masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
   get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
-  apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
+  apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
 
 
   # Miscellaneous
diff --git a/examples/nlp/language_modeling/conf/megatron_llama_config.yaml b/examples/nlp/language_modeling/conf/megatron_llama_config.yaml
index 38ed239ec6e1..965b511fc7e7 100644
--- a/examples/nlp/language_modeling/conf/megatron_llama_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_llama_config.yaml
@@ -112,7 +112,7 @@ model:
   bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
   masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
   get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
-  apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
+  apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
 
 
   # Miscellaneous
diff --git a/examples/nlp/language_modeling/conf/megatron_starcoder_config.yaml b/examples/nlp/language_modeling/conf/megatron_starcoder_config.yaml
index b170e82ca983..355e575a6d59 100644
--- a/examples/nlp/language_modeling/conf/megatron_starcoder_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_starcoder_config.yaml
@@ -117,7 +117,7 @@ model:
   bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition.
   masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
   get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
-  apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
+  apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
 
   # Miscellaneous
   seed: 1234
diff --git a/examples/nlp/language_modeling/megatron_gpt_eval.py b/examples/nlp/language_modeling/megatron_gpt_eval.py
index 01c56f1e3269..f3413a5fa92e 100644
--- a/examples/nlp/language_modeling/megatron_gpt_eval.py
+++ b/examples/nlp/language_modeling/megatron_gpt_eval.py
@@ -148,7 +148,9 @@ def __init__(self, sentences):
         super().__init__()
         self.sentences = sentences
 
-    def __len__(self,):
+    def __len__(
+        self,
+    ):
         return len(self.sentences)
 
     def __getitem__(self, idx):
@@ -173,7 +175,9 @@ def main(cfg) -> None:
         callbacks.append(CustomProgressBar())
     # trainer required for restoring model parallel models
     trainer = Trainer(
-        strategy=NLPDDPStrategy(timeout=datetime.timedelta(seconds=18000)), **cfg.trainer, callbacks=callbacks,
+        strategy=NLPDDPStrategy(timeout=datetime.timedelta(seconds=18000)),
+        **cfg.trainer,
+        callbacks=callbacks,
     )
 
     if cfg.gpt_model_file is not None:
@@ -224,6 +228,7 @@ def main(cfg) -> None:
             pretrained_cfg.activations_checkpoint_method = None
             pretrained_cfg.precision = trainer.precision
             pretrained_cfg["use_flash_attention"] = cfg.inference.get("use_flash_attention", False)
+            pretrained_cfg["apply_rope_fusion"] = False
             if pretrained_cfg.get('mcore_gpt', False):
                 # with dist checkpointing we can use the model parallel config specified by the user
                 pretrained_cfg.tensor_model_parallel_size = cfg.tensor_model_parallel_size
diff --git a/nemo/collections/multimodal/parts/utils.py b/nemo/collections/multimodal/parts/utils.py
index 223fe22bd00a..f9d6ed5250f6 100644
--- a/nemo/collections/multimodal/parts/utils.py
+++ b/nemo/collections/multimodal/parts/utils.py
@@ -138,7 +138,8 @@ def load_nemo_model_weights(nemo_path, sharded_state_dict=None):
                 tmp_model_weights_dir = os.path.splitext(tmp_model_weights_ckpt)[0]
                 assert os.path.isdir(tmp_model_weights_dir), f'Expected {tmp_model_weights_dir} to be a directory.'
                 checkpoint = dist_checkpointing.load(
-                    sharded_state_dict=checkpoint, checkpoint_dir=tmp_model_weights_dir,
+                    sharded_state_dict=checkpoint,
+                    checkpoint_dir=tmp_model_weights_dir,
                 )
                 state_dict = checkpoint["state_dict"]
 
@@ -149,7 +150,9 @@ def load_nemo_model_weights(nemo_path, sharded_state_dict=None):
 
 
 def setup_trainer_and_models_for_inference(
-    model_provider: Any, cfg: DictConfig, model_cfg_modifier: Callable,
+    model_provider: Any,
+    cfg: DictConfig,
+    model_cfg_modifier: Callable,
 ):
     """
     Set up a trainer and NeMo model for inference.
@@ -172,7 +175,10 @@ def setup_trainer_and_models_for_inference(
 
     # Use the NLPDDPStrategy for the distributed data parallel strategy.
     # We don't use DDP for async grad allreduce and don't find unused parameters.
-    strategy = NLPDDPStrategy(no_ddp_communication_hook=True, find_unused_parameters=False,)
+    strategy = NLPDDPStrategy(
+        no_ddp_communication_hook=True,
+        find_unused_parameters=False,
+    )
 
     # Set up the trainer with the specified plugins and strategy.
     trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer)
@@ -215,7 +221,9 @@ def setup_trainer_and_models_for_inference(
             )
 
             model = model_provider.load_from_checkpoint(
-                single_model_cfg.restore_from_path, hparams_file=cfg.model.get("hparams_file"), trainer=trainer,
+                single_model_cfg.restore_from_path,
+                hparams_file=cfg.model.get("hparams_file"),
+                trainer=trainer,
             )
             models.append(model)
 
@@ -239,7 +247,9 @@ def dummy():
 
 
 def setup_trainer_and_model_for_inference(
-    model_provider: Any, cfg: DictConfig, model_cfg_modifier: Callable,
+    model_provider: Any,
+    cfg: DictConfig,
+    model_cfg_modifier: Callable,
 ) -> Tuple[Trainer, Any]:
     """
     Set up a trainer and NeMo model for inference.
@@ -261,7 +271,10 @@ def setup_trainer_and_model_for_inference(
 
     # Use the NLPDDPStrategy for the distributed data parallel strategy.
     # We don't use DDP for async grad allreduce and don't find unused parameters.
-    strategy = NLPDDPStrategy(no_ddp_communication_hook=True, find_unused_parameters=False,)
+    strategy = NLPDDPStrategy(
+        no_ddp_communication_hook=True,
+        find_unused_parameters=False,
+    )
 
     # Set up the trainer with the specified plugins and strategy.
     trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer)
@@ -299,7 +312,9 @@ def setup_trainer_and_model_for_inference(
         )
 
         model = model_provider.load_from_checkpoint(
-            cfg.model.restore_from_path, hparams_file=cfg.model.get("hparams_file"), trainer=trainer,
+            cfg.model.restore_from_path,
+            hparams_file=cfg.model.get("hparams_file"),
+            trainer=trainer,
         )
 
     else:
@@ -335,7 +350,9 @@ def create_neva_model_and_processor(cfg):
         or cfg.get('pipeline_model_parallel_split_rank', -1) < 0
     ):
         model_config = MegatronNevaModel.restore_from(
-            restore_path=cfg.neva_model_file, trainer=trainer, return_config=True,
+            restore_path=cfg.neva_model_file,
+            trainer=trainer,
+            return_config=True,
         )
 
         with open_dict(cfg):
@@ -366,6 +383,7 @@ def create_neva_model_and_processor(cfg):
             neva_cfg.activations_checkpoint_method = None
             neva_cfg.precision = trainer.precision
             neva_cfg.mm_cfg.llm.from_pretrained = cfg.get('base_model_file', None)
+            neva_cfg.apply_rope_fusion = False
             neva_cfg.fp8 = False
         #    neva_cfg.mm_cfg.vision_encoder.from_pretrained = None
 
diff --git a/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py b/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py
index 389c90d7f97c..67fd2b1b6c62 100644
--- a/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py
+++ b/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py
@@ -68,9 +68,6 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):
         assert (
             self.cfg.get("post_process", False) is False
         ), "post_process must be False to get hidden states in the loss_func"
-        assert (
-            self.cfg.get('apply_rope_fusion', True) is False
-        ), "RoPE fusion should be set to False for MegatronGPTEmbeddingModel"
 
     def model_provider_func(self, pre_process, post_process):
         # (@adithyare) We need post_process to be False to get hidden states in the loss_func
diff --git a/nemo/collections/nlp/models/language_modeling/megatron_base_model.py b/nemo/collections/nlp/models/language_modeling/megatron_base_model.py
index 8b7c7a38045c..a27f9fd5e5e4 100644
--- a/nemo/collections/nlp/models/language_modeling/megatron_base_model.py
+++ b/nemo/collections/nlp/models/language_modeling/megatron_base_model.py
@@ -246,12 +246,12 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
         self.use_fsdp = cfg.get('fsdp', False)
 
     def setup_transformer_engine_tp_groups(self):
-        """ This should be called after model parallel groups have been initialized
-            and only needs to be called when using Transformer Engine.
+        """This should be called after model parallel groups have been initialized
+        and only needs to be called when using Transformer Engine.
         """
         for module in self.get_model_module_list():
             """Set TP group
-               Copied from: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/transformer.py#L398
+            Copied from: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/transformer.py#L398
             """
             # Deep iterate but skip self to avoid infinite recursion.
             for index, child in enumerate(module.modules()):
@@ -262,14 +262,14 @@ def setup_transformer_engine_tp_groups(self):
                     child.set_tensor_parallel_group(tp_group)
 
     def setup_transformer_engine_cp_groups(self):
-        """ This should be called after context parallel groups have been initialized
-            and only needs to be called when using Transformer Engine.
+        """This should be called after context parallel groups have been initialized
+        and only needs to be called when using Transformer Engine.
         """
         cp_stream = torch.cuda.Stream()
 
         for module in self.get_model_module_list():
             """Set context parallel running
-               Copied from: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/transformer.py
+            Copied from: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/transformer.py
             """
             # Deep iterate but skip self to avoid infinite recursion.
             for index, child in enumerate(module.modules()):
@@ -283,11 +283,11 @@ def setup_transformer_engine_cp_groups(self):
                     )
 
     def _wrap_model_for_O2(self):
-        """ Wraps self.model in a float16 wrapper if the model is using megatron amp O2.
-            Args:
-                model: The model to wrap. Can be a list of modules or a single module.
-            Returns:
-                The wrapped model. Returns a list of wrapped modules or a single wrapped module.
+        """Wraps self.model in a float16 wrapper if the model is using megatron amp O2.
+        Args:
+            model: The model to wrap. Can be a list of modules or a single module.
+        Returns:
+            The wrapped model. Returns a list of wrapped modules or a single wrapped module.
         """
         is_mcore_model = self.__dict__.get('mcore_gpt', False) or self.__dict__.get('mcore_bert', False)
 
@@ -450,10 +450,10 @@ def on_validation_end(self) -> None:
             gc.collect()
 
     def build_transformer_config(self) -> TransformerConfig:
-        """ Builds the megatron core transformer config for the model.
-            For attributes in the nemo model config that are the same
-            as the megatron core TransformerConfig, we will use the value from the nemo model config.
-            For attributes in TransformerConfig that are not in the nemo model config, we add custom logic.
+        """Builds the megatron core transformer config for the model.
+        For attributes in the nemo model config that are the same
+        as the megatron core TransformerConfig, we will use the value from the nemo model config.
+        For attributes in TransformerConfig that are not in the nemo model config, we add custom logic.
         """
 
         # create a dictionary copy of the model config
@@ -509,7 +509,7 @@ def build_transformer_config(self) -> TransformerConfig:
 
         bias_dropout_fusion = self.cfg.get('bias_dropout_add_fusion', True)
 
-        apply_rope_fusion = self.cfg.get('apply_rope_fusion', True)
+        apply_rope_fusion = self.cfg.get('apply_rope_fusion', False)
 
         # TODO: need to check if recompute APIs are matching up properly
         recompute_granularity = self.cfg.get('activations_checkpoint_granularity', None)
@@ -601,7 +601,7 @@ def get_parameters_with_grad(self):
 
     def configure_gradient_clipping(self, *args, **kwargs):
         """PTL hook to configure gradients.
-           We use gradient clipping implementation from megatron-lm.
+        We use gradient clipping implementation from megatron-lm.
         """
         clip_val = self.trainer.gradient_clip_val
         if clip_val is None:
@@ -627,13 +627,17 @@ def configure_gradient_clipping(self, *args, **kwargs):
                 parameters = self._optimizer.get_parameters_with_grad()
             else:
                 parameters = self.get_parameters_with_grad()
-            grad_norm = clip_grad_norm_fp32(parameters=parameters, max_norm=clip_val, use_fsdp=self.use_fsdp,)
+            grad_norm = clip_grad_norm_fp32(
+                parameters=parameters,
+                max_norm=clip_val,
+                use_fsdp=self.use_fsdp,
+            )
 
         self.log('grad_norm', grad_norm, rank_zero_only=True, batch_size=1)
 
     def allreduce_gradients(self):
         """Reduce gradients across data parallel ranks.
-           Modified from megatron-lm: https://github.com/NVIDIA/Megatron-LM/blob/d41696840ed0a7edb7e0499eb82a48ae112d9bb3/megatron/model/distributed.py#L188
+        Modified from megatron-lm: https://github.com/NVIDIA/Megatron-LM/blob/d41696840ed0a7edb7e0499eb82a48ae112d9bb3/megatron/model/distributed.py#L188
         """
         # Bucketize and all-reduce
         buckets = {}
@@ -732,7 +736,9 @@ def on_validation_batch_end(self, outputs, batch: Any, batch_idx: int, dataloade
             self.validation_global_step += 1
 
     def setup_optimization(
-        self, optim_config: Optional[Union[DictConfig, Dict]] = None, optim_kwargs: Optional[Dict[str, Any]] = None,
+        self,
+        optim_config: Optional[Union[DictConfig, Dict]] = None,
+        optim_kwargs: Optional[Dict[str, Any]] = None,
     ):
         # Ensure `max_steps` is set correctly
         optim_config = self._optim_config_copy(optim_config)
@@ -913,8 +919,8 @@ def _extract_consumed_samples_from_ckpt(self, ckpt_path):
         return init_consumed_samples
 
     def _validate_and_override_config(self):
-        """ Certain configurations might be incompatible or discouraged.
-            We can check for them here and override if necessary.
+        """Certain configurations might be incompatible or discouraged.
+        We can check for them here and override if necessary.
         """
         app_state = AppState()
 
@@ -1093,9 +1099,9 @@ def _get_total_params_across_model_parallel_groups_enc_dec(self, model):
         return num_parameters_on_device, total_num_parameters
 
     def build_model_parallel_config(self) -> ModelParallelConfig:
-        """ For attributes in the nemo model config that are the same as the
-            megatron core ModelParallelConfig we will use the value from the nemo config.
-            For attributes in ModelParallelConfig that are not in the nemo model config, we add custom logic.
+        """For attributes in the nemo model config that are the same as the
+        megatron core ModelParallelConfig we will use the value from the nemo config.
+        For attributes in ModelParallelConfig that are not in the nemo model config, we add custom logic.
         """
         cfg = OmegaConf.to_container(self.cfg, resolve=True)
 
@@ -1116,9 +1122,9 @@ def build_model_parallel_config(self) -> ModelParallelConfig:
             "async_tensor_model_parallel_allreduce": self.cfg.get('tensor_model_parallel_world_size', 1) > 1
             and not self.cfg.get('sequence_parallel', False),
             "pipeline_dtype": pipeline_dtype,
-            "grad_scale_func": self.trainer.precision_plugin.scaler.scale
-            if self.trainer.precision in ["16", "16-mixed"]
-            else None,
+            "grad_scale_func": (
+                self.trainer.precision_plugin.scaler.scale if self.trainer.precision in ["16", "16-mixed"] else None
+            ),
             "enable_autocast": not megatron_amp_O2 and self.torch_dtype in [torch.bfloat16, torch.float16],
             "autocast_dtype": self.autocast_dtype,
             "variable_seq_lengths": False,  # set dynamically during training
@@ -1230,7 +1236,7 @@ def find_frozen_submodules(model):
             return frozen_submodule_names, frozen_submodules
 
         if self.use_fsdp:
-            """ Top-evel FSDP model sharding """
+            """Top-evel FSDP model sharding"""
             # Shard the top-level model hierarchically. We shard the strategy-unwrapped model not
             # to lose the structure of non-FSDP wrapped parameters (e.g, embedding)
             # TODO: Currently the main parameter data type is kept in fp32 (when O2=False). This needs to be

From 1d576e42644a0e3ea4ffa254f59522acfbc9414a Mon Sep 17 00:00:00 2001
From: "He Huang (Steve)" <105218074+stevehuang52@users.noreply.github.com>
Date: Mon, 20 May 2024 10:29:00 -0400
Subject: [PATCH 35/36] Update Online_Offline_Microphone_VAD_Demo.ipynb (#9251)

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
---
 tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb b/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb
index 490a4b6c8de7..9522ac0a80e5 100644
--- a/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb
+++ b/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb
@@ -638,7 +638,7 @@
                 "    ax2.set_ylabel('Preds and Probas')\n",
                 "    \n",
                 "    \n",
-                "ax = plt.subplot(num+1,1,i+2)\n",
+                "ax = plt.subplot(num+1,1,num+1)\n",
                 "S = librosa.feature.melspectrogram(y=audio, sr=sample_rate, n_mels=64, fmax=8000)\n",
                 "S_dB = librosa.power_to_db(S, ref=np.max)\n",
                 "librosa.display.specshow(S_dB, x_axis='time', y_axis='mel', sr=sample_rate, fmax=8000)\n",

From d11324e68af4e40e5d4dc5fbdcbeb3d13a9b30a9 Mon Sep 17 00:00:00 2001
From: paul-gibbons <paul@gibbonspaul.com>
Date: Mon, 20 May 2024 09:59:39 -0700
Subject: [PATCH 36/36] neva media_type fix

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

text gen defaults

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
---
 nemo/collections/multimodal/data/neva/neva_dataset.py          | 2 +-
 .../collections/nlp/modules/common/text_generation_strategy.py | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/nemo/collections/multimodal/data/neva/neva_dataset.py b/nemo/collections/multimodal/data/neva/neva_dataset.py
index caaab2c5d67e..8e44769efaf7 100644
--- a/nemo/collections/multimodal/data/neva/neva_dataset.py
+++ b/nemo/collections/multimodal/data/neva/neva_dataset.py
@@ -981,7 +981,7 @@ def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
 
         tokens = batch['tokens']
         labels = batch['labels']
-        media_type = model_cfg.data.get('media_type')
+        media_type = model_cfg.data.get('media_type','image')
         if media_type == 'image':
             media = batch.get('image')
         elif media_type == 'video':
diff --git a/nemo/collections/nlp/modules/common/text_generation_strategy.py b/nemo/collections/nlp/modules/common/text_generation_strategy.py
index fd32ac844274..35393b413232 100644
--- a/nemo/collections/nlp/modules/common/text_generation_strategy.py
+++ b/nemo/collections/nlp/modules/common/text_generation_strategy.py
@@ -488,7 +488,8 @@ def __init__(self, model):
             sep_image_conv_front=self.data_cfg.sep_image_conv_front,
             conv_template=self.data_cfg.get("conv_template", "nvgpt"),
             image_token_len=self.data_cfg.image_token_len,
-            image_folder=self.data_cfg.image_folder,
+            image_folder=self.data_cfg.get('image_folder', None),
+            video_folder=self.data_cfg.get('video_folder', None),
             image_aspect_ratio=self.data_cfg.image_aspect_ratio,
             use_im_start_end=getattr(self.cfg.mm_cfg, 'use_im_start_end', False),
             image_processor=None,