Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change dist ckpt defaults #10913

Merged
merged 9 commits into from
Oct 24, 2024
Merged

Conversation

ShriyaPalsamudram
Copy link
Collaborator

@ShriyaPalsamudram ShriyaPalsamudram commented Oct 16, 2024

What does this PR do ?

Enable async ckpt, ckpt every 15mins and reduce preemption time from 5mins to 1min. Disable async ckpt save for the PEFT test as it is not supported at this time.

Collection: [Note which collection this PR will affect]
llm

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

ShriyaPalsamudram and others added 4 commits October 24, 2024 07:34
Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>
@pablo-garay pablo-garay requested review from hemildesai, cuichenx, jstjohn, maanug-nv and ashors1 and removed request for cuichenx October 24, 2024 19:10
@ShriyaPalsamudram ShriyaPalsamudram merged commit 2a34bb9 into main Oct 24, 2024
154 of 156 checks passed
@ShriyaPalsamudram ShriyaPalsamudram deleted the shriya/change_dist_ckpt_defaults_2 branch October 24, 2024 20:33
ShriyaPalsamudram added a commit that referenced this pull request Oct 24, 2024
* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
yashaswikarnati pushed a commit that referenced this pull request Oct 24, 2024
* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
pablo-garay added a commit that referenced this pull request Oct 25, 2024
* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min



* fix ssm tests



* Make note that ckpt_async_save is disabled for SSMs



* Enable async ckpt for SSMs with fix



* Disable async ckpt in the peft test as it is a known bug, add note.



* Fix failing unit tests



* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft




* Enable async ckpt for the peft test



* Fix peft setup test



---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
yaoyu-33 pushed a commit that referenced this pull request Oct 25, 2024
* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
titu1994 pushed a commit that referenced this pull request Oct 28, 2024
* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (NVIDIA#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
yaoyu-33 added a commit that referenced this pull request Nov 8, 2024
* add initial code for llama vlm

Signed-off-by: yaoyu-33 <[email protected]>

* some restructure

Signed-off-by: yaoyu-33 <[email protected]>

* add mock data placeholder

Signed-off-by: yaoyu-33 <[email protected]>

* Fix some importing

Signed-off-by: yaoyu-33 <[email protected]>

* add language component for vlm llama

* update code

Signed-off-by: yaoyu-33 <[email protected]>

* now match num of params

* update language part and fix vision part

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix

Signed-off-by: yaoyu-33 <[email protected]>

* model can now init

Signed-off-by: yaoyu-33 <[email protected]>

* minor update for llama32 text config

Signed-off-by: yaoyu-33 <[email protected]>

* make checkpoint loading work

* missing import

* match vision part tensor shapes with configs

Signed-off-by: yaoyu-33 <[email protected]>

* solve some fwd issues and mismatch issues

Signed-off-by: yaoyu-33 <[email protected]>

* add vision import

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update importer to convert both text and image weights

* importer typos and reduce clutter

* fix import qkv

* some fixes for LLM

Signed-off-by: yaoyu-33 <[email protected]>

* Add embedding

* some updates

Signed-off-by: yaoyu-33 <[email protected]>

* enable loading only text or only vision

* add example script

* TP fix

Signed-off-by: yaoyu-33 <[email protected]>

* update

* upload examples

Signed-off-by: yaoyu-33 <[email protected]>

* update generate

Signed-off-by: yaoyu-33 <[email protected]>

* update to newer version

Signed-off-by: yaoyu-33 <[email protected]>

* upload for sharing

* update to new pyt ckpt

* xattn_caches matches (except small differences due to TE RMSNorm)

* cleanup

* embeddings match

* match precision of weights

* update sharded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* change xattn layer num to 3 7 11 etc

* upload llama generation

* minor fix

* fix dummy layer input format

* fix vision qkv order

* fix shareded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* fix vision precision

* fix rope

* match cross attn layer

* remove nrep

* Remove cross attention in ImageTransformerLayer and fix _gate_ffn

* PP draft

Signed-off-by: yaoyu-33 <[email protected]>

* Fix intermediate tensor

* temp save for pp2 is working

Signed-off-by: yaoyu-33 <[email protected]>

* fix pp issues

Signed-off-by: yaoyu-33 <[email protected]>

* merge

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* small update to pretrain script

Signed-off-by: yaoyu-33 <[email protected]>

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* added energon dataloader for neva training (#10451)

* added energon dataloader for neva training

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* specify global batch size to support grad accumulation

* adding neva pretrain example

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* change pretraine example to handle new ckpt reloading

* fixed code quality warnings and unused imports

Signed-off-by: ykarnati <[email protected]>

* minor changes for PR comments

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* refactor conversation template config

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* remove optional import

---------

Signed-off-by: yashaswikarnati <[email protected]>
Signed-off-by: ykarnati <[email protected]>
Co-authored-by: yashaswikarnati <[email protected]>
(cherry picked from commit 7354740)

* llama energon dataloader

* have tokenizer for base task encoder class

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* Add simple inference

* evian3 update

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

Signed-off-by: yaoyu-33 <[email protected]>

* add aspect ratio in model

* support energon dataloader

* some pp update

Signed-off-by: yaoyu-33 <[email protected]>

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv merging

Signed-off-by: yaoyu-33 <[email protected]>

* fix get_key_value_tensors

Signed-off-by: yaoyu-33 <[email protected]>

* rename files

Signed-off-by: yaoyu-33 <[email protected]>

* update to HF style position embedding

Signed-off-by: yaoyu-33 <[email protected]>

* fix energon dataloader and support batching

* update forward args

Signed-off-by: yaoyu-33 <[email protected]>

* clean up and move to aspect_ratio_ids

Signed-off-by: yaoyu-33 <[email protected]>

* rename back to language.py

Signed-off-by: yaoyu-33 <[email protected]>

* fix loss function

Signed-off-by: yaoyu-33 <[email protected]>

* update and fix energon

Signed-off-by: yaoyu-33 <[email protected]>

* Add hf import

* Fix type

* Change config

* update energon pretrain

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

* clean up

* reformat

Signed-off-by: yaoyu-33 <[email protected]>

* update inference files for new code

* update to instruct

* update to instruct

* update few names

Signed-off-by: yaoyu-33 <[email protected]>

* update generation

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer embedding.weight

* few fixes

Signed-off-by: yaoyu-33 <[email protected]>

* add hf script

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv import

* remove interleaved

* fixes and updates

Signed-off-by: yaoyu-33 <[email protected]>

* lora fixes

Signed-off-by: yaoyu-33 <[email protected]>

* some code clean ups

Signed-off-by: yaoyu-33 <[email protected]>

* update training scripts

Signed-off-by: yaoyu-33 <[email protected]>

* refactors

Signed-off-by: yaoyu-33 <[email protected]>

* add LoRA finetuning

* fixes and nemo update

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer registering issue by adding 11B and 90B configs

* update `decoder_seq_len`

Signed-off-by: yaoyu-33 <[email protected]>

* science vqa script

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script name

Signed-off-by: yaoyu-33 <[email protected]>

* fix ckpt save serialization issue

* fix predefined config classes

* add num_chunks in input

Signed-off-by: yaoyu-33 <[email protected]>

* fix format

Signed-off-by: yaoyu-33 <[email protected]>

* update finetuning scripts for PEFT

* add 11b recipe (need #10645 to test)

* fix mask generation

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix code style

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Support no image inference

* add llama svqa eval

* fix masking

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix generation

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* add 90b recipe and revise 11b recipe

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean up typing

* add option to disable vision padding

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* base model finetuning (does not work yet)

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fixed default conversation template config for MLLama

* Update svqa

* add multinode

* bot happy

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Perf improvements. Mainly from XAttn mask calculation (#10901)

* Perf improvements. Mainly from XAttn mask calculation

* Apply isort and black reformatting

Signed-off-by: parthmannan <[email protected]>

---------

Signed-off-by: parthmannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>

* fix existing issues

Signed-off-by: yaoyu-33 <[email protected]>

* fix scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix lora

* few fixes for non image support

Signed-off-by: yaoyu-33 <[email protected]>

* update masking gen

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* fix data sampler and loading issue

Signed-off-by: yaoyu-33 <[email protected]>

* Add vlm generation

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* generation update

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* hide vlm examples

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add vlm generation"

This reverts commit 4711c75

Signed-off-by: yaoyu-33 <[email protected]>

* Fix VisionEncoder multi-batch bug

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* llm.generate fixes (#10983)

* fix context path, disable optimizer init, add tp

Signed-off-by: HuiyingLi <[email protected]>

* format

Signed-off-by: HuiyingLi <[email protected]>

* address comments, require user to provide trainer

Signed-off-by: HuiyingLi <[email protected]>

* minor fix

Signed-off-by: HuiyingLi <[email protected]>

* minor fixes

Signed-off-by: HuiyingLi <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>

* use __dict__ in check (#11012)

* check is_hf_model in leaf module

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* disable getattr alternative path

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo;

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* LoRA support for HF::AutoModelForCausalLM (#10982)

* add LinearAdapter

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add hf lora example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* subclass mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix scale

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* regex selector for peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fmt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* hf_auto_model_for_causal_lm finetune recipe

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Change default for always_save_context to True (#11014)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>

* Add a build option to load_context (#10713)

* Add a build option to load_context

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Adding test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Trying to fix failing CPU test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* cherry-pick fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>

* Fix pip install (#11026)

* Move AutoTokenizer inline

Signed-off-by: Marc Romeyn <[email protected]>

* Move einops to common requirements

Signed-off-by: Marc Romeyn <[email protected]>

* Move AutoTokenizer import to top-level again in fine_tuning

Signed-off-by: Marc Romeyn <[email protected]>

* Move megatron init inside nemo.lightning

Signed-off-by: Marc Romeyn <[email protected]>

* Make megatron_lazy_init_context work when transformer-engine is not installed

Signed-off-by: Marc Romeyn <[email protected]>

* Only import get_nmt_tokenizer when needed

Signed-off-by: Marc Romeyn <[email protected]>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>

* [WIP] Add docs for NEST SSL (#10804)

* add docs

Signed-off-by: stevehuang52 <[email protected]>

* update doc and fix missing param

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>

* Change dist ckpt defaults (#10913)

* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>

* Akoumparouli/mixtral recipe fix r2.0.0 (#10994)

* Mixtral TP8 EP1

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Fix _strategy_lib tests (#11033)

* fix world size and don't mock

Signed-off-by: Maanu Grover <[email protected]>

* cleanup global state

Signed-off-by: Maanu Grover <[email protected]>

* check app state instead

Signed-off-by: Maanu Grover <[email protected]>

* fix syntax nemo logger test

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` (#11016)

* Revert "[NeMo-UX] Use custom `BatchProgress` class which does not restore states (#10383)"

This reverts commit b5798de.

* make megatron sampler return the total number of batches in the dataset

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>

* PTQ example for NeMo 2.0 (#10642)

* initial commit

Signed-off-by: Piotr Kaminski <[email protected]>

* create Quantizer for NeMo 2.0

Signed-off-by: Piotr Kaminski <[email protected]>

* refactor

Signed-off-by: Piotr Kaminski <[email protected]>

* Call quantize on an unwrapped mcore model

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Add tests, adjust unwrapping

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* fix export

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Fix output_path argument for HF import

Signed-off-by: Piotr Kamiński <[email protected]>

* fix fabric ckpt loading

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* code review suggestions

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* remove unused import

Signed-off-by: Piotr Kaminski <[email protected]>

* use cnn dataset in github ci

Signed-off-by: Piotr Kaminski <[email protected]>

* applied code review

Signed-off-by: Piotr Kaminski <[email protected]>

* code review changes

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* simplify interface for data iterator

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* (partial) PP fix

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

---------

Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: artbataev <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: artbataev <[email protected]>

* TDT compute timestamps option and Extra Whitespace handling for SPE (#10875)

* add token duration

Signed-off-by: monica-sekoyan <[email protected]>

* revert rnnt change

Signed-off-by: monica-sekoyan <[email protected]>

* add remove_extra_whitespaces arg to spe tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* add token duration retrieval

Signed-off-by: monica-sekoyan <[email protected]>

* add ignore_extra_whitespace to spe

Signed-off-by: monica-sekoyan <[email protected]>

* add compute_timestamp support for tdt

Signed-off-by: monica-sekoyan <[email protected]>

* fix config field name

Signed-off-by: monica-sekoyan <[email protected]>

* add refinement for tdt timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add segments timestamp support and  refinement for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* modify tests for ctc decoding timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add rnnt timestamp tests

Signed-off-by: monica-sekoyan <[email protected]>

* updated doc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in test

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* fix of unicode char

Signed-off-by: monica-sekoyan <[email protected]>

* fix rnnt_decoding test

Signed-off-by: monica-sekoyan <[email protected]>

* workaround for tesst tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments formation

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in ctc refinement

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* reverse offset change

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* warning mode=once

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* make ignore_extrawhitespaces false

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* adjust changes to the tests

Signed-off-by: monica-sekoyan <[email protected]>

* modify prompt_formatter tests

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

---------

Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>

* Basic online dynamic FP8 quantization with vLLM (#10904)

* Basic online dynamic quantization with vLLM

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

* vllm 0.6.3 updates

Signed-off-by: Jan Lasek <[email protected]>

* Pass quantization param in deploy_vllm_triton.py script

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>

* ci: Improve VM maintenance (#10758)

* ci: Improve VM maintenance

Signed-off-by: Oliver Koenig <[email protected]>

* rename stuff

Signed-off-by: Oliver Koenig <[email protected]>

* title

Signed-off-by: Oliver Koenig <[email protected]>

* use team

Signed-off-by: Oliver Koenig <[email protected]>

* run on failure too

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* yrdy

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>

* Add comment for vision transpose

* update megatron_init.py inside lightning

Signed-off-by: yaoyu-33 <[email protected]>

* rename llama to mllama folder name

Signed-off-by: yaoyu-33 <[email protected]>

* update to attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update dropout to 0

Signed-off-by: yaoyu-33 <[email protected]>

* fix attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* remove disable_vision_padding since we now have a fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update init for mllama

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix copyright title

Signed-off-by: yaoyu-33 <[email protected]>

* fix code scan

Signed-off-by: yaoyu-33 <[email protected]>

* update vision code

Signed-off-by: yaoyu-33 <[email protected]>

* revert attention bias changes until latest MLM code got merged

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* Turn off system message check, as it's "" now

Signed-off-by: yaoyu-33 <[email protected]>

* Rolllback megatron_parallel.py

Signed-off-by: Yu Yao <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: artbataev <[email protected]>
Signed-off-by: parthmannan <[email protected]>
Signed-off-by: meatybobby <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: Ao Tang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: ykarnati <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: artbataev <[email protected]>
Co-authored-by: Parth Mannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Shriya Rishab <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: Anna Shors <[email protected]>
Co-authored-by: Piotr Kamiński <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: janekl <[email protected]>
Co-authored-by: oliver könig <[email protected]>
lilyw97 pushed a commit to lilyw97/NeMo that referenced this pull request Nov 13, 2024
* add initial code for llama vlm

Signed-off-by: yaoyu-33 <[email protected]>

* some restructure

Signed-off-by: yaoyu-33 <[email protected]>

* add mock data placeholder

Signed-off-by: yaoyu-33 <[email protected]>

* Fix some importing

Signed-off-by: yaoyu-33 <[email protected]>

* add language component for vlm llama

* update code

Signed-off-by: yaoyu-33 <[email protected]>

* now match num of params

* update language part and fix vision part

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix

Signed-off-by: yaoyu-33 <[email protected]>

* model can now init

Signed-off-by: yaoyu-33 <[email protected]>

* minor update for llama32 text config

Signed-off-by: yaoyu-33 <[email protected]>

* make checkpoint loading work

* missing import

* match vision part tensor shapes with configs

Signed-off-by: yaoyu-33 <[email protected]>

* solve some fwd issues and mismatch issues

Signed-off-by: yaoyu-33 <[email protected]>

* add vision import

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update importer to convert both text and image weights

* importer typos and reduce clutter

* fix import qkv

* some fixes for LLM

Signed-off-by: yaoyu-33 <[email protected]>

* Add embedding

* some updates

Signed-off-by: yaoyu-33 <[email protected]>

* enable loading only text or only vision

* add example script

* TP fix

Signed-off-by: yaoyu-33 <[email protected]>

* update

* upload examples

Signed-off-by: yaoyu-33 <[email protected]>

* update generate

Signed-off-by: yaoyu-33 <[email protected]>

* update to newer version

Signed-off-by: yaoyu-33 <[email protected]>

* upload for sharing

* update to new pyt ckpt

* xattn_caches matches (except small differences due to TE RMSNorm)

* cleanup

* embeddings match

* match precision of weights

* update sharded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* change xattn layer num to 3 7 11 etc

* upload llama generation

* minor fix

* fix dummy layer input format

* fix vision qkv order

* fix shareded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* fix vision precision

* fix rope

* match cross attn layer

* remove nrep

* Remove cross attention in ImageTransformerLayer and fix _gate_ffn

* PP draft

Signed-off-by: yaoyu-33 <[email protected]>

* Fix intermediate tensor

* temp save for pp2 is working

Signed-off-by: yaoyu-33 <[email protected]>

* fix pp issues

Signed-off-by: yaoyu-33 <[email protected]>

* merge

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* small update to pretrain script

Signed-off-by: yaoyu-33 <[email protected]>

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* added energon dataloader for neva training (NVIDIA#10451)

* added energon dataloader for neva training

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* specify global batch size to support grad accumulation

* adding neva pretrain example

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* change pretraine example to handle new ckpt reloading

* fixed code quality warnings and unused imports

Signed-off-by: ykarnati <[email protected]>

* minor changes for PR comments

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* refactor conversation template config

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* remove optional import

---------

Signed-off-by: yashaswikarnati <[email protected]>
Signed-off-by: ykarnati <[email protected]>
Co-authored-by: yashaswikarnati <[email protected]>
(cherry picked from commit 7354740)

* llama energon dataloader

* have tokenizer for base task encoder class

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* Add simple inference

* evian3 update

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

Signed-off-by: yaoyu-33 <[email protected]>

* add aspect ratio in model

* support energon dataloader

* some pp update

Signed-off-by: yaoyu-33 <[email protected]>

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv merging

Signed-off-by: yaoyu-33 <[email protected]>

* fix get_key_value_tensors

Signed-off-by: yaoyu-33 <[email protected]>

* rename files

Signed-off-by: yaoyu-33 <[email protected]>

* update to HF style position embedding

Signed-off-by: yaoyu-33 <[email protected]>

* fix energon dataloader and support batching

* update forward args

Signed-off-by: yaoyu-33 <[email protected]>

* clean up and move to aspect_ratio_ids

Signed-off-by: yaoyu-33 <[email protected]>

* rename back to language.py

Signed-off-by: yaoyu-33 <[email protected]>

* fix loss function

Signed-off-by: yaoyu-33 <[email protected]>

* update and fix energon

Signed-off-by: yaoyu-33 <[email protected]>

* Add hf import

* Fix type

* Change config

* update energon pretrain

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

* clean up

* reformat

Signed-off-by: yaoyu-33 <[email protected]>

* update inference files for new code

* update to instruct

* update to instruct

* update few names

Signed-off-by: yaoyu-33 <[email protected]>

* update generation

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer embedding.weight

* few fixes

Signed-off-by: yaoyu-33 <[email protected]>

* add hf script

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv import

* remove interleaved

* fixes and updates

Signed-off-by: yaoyu-33 <[email protected]>

* lora fixes

Signed-off-by: yaoyu-33 <[email protected]>

* some code clean ups

Signed-off-by: yaoyu-33 <[email protected]>

* update training scripts

Signed-off-by: yaoyu-33 <[email protected]>

* refactors

Signed-off-by: yaoyu-33 <[email protected]>

* add LoRA finetuning

* fixes and nemo update

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer registering issue by adding 11B and 90B configs

* update `decoder_seq_len`

Signed-off-by: yaoyu-33 <[email protected]>

* science vqa script

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script name

Signed-off-by: yaoyu-33 <[email protected]>

* fix ckpt save serialization issue

* fix predefined config classes

* add num_chunks in input

Signed-off-by: yaoyu-33 <[email protected]>

* fix format

Signed-off-by: yaoyu-33 <[email protected]>

* update finetuning scripts for PEFT

* add 11b recipe (need NVIDIA#10645 to test)

* fix mask generation

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix code style

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Support no image inference

* add llama svqa eval

* fix masking

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix generation

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* add 90b recipe and revise 11b recipe

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean up typing

* add option to disable vision padding

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* base model finetuning (does not work yet)

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fixed default conversation template config for MLLama

* Update svqa

* add multinode

* bot happy

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Perf improvements. Mainly from XAttn mask calculation (NVIDIA#10901)

* Perf improvements. Mainly from XAttn mask calculation

* Apply isort and black reformatting

Signed-off-by: parthmannan <[email protected]>

---------

Signed-off-by: parthmannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>

* fix existing issues

Signed-off-by: yaoyu-33 <[email protected]>

* fix scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix lora

* few fixes for non image support

Signed-off-by: yaoyu-33 <[email protected]>

* update masking gen

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* fix data sampler and loading issue

Signed-off-by: yaoyu-33 <[email protected]>

* Add vlm generation

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* generation update

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* hide vlm examples

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add vlm generation"

This reverts commit 4711c75

Signed-off-by: yaoyu-33 <[email protected]>

* Fix VisionEncoder multi-batch bug

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* llm.generate fixes (NVIDIA#10983)

* fix context path, disable optimizer init, add tp

Signed-off-by: HuiyingLi <[email protected]>

* format

Signed-off-by: HuiyingLi <[email protected]>

* address comments, require user to provide trainer

Signed-off-by: HuiyingLi <[email protected]>

* minor fix

Signed-off-by: HuiyingLi <[email protected]>

* minor fixes

Signed-off-by: HuiyingLi <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>

* use __dict__ in check (NVIDIA#11012)

* check is_hf_model in leaf module

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* disable getattr alternative path

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo;

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* LoRA support for HF::AutoModelForCausalLM (NVIDIA#10982)

* add LinearAdapter

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add hf lora example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* subclass mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix scale

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* regex selector for peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fmt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* hf_auto_model_for_causal_lm finetune recipe

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Change default for always_save_context to True (NVIDIA#11014)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>

* Add a build option to load_context (NVIDIA#10713)

* Add a build option to load_context

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Adding test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Trying to fix failing CPU test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* cherry-pick fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>

* Fix pip install (NVIDIA#11026)

* Move AutoTokenizer inline

Signed-off-by: Marc Romeyn <[email protected]>

* Move einops to common requirements

Signed-off-by: Marc Romeyn <[email protected]>

* Move AutoTokenizer import to top-level again in fine_tuning

Signed-off-by: Marc Romeyn <[email protected]>

* Move megatron init inside nemo.lightning

Signed-off-by: Marc Romeyn <[email protected]>

* Make megatron_lazy_init_context work when transformer-engine is not installed

Signed-off-by: Marc Romeyn <[email protected]>

* Only import get_nmt_tokenizer when needed

Signed-off-by: Marc Romeyn <[email protected]>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>

* [WIP] Add docs for NEST SSL (NVIDIA#10804)

* add docs

Signed-off-by: stevehuang52 <[email protected]>

* update doc and fix missing param

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>

* Change dist ckpt defaults (NVIDIA#10913)

* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (NVIDIA#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>

* Akoumparouli/mixtral recipe fix r2.0.0 (NVIDIA#10994)

* Mixtral TP8 EP1

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Fix _strategy_lib tests (NVIDIA#11033)

* fix world size and don't mock

Signed-off-by: Maanu Grover <[email protected]>

* cleanup global state

Signed-off-by: Maanu Grover <[email protected]>

* check app state instead

Signed-off-by: Maanu Grover <[email protected]>

* fix syntax nemo logger test

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` (NVIDIA#11016)

* Revert "[NeMo-UX] Use custom `BatchProgress` class which does not restore states (NVIDIA#10383)"

This reverts commit b5798de.

* make megatron sampler return the total number of batches in the dataset

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>

* PTQ example for NeMo 2.0 (NVIDIA#10642)

* initial commit

Signed-off-by: Piotr Kaminski <[email protected]>

* create Quantizer for NeMo 2.0

Signed-off-by: Piotr Kaminski <[email protected]>

* refactor

Signed-off-by: Piotr Kaminski <[email protected]>

* Call quantize on an unwrapped mcore model

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Add tests, adjust unwrapping

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* fix export

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Fix output_path argument for HF import

Signed-off-by: Piotr Kamiński <[email protected]>

* fix fabric ckpt loading

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* code review suggestions

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* remove unused import

Signed-off-by: Piotr Kaminski <[email protected]>

* use cnn dataset in github ci

Signed-off-by: Piotr Kaminski <[email protected]>

* applied code review

Signed-off-by: Piotr Kaminski <[email protected]>

* code review changes

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* simplify interface for data iterator

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* (partial) PP fix

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

---------

Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: artbataev <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: artbataev <[email protected]>

* TDT compute timestamps option and Extra Whitespace handling for SPE (NVIDIA#10875)

* add token duration

Signed-off-by: monica-sekoyan <[email protected]>

* revert rnnt change

Signed-off-by: monica-sekoyan <[email protected]>

* add remove_extra_whitespaces arg to spe tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* add token duration retrieval

Signed-off-by: monica-sekoyan <[email protected]>

* add ignore_extra_whitespace to spe

Signed-off-by: monica-sekoyan <[email protected]>

* add compute_timestamp support for tdt

Signed-off-by: monica-sekoyan <[email protected]>

* fix config field name

Signed-off-by: monica-sekoyan <[email protected]>

* add refinement for tdt timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add segments timestamp support and  refinement for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* modify tests for ctc decoding timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add rnnt timestamp tests

Signed-off-by: monica-sekoyan <[email protected]>

* updated doc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in test

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* fix of unicode char

Signed-off-by: monica-sekoyan <[email protected]>

* fix rnnt_decoding test

Signed-off-by: monica-sekoyan <[email protected]>

* workaround for tesst tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments formation

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in ctc refinement

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* reverse offset change

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* warning mode=once

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* make ignore_extrawhitespaces false

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* adjust changes to the tests

Signed-off-by: monica-sekoyan <[email protected]>

* modify prompt_formatter tests

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

---------

Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>

* Basic online dynamic FP8 quantization with vLLM (NVIDIA#10904)

* Basic online dynamic quantization with vLLM

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

* vllm 0.6.3 updates

Signed-off-by: Jan Lasek <[email protected]>

* Pass quantization param in deploy_vllm_triton.py script

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>

* ci: Improve VM maintenance (NVIDIA#10758)

* ci: Improve VM maintenance

Signed-off-by: Oliver Koenig <[email protected]>

* rename stuff

Signed-off-by: Oliver Koenig <[email protected]>

* title

Signed-off-by: Oliver Koenig <[email protected]>

* use team

Signed-off-by: Oliver Koenig <[email protected]>

* run on failure too

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* yrdy

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>

* Add comment for vision transpose

* update megatron_init.py inside lightning

Signed-off-by: yaoyu-33 <[email protected]>

* rename llama to mllama folder name

Signed-off-by: yaoyu-33 <[email protected]>

* update to attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update dropout to 0

Signed-off-by: yaoyu-33 <[email protected]>

* fix attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* remove disable_vision_padding since we now have a fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update init for mllama

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix copyright title

Signed-off-by: yaoyu-33 <[email protected]>

* fix code scan

Signed-off-by: yaoyu-33 <[email protected]>

* update vision code

Signed-off-by: yaoyu-33 <[email protected]>

* revert attention bias changes until latest MLM code got merged

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* Turn off system message check, as it's "" now

Signed-off-by: yaoyu-33 <[email protected]>

* Rolllback megatron_parallel.py

Signed-off-by: Yu Yao <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: artbataev <[email protected]>
Signed-off-by: parthmannan <[email protected]>
Signed-off-by: meatybobby <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: Ao Tang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: ykarnati <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: artbataev <[email protected]>
Co-authored-by: Parth Mannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Shriya Rishab <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: Anna Shors <[email protected]>
Co-authored-by: Piotr Kamiński <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: janekl <[email protected]>
Co-authored-by: oliver könig <[email protected]>
HuiyingLi pushed a commit to HuiyingLi/NeMo that referenced this pull request Nov 15, 2024
* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (NVIDIA#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
HuiyingLi added a commit to HuiyingLi/NeMo that referenced this pull request Nov 15, 2024
* add initial code for llama vlm

Signed-off-by: yaoyu-33 <[email protected]>

* some restructure

Signed-off-by: yaoyu-33 <[email protected]>

* add mock data placeholder

Signed-off-by: yaoyu-33 <[email protected]>

* Fix some importing

Signed-off-by: yaoyu-33 <[email protected]>

* add language component for vlm llama

* update code

Signed-off-by: yaoyu-33 <[email protected]>

* now match num of params

* update language part and fix vision part

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix

Signed-off-by: yaoyu-33 <[email protected]>

* model can now init

Signed-off-by: yaoyu-33 <[email protected]>

* minor update for llama32 text config

Signed-off-by: yaoyu-33 <[email protected]>

* make checkpoint loading work

* missing import

* match vision part tensor shapes with configs

Signed-off-by: yaoyu-33 <[email protected]>

* solve some fwd issues and mismatch issues

Signed-off-by: yaoyu-33 <[email protected]>

* add vision import

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update importer to convert both text and image weights

* importer typos and reduce clutter

* fix import qkv

* some fixes for LLM

Signed-off-by: yaoyu-33 <[email protected]>

* Add embedding

* some updates

Signed-off-by: yaoyu-33 <[email protected]>

* enable loading only text or only vision

* add example script

* TP fix

Signed-off-by: yaoyu-33 <[email protected]>

* update

* upload examples

Signed-off-by: yaoyu-33 <[email protected]>

* update generate

Signed-off-by: yaoyu-33 <[email protected]>

* update to newer version

Signed-off-by: yaoyu-33 <[email protected]>

* upload for sharing

* update to new pyt ckpt

* xattn_caches matches (except small differences due to TE RMSNorm)

* cleanup

* embeddings match

* match precision of weights

* update sharded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* change xattn layer num to 3 7 11 etc

* upload llama generation

* minor fix

* fix dummy layer input format

* fix vision qkv order

* fix shareded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* fix vision precision

* fix rope

* match cross attn layer

* remove nrep

* Remove cross attention in ImageTransformerLayer and fix _gate_ffn

* PP draft

Signed-off-by: yaoyu-33 <[email protected]>

* Fix intermediate tensor

* temp save for pp2 is working

Signed-off-by: yaoyu-33 <[email protected]>

* fix pp issues

Signed-off-by: yaoyu-33 <[email protected]>

* merge

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* small update to pretrain script

Signed-off-by: yaoyu-33 <[email protected]>

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* added energon dataloader for neva training (NVIDIA#10451)

* added energon dataloader for neva training

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* specify global batch size to support grad accumulation

* adding neva pretrain example

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* change pretraine example to handle new ckpt reloading

* fixed code quality warnings and unused imports

Signed-off-by: ykarnati <[email protected]>

* minor changes for PR comments

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* refactor conversation template config

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* remove optional import

---------

Signed-off-by: yashaswikarnati <[email protected]>
Signed-off-by: ykarnati <[email protected]>
Co-authored-by: yashaswikarnati <[email protected]>
(cherry picked from commit 7354740)

* llama energon dataloader

* have tokenizer for base task encoder class

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* Add simple inference

* evian3 update

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

Signed-off-by: yaoyu-33 <[email protected]>

* add aspect ratio in model

* support energon dataloader

* some pp update

Signed-off-by: yaoyu-33 <[email protected]>

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv merging

Signed-off-by: yaoyu-33 <[email protected]>

* fix get_key_value_tensors

Signed-off-by: yaoyu-33 <[email protected]>

* rename files

Signed-off-by: yaoyu-33 <[email protected]>

* update to HF style position embedding

Signed-off-by: yaoyu-33 <[email protected]>

* fix energon dataloader and support batching

* update forward args

Signed-off-by: yaoyu-33 <[email protected]>

* clean up and move to aspect_ratio_ids

Signed-off-by: yaoyu-33 <[email protected]>

* rename back to language.py

Signed-off-by: yaoyu-33 <[email protected]>

* fix loss function

Signed-off-by: yaoyu-33 <[email protected]>

* update and fix energon

Signed-off-by: yaoyu-33 <[email protected]>

* Add hf import

* Fix type

* Change config

* update energon pretrain

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

* clean up

* reformat

Signed-off-by: yaoyu-33 <[email protected]>

* update inference files for new code

* update to instruct

* update to instruct

* update few names

Signed-off-by: yaoyu-33 <[email protected]>

* update generation

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer embedding.weight

* few fixes

Signed-off-by: yaoyu-33 <[email protected]>

* add hf script

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv import

* remove interleaved

* fixes and updates

Signed-off-by: yaoyu-33 <[email protected]>

* lora fixes

Signed-off-by: yaoyu-33 <[email protected]>

* some code clean ups

Signed-off-by: yaoyu-33 <[email protected]>

* update training scripts

Signed-off-by: yaoyu-33 <[email protected]>

* refactors

Signed-off-by: yaoyu-33 <[email protected]>

* add LoRA finetuning

* fixes and nemo update

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer registering issue by adding 11B and 90B configs

* update `decoder_seq_len`

Signed-off-by: yaoyu-33 <[email protected]>

* science vqa script

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script name

Signed-off-by: yaoyu-33 <[email protected]>

* fix ckpt save serialization issue

* fix predefined config classes

* add num_chunks in input

Signed-off-by: yaoyu-33 <[email protected]>

* fix format

Signed-off-by: yaoyu-33 <[email protected]>

* update finetuning scripts for PEFT

* add 11b recipe (need NVIDIA#10645 to test)

* fix mask generation

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix code style

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Support no image inference

* add llama svqa eval

* fix masking

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix generation

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* add 90b recipe and revise 11b recipe

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean up typing

* add option to disable vision padding

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* base model finetuning (does not work yet)

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fixed default conversation template config for MLLama

* Update svqa

* add multinode

* bot happy

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Perf improvements. Mainly from XAttn mask calculation (NVIDIA#10901)

* Perf improvements. Mainly from XAttn mask calculation

* Apply isort and black reformatting

Signed-off-by: parthmannan <[email protected]>

---------

Signed-off-by: parthmannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>

* fix existing issues

Signed-off-by: yaoyu-33 <[email protected]>

* fix scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix lora

* few fixes for non image support

Signed-off-by: yaoyu-33 <[email protected]>

* update masking gen

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* fix data sampler and loading issue

Signed-off-by: yaoyu-33 <[email protected]>

* Add vlm generation

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* generation update

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* hide vlm examples

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add vlm generation"

This reverts commit 4711c75

Signed-off-by: yaoyu-33 <[email protected]>

* Fix VisionEncoder multi-batch bug

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* llm.generate fixes (NVIDIA#10983)

* fix context path, disable optimizer init, add tp

Signed-off-by: HuiyingLi <[email protected]>

* format

Signed-off-by: HuiyingLi <[email protected]>

* address comments, require user to provide trainer

Signed-off-by: HuiyingLi <[email protected]>

* minor fix

Signed-off-by: HuiyingLi <[email protected]>

* minor fixes

Signed-off-by: HuiyingLi <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>

* use __dict__ in check (NVIDIA#11012)

* check is_hf_model in leaf module

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* disable getattr alternative path

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo;

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* LoRA support for HF::AutoModelForCausalLM (NVIDIA#10982)

* add LinearAdapter

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add hf lora example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* subclass mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix scale

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* regex selector for peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fmt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* hf_auto_model_for_causal_lm finetune recipe

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Change default for always_save_context to True (NVIDIA#11014)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>

* Add a build option to load_context (NVIDIA#10713)

* Add a build option to load_context

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Adding test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Trying to fix failing CPU test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* cherry-pick fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>

* Fix pip install (NVIDIA#11026)

* Move AutoTokenizer inline

Signed-off-by: Marc Romeyn <[email protected]>

* Move einops to common requirements

Signed-off-by: Marc Romeyn <[email protected]>

* Move AutoTokenizer import to top-level again in fine_tuning

Signed-off-by: Marc Romeyn <[email protected]>

* Move megatron init inside nemo.lightning

Signed-off-by: Marc Romeyn <[email protected]>

* Make megatron_lazy_init_context work when transformer-engine is not installed

Signed-off-by: Marc Romeyn <[email protected]>

* Only import get_nmt_tokenizer when needed

Signed-off-by: Marc Romeyn <[email protected]>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>

* [WIP] Add docs for NEST SSL (NVIDIA#10804)

* add docs

Signed-off-by: stevehuang52 <[email protected]>

* update doc and fix missing param

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>

* Change dist ckpt defaults (NVIDIA#10913)

* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (NVIDIA#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>

* Akoumparouli/mixtral recipe fix r2.0.0 (NVIDIA#10994)

* Mixtral TP8 EP1

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Fix _strategy_lib tests (NVIDIA#11033)

* fix world size and don't mock

Signed-off-by: Maanu Grover <[email protected]>

* cleanup global state

Signed-off-by: Maanu Grover <[email protected]>

* check app state instead

Signed-off-by: Maanu Grover <[email protected]>

* fix syntax nemo logger test

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` (NVIDIA#11016)

* Revert "[NeMo-UX] Use custom `BatchProgress` class which does not restore states (NVIDIA#10383)"

This reverts commit b5798de.

* make megatron sampler return the total number of batches in the dataset

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>

* PTQ example for NeMo 2.0 (NVIDIA#10642)

* initial commit

Signed-off-by: Piotr Kaminski <[email protected]>

* create Quantizer for NeMo 2.0

Signed-off-by: Piotr Kaminski <[email protected]>

* refactor

Signed-off-by: Piotr Kaminski <[email protected]>

* Call quantize on an unwrapped mcore model

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Add tests, adjust unwrapping

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* fix export

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Fix output_path argument for HF import

Signed-off-by: Piotr Kamiński <[email protected]>

* fix fabric ckpt loading

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* code review suggestions

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* remove unused import

Signed-off-by: Piotr Kaminski <[email protected]>

* use cnn dataset in github ci

Signed-off-by: Piotr Kaminski <[email protected]>

* applied code review

Signed-off-by: Piotr Kaminski <[email protected]>

* code review changes

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* simplify interface for data iterator

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* (partial) PP fix

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

---------

Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: artbataev <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: artbataev <[email protected]>

* TDT compute timestamps option and Extra Whitespace handling for SPE (NVIDIA#10875)

* add token duration

Signed-off-by: monica-sekoyan <[email protected]>

* revert rnnt change

Signed-off-by: monica-sekoyan <[email protected]>

* add remove_extra_whitespaces arg to spe tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* add token duration retrieval

Signed-off-by: monica-sekoyan <[email protected]>

* add ignore_extra_whitespace to spe

Signed-off-by: monica-sekoyan <[email protected]>

* add compute_timestamp support for tdt

Signed-off-by: monica-sekoyan <[email protected]>

* fix config field name

Signed-off-by: monica-sekoyan <[email protected]>

* add refinement for tdt timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add segments timestamp support and  refinement for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* modify tests for ctc decoding timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add rnnt timestamp tests

Signed-off-by: monica-sekoyan <[email protected]>

* updated doc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in test

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* fix of unicode char

Signed-off-by: monica-sekoyan <[email protected]>

* fix rnnt_decoding test

Signed-off-by: monica-sekoyan <[email protected]>

* workaround for tesst tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments formation

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in ctc refinement

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* reverse offset change

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* warning mode=once

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* make ignore_extrawhitespaces false

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* adjust changes to the tests

Signed-off-by: monica-sekoyan <[email protected]>

* modify prompt_formatter tests

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

---------

Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>

* Basic online dynamic FP8 quantization with vLLM (NVIDIA#10904)

* Basic online dynamic quantization with vLLM

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

* vllm 0.6.3 updates

Signed-off-by: Jan Lasek <[email protected]>

* Pass quantization param in deploy_vllm_triton.py script

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>

* ci: Improve VM maintenance (NVIDIA#10758)

* ci: Improve VM maintenance

Signed-off-by: Oliver Koenig <[email protected]>

* rename stuff

Signed-off-by: Oliver Koenig <[email protected]>

* title

Signed-off-by: Oliver Koenig <[email protected]>

* use team

Signed-off-by: Oliver Koenig <[email protected]>

* run on failure too

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* yrdy

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>

* Add comment for vision transpose

* update megatron_init.py inside lightning

Signed-off-by: yaoyu-33 <[email protected]>

* rename llama to mllama folder name

Signed-off-by: yaoyu-33 <[email protected]>

* update to attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update dropout to 0

Signed-off-by: yaoyu-33 <[email protected]>

* fix attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* remove disable_vision_padding since we now have a fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update init for mllama

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix copyright title

Signed-off-by: yaoyu-33 <[email protected]>

* fix code scan

Signed-off-by: yaoyu-33 <[email protected]>

* update vision code

Signed-off-by: yaoyu-33 <[email protected]>

* revert attention bias changes until latest MLM code got merged

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* Turn off system message check, as it's "" now

Signed-off-by: yaoyu-33 <[email protected]>

* Rolllback megatron_parallel.py

Signed-off-by: Yu Yao <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: artbataev <[email protected]>
Signed-off-by: parthmannan <[email protected]>
Signed-off-by: meatybobby <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: Ao Tang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: ykarnati <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: artbataev <[email protected]>
Co-authored-by: Parth Mannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Shriya Rishab <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: Anna Shors <[email protected]>
Co-authored-by: Piotr Kamiński <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: janekl <[email protected]>
Co-authored-by: oliver könig <[email protected]>
yaoyu-33 added a commit that referenced this pull request Nov 21, 2024
* evian3 update

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

Signed-off-by: yaoyu-33 <[email protected]>

* add aspect ratio in model

* support energon dataloader

* some pp update

Signed-off-by: yaoyu-33 <[email protected]>

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv merging

Signed-off-by: yaoyu-33 <[email protected]>

* fix get_key_value_tensors

Signed-off-by: yaoyu-33 <[email protected]>

* rename files

Signed-off-by: yaoyu-33 <[email protected]>

* update to HF style position embedding

Signed-off-by: yaoyu-33 <[email protected]>

* fix energon dataloader and support batching

* update forward args

Signed-off-by: yaoyu-33 <[email protected]>

* clean up and move to aspect_ratio_ids

Signed-off-by: yaoyu-33 <[email protected]>

* rename back to language.py

Signed-off-by: yaoyu-33 <[email protected]>

* fix loss function

Signed-off-by: yaoyu-33 <[email protected]>

* update and fix energon

Signed-off-by: yaoyu-33 <[email protected]>

* Add hf import

* Fix type

* Change config

* update energon pretrain

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

* clean up

* reformat

Signed-off-by: yaoyu-33 <[email protected]>

* update inference files for new code

* update to instruct

* update to instruct

* update few names

Signed-off-by: yaoyu-33 <[email protected]>

* update generation

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer embedding.weight

* few fixes

Signed-off-by: yaoyu-33 <[email protected]>

* add hf script

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv import

* remove interleaved

* fixes and updates

Signed-off-by: yaoyu-33 <[email protected]>

* lora fixes

Signed-off-by: yaoyu-33 <[email protected]>

* some code clean ups

Signed-off-by: yaoyu-33 <[email protected]>

* update training scripts

Signed-off-by: yaoyu-33 <[email protected]>

* refactors

Signed-off-by: yaoyu-33 <[email protected]>

* add LoRA finetuning

* fixes and nemo update

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer registering issue by adding 11B and 90B configs

* update `decoder_seq_len`

Signed-off-by: yaoyu-33 <[email protected]>

* science vqa script

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script name

Signed-off-by: yaoyu-33 <[email protected]>

* fix ckpt save serialization issue

* fix predefined config classes

* add num_chunks in input

Signed-off-by: yaoyu-33 <[email protected]>

* fix format

Signed-off-by: yaoyu-33 <[email protected]>

* update finetuning scripts for PEFT

* add 11b recipe (need #10645 to test)

* fix mask generation

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix code style

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Support no image inference

* add llama svqa eval

* fix masking

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix generation

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* add 90b recipe and revise 11b recipe

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean up typing

* add option to disable vision padding

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* base model finetuning (does not work yet)

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fixed default conversation template config for MLLama

* Update svqa

* add multinode

* bot happy

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Perf improvements. Mainly from XAttn mask calculation (#10901)

* Perf improvements. Mainly from XAttn mask calculation

* Apply isort and black reformatting

Signed-off-by: parthmannan <[email protected]>

---------

Signed-off-by: parthmannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>

* fix existing issues

Signed-off-by: yaoyu-33 <[email protected]>

* fix scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix lora

* few fixes for non image support

Signed-off-by: yaoyu-33 <[email protected]>

* update masking gen

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* fix data sampler and loading issue

Signed-off-by: yaoyu-33 <[email protected]>

* Add vlm generation

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* generation update

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* hide vlm examples

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add vlm generation"

This reverts commit 4711c75

Signed-off-by: yaoyu-33 <[email protected]>

* Fix VisionEncoder multi-batch bug

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* llm.generate fixes (#10983)

* fix context path, disable optimizer init, add tp

Signed-off-by: HuiyingLi <[email protected]>

* format

Signed-off-by: HuiyingLi <[email protected]>

* address comments, require user to provide trainer

Signed-off-by: HuiyingLi <[email protected]>

* minor fix

Signed-off-by: HuiyingLi <[email protected]>

* minor fixes

Signed-off-by: HuiyingLi <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>

* use __dict__ in check (#11012)

* check is_hf_model in leaf module

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* disable getattr alternative path

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo;

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* LoRA support for HF::AutoModelForCausalLM (#10982)

* add LinearAdapter

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add hf lora example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* subclass mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix scale

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* regex selector for peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fmt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* hf_auto_model_for_causal_lm finetune recipe

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Change default for always_save_context to True (#11014)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>

* Add a build option to load_context (#10713)

* Add a build option to load_context

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Adding test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Trying to fix failing CPU test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* cherry-pick fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>

* Fix pip install (#11026)

* Move AutoTokenizer inline

Signed-off-by: Marc Romeyn <[email protected]>

* Move einops to common requirements

Signed-off-by: Marc Romeyn <[email protected]>

* Move AutoTokenizer import to top-level again in fine_tuning

Signed-off-by: Marc Romeyn <[email protected]>

* Move megatron init inside nemo.lightning

Signed-off-by: Marc Romeyn <[email protected]>

* Make megatron_lazy_init_context work when transformer-engine is not installed

Signed-off-by: Marc Romeyn <[email protected]>

* Only import get_nmt_tokenizer when needed

Signed-off-by: Marc Romeyn <[email protected]>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>

* [WIP] Add docs for NEST SSL (#10804)

* add docs

Signed-off-by: stevehuang52 <[email protected]>

* update doc and fix missing param

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>

* Change dist ckpt defaults (#10913)

* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>

* Akoumparouli/mixtral recipe fix r2.0.0 (#10994)

* Mixtral TP8 EP1

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Fix _strategy_lib tests (#11033)

* fix world size and don't mock

Signed-off-by: Maanu Grover <[email protected]>

* cleanup global state

Signed-off-by: Maanu Grover <[email protected]>

* check app state instead

Signed-off-by: Maanu Grover <[email protected]>

* fix syntax nemo logger test

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` (#11016)

* Revert "[NeMo-UX] Use custom `BatchProgress` class which does not restore states (#10383)"

This reverts commit b5798de.

* make megatron sampler return the total number of batches in the dataset

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>

* PTQ example for NeMo 2.0 (#10642)

* initial commit

Signed-off-by: Piotr Kaminski <[email protected]>

* create Quantizer for NeMo 2.0

Signed-off-by: Piotr Kaminski <[email protected]>

* refactor

Signed-off-by: Piotr Kaminski <[email protected]>

* Call quantize on an unwrapped mcore model

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Add tests, adjust unwrapping

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* fix export

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Fix output_path argument for HF import

Signed-off-by: Piotr Kamiński <[email protected]>

* fix fabric ckpt loading

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* code review suggestions

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* remove unused import

Signed-off-by: Piotr Kaminski <[email protected]>

* use cnn dataset in github ci

Signed-off-by: Piotr Kaminski <[email protected]>

* applied code review

Signed-off-by: Piotr Kaminski <[email protected]>

* code review changes

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* simplify interface for data iterator

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* (partial) PP fix

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

---------

Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: artbataev <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: artbataev <[email protected]>

* TDT compute timestamps option and Extra Whitespace handling for SPE (#10875)

* add token duration

Signed-off-by: monica-sekoyan <[email protected]>

* revert rnnt change

Signed-off-by: monica-sekoyan <[email protected]>

* add remove_extra_whitespaces arg to spe tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* add token duration retrieval

Signed-off-by: monica-sekoyan <[email protected]>

* add ignore_extra_whitespace to spe

Signed-off-by: monica-sekoyan <[email protected]>

* add compute_timestamp support for tdt

Signed-off-by: monica-sekoyan <[email protected]>

* fix config field name

Signed-off-by: monica-sekoyan <[email protected]>

* add refinement for tdt timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add segments timestamp support and  refinement for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* modify tests for ctc decoding timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add rnnt timestamp tests

Signed-off-by: monica-sekoyan <[email protected]>

* updated doc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in test

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* fix of unicode char

Signed-off-by: monica-sekoyan <[email protected]>

* fix rnnt_decoding test

Signed-off-by: monica-sekoyan <[email protected]>

* workaround for tesst tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments formation

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in ctc refinement

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* reverse offset change

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* warning mode=once

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* make ignore_extrawhitespaces false

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* adjust changes to the tests

Signed-off-by: monica-sekoyan <[email protected]>

* modify prompt_formatter tests

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

---------

Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>

* Basic online dynamic FP8 quantization with vLLM (#10904)

* Basic online dynamic quantization with vLLM

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

* vllm 0.6.3 updates

Signed-off-by: Jan Lasek <[email protected]>

* Pass quantization param in deploy_vllm_triton.py script

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>

* ci: Improve VM maintenance (#10758)

* ci: Improve VM maintenance

Signed-off-by: Oliver Koenig <[email protected]>

* rename stuff

Signed-off-by: Oliver Koenig <[email protected]>

* title

Signed-off-by: Oliver Koenig <[email protected]>

* use team

Signed-off-by: Oliver Koenig <[email protected]>

* run on failure too

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* yrdy

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>

* neva update

Signed-off-by: yaoyu-33 <[email protected]>

* Add comment for vision transpose

* update megatron_init.py inside lightning

Signed-off-by: yaoyu-33 <[email protected]>

* Fix PP

Signed-off-by: yaoyu-33 <[email protected]>

* add examples

Signed-off-by: yaoyu-33 <[email protected]>

* fix test

Signed-off-by: yaoyu-33 <[email protected]>

* try fix test

Signed-off-by: yaoyu-33 <[email protected]>

* try fix test

Signed-off-by: yaoyu-33 <[email protected]>

* Fix megatron megatron_init.py dp

Signed-off-by: Yu Yao <[email protected]>

* Update lightning megatron_init.py dp

Signed-off-by: Yu Yao <[email protected]>

* make it possible to update pre_preprocess and post_process for llm, required in vlm

Signed-off-by: yaoyu-33 <[email protected]>

* Fixes for neva to run with PP

Signed-off-by: yaoyu-33 <[email protected]>

* Add mcore vit support, and checkpoint conversion

Signed-off-by: yaoyu-33 <[email protected]>

* fix checkpoint loading for epp

Signed-off-by: yaoyu-33 <[email protected]>

* update script

Signed-off-by: yaoyu-33 <[email protected]>

* rename llama to mllama folder name

Signed-off-by: yaoyu-33 <[email protected]>

* update to attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* added datamodule for llava-next

* modified state dict transform

* neva model changes to support  llava-next

* remove accidentally checked in files

Signed-off-by: Yashaswi Karnati <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* remove unused imports

* added io_init to not save task_encoder and image_processor

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* added scripts for pretrain and finetune

Signed-off-by: Yashaswi Karnati <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* generation example

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* small change in llava next example

* llava next end-end train

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* finetune changes

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* finetune debug changes

* update dropout to 0

Signed-off-by: yaoyu-33 <[email protected]>

* added example generation script

* added doc strings, formating, remove debug statemens and unsued imports

* remove example scripts

* fix attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* remove disable_vision_padding since we now have a fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update init for mllama

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix copyright title

Signed-off-by: yaoyu-33 <[email protected]>

* multiple fixes

Signed-off-by: yaoyu-33 <[email protected]>

* bug fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix code scan

Signed-off-by: yaoyu-33 <[email protected]>

* Fix for SP

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update vision code

Signed-off-by: yaoyu-33 <[email protected]>

* revert attention bias changes until latest MLM code got merged

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* Turn off system message check, as it's "" now

Signed-off-by: yaoyu-33 <[email protected]>

* Update layer spec and add siglip support

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update pretrain script

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Fix scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* add neva training recipes

Signed-off-by: yaoyu-33 <[email protected]>

* fix mllama mock ds

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix recipe

Signed-off-by: yaoyu-33 <[email protected]>

* fix pp

Signed-off-by: yaoyu-33 <[email protected]>

* scripts update

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* scripts update

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update config api

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* few updates

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update 70b

Signed-off-by: yaoyu-33 <[email protected]>

* hide examples for pr

Signed-off-by: yaoyu-33 <[email protected]>

* fix few issues

Signed-off-by: yaoyu-33 <[email protected]>

* add docstring layer spec

Signed-off-by: yaoyu-33 <[email protected]>

* add docstring to vit config

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix copyright

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: artbataev <[email protected]>
Signed-off-by: parthmannan <[email protected]>
Signed-off-by: meatybobby <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Yashaswi Karnati <[email protected]>
Signed-off-by: yashaswikarnati <[email protected]>
Signed-off-by: Yashaswi Karnati <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: artbataev <[email protected]>
Co-authored-by: Parth Mannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Shriya Rishab <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: Anna Shors <[email protected]>
Co-authored-by: Piotr Kamiński <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: janekl <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: ykarnati <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: yashaswikarnati <[email protected]>
yashaswikarnati added a commit that referenced this pull request Nov 21, 2024
* add initial code for llama vlm

Signed-off-by: yaoyu-33 <[email protected]>

* some restructure

Signed-off-by: yaoyu-33 <[email protected]>

* add mock data placeholder

Signed-off-by: yaoyu-33 <[email protected]>

* Fix some importing

Signed-off-by: yaoyu-33 <[email protected]>

* add language component for vlm llama

* update code

Signed-off-by: yaoyu-33 <[email protected]>

* now match num of params

* update language part and fix vision part

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix

Signed-off-by: yaoyu-33 <[email protected]>

* model can now init

Signed-off-by: yaoyu-33 <[email protected]>

* minor update for llama32 text config

Signed-off-by: yaoyu-33 <[email protected]>

* make checkpoint loading work

* missing import

* match vision part tensor shapes with configs

Signed-off-by: yaoyu-33 <[email protected]>

* solve some fwd issues and mismatch issues

Signed-off-by: yaoyu-33 <[email protected]>

* add vision import

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* update importer to convert both text and image weights

* importer typos and reduce clutter

* fix import qkv

* some fixes for LLM

Signed-off-by: yaoyu-33 <[email protected]>

* Add embedding

* some updates

Signed-off-by: yaoyu-33 <[email protected]>

* enable loading only text or only vision

* add example script

* TP fix

Signed-off-by: yaoyu-33 <[email protected]>

* update

* upload examples

Signed-off-by: yaoyu-33 <[email protected]>

* update generate

Signed-off-by: yaoyu-33 <[email protected]>

* update to newer version

Signed-off-by: yaoyu-33 <[email protected]>

* upload for sharing

* update to new pyt ckpt

* xattn_caches matches (except small differences due to TE RMSNorm)

* cleanup

* embeddings match

* match precision of weights

* update sharded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* change xattn layer num to 3 7 11 etc

* upload llama generation

* minor fix

* fix dummy layer input format

* fix vision qkv order

* fix shareded state dict

Signed-off-by: yaoyu-33 <[email protected]>

* fix vision precision

* fix rope

* match cross attn layer

* remove nrep

* Remove cross attention in ImageTransformerLayer and fix _gate_ffn

* PP draft

Signed-off-by: yaoyu-33 <[email protected]>

* Fix intermediate tensor

* temp save for pp2 is working

Signed-off-by: yaoyu-33 <[email protected]>

* fix pp issues

Signed-off-by: yaoyu-33 <[email protected]>

* merge

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* small update to pretrain script

Signed-off-by: yaoyu-33 <[email protected]>

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* added energon dataloader for neva training (#10451)

* added energon dataloader for neva training

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* specify global batch size to support grad accumulation

* adding neva pretrain example

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* change pretraine example to handle new ckpt reloading

* fixed code quality warnings and unused imports

Signed-off-by: ykarnati <[email protected]>

* minor changes for PR comments

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* refactor conversation template config

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* remove optional import

---------

Signed-off-by: yashaswikarnati <[email protected]>
Signed-off-by: ykarnati <[email protected]>
Co-authored-by: yashaswikarnati <[email protected]>
(cherry picked from commit 7354740)

* llama energon dataloader

* have tokenizer for base task encoder class

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* Add simple inference

* evian3 update

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

Signed-off-by: yaoyu-33 <[email protected]>

* add aspect ratio in model

* support energon dataloader

* some pp update

Signed-off-by: yaoyu-33 <[email protected]>

* fixes

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv merging

Signed-off-by: yaoyu-33 <[email protected]>

* fix get_key_value_tensors

Signed-off-by: yaoyu-33 <[email protected]>

* rename files

Signed-off-by: yaoyu-33 <[email protected]>

* update to HF style position embedding

Signed-off-by: yaoyu-33 <[email protected]>

* fix energon dataloader and support batching

* update forward args

Signed-off-by: yaoyu-33 <[email protected]>

* clean up and move to aspect_ratio_ids

Signed-off-by: yaoyu-33 <[email protected]>

* rename back to language.py

Signed-off-by: yaoyu-33 <[email protected]>

* fix loss function

Signed-off-by: yaoyu-33 <[email protected]>

* update and fix energon

Signed-off-by: yaoyu-33 <[email protected]>

* Add hf import

* Fix type

* Change config

* update energon pretrain

Signed-off-by: yaoyu-33 <[email protected]>

* clean up

* clean up

* reformat

Signed-off-by: yaoyu-33 <[email protected]>

* update inference files for new code

* update to instruct

* update to instruct

* update few names

Signed-off-by: yaoyu-33 <[email protected]>

* update generation

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer embedding.weight

* few fixes

Signed-off-by: yaoyu-33 <[email protected]>

* add hf script

Signed-off-by: yaoyu-33 <[email protected]>

* fix kv import

* remove interleaved

* fixes and updates

Signed-off-by: yaoyu-33 <[email protected]>

* lora fixes

Signed-off-by: yaoyu-33 <[email protected]>

* some code clean ups

Signed-off-by: yaoyu-33 <[email protected]>

* update training scripts

Signed-off-by: yaoyu-33 <[email protected]>

* refactors

Signed-off-by: yaoyu-33 <[email protected]>

* add LoRA finetuning

* fixes and nemo update

Signed-off-by: yaoyu-33 <[email protected]>

* fix importer registering issue by adding 11B and 90B configs

* update `decoder_seq_len`

Signed-off-by: yaoyu-33 <[email protected]>

* science vqa script

Signed-off-by: yaoyu-33 <[email protected]>

* clean up script name

Signed-off-by: yaoyu-33 <[email protected]>

* fix ckpt save serialization issue

* fix predefined config classes

* add num_chunks in input

Signed-off-by: yaoyu-33 <[email protected]>

* fix format

Signed-off-by: yaoyu-33 <[email protected]>

* update finetuning scripts for PEFT

* add 11b recipe (need #10645 to test)

* fix mask generation

Signed-off-by: yaoyu-33 <[email protected]>

* minor fix code style

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Support no image inference

* add llama svqa eval

* fix masking

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix generation

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* add 90b recipe and revise 11b recipe

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* clean up typing

* add option to disable vision padding

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* base model finetuning (does not work yet)

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* fixed default conversation template config for MLLama

* Update svqa

* add multinode

* bot happy

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Perf improvements. Mainly from XAttn mask calculation (#10901)

* Perf improvements. Mainly from XAttn mask calculation

* Apply isort and black reformatting

Signed-off-by: parthmannan <[email protected]>

---------

Signed-off-by: parthmannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>

* fix existing issues

Signed-off-by: yaoyu-33 <[email protected]>

* fix scripts

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix lora

* few fixes for non image support

Signed-off-by: yaoyu-33 <[email protected]>

* update masking gen

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* fix data sampler and loading issue

Signed-off-by: yaoyu-33 <[email protected]>

* Add vlm generation

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* generation update

Signed-off-by: yaoyu-33 <[email protected]>

* update lazy dataset

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* hide vlm examples

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "Add vlm generation"

This reverts commit 4711c75

Signed-off-by: yaoyu-33 <[email protected]>

* Fix VisionEncoder multi-batch bug

* update mcore parallelism initialization

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update megatron_init.py

Signed-off-by: Yu Yao <[email protected]>

* add encoder parallel default config

Signed-off-by: yaoyu-33 <[email protected]>

* Fix _strategy_lib.py

Signed-off-by: Yu Yao <[email protected]>

* llm.generate fixes (#10983)

* fix context path, disable optimizer init, add tp

Signed-off-by: HuiyingLi <[email protected]>

* format

Signed-off-by: HuiyingLi <[email protected]>

* address comments, require user to provide trainer

Signed-off-by: HuiyingLi <[email protected]>

* minor fix

Signed-off-by: HuiyingLi <[email protected]>

* minor fixes

Signed-off-by: HuiyingLi <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>

* use __dict__ in check (#11012)

* check is_hf_model in leaf module

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* disable getattr alternative path

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo;

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* LoRA support for HF::AutoModelForCausalLM (#10982)

* add LinearAdapter

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add hf lora example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* subclass mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix scale

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* regex selector for peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fmt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* hf_auto_model_for_causal_lm finetune recipe

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Change default for always_save_context to True (#11014)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>

* Add a build option to load_context (#10713)

* Add a build option to load_context

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Adding test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Trying to fix failing CPU test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* cherry-pick fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>

* Fix pip install (#11026)

* Move AutoTokenizer inline

Signed-off-by: Marc Romeyn <[email protected]>

* Move einops to common requirements

Signed-off-by: Marc Romeyn <[email protected]>

* Move AutoTokenizer import to top-level again in fine_tuning

Signed-off-by: Marc Romeyn <[email protected]>

* Move megatron init inside nemo.lightning

Signed-off-by: Marc Romeyn <[email protected]>

* Make megatron_lazy_init_context work when transformer-engine is not installed

Signed-off-by: Marc Romeyn <[email protected]>

* Only import get_nmt_tokenizer when needed

Signed-off-by: Marc Romeyn <[email protected]>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>

* [WIP] Add docs for NEST SSL (#10804)

* add docs

Signed-off-by: stevehuang52 <[email protected]>

* update doc and fix missing param

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>

* Change dist ckpt defaults (#10913)

* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>

* Akoumparouli/mixtral recipe fix r2.0.0 (#10994)

* Mixtral TP8 EP1

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Fix _strategy_lib tests (#11033)

* fix world size and don't mock

Signed-off-by: Maanu Grover <[email protected]>

* cleanup global state

Signed-off-by: Maanu Grover <[email protected]>

* check app state instead

Signed-off-by: Maanu Grover <[email protected]>

* fix syntax nemo logger test

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` (#11016)

* Revert "[NeMo-UX] Use custom `BatchProgress` class which does not restore states (#10383)"

This reverts commit b5798de.

* make megatron sampler return the total number of batches in the dataset

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>

* PTQ example for NeMo 2.0 (#10642)

* initial commit

Signed-off-by: Piotr Kaminski <[email protected]>

* create Quantizer for NeMo 2.0

Signed-off-by: Piotr Kaminski <[email protected]>

* refactor

Signed-off-by: Piotr Kaminski <[email protected]>

* Call quantize on an unwrapped mcore model

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Add tests, adjust unwrapping

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* fix export

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: artbataev <[email protected]>

* Fix output_path argument for HF import

Signed-off-by: Piotr Kamiński <[email protected]>

* fix fabric ckpt loading

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* code review suggestions

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* remove unused import

Signed-off-by: Piotr Kaminski <[email protected]>

* use cnn dataset in github ci

Signed-off-by: Piotr Kaminski <[email protected]>

* applied code review

Signed-off-by: Piotr Kaminski <[email protected]>

* code review changes

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* simplify interface for data iterator

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

* (partial) PP fix

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

---------

Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: artbataev <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: artbataev <[email protected]>

* TDT compute timestamps option and Extra Whitespace handling for SPE (#10875)

* add token duration

Signed-off-by: monica-sekoyan <[email protected]>

* revert rnnt change

Signed-off-by: monica-sekoyan <[email protected]>

* add remove_extra_whitespaces arg to spe tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* add token duration retrieval

Signed-off-by: monica-sekoyan <[email protected]>

* add ignore_extra_whitespace to spe

Signed-off-by: monica-sekoyan <[email protected]>

* add compute_timestamp support for tdt

Signed-off-by: monica-sekoyan <[email protected]>

* fix config field name

Signed-off-by: monica-sekoyan <[email protected]>

* add refinement for tdt timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add segments timestamp support and  refinement for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* modify tests for ctc decoding timestamps

Signed-off-by: monica-sekoyan <[email protected]>

* add rnnt timestamp tests

Signed-off-by: monica-sekoyan <[email protected]>

* updated doc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in test

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* fix of unicode char

Signed-off-by: monica-sekoyan <[email protected]>

* fix rnnt_decoding test

Signed-off-by: monica-sekoyan <[email protected]>

* workaround for tesst tokenizer

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments formation

Signed-off-by: monica-sekoyan <[email protected]>

* modify segments for ctc

Signed-off-by: monica-sekoyan <[email protected]>

* fix in ctc refinement

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* reverse offset change

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* warning mode=once

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

* make ignore_extrawhitespaces false

Signed-off-by: monica-sekoyan <[email protected]>

* minor changes

Signed-off-by: monica-sekoyan <[email protected]>

* adjust changes to the tests

Signed-off-by: monica-sekoyan <[email protected]>

* modify prompt_formatter tests

Signed-off-by: monica-sekoyan <[email protected]>

* Apply isort and black reformatting

Signed-off-by: monica-sekoyan <[email protected]>

---------

Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>

* Basic online dynamic FP8 quantization with vLLM (#10904)

* Basic online dynamic quantization with vLLM

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>

* vllm 0.6.3 updates

Signed-off-by: Jan Lasek <[email protected]>

* Pass quantization param in deploy_vllm_triton.py script

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: janekl <[email protected]>

* ci: Improve VM maintenance (#10758)

* ci: Improve VM maintenance

Signed-off-by: Oliver Koenig <[email protected]>

* rename stuff

Signed-off-by: Oliver Koenig <[email protected]>

* title

Signed-off-by: Oliver Koenig <[email protected]>

* use team

Signed-off-by: Oliver Koenig <[email protected]>

* run on failure too

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* yrdy

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

* f

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>

* Add comment for vision transpose

* update megatron_init.py inside lightning

Signed-off-by: yaoyu-33 <[email protected]>

* rename llama to mllama folder name

Signed-off-by: yaoyu-33 <[email protected]>

* update to attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update dropout to 0

Signed-off-by: yaoyu-33 <[email protected]>

* fix attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* remove disable_vision_padding since we now have a fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Update init for mllama

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Address comments

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix copyright title

Signed-off-by: yaoyu-33 <[email protected]>

* fix code scan

Signed-off-by: yaoyu-33 <[email protected]>

* update vision code

Signed-off-by: yaoyu-33 <[email protected]>

* revert attention bias changes until latest MLM code got merged

Signed-off-by: yaoyu-33 <[email protected]>

* fix warning

Signed-off-by: yaoyu-33 <[email protected]>

* Turn off system message check, as it's "" now

Signed-off-by: yaoyu-33 <[email protected]>

* Rolllback megatron_parallel.py

Signed-off-by: Yu Yao <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: artbataev <[email protected]>
Signed-off-by: parthmannan <[email protected]>
Signed-off-by: meatybobby <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Signed-off-by: Piotr Kamiński <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: monica-sekoyan <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: Ao Tang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: ykarnati <[email protected]>
Co-authored-by: cuichenx <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: artbataev <[email protected]>
Co-authored-by: Parth Mannan <[email protected]>
Co-authored-by: parthmannan <[email protected]>
Co-authored-by: meatybobby <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Marc Romeyn <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: marcromeyn <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Shriya Rishab <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: Anna Shors <[email protected]>
Co-authored-by: Piotr Kamiński <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: monica-sekoyan <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: janekl <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants