Add PP support in NeVA along with few bug fixes (#11170)

* evian3 update Signed-off-by: yaoyu-33 <[email protected]> * add encoder parallel default config Signed-off-by: yaoyu-33 <[email protected]> * add encoder parallel default config Signed-off-by: yaoyu-33 <[email protected]> * clean up Signed-off-by: yaoyu-33 <[email protected]> * add aspect ratio in model * support energon dataloader * some pp update Signed-off-by: yaoyu-33 <[email protected]> * fixes Signed-off-by: yaoyu-33 <[email protected]> * fix kv merging Signed-off-by: yaoyu-33 <[email protected]> * fix get_key_value_tensors Signed-off-by: yaoyu-33 <[email protected]> * rename files Signed-off-by: yaoyu-33 <[email protected]> * update to HF style position embedding Signed-off-by: yaoyu-33 <[email protected]> * fix energon dataloader and support batching * update forward args Signed-off-by: yaoyu-33 <[email protected]> * clean up and move to aspect_ratio_ids Signed-off-by: yaoyu-33 <[email protected]> * rename back to language.py Signed-off-by: yaoyu-33 <[email protected]> * fix loss function Signed-off-by: yaoyu-33 <[email protected]> * update and fix energon Signed-off-by: yaoyu-33 <[email protected]> * Add hf import * Fix type * Change config * update energon pretrain Signed-off-by: yaoyu-33 <[email protected]> * clean up * clean up * reformat Signed-off-by: yaoyu-33 <[email protected]> * update inference files for new code * update to instruct * update to instruct * update few names Signed-off-by: yaoyu-33 <[email protected]> * update generation Signed-off-by: yaoyu-33 <[email protected]> * fix importer embedding.weight * few fixes Signed-off-by: yaoyu-33 <[email protected]> * add hf script Signed-off-by: yaoyu-33 <[email protected]> * fix kv import * remove interleaved * fixes and updates Signed-off-by: yaoyu-33 <[email protected]> * lora fixes Signed-off-by: yaoyu-33 <[email protected]> * some code clean ups Signed-off-by: yaoyu-33 <[email protected]> * update training scripts Signed-off-by: yaoyu-33 <[email protected]> * refactors Signed-off-by: yaoyu-33 <[email protected]> * add LoRA finetuning * fixes and nemo update Signed-off-by: yaoyu-33 <[email protected]> * fix importer registering issue by adding 11B and 90B configs * update `decoder_seq_len` Signed-off-by: yaoyu-33 <[email protected]> * science vqa script Signed-off-by: yaoyu-33 <[email protected]> * clean up script name Signed-off-by: yaoyu-33 <[email protected]> * fix ckpt save serialization issue * fix predefined config classes * add num_chunks in input Signed-off-by: yaoyu-33 <[email protected]> * fix format Signed-off-by: yaoyu-33 <[email protected]> * update finetuning scripts for PEFT * add 11b recipe (need #10645 to test) * fix mask generation Signed-off-by: yaoyu-33 <[email protected]> * minor fix code style Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Support no image inference * add llama svqa eval * fix masking Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix generation Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * add 90b recipe and revise 11b recipe * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * clean up typing * add option to disable vision padding * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * base model finetuning (does not work yet) * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * fixed default conversation template config for MLLama * Update svqa * add multinode * bot happy * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: artbataev <[email protected]> * Perf improvements. Mainly from XAttn mask calculation (#10901) * Perf improvements. Mainly from XAttn mask calculation * Apply isort and black reformatting Signed-off-by: parthmannan <[email protected]> --------- Signed-off-by: parthmannan <[email protected]> Co-authored-by: parthmannan <[email protected]> * fix existing issues Signed-off-by: yaoyu-33 <[email protected]> * fix scripts Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix lora * few fixes for non image support Signed-off-by: yaoyu-33 <[email protected]> * update masking gen Signed-off-by: yaoyu-33 <[email protected]> * update lazy dataset Signed-off-by: yaoyu-33 <[email protected]> * fix data sampler and loading issue Signed-off-by: yaoyu-33 <[email protected]> * Add vlm generation * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * generation update Signed-off-by: yaoyu-33 <[email protected]> * update lazy dataset Signed-off-by: yaoyu-33 <[email protected]> * Fix _strategy_lib.py Signed-off-by: Yu Yao <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix warning Signed-off-by: yaoyu-33 <[email protected]> * hide vlm examples Signed-off-by: yaoyu-33 <[email protected]> * Revert "Add vlm generation" This reverts commit 4711c75 Signed-off-by: yaoyu-33 <[email protected]> * Fix VisionEncoder multi-batch bug * update mcore parallelism initialization Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Update megatron_init.py Signed-off-by: Yu Yao <[email protected]> * add encoder parallel default config Signed-off-by: yaoyu-33 <[email protected]> * Fix _strategy_lib.py Signed-off-by: Yu Yao <[email protected]> * llm.generate fixes (#10983) * fix context path, disable optimizer init, add tp Signed-off-by: HuiyingLi <[email protected]> * format Signed-off-by: HuiyingLi <[email protected]> * address comments, require user to provide trainer Signed-off-by: HuiyingLi <[email protected]> * minor fix Signed-off-by: HuiyingLi <[email protected]> * minor fixes Signed-off-by: HuiyingLi <[email protected]> --------- Signed-off-by: HuiyingLi <[email protected]> * use __dict__ in check (#11012) * check is_hf_model in leaf module Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * disable getattr alternative path Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * undo; Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * LoRA support for HF::AutoModelForCausalLM (#10982) * add LinearAdapter Signed-off-by: Alexandros Koumparoulis <[email protected]> * add hf lora example Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * subclass mixin Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove stale imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * undo Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix scale Signed-off-by: Alexandros Koumparoulis <[email protected]> * regex selector for peft Signed-off-by: Alexandros Koumparoulis <[email protected]> * move lora Signed-off-by: Alexandros Koumparoulis <[email protected]> * fmt Signed-off-by: Alexandros Koumparoulis <[email protected]> * hf_auto_model_for_causal_lm finetune recipe Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * Change default for always_save_context to True (#11014) Signed-off-by: Abhishree <[email protected]> Co-authored-by: Pablo Garay <[email protected]> * Add a build option to load_context (#10713) * Add a build option to load_context Signed-off-by: Marc Romeijn <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * Adding test Signed-off-by: Marc Romeijn <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * Trying to fix failing CPU test Signed-off-by: Marc Romeijn <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> * cherry-pick fix Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Marc Romeijn <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> * Fix pip install (#11026) * Move AutoTokenizer inline Signed-off-by: Marc Romeyn <[email protected]> * Move einops to common requirements Signed-off-by: Marc Romeyn <[email protected]> * Move AutoTokenizer import to top-level again in fine_tuning Signed-off-by: Marc Romeyn <[email protected]> * Move megatron init inside nemo.lightning Signed-off-by: Marc Romeyn <[email protected]> * Make megatron_lazy_init_context work when transformer-engine is not installed Signed-off-by: Marc Romeyn <[email protected]> * Only import get_nmt_tokenizer when needed Signed-off-by: Marc Romeyn <[email protected]> * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> --------- Signed-off-by: Marc Romeyn <[email protected]> Signed-off-by: marcromeyn <[email protected]> Co-authored-by: marcromeyn <[email protected]> * [WIP] Add docs for NEST SSL (#10804) * add docs Signed-off-by: stevehuang52 <[email protected]> * update doc and fix missing param Signed-off-by: stevehuang52 <[email protected]> --------- Signed-off-by: stevehuang52 <[email protected]> * Change dist ckpt defaults (#10913) * Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min Signed-off-by: Shriya Palsamudram <[email protected]> * fix ssm tests Signed-off-by: Shriya Palsamudram <[email protected]> * Make note that ckpt_async_save is disabled for SSMs Signed-off-by: Shriya Palsamudram <[email protected]> * Enable async ckpt for SSMs with fix Signed-off-by: Shriya Palsamudram <[email protected]> * Disable async ckpt in the peft test as it is a known bug, add note. Signed-off-by: Shriya Palsamudram <[email protected]> * Fix failing unit tests Signed-off-by: Shriya Palsamudram <[email protected]> * Ashors/peft async ckpt (#11010) * [WIP] prototype for supporting async checkpointing with peft Signed-off-by: ashors1 <[email protected]> Signed-off-by: Shriya Palsamudram <[email protected]> * Enable async ckpt for the peft test Signed-off-by: Shriya Palsamudram <[email protected]> * Fix peft setup test Signed-off-by: Shriya Palsamudram <[email protected]> --------- Signed-off-by: Shriya Palsamudram <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: ataghibakhsh <[email protected]> * Akoumparouli/mixtral recipe fix r2.0.0 (#10994) * Mixtral TP8 EP1 Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * Fix _strategy_lib tests (#11033) * fix world size and don't mock Signed-off-by: Maanu Grover <[email protected]> * cleanup global state Signed-off-by: Maanu Grover <[email protected]> * check app state instead Signed-off-by: Maanu Grover <[email protected]> * fix syntax nemo logger test Signed-off-by: Maanu Grover <[email protected]> --------- Signed-off-by: Maanu Grover <[email protected]> * Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` (#11016) * Revert "[NeMo-UX] Use custom `BatchProgress` class which does not restore states (#10383)" This reverts commit b5798de. * make megatron sampler return the total number of batches in the dataset Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> * PTQ example for NeMo 2.0 (#10642) * initial commit Signed-off-by: Piotr Kaminski <[email protected]> * create Quantizer for NeMo 2.0 Signed-off-by: Piotr Kaminski <[email protected]> * refactor Signed-off-by: Piotr Kaminski <[email protected]> * Call quantize on an unwrapped mcore model Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> * Add tests, adjust unwrapping Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> * fix export Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> * Apply isort and black reformatting Signed-off-by: artbataev <[email protected]> * Fix output_path argument for HF import Signed-off-by: Piotr Kamiński <[email protected]> * fix fabric ckpt loading Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> * code review suggestions Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> * remove unused import Signed-off-by: Piotr Kaminski <[email protected]> * use cnn dataset in github ci Signed-off-by: Piotr Kaminski <[email protected]> * applied code review Signed-off-by: Piotr Kaminski <[email protected]> * code review changes Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> * simplify interface for data iterator Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> * (partial) PP fix Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> --------- Signed-off-by: Piotr Kaminski <[email protected]> Signed-off-by: Laplasjan107 <[email protected]> Signed-off-by: Piotr Kamiński <[email protected]> Signed-off-by: artbataev <[email protected]> Co-authored-by: Piotr Kaminski <[email protected]> Co-authored-by: Laplasjan107 <[email protected]> Co-authored-by: artbataev <[email protected]> * TDT compute timestamps option and Extra Whitespace handling for SPE (#10875) * add token duration Signed-off-by: monica-sekoyan <[email protected]> * revert rnnt change Signed-off-by: monica-sekoyan <[email protected]> * add remove_extra_whitespaces arg to spe tokenizer Signed-off-by: monica-sekoyan <[email protected]> * add token duration retrieval Signed-off-by: monica-sekoyan <[email protected]> * add ignore_extra_whitespace to spe Signed-off-by: monica-sekoyan <[email protected]> * add compute_timestamp support for tdt Signed-off-by: monica-sekoyan <[email protected]> * fix config field name Signed-off-by: monica-sekoyan <[email protected]> * add refinement for tdt timestamps Signed-off-by: monica-sekoyan <[email protected]> * add segments timestamp support and refinement for ctc Signed-off-by: monica-sekoyan <[email protected]> * modify tests for ctc decoding timestamps Signed-off-by: monica-sekoyan <[email protected]> * add rnnt timestamp tests Signed-off-by: monica-sekoyan <[email protected]> * updated doc Signed-off-by: monica-sekoyan <[email protected]> * fix in test Signed-off-by: monica-sekoyan <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * fix of unicode char Signed-off-by: monica-sekoyan <[email protected]> * fix rnnt_decoding test Signed-off-by: monica-sekoyan <[email protected]> * workaround for tesst tokenizer Signed-off-by: monica-sekoyan <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * modify segments formation Signed-off-by: monica-sekoyan <[email protected]> * modify segments for ctc Signed-off-by: monica-sekoyan <[email protected]> * fix in ctc refinement Signed-off-by: monica-sekoyan <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * minor changes Signed-off-by: monica-sekoyan <[email protected]> * reverse offset change Signed-off-by: monica-sekoyan <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * warning mode=once Signed-off-by: monica-sekoyan <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> * make ignore_extrawhitespaces false Signed-off-by: monica-sekoyan <[email protected]> * minor changes Signed-off-by: monica-sekoyan <[email protected]> * adjust changes to the tests Signed-off-by: monica-sekoyan <[email protected]> * modify prompt_formatter tests Signed-off-by: monica-sekoyan <[email protected]> * Apply isort and black reformatting Signed-off-by: monica-sekoyan <[email protected]> --------- Signed-off-by: monica-sekoyan <[email protected]> Signed-off-by: monica-sekoyan <[email protected]> Co-authored-by: monica-sekoyan <[email protected]> * Basic online dynamic FP8 quantization with vLLM (#10904) * Basic online dynamic quantization with vLLM Signed-off-by: Jan Lasek <[email protected]> * Apply isort and black reformatting Signed-off-by: janekl <[email protected]> * vllm 0.6.3 updates Signed-off-by: Jan Lasek <[email protected]> * Pass quantization param in deploy_vllm_triton.py script Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: janekl <[email protected]> Co-authored-by: janekl <[email protected]> * ci: Improve VM maintenance (#10758) * ci: Improve VM maintenance Signed-off-by: Oliver Koenig <[email protected]> * rename stuff Signed-off-by: Oliver Koenig <[email protected]> * title Signed-off-by: Oliver Koenig <[email protected]> * use team Signed-off-by: Oliver Koenig <[email protected]> * run on failure too Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * yrdy Signed-off-by: Oliver Koenig <[email protected]> * f Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * f Signed-off-by: Oliver Koenig <[email protected]> * f Signed-off-by: Oliver Koenig <[email protected]> * f Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> * neva update Signed-off-by: yaoyu-33 <[email protected]> * Add comment for vision transpose * update megatron_init.py inside lightning Signed-off-by: yaoyu-33 <[email protected]> * Fix PP Signed-off-by: yaoyu-33 <[email protected]> * add examples Signed-off-by: yaoyu-33 <[email protected]> * fix test Signed-off-by: yaoyu-33 <[email protected]> * try fix test Signed-off-by: yaoyu-33 <[email protected]> * try fix test Signed-off-by: yaoyu-33 <[email protected]> * Fix megatron megatron_init.py dp Signed-off-by: Yu Yao <[email protected]> * Update lightning megatron_init.py dp Signed-off-by: Yu Yao <[email protected]> * make it possible to update pre_preprocess and post_process for llm, required in vlm Signed-off-by: yaoyu-33 <[email protected]> * Fixes for neva to run with PP Signed-off-by: yaoyu-33 <[email protected]> * Add mcore vit support, and checkpoint conversion Signed-off-by: yaoyu-33 <[email protected]> * fix checkpoint loading for epp Signed-off-by: yaoyu-33 <[email protected]> * update script Signed-off-by: yaoyu-33 <[email protected]> * rename llama to mllama folder name Signed-off-by: yaoyu-33 <[email protected]> * update to attention bias Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * added datamodule for llava-next * modified state dict transform * neva model changes to support llava-next * remove accidentally checked in files Signed-off-by: Yashaswi Karnati <[email protected]> * Apply isort and black reformatting Signed-off-by: yashaswikarnati <[email protected]> * remove unused imports * added io_init to not save task_encoder and image_processor * Apply isort and black reformatting Signed-off-by: yashaswikarnati <[email protected]> * added scripts for pretrain and finetune Signed-off-by: Yashaswi Karnati <[email protected]> * Apply isort and black reformatting Signed-off-by: yashaswikarnati <[email protected]> * generation example * Apply isort and black reformatting Signed-off-by: yashaswikarnati <[email protected]> * small change in llava next example * llava next end-end train * Apply isort and black reformatting Signed-off-by: yashaswikarnati <[email protected]> * finetune changes * Apply isort and black reformatting Signed-off-by: yashaswikarnati <[email protected]> * finetune debug changes * update dropout to 0 Signed-off-by: yaoyu-33 <[email protected]> * added example generation script * added doc strings, formating, remove debug statemens and unsued imports * remove example scripts * fix attention bias Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * remove disable_vision_padding since we now have a fix Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Update init for mllama Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Address comments Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix copyright title Signed-off-by: yaoyu-33 <[email protected]> * multiple fixes Signed-off-by: yaoyu-33 <[email protected]> * bug fix Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix code scan Signed-off-by: yaoyu-33 <[email protected]> * Fix for SP Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * update vision code Signed-off-by: yaoyu-33 <[email protected]> * revert attention bias changes until latest MLM code got merged Signed-off-by: yaoyu-33 <[email protected]> * fix warning Signed-off-by: yaoyu-33 <[email protected]> * Turn off system message check, as it's "" now Signed-off-by: yaoyu-33 <[email protected]> * Update layer spec and add siglip support Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * update pretrain script Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Fix scripts Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * add neva training recipes Signed-off-by: yaoyu-33 <[email protected]> * fix mllama mock ds Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix recipe Signed-off-by: yaoyu-33 <[email protected]> * fix pp Signed-off-by: yaoyu-33 <[email protected]> * scripts update Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * scripts update Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * update config api Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * few updates Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * update 70b Signed-off-by: yaoyu-33 <[email protected]> * hide examples for pr Signed-off-by: yaoyu-33 <[email protected]> * fix few issues Signed-off-by: yaoyu-33 <[email protected]> * add docstring layer spec Signed-off-by: yaoyu-33 <[email protected]> * add docstring to vit config Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix copyright Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: cuichenx <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: artbataev <[email protected]> Signed-off-by: parthmannan <[email protected]> Signed-off-by: meatybobby <[email protected]> Signed-off-by: Yu Yao <[email protected]> Signed-off-by: HuiyingLi <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: Abhishree <[email protected]> Signed-off-by: Marc Romeijn <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> Signed-off-by: marcromeyn <[email protected]> Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: Shriya Palsamudram <[email protected]> Signed-off-by: ashors1 <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: Piotr Kaminski <[email protected]> Signed-off-by: Laplasjan107 <[email protected]> Signed-off-by: Piotr Kamiński <[email protected]> Signed-off-by: monica-sekoyan <[email protected]> Signed-off-by: monica-sekoyan <[email protected]> Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: janekl <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Yashaswi Karnati <[email protected]> Signed-off-by: yashaswikarnati <[email protected]> Signed-off-by: Yashaswi Karnati <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: Bobby Chen <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Yashaswi Karnati <[email protected]> Co-authored-by: artbataev <[email protected]> Co-authored-by: Parth Mannan <[email protected]> Co-authored-by: parthmannan <[email protected]> Co-authored-by: meatybobby <[email protected]> Co-authored-by: Huiying <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: marcromeyn <[email protected]> Co-authored-by: He Huang (Steve) <[email protected]> Co-authored-by: Shriya Rishab <[email protected]> Co-authored-by: ataghibakhsh <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: Anna Shors <[email protected]> Co-authored-by: Piotr Kamiński <[email protected]> Co-authored-by: Piotr Kaminski <[email protected]> Co-authored-by: Laplasjan107 <[email protected]> Co-authored-by: monica-sekoyan <[email protected]> Co-authored-by: monica-sekoyan <[email protected]> Co-authored-by: Jan Lasek <[email protected]> Co-authored-by: janekl <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Co-authored-by: ykarnati <[email protected]> Co-authored-by: Yashaswi Karnati <[email protected]> Co-authored-by: yashaswikarnati <[email protected]>
NVIDIA · Nov 21, 2024 · 773590c · 773590c
1 parent 34f7408
commit 773590c
Show file tree

Hide file tree

Showing 23 changed files with 1,128 additions and 457 deletions.
diff --git a/nemo/collections/llm/fn/activation.py b/nemo/collections/llm/fn/activation.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 
 import torch
+from megatron.core.jit import jit_fuser
 
 
 @torch.jit.script
@@ -25,6 +26,11 @@ def openai_gelu(x):
     return gelu_impl(x)
 
 
+@jit_fuser
+def quick_gelu(x: torch.Tensor) -> torch.Tensor:
+    return x * torch.sigmoid(1.702 * x)
+
+
 # @torch.jit.script # remove until we have serialization
 def squared_relu(x):
     """Squared ReLU activation function."""

diff --git a/nemo/collections/llm/gpt/model/base.py b/nemo/collections/llm/gpt/model/base.py
@@ -179,7 +179,7 @@ class GPTConfig(TransformerConfig, io.IOMixin):
     forward_step_fn: Callable = gpt_forward_step
     data_step_fn: Callable = gpt_data_step
 
-    def configure_model(self, tokenizer) -> "MCoreGPTModel":
+    def configure_model(self, tokenizer, pre_process=None, post_process=None) -> "MCoreGPTModel":
         vp_size = self.virtual_pipeline_model_parallel_size
         if vp_size:
             p_size = self.pipeline_model_parallel_size
@@ -214,8 +214,8 @@ def configure_model(self, tokenizer) -> "MCoreGPTModel":
             rotary_percent=self.rotary_percent,
             rotary_base=self.rotary_base,
             seq_len_interpolation_factor=self.seq_len_interpolation_factor,
-            pre_process=parallel_state.is_pipeline_first_stage(),
-            post_process=parallel_state.is_pipeline_last_stage(),
+            pre_process=pre_process or parallel_state.is_pipeline_first_stage(),
+            post_process=post_process or parallel_state.is_pipeline_last_stage(),
         )
 
         # If using full TE layer, need to set TP, CP group since the module call

diff --git a/nemo/collections/llm/gpt/model/llama.py b/nemo/collections/llm/gpt/model/llama.py
@@ -115,8 +115,8 @@ class Llama31Config(Llama3Config):
     old_context_len: int = 8192
     init_method_std: float = 0.02
 
-    def configure_model(self, tokenizer) -> "MCoreGPTModel":
-        model = super().configure_model(tokenizer)
+    def configure_model(self, tokenizer, pre_process=None, post_process=None) -> "MCoreGPTModel":
+        model = super().configure_model(tokenizer, pre_process, post_process)
         # Apply rope scaling for Llama3.1 model
         model.rotary_pos_emb.inv_freq = apply_rope_scaling(
             model.rotary_pos_emb.inv_freq,

diff --git a/nemo/collections/llm/gpt/model/ssm.py b/nemo/collections/llm/gpt/model/ssm.py
@@ -86,7 +86,7 @@ class SSMConfig(TransformerConfig, io.IOMixin):
     data_step_fn: Callable = gpt_data_step
     tokenizer_model_path: str = None
 
-    def configure_model(self, tokenizer) -> "MCoreMambaModel":
+    def configure_model(self, tokenizer, pre_process=None, post_process=None) -> "MCoreMambaModel":
 
         return MCoreMambaModel(
             self,
@@ -101,8 +101,8 @@ def configure_model(self, tokenizer) -> "MCoreMambaModel":
             rotary_percent=self.rotary_percent,
             rotary_base=self.rotary_base,
             seq_len_interpolation_factor=self.seq_len_interpolation_factor,
-            pre_process=parallel_state.is_pipeline_first_stage(),
-            post_process=parallel_state.is_pipeline_last_stage(),
+            pre_process=pre_process or parallel_state.is_pipeline_first_stage(),
+            post_process=post_process or parallel_state.is_pipeline_last_stage(),
         )
 
 

diff --git a/nemo/collections/vlm/__init__.py b/nemo/collections/vlm/__init__.py
@@ -29,6 +29,7 @@
     DataConfig,
     ImageDataConfig,
     ImageToken,
+    LlavaNextTaskEncoder,
     MultiModalToken,
     NevaLazyDataModule,
     NevaMockDataModule,
@@ -42,7 +43,8 @@
     NevaConfig,
     NevaModel,
 )
-from nemo.collections.vlm.neva.model.llava import Llava1_5Config7B, Llava1_5Config13B, LlavaConfig, LlavaModel
+from nemo.collections.vlm.neva.model.llava import Llava15Config7B, Llava15Config13B, LlavaConfig, LlavaModel
+from nemo.collections.vlm.neva.model.vit_config import CLIPViTL_14_336_Config, SigLIPViT400M_14_384_Config
 from nemo.collections.vlm.peft import LoRA
 from nemo.collections.vlm.recipes import *
 
@@ -59,13 +61,16 @@
     "VideoToken",
     "CLIPViTConfig",
     "HFCLIPVisionConfig",
+    "CLIPViTL_14_336_Config",
+    "SigLIPViT400M_14_384_Config",
     "MultimodalProjectorConfig",
     "NevaConfig",
     "NevaModel",
     "LlavaConfig",
-    "Llava1_5Config7B",
-    "Llava1_5Config13B",
+    "Llava15Config7B",
+    "Llava15Config13B",
     "LlavaModel",
+    "LlavaNextTaskEncoder",
     "MLlamaModel",
     "MLlamaModelConfig",
     "CrossAttentionTextConfig",

diff --git a/nemo/collections/vlm/layer_specs.py b/nemo/collections/vlm/layer_specs.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
+from megatron.core.tensor_parallel.layers import ColumnParallelLinear, RowParallelLinear
+from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
+from megatron.core.transformer.dot_product_attention import DotProductAttention
+from megatron.core.transformer.enums import AttnMaskType
+from megatron.core.transformer.identity_op import IdentityOp
+from megatron.core.transformer.mlp import MLP, MLPSubmodules
+from megatron.core.transformer.spec_utils import ModuleSpec
+from megatron.core.transformer.transformer_layer import TransformerLayer, TransformerLayerSubmodules
+
+try:
+    from megatron.core.extensions.transformer_engine import (
+        TEColumnParallelLinear,
+        TEDotProductAttention,
+        TELayerNormColumnParallelLinear,
+        TENorm,
+        TERowParallelLinear,
+    )
+
+    HAVE_TE = True
+except ImportError:
+    HAVE_TE = False
+
+try:
+    from megatron.core.fusions.fused_layer_norm import FusedLayerNorm
+
+    HAVE_APEX = True
+    LNImpl = FusedLayerNorm
+except ImportError:
+    import warnings
+
+    from megatron.core.transformer.torch_layer_norm import WrappedTorchLayerNorm
+
+    warnings.warn(f'Apex is not installed. Falling back to Torch LayerNorm')
+    LNImpl = WrappedTorchLayerNorm
+
+
+def get_layer_spec(is_vit, normalization) -> ModuleSpec:
+    """Transformer Layer Spec"""
+    attn_mask_type = AttnMaskType.no_mask if is_vit else AttnMaskType.causal
+    if normalization == "LayerNorm":
+        norm = LNImpl
+    elif normalization == "RMSNorm":
+        norm = TENorm
+    else:
+        raise RuntimeError("unknown normalization", normalization)
+
+    mlp = get_mlp_module_spec(use_te=False)  # doesn't include norm.
+
+    return ModuleSpec(
+        module=TransformerLayer,
+        submodules=TransformerLayerSubmodules(
+            input_layernorm=norm,
+            self_attention=ModuleSpec(
+                module=SelfAttention,
+                params={"attn_mask_type": attn_mask_type},
+                submodules=SelfAttentionSubmodules(
+                    linear_qkv=ColumnParallelLinear,
+                    core_attention=DotProductAttention,
+                    linear_proj=RowParallelLinear,
+                    q_layernorm=IdentityOp,
+                    k_layernorm=IdentityOp,
+                ),
+            ),
+            self_attn_bda=get_bias_dropout_add,
+            pre_mlp_layernorm=norm,
+            mlp=mlp,
+            mlp_bda=get_bias_dropout_add,
+        ),
+    )
+
+
+def get_layer_spec_te(is_vit=False) -> ModuleSpec:
+    """Transformer Layer Spec w/ TE Modules"""
+    attn_mask_type = AttnMaskType.no_mask if is_vit else AttnMaskType.causal
+
+    mlp = get_norm_mlp_module_spec_te()
+    return ModuleSpec(
+        module=TransformerLayer,
+        submodules=TransformerLayerSubmodules(
+            self_attention=ModuleSpec(
+                module=SelfAttention,
+                params={"attn_mask_type": attn_mask_type},
+                submodules=SelfAttentionSubmodules(
+                    linear_qkv=TELayerNormColumnParallelLinear,
+                    core_attention=TEDotProductAttention,
+                    linear_proj=TERowParallelLinear,
+                    q_layernorm=IdentityOp,
+                    k_layernorm=IdentityOp,
+                ),
+            ),
+            self_attn_bda=get_bias_dropout_add,
+            pre_mlp_layernorm=IdentityOp,
+            mlp=mlp,
+            mlp_bda=get_bias_dropout_add,
+        ),
+    )
+
+
+def get_mlp_module_spec(use_te: bool = True) -> ModuleSpec:
+    """MLP Submodule Spec"""
+    # Dense MLP w/ or w/o TE modules.
+    return ModuleSpec(
+        module=MLP,
+        submodules=MLPSubmodules(
+            linear_fc1=TEColumnParallelLinear if use_te else ColumnParallelLinear,
+            linear_fc2=TERowParallelLinear if use_te else RowParallelLinear,
+        ),
+    )
+
+
+def get_norm_mlp_module_spec_te() -> ModuleSpec:
+    """Norm + MLP Submodule Spec"""
+    return ModuleSpec(
+        module=MLP,
+        submodules=MLPSubmodules(linear_fc1=TELayerNormColumnParallelLinear, linear_fc2=TERowParallelLinear),
+    )
diff --git a/nemo/collections/vlm/mllama/data/mock.py b/nemo/collections/vlm/mllama/data/mock.py
@@ -34,6 +34,8 @@ def __init__(
         micro_batch_size: int = 4,
         global_batch_size: int = 8,
         rampup_batch_size: Optional[List[int]] = None,
+        tokenizer: Optional = None,
+        image_processor: Optional = None,
         num_train_samples: int = 10_000,
         num_val_samples: int = 10_000,
         num_test_samples: int = 10_000,
@@ -52,6 +54,8 @@ def __init__(
         self.persistent_workers = persistent_workers
         self.vocab_size = vocab_size
         self.crop_size = crop_size
+        self.tokenizer = tokenizer
+        self.image_processor = image_processor
 
         self.data_sampler = MegatronDataSampler(
             seq_len=self.seq_length,
@@ -142,8 +146,8 @@ def __getitem__(self, idx) -> Dict[str, torch.Tensor]:
 
         return {
             "images": images,
-            "masks": [[5, 512]],
-            "num_chunks": [4],
+            "masks": torch.tensor([[5, 512]]),
+            "num_chunks": torch.tensor([4]),
             "tokens": tokens,
             "aspect_ratio_ids": aspect_ratio_ids,
             "loss_mask": self.loss_mask,

diff --git a/nemo/collections/vlm/mllama/model/base.py b/nemo/collections/vlm/mllama/model/base.py
@@ -40,6 +40,7 @@
 from nemo.collections.vlm.mllama.model.language import CrossAttentionTextModel
 from nemo.collections.vlm.mllama.model.utils import _generate_cross_attention_mask, _pad_attention_masks
 from nemo.collections.vlm.mllama.model.vision import VisionEncoder
+from nemo.collections.vlm.neva.model.base import MODEL_CONFIG_ATTR
 from nemo.lightning import get_vocab_size, io
 from nemo.lightning.megatron_parallel import MaskedTokenLossReduction
 from nemo.lightning.pytorch.optim import MegatronOptimizerModule, OptimizerModule
@@ -240,35 +241,8 @@ class MLlamaModelConfig(TransformerConfig, io.IOMixin):
     data_step_fn: Callable = llama_data_step
 
     def __post_init__(self):
-        model_config_attr = [
-            'num_layers',
-            'hidden_size',
-            'num_attention_heads',
-            'num_query_groups',
-            'ffn_hidden_size',
-            'kv_channels',
-            'hidden_dropout',
-            'attention_dropout',
-            'fp32_residual_connection',
-            'apply_residual_connection_post_layernorm',
-            'layernorm_epsilon',
-            'layernorm_zero_centered_gamma',
-            'add_bias_linear',
-            'add_qkv_bias',
-            'gated_linear_unit',
-            'activation_func',
-            'activation_func_fp8_input_store',
-            'num_moe_experts',
-            'rotary_interleaved',
-            'window_size',
-            'normalization',
-            'qk_layernorm',
-            'test_mode',
-            'calculate_per_token_loss',
-        ]
-
         if self.language_model_config is not None:
-            for attr in model_config_attr:
+            for attr in MODEL_CONFIG_ATTR:
                 setattr(self, attr, getattr(self.language_model_config, attr))
 
     def configure_model(self, tokenizer) -> "MLlamaBaseModel":

diff --git a/nemo/collections/vlm/neva/data/__init__.py b/nemo/collections/vlm/neva/data/__init__.py
@@ -14,6 +14,7 @@
 
 from nemo.collections.vlm.neva.data.config import DataConfig, ImageDataConfig, VideoDataConfig
 from nemo.collections.vlm.neva.data.lazy import NevaLazyDataModule
+from nemo.collections.vlm.neva.data.llava_next_energon import LlavaNextTaskEncoder
 from nemo.collections.vlm.neva.data.mock import MockDataModule as NevaMockDataModule
 from nemo.collections.vlm.neva.data.multimodal_tokens import ImageToken, MultiModalToken, VideoToken
 
@@ -26,4 +27,5 @@
     "MultiModalToken",
     "ImageToken",
     "VideoToken",
+    "LlavaNextTaskEncoder",
 ]
diff --git a/nemo/collections/vlm/neva/data/conversation.py b/nemo/collections/vlm/neva/data/conversation.py
@@ -77,7 +77,6 @@ def process_chat_template(self, tokenizer_name_or_path, messages):
 
     def get_prompt(self):
         messages = self.messages
-        messages = self.process_prompt_with_images(messages)
 
         if self.sep_style == SeparatorStyle.SINGLE:
             ret = self.system + self.sep
@@ -100,6 +99,8 @@ def get_prompt(self):
                     if type(message) is tuple:
                         message, _, _ = message
                     ret += role + ": " + message + seps[i % 2]
+                    # Add space to make sure the labels can be correctly generated.
+                    self.messages[i][1] = " " + self.messages[i][1]
                 else:
                     ret += role + ":"
 
@@ -155,7 +156,6 @@ def get_prompt(self):
             ret = self.process_chat_template(tokenizer_name_or_path, messages)
 
         elif self.sep_style == SeparatorStyle.MLLAMA:
-            """ """
             tokenizer_name_or_path = self.tokenizer_name_or_path or "meta-llama/Llama-3.2-11B-Vision-Instruct"
             ret = self.process_chat_template(tokenizer_name_or_path, messages)
 

diff --git a/nemo/collections/vlm/neva/data/lazy.py b/nemo/collections/vlm/neva/data/lazy.py
@@ -251,7 +251,7 @@ def __init__(
         data_config,
         tokenizer,
         image_processor,
-        sequence_length,
+        sequence_length=None,
     ):
         super().__init__()
         if data_path is not None:
@@ -497,6 +497,7 @@ def __init__(
         weights: Optional[List[float]] = None,
         data_config: Optional[DataConfig] = ImageDataConfig,
         seq_length: int = 2048,
+        decoder_seq_length: Optional[int] = None,
         tokenizer: Optional = None,
         image_processor: Optional = None,
         micro_batch_size: int = 4,
@@ -523,6 +524,7 @@ def __init__(
         self.weights = weights
         self.data_config = data_config
         self.seq_length = seq_length
+        self.decoder_seq_length = decoder_seq_length
         self.tokenizer = tokenizer
         self.image_processor = image_processor
         self.num_train_samples = num_train_samples
@@ -538,13 +540,15 @@ def __init__(
         if tokenizer is None or image_processor is None:
             logging.warning(f"Processor and tokenizer are not provided! Fall back to `llava-hf/llava-1.5-7b-hf`.")
             from transformers import AutoProcessor
+            from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
 
             processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
-            self.tokenizer = tokenizer or processor.tokenizer
+            self.tokenizer = tokenizer or AutoTokenizer("llava-hf/llava-1.5-7b-hf")
             self.image_processor = image_processor or processor.image_processor
 
         self.data_sampler = MegatronDataSampler(
             seq_len=self.seq_length,
+            decoder_seq_len=self.decoder_seq_length,
             micro_batch_size=micro_batch_size,
             global_batch_size=global_batch_size,
             dataloader_type="cyclic",