diff --git a/docs/source/multimodal/api.rst b/docs/source/multimodal/api.rst index ef517d6bdd5a..63ce477273b3 100644 --- a/docs/source/multimodal/api.rst +++ b/docs/source/multimodal/api.rst @@ -10,7 +10,7 @@ Model Classes :members: __init__, configure_optimizers -.. autoclass:: nemo.collections.multimodal.models.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion +.. autoclass:: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion :show-inheritance: :no-members: :members: __init__, training_step, validation_step, setup, build_train_valid_test_datasets @@ -49,7 +49,7 @@ Modules :show-inheritance: :no-members: -.. autoclass:: nemo.collections.multimodal.models.stable_diffusion.ldm.autoencoder.AutoencoderKL +.. autoclass:: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL :show-inheritance: :no-members: :members: __init__, encode, decode diff --git a/docs/source/multimodal/mllm/checkpoint.rst b/docs/source/multimodal/mllm/checkpoint.rst index 8ccb520bda4b..9ee1042588b1 100644 --- a/docs/source/multimodal/mllm/checkpoint.rst +++ b/docs/source/multimodal/mllm/checkpoint.rst @@ -108,7 +108,7 @@ Adjust model parallelism with: --target_tensor_model_parallel_size=??? \ --pipeline_model_parallel_size=??? \ --target_pipeline_model_parallel_size=??? \ - --model_class="nemo.collections.multimodal.models.neva.neva_model.MegatronNevaModel" \ + --model_class="nemo.collections.multimodal.models.multimodal_llm.neva.neva_model.MegatronNevaModel" \ --precision=32 \ --tokenizer_model_path=/path/to/tokenizer.model \ --tp_conversion_only diff --git a/docs/source/multimodal/text2img/insp2p.rst b/docs/source/multimodal/text2img/insp2p.rst index 20e68f5742e3..b5a04d69fd2d 100644 --- a/docs/source/multimodal/text2img/insp2p.rst +++ b/docs/source/multimodal/text2img/insp2p.rst @@ -6,7 +6,7 @@ Model Introduction InstructPix2Pix [InstructPix2Pix]_ :cite:`mm-models-insp2p` offers a unique approach to image editing using human-written instructions. Given an input image and a textual directive, the model adjusts the image according to the provided instructions. NeMo Multimodal presents a training pipeline for this conditional diffusion model, utilizing a dataset generated by harnessing the strengths of two prominent pretrained models: a language model (GPT-3) and a text-to-image model (Stable Diffusion). The InstructPix2Pix model operates swiftly, editing images within seconds, eliminating the need for per-example fine-tuning or inversion. It has demonstrated remarkable results across a wide variety of input images and written instructions. -Built upon the Stable Diffusion framework, NeMo's InstructPix2Pix shares a similar architecture with Stable Diffusion (refer to :doc:`Stable Diffusion <./sd>`). What sets it apart is its unique training dataset and the combined guidance from both image and text prompts. Specifically, InstructPix2pix ::class::``nemo.collections.multimodal.models.instruct_pix2pix.ldm.ddpm_edit.MegatronLatentDiffusionEdit`` is derived directly from Stable Diffusion's ::class::``nemo.collections.multimodal.models.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion``, with alterations to accommodate the dataset and provide support for dual guidance. +Built upon the Stable Diffusion framework, NeMo's InstructPix2Pix shares a similar architecture with Stable Diffusion (refer to :doc:`Stable Diffusion <./sd>`). What sets it apart is its unique training dataset and the combined guidance from both image and text prompts. Specifically, InstructPix2pix ::class::``nemo.collections.multimodal.models.instruct_pix2pix.ldm.ddpm_edit.MegatronLatentDiffusionEdit`` is derived directly from Stable Diffusion's ::class::``nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion``, with alterations to accommodate the dataset and provide support for dual guidance. Training Dataset -------------------- diff --git a/docs/source/multimodal/text2img/sd.rst b/docs/source/multimodal/text2img/sd.rst index 23865058ab9b..ffadeda61637 100644 --- a/docs/source/multimodal/text2img/sd.rst +++ b/docs/source/multimodal/text2img/sd.rst @@ -33,7 +33,7 @@ The VAE configuration is defined under **first_stage_config**. .. code-block:: yaml first_stage_config: - _target_: nemo.collections.multimodal.models.stable_diffusion.ldm.autoencoder.AutoencoderKL + _target_: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL from_pretrained: /path/to/vae.bin embed_dim: 4 monitor: val/rec_loss diff --git a/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml b/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml index 6ff8e889aba1..c822237f8fc9 100644 --- a/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml +++ b/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml @@ -11,6 +11,7 @@ inference: compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False end_strings: ["","",] # generation will stop when one of these tokens is generated images_base_path: /pwd/images + insert_image_token: null # `left` or `right` or `null` trainer: devices: 8 @@ -24,7 +25,7 @@ tensor_model_parallel_size: 8 pipeline_model_parallel_size: 1 pipeline_model_parallel_split_rank: 0 # used for encoder and decoder model (0 for others) neva_model_file: /pwd/nemo_experiments/nemo_llava.nemo #neva_22b_tp8_finetuned_v1.nemo neva_8b_tp4_finetuned_v1.nemo -llm_model_file: null +base_model_file: null checkpoint_dir: null #/pwd/nemo_multimodal/nemo_experiments/nemo_llava_finetune/checkpoints # checkpoint file dir. This is used to load the PTL checkpoint generated during the Kosmos training checkpoint_name: null #megatron_clip--val_loss=0.41-step=13499-consumed_samples=431904.0.ckpt # PTL checkpoint file name, only used for PTL checkpoint loading hparams_file: null #/pwd/nemo_multimodal/nemo_experiments/nemo_llava_finetune/version_0/hparams.yaml # model configuration file, only used for PTL checkpoint loading diff --git a/examples/multimodal/multimodal_llm/neva/conf/neva_peft.yaml b/examples/multimodal/multimodal_llm/neva/conf/neva_peft.yaml index 36e706635b97..add113cdc539 100644 --- a/examples/multimodal/multimodal_llm/neva/conf/neva_peft.yaml +++ b/examples/multimodal/multimodal_llm/neva/conf/neva_peft.yaml @@ -209,7 +209,7 @@ model: optim: name: fused_adam - lr: 2e-5 + lr: 2e-4 weight_decay: 0. betas: - 0.9 diff --git a/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py b/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py index 82536e32c370..c9263ea85bbf 100644 --- a/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py +++ b/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py @@ -18,7 +18,8 @@ python convert_hf_llava_to_neva.py \ --in-file \ --out-file \ - --tokenizer-model + --tokenizer-model \ + --conv-template llama_2 # nvgpt, llama_2, v1 (vicuna) """ import os @@ -49,6 +50,13 @@ def get_args(): "--in-file", type=str, default=None, required=True, help="Path to Huggingface LLaMA checkpoints", ) parser.add_argument("--out-file", type=str, default=None, required=True, help="Path to output .nemo file.") + parser.add_argument( + "--conv-template", + type=str, + default="llama_2", + required=False, + help="Conversation template: nvgpt, llama_2, v1 (vicuna)", + ) parser.add_argument( "--tokenizer-model", type=str, default=None, required=False, help="Path to sentencepiece tokenizer model." ) @@ -121,6 +129,8 @@ def load_config(args, llava_config): nemo_config.num_query_groups = llava_config['num_key_value_heads'] nemo_config.use_cpu_initialization = True nemo_config.activation = 'fast-swiglu' + nemo_config.data.conv_template = args.conv_template + nemo_config.mm_cfg.model_type = args.conv_template if args.tokenizer_model is None: nemo_config.tokenizer.model = llava_config['tokenizer_model'] else: diff --git a/examples/multimodal/multimodal_llm/neva/eval/gradio_cli.py b/examples/multimodal/multimodal_llm/neva/eval/gradio_cli.py new file mode 100644 index 000000000000..4f2136eac83a --- /dev/null +++ b/examples/multimodal/multimodal_llm/neva/eval/gradio_cli.py @@ -0,0 +1,41 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import base64 + +import requests + +# URL of the Gradio server +url = 'http://localhost:8890/api/predict/' + +# Prepare the text data +text_data = 'Describe this image please.' + +# Prepare the image data +with open("/path/to/images/001.jpg", "rb") as image_file: + encoded_string = base64.b64encode(image_file.read()).decode() + +# Data to send +data = {'data': [text_data, encoded_string]} + +# Sending a POST request to the Gradio server +response = requests.post(url, json=data) + +# Checking if the request was successful +if response.status_code == 200: + # Parsing the response + response_data = response.json() + print("Response from server:", response_data) +else: + print("Failed to get a response from the server, status code:", response.status_code) diff --git a/examples/multimodal/multimodal_llm/neva/eval/gradio_server.py b/examples/multimodal/multimodal_llm/neva/eval/gradio_server.py new file mode 100644 index 000000000000..b1308a7b0d3c --- /dev/null +++ b/examples/multimodal/multimodal_llm/neva/eval/gradio_server.py @@ -0,0 +1,108 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import base64 +import io + +import gradio as gr +import PIL.Image +from omegaconf import OmegaConf + +from nemo.collections.multimodal.parts.utils import create_neva_model_and_processor + +CFG_STRING = """ +trainer: + devices: 1 + num_nodes: 1 + accelerator: gpu + logger: False # logger provided by exp_manager + precision: bf16 # 16, 32, or bf16 + +inference: + greedy: False # Whether or not to use sampling ; use greedy decoding otherwise + top_k: 0 # The number of highest probability vocabulary tokens to keep for top-k-filtering. + top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. + temperature: 0.2 # sampling temperature + add_BOS: False # add the bos token at the begining of the prompt + tokens_to_generate: 256 # The minimum length of the sequence to be generated. + all_probs: False # whether return the log prob for all the tokens in vocab + repetition_penalty: 1.2 # The parameter for repetition penalty. 1.0 means no penalty. + min_tokens_to_generate: 0 # The minimum length of the sequence to be generated. + compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False + end_strings: ["","",] # generation will stop when one of these tokens is generated + images_base_path: /pwd/images + insert_image_token: null # `left` or `right` or `null` + +cluster_type: BCP +tensor_model_parallel_size: 1 +pipeline_model_parallel_size: 1 +pipeline_model_parallel_split_rank: 0 # used for encoder and decoder model (0 for others) + +neva_model_file: /pwd/nemo_experiments/nemo_llava.nemo #neva_22b_tp8_finetuned_v1.nemo neva_8b_tp4_finetuned_v1.nemo +base_model_file: null +checkpoint_dir: null #/pwd/nemo_multimodal/nemo_experiments/nemo_llava_finetune/checkpoints # checkpoint file dir. This is used to load the PTL checkpoint generated during the Kosmos training +checkpoint_name: null #megatron_clip--val_loss=0.41-step=13499-consumed_samples=431904.0.ckpt # PTL checkpoint file name, only used for PTL checkpoint loading +hparams_file: null #/pwd/nemo_multimodal/nemo_experiments/nemo_llava_finetune/version_0/hparams.yaml # model configuration file, only used for PTL checkpoint loading +""" + +cfg = OmegaConf.create(CFG_STRING) +cfg.neva_model_file = "/path/to/llava-v1.5-7b.nemo" +model, image_processor = create_neva_model_and_processor(cfg) + + +def predict(prompt, image_base64=None): + input_data = {"prompt": prompt} + if image_base64 is not None: + image_data = base64.b64decode(image_base64) + # image = PIL.Image.fromarray(image) + image = PIL.Image.open(io.BytesIO(image_data)) + input_data["image"] = image_processor(image) + + length_params: LengthParam = { + "max_length": cfg.inference.tokens_to_generate, + "min_length": cfg.inference.min_tokens_to_generate, + } + sampling_params: SamplingParam = { + "use_greedy": cfg.inference.greedy, + "temperature": cfg.inference.temperature, + "top_k": cfg.inference.top_k, + "top_p": cfg.inference.top_p, + "repetition_penalty": cfg.inference.repetition_penalty, + "add_BOS": cfg.inference.add_BOS, + "all_probs": cfg.inference.all_probs, + "compute_logprob": cfg.inference.compute_logprob, + "end_strings": cfg.inference.end_strings, + } + + # Generate model responses + responses = model.generate( + input_prompts=[input_data], # Adjust based on your model's requirements + length_params=length_params, # Define these parameters as in your original code + sampling_params=sampling_params, # Define these parameters as in your original code + inference_config=cfg, + ) + + return responses[0]["clean_response"] + + +iface = gr.Interface( + fn=predict, + inputs=[gr.Textbox(), gr.Textbox()], + outputs="text", + title="Multimodal Model Inference", + description="Enter a prompt and optionally upload an image for model inference.", +) + +if __name__ == "__main__": + iface.launch(server_port=8890, share=False) diff --git a/examples/multimodal/multimodal_llm/neva/eval/vqa_science.py b/examples/multimodal/multimodal_llm/neva/eval/vqa_science.py new file mode 100644 index 000000000000..8ea267ac8116 --- /dev/null +++ b/examples/multimodal/multimodal_llm/neva/eval/vqa_science.py @@ -0,0 +1,176 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import math +import os + +import shortuuid +from omegaconf import OmegaConf +from tqdm import tqdm + +from nemo.collections.multimodal.parts.utils import create_neva_model_and_processor +from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, SamplingParam +from nemo.utils.get_rank import is_global_rank_zero + +CFG_STRING = """ +trainer: + devices: 1 + num_nodes: 1 + accelerator: gpu + logger: False # logger provided by exp_manager + precision: bf16 # 16, 32, or bf16 + +inference: + greedy: True # Whether or not to use sampling ; use greedy decoding otherwise + top_k: 0 # The number of highest probability vocabulary tokens to keep for top-k-filtering. + top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. + temperature: 0.2 # sampling temperature + add_BOS: True # add the bos token at the begining of the prompt + tokens_to_generate: 64 # The minimum length of the sequence to be generated. + all_probs: False # whether return the log prob for all the tokens in vocab + repetition_penalty: 1.2 # The parameter for repetition penalty. 1.0 means no penalty. + min_tokens_to_generate: 0 # The minimum length of the sequence to be generated. + compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False + end_strings: ["","",] # generation will stop when one of these tokens is generated + images_base_path: /pwd/images + insert_image_token: null # `left` or `right` or `null` + +cluster_type: BCP +tensor_model_parallel_size: 1 +pipeline_model_parallel_size: 1 +pipeline_model_parallel_split_rank: 0 # used for encoder and decoder model (0 for others) + +neva_model_file: /pwd/nemo_experiments/nemo_llava.nemo #neva_22b_tp8_finetuned_v1.nemo neva_8b_tp4_finetuned_v1.nemo +base_model_file: null +checkpoint_dir: null #/pwd/nemo_multimodal/nemo_experiments/nemo_llava_finetune/checkpoints # checkpoint file dir. This is used to load the PTL checkpoint generated during the Kosmos training +checkpoint_name: null #megatron_clip--val_loss=0.41-step=13499-consumed_samples=431904.0.ckpt # PTL checkpoint file name, only used for PTL checkpoint loading +hparams_file: null #/pwd/nemo_multimodal/nemo_experiments/nemo_llava_finetune/version_0/hparams.yaml # model configuration file, only used for PTL checkpoint loading +""" + + +def split_list(lst, n): + """Split a list into n (roughly) equal-sized chunks""" + chunk_size = math.ceil(len(lst) / n) # integer division + return [lst[i : i + chunk_size] for i in range(0, len(lst), chunk_size)] + + +def get_chunk(lst, n, k): + chunks = split_list(lst, n) + return chunks[k] + + +def eval_model(args): + # Model + cfg = OmegaConf.create(CFG_STRING) + cfg.neva_model_file = args.model_path + cfg.base_model_file = args.model_base + cfg.inference.images_base_path = args.image_folder + cfg.tensor_model_parallel_size = args.tp + cfg.trainer.devices = args.tp + + model, image_processor = create_neva_model_and_processor(cfg) + length_params: LengthParam = { + "max_length": cfg.inference.tokens_to_generate, + "min_length": cfg.inference.min_tokens_to_generate, + } + sampling_params: SamplingParam = { + "use_greedy": cfg.inference.greedy, + "temperature": cfg.inference.temperature, + "top_k": cfg.inference.top_k, + "top_p": cfg.inference.top_p, + "repetition_penalty": cfg.inference.repetition_penalty, + "add_BOS": cfg.inference.add_BOS, + "all_probs": cfg.inference.all_probs, + "compute_logprob": cfg.inference.compute_logprob, + "end_strings": cfg.inference.end_strings, + } + + questions = json.load(open(os.path.expanduser(args.question_file), "r")) + questions = get_chunk(questions, args.num_chunks, args.chunk_idx) + answers_file = os.path.expanduser(args.answers_file) + os.makedirs(os.path.dirname(answers_file), exist_ok=True) + ans_file = open(answers_file, "w") + for i, line in enumerate(tqdm(questions, disable=(not is_global_rank_zero()))): + idx = line["id"] + question = line['conversations'][0] + qs = question['value'].replace('', '').strip() + cur_prompt = qs + + if 'image' in line: + cur_prompt = qs = '' + cur_prompt + line['image'] = image_processor(os.path.join(cfg.inference.images_base_path, line['image'])) + + if args.single_pred_prompt: + qs = qs + '\n' + "Answer with the option's letter from the given choices directly." + cur_prompt = cur_prompt + '\n' + "Answer with the option's letter from the given choices directly." + + responses = model.generate( + input_prompts=[dict(prompt=qs, image=line.get('image', None))], + length_params=length_params, + sampling_params=sampling_params, + inference_config=cfg, + ) + # import pdb; pdb.set_trace() + outputs = responses[0]["clean_response"] + + # prompt for answer + if args.answer_prompter: + outputs_reasoning = outputs + + responses = model.generate( + input_prompts=[prompt + outputs_reasoning + ' ###\nANSWER:'], + length_params=length_params, + sampling_params=sampling_params, + inference_config=cfg, + ) + outputs = responses[0]["clean_response"] + outputs = outputs_reasoning + '\n The answer is ' + outputs + + ans_id = shortuuid.uuid() + ans_file.write( + json.dumps( + { + "question_id": idx, + "prompt": cur_prompt, + "text": outputs, + "answer_id": ans_id, + "model_id": args.model_path, + "metadata": {}, + } + ) + + "\n" + ) + ans_file.flush() + ans_file.close() + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--model-path", type=str, default="facebook/opt-350m") + parser.add_argument("--model-base", type=str, default=None) + parser.add_argument("--image-folder", type=str, default="") + parser.add_argument("--question-file", type=str, default="tables/question.json") + parser.add_argument("--answers-file", type=str, default="answer.jsonl") + parser.add_argument("--conv-mode", type=str, default="llava_v0") + parser.add_argument("--tp", type=int, default=1) + parser.add_argument("--num-chunks", type=int, default=1) + parser.add_argument("--chunk-idx", type=int, default=0) + parser.add_argument("--temperature", type=float, default=0.2) + parser.add_argument("--answer-prompter", action="store_true") + parser.add_argument("--single-pred-prompt", action="store_true") + args = parser.parse_args() + + eval_model(args) diff --git a/examples/multimodal/multimodal_llm/neva/neva_evaluation.py b/examples/multimodal/multimodal_llm/neva/neva_evaluation.py index a3ee54937161..545a634ac7fb 100644 --- a/examples/multimodal/multimodal_llm/neva/neva_evaluation.py +++ b/examples/multimodal/multimodal_llm/neva/neva_evaluation.py @@ -14,21 +14,12 @@ import json import os - import torch -from omegaconf import OmegaConf, open_dict -from pytorch_lightning.plugins.environments import TorchElasticEnvironment -from pytorch_lightning.trainer.trainer import Trainer from torch.utils.data import Dataset -from nemo.collections.multimodal.models.multimodal_llm.neva.neva_model import MegatronNevaModel -from nemo.collections.nlp.modules.common.megatron.megatron_init import fake_initialize_model_parallel +from nemo.collections.multimodal.parts.utils import create_neva_model_and_processor from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, SamplingParam -from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector -from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP from nemo.core.config import hydra_runner -from nemo.utils.app_state import AppState -from nemo.utils.model_utils import inject_model_parallel_rank try: @@ -40,105 +31,6 @@ HAVE_AMMO = False - -""" -This is the script to run GPT text generation. - -Usage: - Assume the model has TP=1, PP=1 in the following use cases. - a. run greedy inference from a nemo file: - python neva_evaluation.py \ - neva_model_file=PATH_TO_MODEL \ - inference.greedy=True \ - inference.add_BOS=True \ - trainer.devices=1 \ - trainer.num_nodes=1 \ - tensor_model_parallel_size=-1 \ - pipeline_model_parallel_size=-1 \ - prompts=[prompt1,prompt2] - - b. run greedy inference from a PTL checkpoint file: - python neva_evaluation.py \ - checkpoint_dir=PATH_TO_CHECKPOINT_FILE \ - checkpoint_name=CHECKPOINT_FILE_NAME \ - hparams_file=HPARAMS_FILE \ - inference.greedy=True \ - inference.add_BOS=True \ - trainer.devices=1 \ - trainer.num_nodes=1 \ - tensor_model_parallel_size=-1 \ - pipeline_model_parallel_size=-1 \ - prompts=[prompt1,prompt2] - - c. run top_p inference from a nemo file: - python neva_evaluation.py \ - neva_model_file=PATH_TO_MODEL \ - inference.greedy=False \ - inference.top_k=0 \ - inference.top_p=0.9 \ - inference.repetition_penalty=1.2 \ - inference.add_BOS=True \ - trainer.devices=1 \ - trainer.num_nodes=1 \ - tensor_model_parallel_size=-1 \ - pipeline_model_parallel_size=-1 \ - prompts=[prompt1,prompt2] - - d. If you don't need to generate tokens and need model to compute logprobs: - python neva_evaluation.py \ - neva_model_file=PATH_TO_MODEL \ - inference.compute_logprob=True \ - trainer.devices=1 \ - trainer.num_nodes=1 \ - tensor_model_parallel_size=-1 \ - pipeline_model_parallel_size=-1 \ - prompts=[text to get logprob] - - e. Launch the inference server - python neva_evaluation.py \ - neva_model_file=PATH_TO_MODEL \ - trainer.devices=1 \ - trainer.num_nodes=1 \ - tensor_model_parallel_size=-1 \ - pipeline_model_parallel_size=-1 \ - server=True - - To send a request to the server, here is one example code: - ```python - import json - import requests - - batch_size = 8 - port_num = 5555 - headers = {"Content-Type": "application/json"} - - - def request_data(data): - resp = requests.put('http://localhost:{}/generate'.format(port_num), - data=json.dumps(data), - headers=headers) - sentences = resp.json()['sentences'] - return sentences - - - data = { - "sentences": [""] * batch_size, - "images" : [] * batch_size, - "tokens_to_generate": 300, - "temperature": 1.0, - "add_BOS": True, - "top_k": 0, - "top_p": 0.9, - "greedy": False, - "all_probs": False, - "repetition_penalty": 1.2, - "min_tokens_to_generate": 2, - } - - sentences = request_data(data) - ``` -""" - if not torch.cuda.is_available(): raise EnvironmentError("GPU is needed for the inference") @@ -157,100 +49,7 @@ def __getitem__(self, idx): @hydra_runner(config_path="conf", config_name="neva_inference") def main(cfg) -> None: - - plugins = [] - if cfg.get('cluster_type', None) == 'BCP': - plugins.append(TorchElasticEnvironment()) - # trainer required for restoring model parallel models - trainer = Trainer(plugins=plugins, strategy=NLPDDPStrategy(), **cfg.trainer) - - if ( - cfg.tensor_model_parallel_size < 0 - or cfg.pipeline_model_parallel_size < 0 - or cfg.get('pipeline_model_parallel_split_rank', -1) < 0 - ): - model_config = MegatronNevaModel.restore_from( - restore_path=cfg.neva_model_file, trainer=trainer, return_config=True, - ) - - with open_dict(cfg): - cfg.tensor_model_parallel_size = model_config.get('tensor_model_parallel_size', 1) - cfg.pipeline_model_parallel_size = model_config.get('pipeline_model_parallel_size', 1) - cfg.pipeline_model_parallel_split_rank = model_config.get('pipeline_model_parallel_split_rank', 0) - - assert ( - cfg.trainer.devices * cfg.trainer.num_nodes - == cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size - ), "devices * num_nodes should equal tensor_model_parallel_size * pipeline_model_parallel_size" - - if cfg.neva_model_file: - save_restore_connector = NLPSaveRestoreConnector() - if os.path.isdir(cfg.neva_model_file): - save_restore_connector.model_extracted_dir = cfg.neva_model_file - - neva_cfg = MegatronNevaModel.restore_from( - restore_path=cfg.neva_model_file, - trainer=trainer, - return_config=True, - save_restore_connector=save_restore_connector, - ) - OmegaConf.set_struct(neva_cfg, True) - with open_dict(neva_cfg): - neva_cfg.sequence_parallel = False - neva_cfg.activations_checkpoint_granularity = None - neva_cfg.activations_checkpoint_method = None - neva_cfg.precision = trainer.precision - neva_cfg.mm_cfg.llm.from_pretrained = cfg.get('llm_model_file', None) - # neva_cfg.mm_cfg.vision_encoder.from_pretrained = None - - model = MegatronNevaModel.restore_from( - restore_path=cfg.neva_model_file, - trainer=trainer, - override_config_path=neva_cfg, - save_restore_connector=save_restore_connector, - ) - if neva_cfg.get('peft') is not None: - peft_cfg_cls = PEFT_CONFIG_MAP[neva_cfg.peft.peft_scheme] - if peft_cfg_cls is not None: - model.load_adapters(cfg.neva_model_file, peft_cfg_cls(neva_cfg)) - - elif cfg.checkpoint_dir: - app_state = AppState() - if cfg.tensor_model_parallel_size > 1 or cfg.pipeline_model_parallel_size > 1: - app_state.model_parallel_size = cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size - app_state.tensor_model_parallel_size = cfg.tensor_model_parallel_size - app_state.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size - ( - app_state.tensor_model_parallel_rank, - app_state.pipeline_model_parallel_rank, - app_state.model_parallel_size, - app_state.data_parallel_size, - app_state.pipeline_model_parallel_split_rank, - app_state.virtual_pipeline_model_parallel_rank, - ) = fake_initialize_model_parallel( - world_size=app_state.model_parallel_size, - rank=trainer.global_rank, - tensor_model_parallel_size_=cfg.tensor_model_parallel_size, - pipeline_model_parallel_size_=cfg.pipeline_model_parallel_size, - pipeline_model_parallel_split_rank_=cfg.pipeline_model_parallel_split_rank, - ) - checkpoint_path = inject_model_parallel_rank(os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name)) - # TODO: This wont work properly (We need to set model.llm.from_pretrained model.vision.from_pretrained to nul) - model = MegatronNevaModel.load_from_checkpoint(checkpoint_path, hparams_file=cfg.hparams_file, trainer=trainer) - else: - raise ValueError("need at least a nemo file or checkpoint dir") - - model.freeze() - - # Have to turn off activations_checkpoint_method for inference - try: - model.model.language_model.encoder.activations_checkpoint_method = None - except AttributeError: - pass - try: - model.model.module.language_model.encoder.activations_checkpoint_method = None - except AttributeError: - pass + model, image_processor = create_neva_model_and_processor(cfg) length_params: LengthParam = { "max_length": cfg.inference.tokens_to_generate, @@ -275,6 +74,16 @@ def main(cfg) -> None: final_prompts = [] for line in lines: prompt_dict = json.loads(line) + assert 'prompt' in prompt_dict or 'text' in prompt_dict + if 'prompt' not in prompt_dict: + prompt_dict['prompt'] = prompt_dict['text'] + if cfg.inference.insert_image_token == 'left': + prompt_dict['prompt'] = '' + prompt_dict['prompt'] + elif cfg.inference.insert_image_token == 'right': + prompt_dict['prompt'] = prompt_dict['prompt'] + '' + if 'image' in prompt_dict: + prompt_dict['image_path'] = prompt_dict['image'] + prompt_dict['image'] = image_processor(os.path.join(cfg.inference.images_base_path, prompt_dict['image'])) final_prompts.append(prompt_dict) responses = model.generate( @@ -316,27 +125,18 @@ def forward_loop(): prompt['full_text'] = response["clean_text"] prompt['text'] = response["clean_response"] prompt['model_id'] = cfg.neva_model_file - prompt['answer_id'] = 0 - prompt['metadata'] = {} + if 'image_path' in prompt: + prompt['image'] = prompt.pop('image_path') + if 'answer_id' not in prompt: + prompt['answer_id'] = 0 + if 'metadata' not in prompt: + prompt['metadata'] = {} results.append(prompt) with open(cfg.output_file, 'w') as f: for result in results: f.write(json.dumps(result) + '\n') - """ - # Second method of running text generation, call trainer.predict - ds = RequestDataSet(final_prompts) - request_dl = DataLoader(dataset=ds, batch_size=1) - config = OmegaConf.to_container(cfg.inference) - model.set_inference_config(config) - response = trainer.predict(model, request_dl) - - print("***************************") - print(response) - print("***************************") - """ - if __name__ == '__main__': main() # noqa pylint: disable=no-value-for-parameter diff --git a/examples/multimodal/text_to_image/controlnet/controlnet_infer.py b/examples/multimodal/text_to_image/controlnet/controlnet_infer.py index 4cdf922f8211..a33f3fc185ad 100644 --- a/examples/multimodal/text_to_image/controlnet/controlnet_infer.py +++ b/examples/multimodal/text_to_image/controlnet/controlnet_infer.py @@ -20,7 +20,6 @@ import torch from PIL import Image -from nemo.collections.multimodal.models.text_to_image.controlnet import get_preprocessing_function from nemo.collections.multimodal.models.text_to_image.controlnet.controlnet import MegatronControlNet from nemo.collections.multimodal.models.text_to_image.stable_diffusion.samplers.ddim import DDIMSampler from nemo.collections.multimodal.models.text_to_image.stable_diffusion.samplers.plms import PLMSSampler @@ -30,10 +29,6 @@ def get_control_input(image_path, batch_size, hint_image_size, control_image_preprocess=None): image = cv2.imread(image_path) - if control_image_preprocess: - # More applications will be supported here - process = get_preprocessing_function(control_image_preprocess) - image = process(image) image = cv2.resize(image, (hint_image_size, hint_image_size)) control = torch.from_numpy(image).float() / 255.0 control = torch.stack([control for _ in range(batch_size)], dim=0) diff --git a/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_infer.yaml b/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_infer.yaml index fc8d35443767..02faba0c65f6 100644 --- a/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_infer.yaml +++ b/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_infer.yaml @@ -15,16 +15,13 @@ infer: seed: 123 prompts: - 'a photo of a sks dog' - - 'a photo of a sks dog in the Acropolis' - - 'a photo of a sks dog in front of eiffel tower' - - 'a photo of sks dog sleeping' - - 'a photo of a sks dog riding a bike' + - 'a photo of a sks dog in a bucket' trainer: devices: 1 num_nodes: 1 accelerator: gpu - precision: 16 + precision: bf16 logger: False # logger provided by exp_manager model: diff --git a/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_lora_infer.yaml b/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_lora_infer.yaml new file mode 100644 index 000000000000..b2af365e4d05 --- /dev/null +++ b/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_lora_infer.yaml @@ -0,0 +1,33 @@ +name: stable-diffusion-train + +infer: + unconditional_guidance_scale: 7.5 + num_images_per_prompt: 4 + height: 512 + width: 512 + down_factor: 8 + inference_steps: 50 + sampler_type: 'DDIM' + eta: 0 + output_type: 'pil' + save_to_file: True + out_path: 'dreambooth' + seed: 123 + prompts: + - 'a photo of a sks dog' + - 'a photo of sks dog in a bucket' + + +trainer: + devices: 1 + num_nodes: 1 + accelerator: gpu + precision: 16 + logger: False # logger provided by exp_manager + +model: + precision: ${trainer.precision} + peft: + restore_from_path: null + unet_config: + from_pretrained: null # In case user want to load lora weights to a different unet ckpt than that is used in training \ No newline at end of file diff --git a/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_lora_train.yaml b/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_lora_train.yaml new file mode 100644 index 000000000000..283fbda56e33 --- /dev/null +++ b/examples/multimodal/text_to_image/dreambooth/conf/dreambooth_lora_train.yaml @@ -0,0 +1,241 @@ +name: Dreambooth-lora + +trainer: + devices: 1 + num_nodes: 1 + accelerator: gpu + precision: bf16-mixed + logger: False # logger provided by exp_manager + enable_checkpointing: False + use_distributed_sampler: False + max_epochs: -1 # PTL default. In practice, max_steps will be reached first. + max_steps: 500 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches + log_every_n_steps: 10 + accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models + gradient_clip_val: 1.0 + benchmark: False + enable_model_summary: True + limit_val_batches: 0 + +exp_manager: + exp_dir: null + name: ${name} + create_checkpoint_callback: True + create_tensorboard_logger: True + checkpoint_callback_params: + every_n_train_steps: 200 + every_n_epochs: 0 + monitor: reduced_train_loss + save_on_train_epoch_end: False + filename: '${name}-{step}' + save_top_k: -1 + resume_if_exists: True + resume_ignore_no_checkpoint: True + resume_from_checkpoint: ${model.resume_from_checkpoint} + ema: + enable: False + decay: 0.9999 + validate_original_weights: False + every_n_steps: 1 + cpu_offload: False + + + +model: + precision: ${trainer.precision} + # specify micro_batch_size, global_batch_size, and model parallelism + # gradient accumulation will be done automatically based on data_parallel_size + micro_batch_size: 1 # limited by GPU memory + global_batch_size: 1 # will use more micro batches to reach global batch size + + with_prior_preservation: False + use_cached_latents: False + prior_loss_weight: 0.5 + train_text_encoder: False + restore_from_path: /ckpts/nemo-v1-5-188000-ema.nemo #This ckpt is only used to generate regularization images, thus .nemo ckpt is needed + + + + + linear_start: 0.00085 + linear_end: 0.012 + num_timesteps_cond: 1 + log_every_t: 200 + timesteps: 1000 + first_stage_key: images + cond_stage_key: captions + image_size: 64 + channels: 4 + cond_stage_trainable: false + conditioning_key: crossattn # check + monitor: val/loss_simple_ema + scale_factor: 0.18215 + use_ema: False + scale_by_std: False + ckpt_path: + ignore_keys: [ ] + parameterization: eps + clip_denoised: True + load_only_unet: False + cosine_s: 8e-3 + given_betas: + original_elbo_weight: 0 + v_posterior: 0 + l_simple_weight: 1 + use_positional_encodings: False + learn_logvar: False + logvar_init: 0 + beta_schedule: linear + loss_type: l2 + + concat_mode: True + cond_stage_forward: + text_embedding_dropout_rate: 0.1 + fused_opt: True + inductor: False + inductor_cudagraphs: False + channels_last: False + + unet_config: + _target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.openaimodel.UNetModel + from_pretrained: /ckpts/unet.bin #load unet weights for finetuning, can use .ckpt ckpts from various sources + from_NeMo: False #Must be specified when from pretrained is not None, False means loading unet from HF ckpt + image_size: 32 # unused + in_channels: 4 + out_channels: 4 + model_channels: 320 + attention_resolutions: + - 4 + - 2 + - 1 + num_res_blocks: 2 + channel_mult: + - 1 + - 2 + - 4 + - 4 + num_heads: 8 + use_spatial_transformer: true + transformer_depth: 1 + context_dim: 768 + use_checkpoint: False + legacy: False + use_flash_attention: False + lora_network_alpha: null + + first_stage_config: + _target_: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL + from_pretrained: /ckpts/vae.bin + #ckpt_path: /ckpts/vae.ckpt #to support original opensource weights files, please use ckpt_path to load it. + embed_dim: 4 + monitor: val/rec_loss + ddconfig: + double_z: true + z_channels: 4 + resolution: 256 #Never used + in_channels: 3 + out_ch: 3 + ch: 128 + ch_mult: + - 1 + - 2 + - 4 + - 4 + num_res_blocks: 2 + attn_resolutions: [ ] + dropout: 0.0 + lossconfig: + target: torch.nn.Identity + + cond_stage_config: + _target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenMegatronCLIPEmbedder + restore_from_path: /ckpts/openai.nemo + device: cuda + freeze: True + layer: "last" + enable_lora_finetune: False #to enable text encoder lora finetune, please enable both this one and "train_text_encoder" + # For compatibility of history version that uses HF clip model + # _target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder + # version: openai/clip-vit-large-patch14 + # device: cuda + # max_length: 77 + # enable_lora_finetune: False #to enable text encoder lora finetune, please enable both this one and "train_text_encoder" + + noise_scheduler: + _target_: nemo.collections.multimodal.models.text_to_image.dreambooth.util.sd_noise_scheduler + parameterization: eps + v_posterior: 0 + given_betas: + beta_schedule: linear + timesteps: 1000 + linear_start: 0.00085 + linear_end: 0.012 + cosine_s: 8e-3 + + # miscellaneous + seed: 1234 + resume_from_checkpoint: null # manually set the checkpoint file to load from + apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this + gradient_as_bucket_view: True # PyTorch DDP argument. Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory) + + optim: + name: fused_adam + lr: 1e-4 + weight_decay: 0. + betas: + - 0.9 + - 0.999 + sched: + name: WarmupHoldPolicy + warmup_steps: 1 + hold_steps: 10000000000000 # Incredibly large value to hold the lr as constant + + # Nsys profiling options + nsys_profile: + enabled: False + start_step: 10 # Global batch to start profiling + end_step: 10 # Global batch to end profiling + ranks: [ 0 ] # Global rank IDs to profile + gen_shape: False # Generate model and kernel details including input shapes + + data: + name: pbss + num_workers: 4 + instance_dir: /home/scratch.zhuoyaow_gpu/workspace/SD_NeMo_EA/launcher_scripts/data/inst_dir + instance_prompt: a photo of a sks dog + regularization_dir: /home/scratch.zhuoyaow_gpu/workspace/SD_NeMo_EA/launcher_scripts/data/nemo_dogs + regularization_prompt: a photo of a dog + num_reg_images: 10 + num_images_per_prompt: 4 + resolution: 512 + center_crop: True + cached_instance_dir: #/datasets/instance_dir_cached + cached_reg_dir: #/datasets/nemo_dogs_cached + + peft: + peft_scheme: "sdlora" + restore_from_path: null + lora_tuning: + adapter_dim: 32 + network_alpha: 16 + adapter_dropout: 0.0 + column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal + row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal + layer_selection: null # selects in which layers to add lora adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers + weight_tying: False + position_embedding_strategy: null # used only when weight_tying is True + +##The below infer config is to use inference script generating regularization images +infer: + unconditional_guidance_scale: 7.5 + num_images_per_prompt: ${model.data.num_images_per_prompt} + height: 512 + width: 512 + down_factor: 8 + inference_steps: 50 + sampler_type: 'PLMS' + eta: 0 + output_type: 'pil' + save_to_file: False + out_path: ${model.data.regularization_dir} + prompts: ${model.data.regularization_prompt} \ No newline at end of file diff --git a/examples/multimodal/text_to_image/dreambooth/dreambooth.py b/examples/multimodal/text_to_image/dreambooth/dreambooth.py index e8e7d776f1ff..70231e571331 100644 --- a/examples/multimodal/text_to_image/dreambooth/dreambooth.py +++ b/examples/multimodal/text_to_image/dreambooth/dreambooth.py @@ -21,6 +21,8 @@ from nemo.collections.multimodal.parts.stable_diffusion.pipeline import pipeline from nemo.collections.multimodal.parts.utils import setup_trainer_and_model_for_inference from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronTrainerBuilder + +from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP from nemo.core.config import hydra_runner from nemo.utils import logging from nemo.utils.exp_manager import exp_manager @@ -53,7 +55,9 @@ def model_cfg_modifier(model_cfg): model_cfg.global_batch_size = cfg.model.global_batch_size model_cfg.unet_config.from_pretrained = None model_cfg.first_stage_config.from_pretrained = None - model_cfg.target = 'nemo.collections.multimodal.models.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion' + model_cfg.target = ( + 'nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion' + ) trainer, megatron_diffusion_model = setup_trainer_and_model_for_inference( model_provider=MegatronLatentDiffusion, cfg=cfg, model_cfg_modifier=model_cfg_modifier @@ -101,6 +105,21 @@ def main(cfg): model = MegatronDreamBooth(cfg.model, trainer) + if cfg.model.get('peft', None): + + peft_cfg_cls = PEFT_CONFIG_MAP[cfg.model.peft.peft_scheme] + + if cfg.model.peft.restore_from_path is not None: + # initialize peft weights from a checkpoint instead of randomly + # This is not the same as resume training because optimizer states are not restored. + logging.info("PEFT Weights will be loaded from", cfg.model.peft.restore_from_path) + model.load_adapters(cfg.model.peft.restore_from_path, peft_cfg_cls(model_cfg)) + elif peft_cfg_cls is not None: + logging.info("Adding adapter weights to the model for PEFT") + model.add_adapter(peft_cfg_cls(cfg.model)) + else: + logging.info(f"Running full finetuning since no peft scheme is given.\n{model.summarize()}") + trainer.fit(model) diff --git a/examples/multimodal/text_to_image/dreambooth/dreambooth_infer.py b/examples/multimodal/text_to_image/dreambooth/dreambooth_infer.py index 672431d7b3fa..17952cea3b6b 100644 --- a/examples/multimodal/text_to_image/dreambooth/dreambooth_infer.py +++ b/examples/multimodal/text_to_image/dreambooth/dreambooth_infer.py @@ -28,7 +28,9 @@ def model_cfg_modifier(model_cfg): model_cfg.unet_config.use_flash_attention = False model_cfg.unet_config.from_pretrained = None model_cfg.first_stage_config.from_pretrained = None - model_cfg.target = 'nemo.collections.multimodal.models.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion' + model_cfg.target = ( + 'nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion' + ) trainer, megatron_diffusion_model = setup_trainer_and_model_for_inference( model_provider=MegatronLatentDiffusion, cfg=cfg, model_cfg_modifier=model_cfg_modifier diff --git a/examples/multimodal/text_to_image/dreambooth/dreambooth_lora_infer.py b/examples/multimodal/text_to_image/dreambooth/dreambooth_lora_infer.py new file mode 100644 index 000000000000..52f0aa2940d2 --- /dev/null +++ b/examples/multimodal/text_to_image/dreambooth/dreambooth_lora_infer.py @@ -0,0 +1,67 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import torch +from omegaconf import open_dict +from pytorch_lightning import Trainer +from pytorch_lightning.plugins.environments import TorchElasticEnvironment + +from nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm import MegatronLatentDiffusion +from nemo.collections.multimodal.parts.stable_diffusion.pipeline import pipeline +from nemo.collections.multimodal.parts.utils import setup_trainer_and_model_for_inference +from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector +from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP +from nemo.core.config import hydra_runner + + +@hydra_runner(config_path='conf', config_name='dreambooth_lora_infer') +def main(cfg): + def model_cfg_modifier(model_cfg): + model_cfg.precision = cfg.trainer.precision + model_cfg.ckpt_path = None + model_cfg.inductor = False + model_cfg.target = ( + 'nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion' + ) + if cfg.model.unet_config.from_pretrained: + model_cfg.unet_config.from_pretrained = cfg.model.unet_config.from_pretrained + + model_cfg = MegatronLatentDiffusion.restore_from( + restore_path=cfg.model.peft.restore_from_path, + trainer=None, + save_restore_connector=NLPSaveRestoreConnector(), + return_config=True, + ) + + with open_dict(model_cfg): + model_cfg_modifier(model_cfg) + + plugins = [] + plugins.append(TorchElasticEnvironment()) + strategy = NLPDDPStrategy(no_ddp_communication_hook=True, find_unused_parameters=False,) + trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer) + + model = MegatronLatentDiffusion(model_cfg, trainer=trainer) + model.setup_complete = True + + peft_cfg_cls = PEFT_CONFIG_MAP[model_cfg.peft.peft_scheme] + + model.load_adapters(cfg.model.peft.restore_from_path, peft_cfg_cls(model_cfg)) + rng = torch.Generator().manual_seed(cfg.infer.seed) + + model = model.model.cuda().eval() + pipeline(model, cfg, rng=rng) + + +if __name__ == "__main__": + main() diff --git a/examples/multimodal/text_to_image/instruct_pix2pix/conf/sd_finetune.yaml b/examples/multimodal/text_to_image/instruct_pix2pix/conf/sd_finetune.yaml index 34ef1f436cd6..1c15b6e1a5fc 100644 --- a/examples/multimodal/text_to_image/instruct_pix2pix/conf/sd_finetune.yaml +++ b/examples/multimodal/text_to_image/instruct_pix2pix/conf/sd_finetune.yaml @@ -113,7 +113,7 @@ model: use_flash_attention: False first_stage_config: - _target_: nemo.collections.multimodal.models.stable_diffusion.ldm.autoencoder.AutoencoderKL + _target_: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL from_pretrained: embed_dim: 4 monitor: val/rec_loss diff --git a/examples/multimodal/text_to_image/stable_diffusion/conf/sd2_train.yaml b/examples/multimodal/text_to_image/stable_diffusion/conf/sd2_train.yaml index 3cfc822f8462..b725b15f1ab2 100644 --- a/examples/multimodal/text_to_image/stable_diffusion/conf/sd2_train.yaml +++ b/examples/multimodal/text_to_image/stable_diffusion/conf/sd2_train.yaml @@ -119,7 +119,7 @@ model: use_flash_attention: False first_stage_config: - _target_: nemo.collections.multimodal.models.stable_diffusion.ldm.autoencoder.AutoencoderKL + _target_: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL from_pretrained: embed_dim: 4 monitor: val/rec_loss diff --git a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_fid_images.yaml b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_fid_images.yaml index e526bc52d673..23a64a9607e3 100644 --- a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_fid_images.yaml +++ b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_fid_images.yaml @@ -32,6 +32,7 @@ infer: out_path: ${fid.save_path} seed: 123 prompts: + batch_size: 8 trainer: devices: ${fid.ntasks_per_node} @@ -42,4 +43,4 @@ trainer: model: restore_from_path: null - precision: ${trainer.precision} \ No newline at end of file + precision: ${trainer.precision} diff --git a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_infer.yaml b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_infer.yaml index dbe384dd2566..5a349387a0b4 100644 --- a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_infer.yaml +++ b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_infer.yaml @@ -3,6 +3,7 @@ name: stable-diffusion-train infer: unconditional_guidance_scale: 7.5 num_images_per_prompt: 4 + batch_size: 8 height: 512 width: 512 down_factor: 8 @@ -28,4 +29,4 @@ trainer: model: restore_from_path: null - precision: ${trainer.precision} \ No newline at end of file + precision: ${trainer.precision} diff --git a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_lora_infer.yaml b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_lora_infer.yaml new file mode 100644 index 000000000000..d77c24de704a --- /dev/null +++ b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_lora_infer.yaml @@ -0,0 +1,34 @@ +name: stable-diffusion-train + +infer: + unconditional_guidance_scale: 7.5 + num_images_per_prompt: 4 + height: 512 + width: 512 + down_factor: 8 + inference_steps: 25 + sampler_type: 'DPM' + eta: 0 + output_type: 'pil' + save_to_file: True + out_path: 'stable-diffusion' + seed: 123 + prompts: + - 'A photo of a Shiba Inu dog with a backpack riding a bike. It is wearing sunglasses and a beach hat.' + - 'A cute corgi lives in a house made out of sushi.' + - 'A high contrast portrait of a very happy fuzzy panda dressed as a chef in a high end kitchen making dough. There is a painting of flowers on the wall behind him.' + - 'A brain riding a rocketship heading towards the moon.' + +trainer: + devices: 1 + num_nodes: 1 + accelerator: gpu + precision: 16 + logger: False # logger provided by exp_manager + +model: + precision: ${trainer.precision} + peft: + restore_from_path: null + unet_config: + from_pretrained: null # In case user want to load lora weights to a different unet ckpt than that is used in training \ No newline at end of file diff --git a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_lora_train.yaml b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_lora_train.yaml new file mode 100644 index 000000000000..3fbe03aaeaa1 --- /dev/null +++ b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_lora_train.yaml @@ -0,0 +1,217 @@ +name: stable-diffusion-lora-train + +trainer: + devices: 1 + num_nodes: 1 + accelerator: gpu + precision: 16 + logger: False # logger provided by exp_manager + enable_checkpointing: False + use_distributed_sampler: False + max_epochs: 2 # PTL default. In practice, max_steps will be reached first. + max_steps: -1 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches + log_every_n_steps: 10 + accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models + gradient_clip_val: 1.0 + benchmark: False + enable_model_summary: True + limit_val_batches: 0 + + +exp_manager: + exp_dir: null + name: ${name} + create_wandb_logger: False + wandb_logger_kwargs: + project: stable-diffusion + group: nemo-sd + name: ${name} + resume: True + create_checkpoint_callback: True + create_tensorboard_logger: True + checkpoint_callback_params: + every_n_train_steps: 1000 + every_n_epochs: 0 + monitor: reduced_train_loss + filename: '${name}--{reduced_train_loss:.2f}-{step}-{consumed_samples}' + resume_if_exists: True + resume_ignore_no_checkpoint: True + resume_from_checkpoint: ${model.resume_from_checkpoint} + ema: + enable: True + decay: 0.9999 + validate_original_weights: False + every_n_steps: 1 + cpu_offload: False + + +model: + precision: ${trainer.precision} + # specify micro_batch_size, global_batch_size, and model parallelism + # gradient accumulation will be done automatically based on data_parallel_size + micro_batch_size: 1 # limited by GPU memory + global_batch_size: 1 # will use more micro batches to reach global batch size + native_amp_init_scale: 65536.0 # Init scale for grad scaler used at fp16 + + + linear_start: 0.00085 + linear_end: 0.012 + num_timesteps_cond: 1 + log_every_t: 200 + timesteps: 1000 + first_stage_key: images + cond_stage_key: captions # txt for cifar, caption for pbss + image_size: 64 + channels: 4 + cond_stage_trainable: false + conditioning_key: crossattn # check + monitor: val/loss_simple_ema + scale_factor: 0.18215 + use_ema: False + scale_by_std: False + ckpt_path: + ignore_keys: [] + parameterization: eps + clip_denoised: True + load_only_unet: False + cosine_s: 8e-3 + given_betas: + original_elbo_weight: 0 + v_posterior: 0 + l_simple_weight: 1 + use_positional_encodings: False + learn_logvar: False + logvar_init: 0 + beta_schedule: linear + loss_type: l2 + + concat_mode: True + cond_stage_forward: + text_embedding_dropout_rate: 0.1 + fused_opt: True + inductor: False + inductor_cudagraphs: False + capture_cudagraph_iters: -1 # -1 to disable + channels_last: True + + unet_config: + _target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.openaimodel.UNetModel + from_pretrained: /ckpts/nemo-v1-2.ckpt + from_NeMo: True #Must be specified when from pretrained is not None, False means loading unet from HF ckpt + image_size: 32 # unused + in_channels: 4 + out_channels: 4 + model_channels: 320 + attention_resolutions: + - 4 + - 2 + - 1 + num_res_blocks: 2 + channel_mult: + - 1 + - 2 + - 4 + - 4 + num_heads: 8 + use_spatial_transformer: true + transformer_depth: 1 + context_dim: 768 + use_checkpoint: False + legacy: False + use_flash_attention: True + enable_amp_o2_fp16: False + resblock_gn_groups: 32 + lora_network_alpha: null + + first_stage_config: + _target_: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL + from_pretrained: /ckpts/vae.bin + embed_dim: 4 + monitor: val/rec_loss + ddconfig: + double_z: true + z_channels: 4 + resolution: 256 #Never used + in_channels: 3 + out_ch: 3 + ch: 128 + ch_mult: + - 1 + - 2 + - 4 + - 4 + num_res_blocks: 2 + attn_resolutions: [] + dropout: 0.0 + lossconfig: + target: torch.nn.Identity + + cond_stage_config: + _target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenMegatronCLIPEmbedder + restore_from_path: /ckpts/openai.nemo + device: cuda + freeze: True + layer: "last" + # For compatibility of history version that uses HF clip model + # _target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder + # version: openai/clip-vit-large-patch14 + # device: cuda + # max_length: 77 + # capture_cudagraph_iters: ${model.capture_cudagraph_iters} + + + # miscellaneous + seed: 1234 + resume_from_checkpoint: null # manually set the checkpoint file to load from + apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this + gradient_as_bucket_view: True # PyTorch DDP argument. Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory) + ddp_overlap: True # True for using PyTorch DDP overlap. + + optim: + name: fused_adam + lr: 1e-4 + weight_decay: 0. + betas: + - 0.9 + - 0.999 + sched: + name: WarmupHoldPolicy + warmup_steps: 1 + hold_steps: 10000000000000 # Incredibly large value to hold the lr as constant + + # Nsys profiling options + nsys_profile: + enabled: False + start_step: 10 # Global batch to start profiling + end_step: 10 # Global batch to end profiling + ranks: [ 0 ] # Global rank IDs to profile + gen_shape: False # Generate model and kernel details including input shapes + + data: + num_workers: 16 + synthetic_data: False # dataset_path and local_root_path can be empty when using synthetic data + synthetic_data_length: 10000 + train: + dataset_path: + - /datasets/coyo/test.pkl + augmentations: + resize_smallest_side: 512 + center_crop_h_w: 512, 512 + horizontal_flip: False + filterings: + + webdataset: + infinite_sampler: False + local_root_path: /datasets/coyo + + peft: + peft_scheme: "sdlora" + restore_from_path: null + lora_tuning: + adapter_dim: 32 + adapter_dropout: 0.0 + column_init_method: 'xavier' # IGNORED if linear_adapter is used, options: xavier, zero or normal + row_init_method: 'zero' # IGNORED if linear_adapter is used, options: xavier, zero or normal + layer_selection: null # selects in which layers to add lora adapters. e.g. [1,12] will add lora to layer 1 (lowest) and 12. null will apply adapters to all layers + weight_tying: False + position_embedding_strategy: null # used only when weight_tying is True \ No newline at end of file diff --git a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_train.yaml b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_train.yaml index e87a99344d70..db1c138f9d3e 100644 --- a/examples/multimodal/text_to_image/stable_diffusion/conf/sd_train.yaml +++ b/examples/multimodal/text_to_image/stable_diffusion/conf/sd_train.yaml @@ -144,7 +144,6 @@ model: dropout: 0.0 lossconfig: target: torch.nn.Identity - capture_cudagraph_iters: ${model.capture_cudagraph_iters} cond_stage_config: _target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenMegatronCLIPEmbedder @@ -157,7 +156,6 @@ model: # version: openai/clip-vit-large-patch14 # device: cuda # max_length: 77 - # capture_cudagraph_iters: ${model.capture_cudagraph_iters} # miscellaneous @@ -193,7 +191,7 @@ model: synthetic_data_length: 10000 train: dataset_path: - - /datasets/coyo/wdinfo.pkl + - /datasets/coyo/test.pkl augmentations: resize_smallest_side: 512 center_crop_h_w: 512, 512 diff --git a/examples/multimodal/text_to_image/stable_diffusion/generate_fid_images.py b/examples/multimodal/text_to_image/stable_diffusion/generate_fid_images.py index d04a1d2b18af..27ea5913cbff 100644 --- a/examples/multimodal/text_to_image/stable_diffusion/generate_fid_images.py +++ b/examples/multimodal/text_to_image/stable_diffusion/generate_fid_images.py @@ -74,7 +74,7 @@ def model_cfg_modifier(model_cfg): model_cfg.unet_config.use_flash_attention = False model_cfg.unet_config.from_pretrained = None model_cfg.first_stage_config.from_pretrained = None - model_cfg.global_batch_size = model_cfg.micro_batch_size * ntasks_per_node + model_cfg.global_batch_size = cfg.infer.batch_size * ntasks_per_node # Set up the trainer and model for inference trainer, megatron_diffusion_model = setup_trainer_and_model_for_inference( @@ -84,13 +84,12 @@ def model_cfg_modifier(model_cfg): model.cuda().eval() # Generate images using the model and save them - for i, prompt in enumerate(input): - cfg.infer.prompts = [prompt] - rng = torch.Generator().manual_seed(cfg.infer.seed + local_task_id * 10 + node_id_per_cfg * 100 + i * 1000) - output = pipeline(model, cfg, rng=rng) - for image in output[0]: - image_num = i + partition_size_per_node * node_id_per_cfg + partition_size_per_task * local_task_id - image.save(os.path.join(save_path, f'image{image_num:06d}.png')) + cfg.infer.prompts = input + rng = torch.Generator().manual_seed(cfg.infer.seed + local_task_id * 10 + node_id_per_cfg * 100) + output = pipeline(model, cfg, rng=rng) + for i, image in enumerate(img for batch in output for img in batch): + image_num = i + partition_size_per_node * node_id_per_cfg + partition_size_per_task * local_task_id + image.save(os.path.join(save_path, f'image{image_num:06d}.png')) if __name__ == "__main__": diff --git a/examples/multimodal/text_to_image/stable_diffusion/sd_lora_infer.py b/examples/multimodal/text_to_image/stable_diffusion/sd_lora_infer.py new file mode 100644 index 000000000000..0877d4eb4b2f --- /dev/null +++ b/examples/multimodal/text_to_image/stable_diffusion/sd_lora_infer.py @@ -0,0 +1,64 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import torch +from omegaconf import open_dict +from pytorch_lightning import Trainer +from pytorch_lightning.plugins.environments import TorchElasticEnvironment + +from nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm import MegatronLatentDiffusion +from nemo.collections.multimodal.parts.stable_diffusion.pipeline import pipeline +from nemo.collections.multimodal.parts.utils import setup_trainer_and_model_for_inference +from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector +from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP +from nemo.core.config import hydra_runner + + +@hydra_runner(config_path='conf', config_name='dreambooth_lora_infer') +def main(cfg): + def model_cfg_modifier(model_cfg): + model_cfg.precision = cfg.trainer.precision + model_cfg.ckpt_path = None + model_cfg.inductor = False + if cfg.model.unet_config.from_pretrained: + model_cfg.unet_config.from_pretrained = cfg.model.unet_config.from_pretrained + + model_cfg = MegatronLatentDiffusion.restore_from( + restore_path=cfg.model.peft.restore_from_path, + trainer=None, + save_restore_connector=NLPSaveRestoreConnector(), + return_config=True, + ) + + with open_dict(model_cfg): + model_cfg_modifier(model_cfg) + + plugins = [] + plugins.append(TorchElasticEnvironment()) + strategy = NLPDDPStrategy(no_ddp_communication_hook=True, find_unused_parameters=False,) + trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer) + + model = MegatronLatentDiffusion(model_cfg, trainer=trainer) + model.setup_complete = True + + peft_cfg_cls = PEFT_CONFIG_MAP[model_cfg.peft.peft_scheme] + + model.load_adapters(cfg.model.peft.restore_from_path, peft_cfg_cls(model_cfg)) + rng = torch.Generator().manual_seed(cfg.infer.seed) + + model = model.model.cuda().eval() + pipeline(model, cfg, rng=rng) + + +if __name__ == "__main__": + main() diff --git a/examples/multimodal/text_to_image/stable_diffusion/sd_train.py b/examples/multimodal/text_to_image/stable_diffusion/sd_train.py index 9259b4960734..434150516d0c 100644 --- a/examples/multimodal/text_to_image/stable_diffusion/sd_train.py +++ b/examples/multimodal/text_to_image/stable_diffusion/sd_train.py @@ -20,6 +20,7 @@ from nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm import MegatronLatentDiffusion from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronTrainerBuilder from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy +from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP from nemo.core.config import hydra_runner from nemo.utils import logging from nemo.utils.exp_manager import exp_manager @@ -55,29 +56,27 @@ def main(cfg) -> None: torch.backends.cuda.matmul.allow_tf32 = True - if cfg.model.capture_cudagraph_iters >= 0: - # Required by CUDA graph with DDP - os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0" - - # Hack to avoid CUDA graph issue with AMP, PyTorch Lightning doesn't support - # changing autocast arguments for now. - # https://github.com/pytorch/pytorch/blob/v1.13.1/torch/cuda/graphs.py#L234 - def amp_autocast_init(self, *args, **kwargs): - if "cache_enabled" not in kwargs: - kwargs["cache_enabled"] = False - return self.__orig_init__(*args, **kwargs) - - torch.cuda.amp.autocast.__orig_init__ = torch.cuda.amp.autocast.__init__ - torch.cuda.amp.autocast.__init__ = amp_autocast_init - torch.autocast.__orig_init__ = torch.autocast.__init__ - torch.autocast.__init__ = amp_autocast_init - trainer = MegatronStableDiffusionTrainerBuilder(cfg).create_trainer() exp_manager(trainer, cfg.exp_manager) model = MegatronLatentDiffusion(cfg.model, trainer) + if cfg.model.get('peft', None): + + peft_cfg_cls = PEFT_CONFIG_MAP[cfg.model.peft.peft_scheme] + + if cfg.model.peft.restore_from_path is not None: + # initialize peft weights from a checkpoint instead of randomly + # This is not the same as resume training because optimizer states are not restored. + logging.info("PEFT Weights will be loaded from", cfg.model.peft.restore_from_path) + model.load_adapters(cfg.model.peft.restore_from_path, peft_cfg_cls(model_cfg)) + elif peft_cfg_cls is not None: + logging.info("Adding adapter weights to the model for PEFT") + model.add_adapter(peft_cfg_cls(cfg.model)) + else: + logging.info(f"Running full finetuning since no peft scheme is given.\n{model.summarize()}") + trainer.fit(model) diff --git a/examples/multimodal/vision_language_foundation/clip/megatron_clip_infer.py b/examples/multimodal/vision_language_foundation/clip/megatron_clip_infer.py index d77802e5a010..c99e7cbfc3bc 100644 --- a/examples/multimodal/vision_language_foundation/clip/megatron_clip_infer.py +++ b/examples/multimodal/vision_language_foundation/clip/megatron_clip_infer.py @@ -46,11 +46,11 @@ def model_cfg_modifier(model_cfg): ) if model.cfg.get("megatron_amp_O2", False): - vision_encoder = model.model.module.vision_encoder - text_encoder = model.model.module.text_encoder + vision_encoder = model.model.module.vision_encoder.eval() + text_encoder = model.model.module.text_encoder.eval() else: - vision_encoder = model.model.vision_encoder - text_encoder = model.model.text_encoder + vision_encoder = model.model.vision_encoder.eval() + text_encoder = model.model.text_encoder.eval() val_image_transform, text_transform = get_preprocess_fns(model.cfg, model.tokenizer, is_train=False,) diff --git a/nemo/README.md b/nemo/README.md index d7c95a070979..91b734b64361 100644 --- a/nemo/README.md +++ b/nemo/README.md @@ -7,4 +7,5 @@ NeMo (**Ne**ural **Mo**dules) is a toolkit for creating AI applications built ar * ASR - collection of modules and models for building speech recognition networks * TTS - collection of modules and models for building speech synthesis networks * NLP - collection of modules and models for building NLP networks +* Vision - collection of modules and models for building computer vision networks * Multimodal - collection of modules and models for building multimodal networks diff --git a/nemo/collections/multimodal/data/clip/augmentations/augmentations.py b/nemo/collections/multimodal/data/clip/augmentations/augmentations.py index a0a96d39de04..d1de22f687e5 100644 --- a/nemo/collections/multimodal/data/clip/augmentations/augmentations.py +++ b/nemo/collections/multimodal/data/clip/augmentations/augmentations.py @@ -16,7 +16,8 @@ https://github.com/mlfoundations/open_clip/blob/28c994406e39a5babc749c76871d92f33e9c558d/src/open_clip/transform.py by @yaoyu-33 """ -from typing import Optional, Tuple +from dataclasses import asdict, dataclass +from typing import Any, Dict, Optional, Tuple, Union import torch import torch.nn as nn @@ -37,10 +38,23 @@ except (ImportError, ModuleNotFoundError): TORCHVISION_AVAILABLE = False +from nemo.utils import logging + OPENAI_DATASET_MEAN = (0.48145466, 0.4578275, 0.40821073) OPENAI_DATASET_STD = (0.26862954, 0.26130258, 0.27577711) +@dataclass +class AugmentationCfg: + scale: Tuple[float, float] = (0.9, 1.0) + ratio: Optional[Tuple[float, float]] = None + color_jitter: Optional[Union[float, Tuple[float, float, float]]] = None + interpolation: Optional[str] = None + re_prob: Optional[float] = None + re_count: Optional[int] = None + use_timm: bool = False + + class ResizeMaxSize(nn.Module): def __init__(self, max_size, interpolation=InterpolationMode.BICUBIC, fn='max', fill=0): super().__init__() @@ -78,6 +92,7 @@ def image_transform( std: Optional[Tuple[float, ...]] = None, resize_longest_max: bool = False, fill_color: int = 0, + aug_cfg: Optional[Union[Dict[str, Any], AugmentationCfg]] = None, ): assert TORCHVISION_AVAILABLE, "Torchvision imports failed but they are required." mean = mean or OPENAI_DATASET_MEAN @@ -92,16 +107,50 @@ def image_transform( # for square size, pass size as int so that Resize() uses aspect preserving shortest edge image_size = image_size[0] + if isinstance(aug_cfg, dict): + aug_cfg = AugmentationCfg(**aug_cfg) + else: + aug_cfg = aug_cfg or AugmentationCfg() normalize = Normalize(mean=mean, std=std) if is_train: - return Compose( - [ - RandomResizedCrop(image_size, scale=(0.9, 1.0), interpolation=InterpolationMode.BICUBIC), - _convert_to_rgb, - ToTensor(), - normalize, - ] - ) + aug_cfg_dict = {k: v for k, v in asdict(aug_cfg).items() if v is not None} + use_timm = aug_cfg_dict.pop('use_timm', False) + if use_timm: + from timm.data import create_transform # timm can still be optional + + if isinstance(image_size, (tuple, list)): + assert len(image_size) >= 2 + input_size = (3,) + image_size[-2:] + else: + input_size = (3, image_size, image_size) + # by default, timm aug randomly alternates bicubic & bilinear for better robustness at inference time + aug_cfg_dict.setdefault('interpolation', 'random') + aug_cfg_dict.setdefault('color_jitter', None) # disable by default + train_transform = create_transform( + input_size=input_size, + is_training=True, + hflip=0.0, + mean=mean, + std=std, + re_mode='pixel', + **aug_cfg_dict, + ) + else: + train_transform = Compose( + [ + RandomResizedCrop( + image_size, scale=aug_cfg_dict.pop('scale'), interpolation=InterpolationMode.BICUBIC, + ), + _convert_to_rgb, + ToTensor(), + normalize, + ] + ) + if aug_cfg_dict: + logging.warning( + f'Unused augmentation cfg items, specify `use_timm` to use ({list(aug_cfg_dict.keys())}).' + ) + return train_transform else: if resize_longest_max: transforms = [ResizeMaxSize(image_size, fill=fill_color)] diff --git a/nemo/collections/multimodal/data/common/utils.py b/nemo/collections/multimodal/data/common/utils.py new file mode 100644 index 000000000000..31e83feb1375 --- /dev/null +++ b/nemo/collections/multimodal/data/common/utils.py @@ -0,0 +1,33 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import open_clip +import torch + + +def get_collate_fn(first_stage_key="images_moments", cond_stage_key="captions"): + def collate_fn_with_tokenize(batch): + images_moments = [s[first_stage_key] for s in batch] + cond_inputs = [s[cond_stage_key] for s in batch] + if cond_stage_key == "captions": + tokens = open_clip.tokenize(cond_inputs) + else: + tokens = torch.stack(cond_inputs) + batch = { + first_stage_key: torch.cat(images_moments), + cond_stage_key: tokens, + } + return batch + + return collate_fn_with_tokenize diff --git a/nemo/collections/multimodal/data/common/webdataset.py b/nemo/collections/multimodal/data/common/webdataset.py index d0e1b19d444e..8d70a03fa911 100644 --- a/nemo/collections/multimodal/data/common/webdataset.py +++ b/nemo/collections/multimodal/data/common/webdataset.py @@ -11,6 +11,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import glob import io import itertools import json @@ -123,6 +124,13 @@ def __init__( self.augmentations = dataset_cfg.validation.get("augmentations", None) self.filterings = dataset_cfg.validation.get("filterings", None) + # Optionally expand dataset as as a glob pattern + # This can be used to specify multiple .zip files: dataset_path="data/*.zip" + if isinstance(dataset_path, str): + glob_path = dataset_path + dataset_path = glob.glob(dataset_path) + assert len(dataset_path) > 0, f"No files found for {glob_path}" + if "boto3" in dataset_cfg: logging.info(f'Init boto3 using credentials file at {dataset_cfg.boto3.credentials_file}') self.use_boto3 = True diff --git a/nemo/collections/multimodal/data/neva/conversation.py b/nemo/collections/multimodal/data/neva/conversation.py index 4e53eb5190f6..886049dd5170 100644 --- a/nemo/collections/multimodal/data/neva/conversation.py +++ b/nemo/collections/multimodal/data/neva/conversation.py @@ -15,6 +15,18 @@ from enum import Enum, auto from typing import List +DEFAULT_PAD_TOKEN = "" +DEFAULT_BOS_TOKEN = "" +DEFAULT_EOS_TOKEN = "" +DEFAULT_UNK_TOKEN = "" +DEFAULT_IMAGE_TOKEN = "" +DEFAULT_SYSTEM_TOKEN = "" +DEFAULT_SEPARATOR_TOKEN = "" +DEFAULT_LABELS_TOKEN = "" +DEFAULT_IMAGE_PATCH_TOKEN = "" +DEFAULT_IM_START_TOKEN = "" +DEFAULT_IM_END_TOKEN = "" + class SeparatorStyle(Enum): """Different separator style.""" @@ -70,6 +82,8 @@ def get_prompt(self): if type(message) is tuple: message, _, _ = message ret += role + ": " + message + seps[i % 2] + if i % 2 == 1 and i != len(messages) - 1: # Assistant end + ret += " " else: ret += role + ":" elif self.sep_style == SeparatorStyle.LLAMA_2: @@ -88,7 +102,7 @@ def get_prompt(self): message = wrap_sys(self.system) + message if i % 2 == 0: message = wrap_inst(message) - ret += self.sep + message + ret += self.sep + " " + message else: ret += " " + message + " " + self.sep2 else: @@ -245,8 +259,8 @@ def dict(self): messages=(), offset=0, sep_style=SeparatorStyle.NVGPT, - sep="", - sep2="System\n", + sep=DEFAULT_SEPARATOR_TOKEN, + sep2=f"{DEFAULT_SYSTEM_TOKEN}System\n", ) conv_vicuna_v0 = Conversation( @@ -291,7 +305,7 @@ def dict(self): offset=0, sep_style=SeparatorStyle.TWO, sep=" ", - sep2="", + sep2=DEFAULT_EOS_TOKEN, ) conv_llama_2 = Conversation( @@ -303,8 +317,8 @@ def dict(self): messages=(), offset=0, sep_style=SeparatorStyle.LLAMA_2, - sep="", - sep2="", + sep=DEFAULT_BOS_TOKEN, + sep2=DEFAULT_EOS_TOKEN, ) conv_llava_llama_2 = Conversation( @@ -316,8 +330,8 @@ def dict(self): messages=(), offset=0, sep_style=SeparatorStyle.LLAMA_2, - sep="", - sep2="", + sep=DEFAULT_BOS_TOKEN, + sep2=DEFAULT_EOS_TOKEN, ) conv_llava_plain = Conversation( @@ -355,7 +369,7 @@ def dict(self): offset=0, sep_style=SeparatorStyle.TWO, sep=" ", - sep2="", + sep2=DEFAULT_EOS_TOKEN, ) conv_llava_v1_mmtag = Conversation( @@ -367,7 +381,7 @@ def dict(self): offset=0, sep_style=SeparatorStyle.TWO, sep=" ", - sep2="", + sep2=DEFAULT_EOS_TOKEN, version="v1_mmtag", ) diff --git a/nemo/collections/multimodal/data/neva/neva_dataset.py b/nemo/collections/multimodal/data/neva/neva_dataset.py index 7ed1597814c1..4dd6b120c8c8 100644 --- a/nemo/collections/multimodal/data/neva/neva_dataset.py +++ b/nemo/collections/multimodal/data/neva/neva_dataset.py @@ -30,21 +30,24 @@ from transformers import CLIPImageProcessor import nemo.collections.multimodal.data.neva.conversation as conversation_lib +from nemo.collections.multimodal.data.clip.augmentations.augmentations import image_transform +from nemo.collections.multimodal.data.neva.conversation import ( + DEFAULT_BOS_TOKEN, + DEFAULT_EOS_TOKEN, + DEFAULT_IM_END_TOKEN, + DEFAULT_IM_START_TOKEN, + DEFAULT_IMAGE_PATCH_TOKEN, + DEFAULT_IMAGE_TOKEN, + DEFAULT_LABELS_TOKEN, + DEFAULT_PAD_TOKEN, + DEFAULT_SEPARATOR_TOKEN, + DEFAULT_SYSTEM_TOKEN, + DEFAULT_UNK_TOKEN, +) from nemo.collections.nlp.modules.common.megatron.utils import get_ltor_masks_and_position_ids -MAX_NUM_IMAGES = 4 +MAX_NUM_IMAGES = 1 IGNORE_INDEX = -1 -DEFAULT_PAD_TOKEN = "" -DEFAULT_BOS_TOKEN = "" -DEFAULT_EOS_TOKEN = "" -DEFAULT_UNK_TOKEN = "" -DEFAULT_IMAGE_TOKEN = "" -DEFAULT_SYSTEM_TOKEN = "" -DEFAULT_SEPARATOR_TOKEN = "" -DEFAULT_LABELS_TOKEN = "" -DEFAULT_IMAGE_PATCH_TOKEN = "" -DEFAULT_IM_START_TOKEN = "" -DEFAULT_IM_END_TOKEN = "" class TarOrFolderImageLoader: @@ -135,34 +138,36 @@ def tokenize( return result -def preprocess_multimodal(sources: dict, multimodal_cfg: dict, cur_token_len: int,) -> Dict: +def preprocess_multimodal(sources: dict, multimodal_cfg: dict, cur_token_len: int, use_plain: bool = False) -> Dict: """ - Preprocesses a given multimodal input based on specified configurations. + Preprocesses multimodal sources based on the provided configuration. - This function modifies the 'sources' dictionary, primarily focusing on conversations. It checks if the input - is multimodal based on 'multimodal_cfg'. If not, it returns the 'sources' unmodified. For multimodal inputs, - it processes each conversation in 'sources'. If 'sep_image_conv_front' is set in 'multimodal_cfg', the function - asserts the presence of 'DEFAULT_IMAGE_TOKEN' at the beginning of each conversation, removes it, and restructures - the conversation's first turn with this token and other formatting details. Furthermore, the function replaces - 'DEFAULT_IMAGE_TOKEN' with a series of 'DEFAULT_IMAGE_PATCH_TOKEN' tokens, the count of which depends on - 'image_token_len' and 'use_im_start_end' configuration. + This function modifies the sources for multimodal data processing. It checks if the data is multimodal and + adjusts the token lengths accordingly. It also handles the start and end tokens for images and replaces + image tokens in conversations. Parameters: - - sources (dict): A dictionary containing the source data to be processed. Each source is expected to have - 'conversations' as one of its keys. - - multimodal_cfg (dict): A configuration dictionary specifying how the multimodal data should be processed. - Key configurations include 'is_multimodal', 'sep_image_conv_front', and 'use_im_start_end'. - - cur_token_len (int): The current length of image tokens, used to determine the number of patch tokens - to replace the 'DEFAULT_IMAGE_TOKEN'. + - sources (dict): A dictionary containing the multimodal sources to be processed. + - multimodal_cfg (dict): A configuration dictionary specifying various options for multimodal processing. + It includes keys like 'is_multimodal', 'use_im_start_end', and 'sep_image_conv_front'. + - cur_token_len (int): The current length of tokens to be considered for image processing. + - use_plain (bool, optional): A boolean flag to use plain image token replacement without additional processing. + Defaults to False. Returns: - - dict: The modified 'sources' dictionary after applying the multimodal preprocessing. + - dict: The processed sources dictionary after applying multimodal preprocessing steps. """ is_multimodal = multimodal_cfg['is_multimodal'] image_token_len = cur_token_len if not is_multimodal: return sources + if multimodal_cfg['use_im_start_end']: + replace_token = DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + else: + replace_token = DEFAULT_IMAGE_PATCH_TOKEN * (image_token_len - 2) + replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN + for source in sources: conversation = source['conversations'] if multimodal_cfg['sep_image_conv_front']: @@ -175,12 +180,10 @@ def preprocess_multimodal(sources: dict, multimodal_cfg: dict, cur_token_len: in + ": " + conversation[0]['value'] ) + if use_plain: + assert DEFAULT_IMAGE_TOKEN in conversation[0]['value'] + conversation[0]['value'] = DEFAULT_IMAGE_TOKEN for turn in conversation: - if multimodal_cfg['use_im_start_end']: - replace_token = DEFAULT_IMAGE_PATCH_TOKEN * image_token_len - else: - replace_token = DEFAULT_IMAGE_PATCH_TOKEN * (image_token_len - 2) - replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN turn["value"] = turn["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token) return sources @@ -188,24 +191,19 @@ def preprocess_multimodal(sources: dict, multimodal_cfg: dict, cur_token_len: in def preprocess_llama_2(sources: dict, tokenizer, cfg,) -> Dict: """ - Preprocess a given set of conversational sources using llama_2 chat conversation template + Preprocesses sources for the LLaMA 2 model configuration. - This function processes conversations by first ensuring the conversation starts with a 'human' role, then tokenizes the conversations, applies specific token replacements, and finally masks labels for training purposes. + The function applies prompt templates and tokenizes the conversations according to the LLaMA 2 model specifications. + It involves special handling of tokens, masking of labels, and adjustments based on configuration settings. Parameters: - - sources: A dictionary containing conversational data. Expected format is a dict of conversations, where each conversation is a list of messages, and each message is a dict with 'from' (role) and 'value' (message text). - - tokenizer: A tokenizer from the Hugging Face Transformers library used for tokenizing the conversations. - - cfg: Configuration settings which include 'add_extra_token' (bool) to determine if an extra token should be added to the tokenized output, and 'context_length' for specifying the tokenization context length. + - sources (dict): A dictionary of sources containing conversations to be processed. + - tokenizer: The tokenizer to be used for processing the text. + - cfg: Configuration settings for preprocessing, including context length and additional tokens. Returns: - - Dict: A dictionary containing two keys: - - 'tokens': A tensor of tokenized conversation data. - - 'labels': A tensor of labels for the conversation data, used for training models. Labels are masked based on the conversation structure. - - Note: - - The function includes specific token replacements (e.g., DEFAULT_IMAGE_PATCH_TOKEN, , ) and masking techniques for labels. - - It is designed to work with conversational data where messages alternate between a 'human' and a 'gpt' role. - - The function asserts that each message in a conversation alternates between the defined roles and skips messages not starting with the 'human' role. + - Dict: A dictionary containing tokenized and labeled data suitable for the LLaMA 2 model. + This includes tokens, labels, and any special processing as defined in the configuration. """ conv = conversation_lib.conv_llava_llama_2.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]} @@ -259,7 +257,7 @@ def preprocess_llama_2(sources: dict, tokenizer, cfg,) -> Dict: round_len = len(tokenizer.text_to_ids(rou + conv.sep2)) if i > 0: round_len -= 1 # Remove extra token added by sp tokenizer - instruction_len = len(tokenizer.text_to_ids(parts[0])) - 1 + instruction_len = len(tokenizer.text_to_ids(parts[0])) - 2 target[cur_len : cur_len + instruction_len] = IGNORE_INDEX cur_len += round_len @@ -280,24 +278,18 @@ def preprocess_llama_2(sources: dict, tokenizer, cfg,) -> Dict: def preprocess_v1(sources: dict, tokenizer, cfg,) -> Dict: """ - Preprocess a given set of conversational sources using vicuna v1 conversation template + Preprocesses sources for the Vicuna V1 model configuration. - This function processes conversations by first ensuring the conversation starts with a 'human' role, then tokenizes the conversations, applies specific token replacements, and finally masks labels for training purposes. + Similar to `preprocess_llama_2`, this function applies prompt templates and performs tokenization, but it is tailored + for the Vicuna V1 model. It includes specific handling for token translations, label masking, and tokenizer configuration. Parameters: - - sources: A dictionary containing conversational data. Expected format is a dict of conversations, where each conversation is a list of messages, and each message is a dict with 'from' (role) and 'value' (message text). - - tokenizer: A tokenizer from the Hugging Face Transformers library used for tokenizing the conversations. - - cfg: Configuration settings which include 'add_extra_token' (bool) to determine if an extra token should be added to the tokenized output, and 'context_length' for specifying the tokenization context length. + - sources (dict): A dictionary of sources containing conversations to be processed. + - tokenizer: The tokenizer to be used for processing the text. + - cfg: Configuration settings for preprocessing, which may include context length and additional tokens. Returns: - - Dict: A dictionary containing two keys: - - 'tokens': A tensor of tokenized conversation data. - - 'labels': A tensor of labels for the conversation data, used for training models. Labels are masked based on the conversation structure. - - Note: - - The function includes specific token replacements (e.g., DEFAULT_IMAGE_PATCH_TOKEN, , ) and masking techniques for labels. - - It is designed to work with conversational data where messages alternate between a 'human' and a 'gpt' role. - - The function asserts that each message in a conversation alternates between the defined roles and skips messages not starting with the 'human' role. + - Dict: A dictionary containing the processed data, including tokens and labels, formatted for the Vicuna V1 model. """ conv = conversation_lib.conv_vicuna_v1.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]} @@ -400,8 +392,8 @@ def preprocess_nvgpt(sources: dict, tokenizer, cfg,) -> Dict: strip_end_for_inference = False for i, turn in enumerate(source['conversations']): - if i % 2 == 0: - turn['from'] = conv.roles[0] + if i % 2 == 1: + turn['from'] = conv.roles[1] if 'label' not in turn: turn[ 'label' @@ -413,7 +405,7 @@ def preprocess_nvgpt(sources: dict, tokenizer, cfg,) -> Dict: True # in inference, current turn is empty, thus end tokens need to striped. ) else: - turn['from'] = conv.roles[1] + turn['from'] = conv.roles[0] conv.append_message(turn['from'], turn['value']) context = conv.get_prompt() if strip_end_for_inference: @@ -470,6 +462,57 @@ def preprocess_nvgpt(sources: dict, tokenizer, cfg,) -> Dict: return dict(tokens=tokens, labels=labels,) +def preprocess_plain(sources, tokenizer, cfg,) -> Dict: + """ + Preprocesses plain text sources (no template) for tokenization and label generation. + + This function concatenates conversations with an end signal, tokenizes them, and prepares labels for training. + It handles sources with a specific structure (expecting two elements in 'conversations') and includes the + option to add an extra token as specified in the configuration. The function also applies masking to the labels. + + Parameters: + - sources: A list of source dictionaries. Each source dictionary should have a key 'conversations' + containing a list of conversation parts. + - tokenizer: The tokenizer to be used for converting text to tokens. + - cfg: Configuration dictionary which may include 'context_length' and 'add_extra_token' settings. + + Returns: + - Dict: A dictionary containing tokenized data and corresponding labels. This includes 'tokens' which are the + tokenized representations of the conversations, and 'labels' which are used for training the model. The labels + have specific indices masked with IGNORE_INDEX as per the preprocessing logic. + """ + # add end signal and concatenate together + conversations = [] + for source in sources: + source = source['conversations'] + assert len(source) == 2 + # This line is different from LLaVA repo, we inserted '\n' after . + conversation = source[0]['value'] + source[1]['value'] + '\n' + conversations.append(conversation) + # tokenize conversations + add_extra_token = cfg.get("add_extra_token") + tokens = tokenize( + texts=conversations, + tokenizer=tokenizer, + context_length=cfg.get("context_length"), + add_extra_token=add_extra_token, + ) + labels = tokens.clone().detach() + for target, source in zip(labels, sources): + source = source['conversations'] + tokenized_len = len(tokenizer.text_to_ids(source[0]['value'])) + target[:tokenized_len] = IGNORE_INDEX + + if add_extra_token: + tokens = tokens[:, :-1].contiguous() + labels = labels[:, 1:].contiguous() + else: + labels = torch.roll(labels, shifts=-1, dims=-1) + labels[:, -1] = IGNORE_INDEX + + return dict(tokens=tokens, labels=labels,) + + class LazySupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" @@ -498,7 +541,6 @@ def __len__(self): def __getitem__(self, i) -> Dict[str, torch.Tensor]: sources = self.list_data_dict[i] - processor = self.processor if isinstance(i, int): sources = [sources] assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME @@ -511,33 +553,40 @@ def __getitem__(self, i) -> Dict[str, torch.Tensor]: image = self.image_loader.open_image(image_file) if image is None: logging.warning(f"Image {image_file} could not be found!") - if self.multimodal_cfg['image_aspect_ratio'] == 'keep': - max_hw, min_hw = max(image.size), min(image.size) - aspect_ratio = max_hw / min_hw - max_len, min_len = 448, 224 - shortest_edge = int(min(max_len / aspect_ratio, min_len)) - image = processor.preprocess( - image, return_tensors='pt', do_center_crop=False, size={"shortest_edge": shortest_edge} - )['pixel_values'][0] - elif self.multimodal_cfg['image_aspect_ratio'] == 'pad': - - def expand2square(pil_img, background_color): - width, height = pil_img.size - if width == height: - return pil_img - elif width > height: - result = Image.new(pil_img.mode, (width, width), background_color) - result.paste(pil_img, (0, (width - height) // 2)) - return result - else: - result = Image.new(pil_img.mode, (height, height), background_color) - result.paste(pil_img, ((height - width) // 2, 0)) - return result - - image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean)) - image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0] + if isinstance(self.processor, CLIPImageProcessor): + # image processor from HF + if self.multimodal_cfg['image_aspect_ratio'] == 'keep': + max_hw, min_hw = max(image.size), min(image.size) + aspect_ratio = max_hw / min_hw + max_len, min_len = 448, 224 + shortest_edge = int(min(max_len / aspect_ratio, min_len)) + image = self.processor.preprocess( + image, return_tensors='pt', do_center_crop=False, size={"shortest_edge": shortest_edge} + )['pixel_values'][0] + elif self.multimodal_cfg['image_aspect_ratio'] == 'pad': + + def expand2square(pil_img, background_color): + width, height = pil_img.size + if width == height: + return pil_img + elif width > height: + result = Image.new(pil_img.mode, (width, width), background_color) + result.paste(pil_img, (0, (width - height) // 2)) + return result + else: + result = Image.new(pil_img.mode, (height, height), background_color) + result.paste(pil_img, ((height - width) // 2, 0)) + return result + + image = expand2square(image, tuple(int(x * 255) for x in self.processor.image_mean)) + image = self.processor.preprocess(image, return_tensors='pt')['pixel_values'][0] + else: + image = self.processor.preprocess(image, return_tensors='pt')['pixel_values'][0] else: - image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0] + assert ( + self.multimodal_cfg['image_aspect_ratio'] == 'square' + ), 'NeMo image transform with setting `image_aspect_ratio` to `square`.' + image = self.processor(image) images.append(image) images_tensors = torch.tensor([]) if images: @@ -545,7 +594,12 @@ def expand2square(pil_img, background_color): cur_token_len = (images_tensors[0].shape[1] // 14) * ( images_tensors[0].shape[2] // 14 ) # FIXME: 14 is hardcoded patch size - sources = preprocess_multimodal(copy.deepcopy(sources), self.multimodal_cfg, cur_token_len) + sources = preprocess_multimodal( + copy.deepcopy(sources), + self.multimodal_cfg, + cur_token_len, + use_plain=(self.conv_template == "plain"), + ) else: images_tensors = torch.tensor([]) sources = copy.deepcopy(sources) @@ -556,6 +610,8 @@ def expand2square(pil_img, background_color): data_dict = preprocess_v1(sources, self.tokenizer, self.multimodal_cfg,) elif self.conv_template == "llama_2": data_dict = preprocess_llama_2(sources, self.tokenizer, self.multimodal_cfg,) + elif self.conv_template == "plain": + data_dict = preprocess_plain(sources, self.tokenizer, self.multimodal_cfg,) else: raise ValueError(f"Conversation template `{self.conv_template}` is not supported in Neva now.") @@ -564,10 +620,13 @@ def expand2square(pil_img, background_color): # image exist in the data if self.multimodal_cfg['is_multimodal']: - crop_size = self.processor.crop_size + if isinstance(self.processor, CLIPImageProcessor): + crop_size = [self.processor.crop_size['height'], self.processor.crop_size['width']] + else: + crop_size = self.multimodal_cfg['crop_size'] # image does not exist in the data, but the model is multimodal zero_padding = torch.zeros( - (MAX_NUM_IMAGES - len(images_tensors), 3, crop_size['height'], crop_size['width']), dtype=torch.float + (MAX_NUM_IMAGES - len(images_tensors), 3, crop_size[0], crop_size[1]), dtype=torch.float ) images_tensors = torch.cat((images_tensors, zero_padding), dim=0) data_dict['image'] = images_tensors @@ -669,15 +728,15 @@ def make_supervised_data_module(tokenizer, model_cfg) -> Dict: add_extra_token = 1 if getattr(model_cfg, 'no_seqlen_plus_one_input_tokens', False): add_extra_token = 0 + crop_size = data_cfg.get("crop_size", (224, 224)) if mm_cfg.vision_encoder.from_hf: image_processor = CLIPImageProcessor.from_pretrained( mm_cfg.vision_encoder.from_pretrained, torch_dtype=torch.bfloat16 ) else: # TODO(yuya): Fix this hard-code for our own CLIP - image_processor = CLIPImageProcessor.from_pretrained( - "openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16 - ) + image_processor = image_transform(crop_size, is_train=False, mean=None, std=None,) + train_dataset = NevaDataset( tokenizer=tokenizer, data_path=data_cfg.data_path, @@ -685,6 +744,7 @@ def make_supervised_data_module(tokenizer, model_cfg) -> Dict: is_multimodal=data_cfg.is_multimodal, sep_image_conv_front=data_cfg.sep_image_conv_front, conv_template=data_cfg.get("conv_template", "nvgpt"), + crop_size=crop_size, image_token_len=data_cfg.image_token_len, image_folder=data_cfg.image_folder, image_aspect_ratio=data_cfg.image_aspect_ratio, @@ -694,5 +754,5 @@ def make_supervised_data_module(tokenizer, model_cfg) -> Dict: context_length=model_cfg.encoder_seq_length, ), ) - # data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer) + return dict(train_dataset=train_dataset, eval_dataset=train_dataset) diff --git a/nemo/collections/multimodal/data/stable_diffusion/stable_diffusion_dataset.py b/nemo/collections/multimodal/data/stable_diffusion/stable_diffusion_dataset.py index 445932124718..e4f3dea59169 100644 --- a/nemo/collections/multimodal/data/stable_diffusion/stable_diffusion_dataset.py +++ b/nemo/collections/multimodal/data/stable_diffusion/stable_diffusion_dataset.py @@ -183,3 +183,43 @@ def transform_fn(sample): ) return train_data, val_data + + +def build_train_valid_precached_clip_datasets(model_cfg, consumed_samples): + data_cfg = model_cfg.data + + # This function maps data that are tuples to dictionary. + def tuple_to_dict(inp): + for input in inp: + out_dict = dict() + out_dict[model_cfg.first_stage_key] = input[0] + out_dict[model_cfg.cond_stage_key] = input[1] + yield out_dict + + def transform_fn(sample): + latents, text_embed = sample["pyd"]["image_embed"], sample["pyd"]['captions_embed'] + latents = torch.from_numpy(latents) + text_embed = torch.from_numpy(text_embed) + + # latents are of shape ([4, 64, 64]) + return latents, text_embed + + train_data = WebDatasetCommon( + dataset_cfg=data_cfg, + consumed_samples=consumed_samples, + map_fn=transform_fn, + compose_fn=tuple_to_dict, + is_train=True, + ) + + val_data = None + if data_cfg.get("validation") is not None and data_cfg.validation.get("data_path"): + val_data = WebDatasetCommon( + dataset_cfg=data_cfg, + consumed_samples=consumed_samples, + map_fn=transform_fn, + compose_fn=tuple_to_dict, + is_train=False, + ) + + return train_data, val_data diff --git a/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py b/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py index 1a38d86742b4..5fd0fa830dd0 100644 --- a/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py +++ b/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py @@ -13,7 +13,6 @@ # limitations under the License. import os -import tempfile from functools import partial from itertools import chain from typing import Any, Optional @@ -25,9 +24,8 @@ from pytorch_lightning.trainer.trainer import Trainer from transformers import CLIPVisionModel +from nemo.collections.multimodal.data.neva.conversation import DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN from nemo.collections.multimodal.data.neva.neva_dataset import ( - DEFAULT_IM_END_TOKEN, - DEFAULT_IM_START_TOKEN, DataCollatorForSupervisedDataset, make_supervised_data_module, ) @@ -35,13 +33,10 @@ CLIPVisionTransformer, MegatronCLIPModel, ) -from nemo.collections.multimodal.parts.utils import extend_instance -from nemo.collections.nlp.data.language_modeling.megatron.data_samplers import ( - MegatronPretrainingRandomSampler, - MegatronPretrainingSampler, -) +from nemo.collections.multimodal.parts.utils import extend_instance, load_nemo_model_weights +from nemo.collections.nlp.data.language_modeling.megatron.data_samplers import MegatronPretrainingSampler from nemo.collections.nlp.models.language_modeling.megatron.gpt_model import GPTModel -from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel +from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel, get_specs from nemo.collections.nlp.models.nlp_model import NLPModel from nemo.collections.nlp.modules.common.megatron.adapters.parallel_adapters import ( AdapterName, @@ -57,11 +52,20 @@ ) from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, OutputType, SamplingParam from nemo.collections.nlp.parts.mixins.multimodal_adapter_mixins import MultimodalAdapterModelMixin -from nemo.collections.nlp.parts.nlp_overrides import NLPSaveRestoreConnector from nemo.collections.nlp.parts.utils_funcs import get_last_rank +from nemo.collections.vision.data.megatron.data_samplers import MegatronVisionPretrainingRandomSampler from nemo.core import adapter_mixins from nemo.core.classes.common import PretrainedModelInfo -from nemo.utils import AppState, logging +from nemo.utils import logging + +try: + import apex.transformer.pipeline_parallel.utils + + HAVE_APEX = True + +except (ImportError, ModuleNotFoundError): + + HAVE_APEX = False try: from megatron.core import InferenceParams, dist_checkpointing, parallel_state @@ -82,6 +86,7 @@ def __init__(self, model_cfg, model_parallel_config, pre_process=True, post_proc model_cfg, model_parallel_config, pre_process=pre_process, post_process=post_process, skip_head=True, ) self.frozen = False + self.dtype = self.config.params_dtype def train(self, mode): if self.frozen: @@ -239,7 +244,7 @@ def __init__( self.media_start_id = media_start_id self.media_end_id = media_end_id self.mcore_gpt = mcore_gpt - self.dist_ckpt = False + self.is_dist_ckpt = False if getattr(self, 'language_model', None) is not None: self.embedding = self.language_model.embedding @@ -287,48 +292,10 @@ def _load_model_weights(self, nemo_path): """ Shared method to load model weights from a given nemo_path. """ - if torch.cuda.is_available(): - map_location = torch.device('cuda') - else: - map_location = torch.device('cpu') - - save_restore_connector = NLPSaveRestoreConnector() - cwd = os.getcwd() - app_state = AppState() - - with tempfile.TemporaryDirectory() as tmpdir: - try: - if os.path.isfile(nemo_path): - save_restore_connector._unpack_nemo_file(path2file=nemo_path, out_folder=tmpdir) - else: - tmpdir = nemo_path - os.chdir(tmpdir) - if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1: - model_weights = save_restore_connector._inject_model_parallel_rank_for_ckpt( - tmpdir, save_restore_connector.model_weights_ckpt - ) - else: - model_weights = os.path.join(tmpdir, save_restore_connector.model_weights_ckpt) - - state_dict = save_restore_connector._load_state_dict_from_disk( - model_weights, map_location=map_location - ) - - # distributed checkpointing - if state_dict is None: - self.dist_ckpt = True - sharded_state_dict = self.sharded_state_dict(prefix="model.") - checkpoint = dict(state_dict=sharded_state_dict) - tmp_model_weights_ckpt = os.path.join(tmpdir, save_restore_connector.model_weights_ckpt) - tmp_model_weights_dir = os.path.splitext(tmp_model_weights_ckpt)[0] - assert os.path.isdir(tmp_model_weights_dir), f'Expected {tmp_model_weights_dir} to be a directory.' - checkpoint = dist_checkpointing.load( - sharded_state_dict=checkpoint, checkpoint_dir=tmp_model_weights_dir, - ) - state_dict = checkpoint["state_dict"] - - finally: - os.chdir(cwd) + sharded_state_dict = None + if getattr(self, "sharded_state_dict", None) is not None: + sharded_state_dict = self.sharded_state_dict(prefix="model.") + state_dict, self.is_dist_ckpt = load_nemo_model_weights(nemo_path, sharded_state_dict) return state_dict @@ -352,7 +319,7 @@ def load_llm_weights(self, nemo_path): state_dict = self._load_model_weights(nemo_path) new_state_dict = {} - if self.dist_ckpt or self.mcore_gpt: + if self.is_dist_ckpt or self.mcore_gpt: for k, v in state_dict.items(): new_k = k if k.startswith("model."): @@ -508,6 +475,7 @@ def dummy(): media_end_id=media_end_id, mcore_gpt=self.mcore_gpt, config=self.transformer_config, + transformer_layer_spec=get_specs(self.spec_name), vocab_size=self.cfg.get('override_vocab_size', self.padded_vocab_size), max_sequence_length=self.cfg.get('encoder_seq_length', 512), pre_process=pre_process, @@ -517,6 +485,7 @@ def dummy(): position_embedding_type=self.cfg.get('position_embedding_type', 'learned_absolute'), rotary_percent=self.cfg.get('rotary_percentage', 1.0), seq_len_interpolation_factor=self.cfg.get('seq_len_interpolation_factor', None), + rotary_base=self.cfg.get('rotary_base', 10000), ) else: model = NevaModel( @@ -602,6 +571,33 @@ def setup_optimizer_param_groups(self): params_with_grad = [param for param in param_group['params'] if param.requires_grad] param_group['params'] = params_with_grad + # set projection matrix and lora to two param groups with different LR + if self.use_peft: + assert len(self._optimizer_param_groups) == 1 + assert len(self.adapter_keys) == len(self._optimizer_param_groups[0]['params']) + # Mapping from parameter objects to their names + param_to_name = { + param: name + for name, param in self.model.named_parameters() + if name or name.replace("model.module.", "model.", "1") in self.adapter_keys + } + # Match the parameters and separate them into two groups + group1_params, group2_params = [], [] + for param in self._optimizer_param_groups[0]['params']: + param_name = param_to_name.get(param) + if 'mm_projector' in param_name: + group2_params.append(param) + else: + group1_params.append(param) + + base_lr = self._cfg.optim.get('lr') + mm_projector_lr_ratio = 0.1 # hard-coded ratio + # Create two new optimizer param groups + self._optimizer_param_groups = [ + {'params': group1_params, 'lr': base_lr}, + {'params': group2_params, 'lr': base_lr * mm_projector_lr_ratio}, + ] + def forward(self, tokens, text_position_ids, attention_mask, labels, media=None): forward_args = { 'input_ids': tokens, @@ -893,13 +889,15 @@ def build_pretraining_data_loader( pad_samples_to_global_batch_size=pad_samples_to_global_batch_size, ) elif self.cfg.data.dataloader_type == 'cyclic': - batch_sampler = MegatronPretrainingRandomSampler( + batch_sampler = MegatronVisionPretrainingRandomSampler( + dataset=dataset, total_samples=len(dataset), consumed_samples=consumed_samples, micro_batch_size=self.cfg.micro_batch_size, data_parallel_rank=parallel_state.get_data_parallel_rank(), data_parallel_size=parallel_state.get_data_parallel_world_size(), drop_last=self.cfg.get('drop_last', True), + data_sharding=False, ) else: raise ValueError('cfg.data.dataloader_type must be "single" or "cyclic"') @@ -1021,12 +1019,7 @@ def dummy(): if length_params is None: length_params = get_default_length_params() - import time - - start = time.time() # Supports only one prompt at a time result = megatron_neva_generate(self.cuda(), input_prompts, length_params, sampling_params, inference_config) - end = time.time() - print(f'Time taken {end - start}') return result diff --git a/nemo/collections/multimodal/models/text_to_image/dreambooth/dreambooth.py b/nemo/collections/multimodal/models/text_to_image/dreambooth/dreambooth.py index 492347f08524..ce82da9bd171 100644 --- a/nemo/collections/multimodal/models/text_to_image/dreambooth/dreambooth.py +++ b/nemo/collections/multimodal/models/text_to_image/dreambooth/dreambooth.py @@ -20,15 +20,19 @@ from torch._inductor import config as inductor_config from nemo.collections.multimodal.data.dreambooth.dreambooth_dataset import DreamBoothDataset +from nemo.collections.multimodal.modules.stable_diffusion.attention import LinearWrapper from nemo.collections.multimodal.modules.stable_diffusion.distributions.distributions import ( DiagonalGaussianDistribution, ) +from nemo.collections.multimodal.modules.stable_diffusion.encoders.modules import LoraWrapper from nemo.collections.multimodal.parts.utils import randn_like from nemo.collections.nlp.data.language_modeling.megatron.data_samplers import MegatronPretrainingRandomSampler from nemo.collections.nlp.models.language_modeling.megatron_base_model import MegatronBaseModel from nemo.collections.nlp.modules.common.megatron.module import Float16Module +from nemo.collections.nlp.parts.mixins.nlp_adapter_mixins import NLPAdapterModelMixin from nemo.collections.nlp.parts.utils_funcs import get_last_rank from nemo.core.classes.common import Serialization +from nemo.core.classes.mixins.adapter_mixins import AdapterModuleMixin from nemo.utils import logging try: @@ -124,8 +128,11 @@ def instantiate_text_encoder(self, cfg): model = DreamBooth.from_config_dict(cfg) if self.train_text_encoder: self.text_encoder = model.train() - for param in self.text_encoder.parameters(): - param.requires_grad = True + if (not hasattr(model, 'lora_layers')) or len( + model.lora_layers + ) == 0: # if no lora, train all the parameters + for param in self.text_encoder.parameters(): + param.requires_grad = True else: self.text_encoder = model.eval() self.text_encoder.train = disabled_train @@ -187,7 +194,7 @@ def set_input_tensor(self, input_tensor): pass -class MegatronDreamBooth(MegatronBaseModel): +class MegatronDreamBooth(NLPAdapterModelMixin, MegatronBaseModel): def __init__(self, cfg: DictConfig, trainer: Trainer): if not HAVE_APEX: raise ImportError( @@ -447,6 +454,7 @@ def setup(self, stage=None): self._micro_batch_size = self.cfg.micro_batch_size self.setup_training_data(self.cfg.data) + self.setup_complete = True def setup_training_data(self, cfg): if self.cfg.with_prior_preservation: @@ -637,3 +645,19 @@ def load_from_checkpoint( finally: cls._set_model_restore_state(is_being_restored=False) return checkpoint + + def _check_and_add_adapter(self, name, module, peft_name, peft_cfg, name_key_to_mcore_mixins=None): + if isinstance(module, AdapterModuleMixin): + if isinstance(module, LinearWrapper): + peft_cfg.in_features, peft_cfg.out_features = module.in_features, module.out_features + elif isinstance(module, LoraWrapper): + peft_cfg.in_features, peft_cfg.out_features = module.in_features, module.out_features + else: + return + if model_utils.import_class_by_path(peft_cfg._target_) in module.get_accepted_adapter_types(): + module.add_adapter( + name=peft_name, + cfg=peft_cfg, + base_model_cfg=self.cfg, + model_parallel_config=self.model_parallel_config, + ) diff --git a/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/autoencoder.py b/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/autoencoder.py index 1dd695af86f5..3a929ddd6829 100644 --- a/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/autoencoder.py +++ b/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/autoencoder.py @@ -318,7 +318,6 @@ def __init__( colorize_nlabels=None, monitor=None, from_pretrained: str = None, - capture_cudagraph_iters=-1, ): super().__init__() self.image_key = image_key @@ -339,17 +338,14 @@ def __init__( if from_pretrained is not None: state_dict = torch.load(from_pretrained) - self._load_pretrained_model(state_dict) - - # CUDA graph captured sub-modules - self.capture_cudagraph_iters = capture_cudagraph_iters - self.stream = torch.cuda.Stream() - self.encoder_iterations = self.decoder_iterations = 0 - self.encoder_graph = torch.cuda.CUDAGraph() # eval - self.decoder_graph = torch.cuda.CUDAGraph() # eval - self.graphed_encoder = self.graphed_decoder = None # train - self.static_x = self.static_moments = None - self.static_z = self.static_dec = None + missed_key, unexpected_key, missmatched_key, err_msg = self._load_pretrained_model(state_dict) + + if len(missed_key) > 0: + print( + f'{self.__class__.__name__}: Following keys are missing during loading unet weights, which may lead to compromised image quality for a resumed training. Please check the checkpoint you provided.' + ) + print("missed key: ", missed_key) + print("unexpected key: ", unexpected_key) def _state_key_mapping(self, state_dict: dict): import re diff --git a/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py b/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py index 89063f2490cc..31b56443846f 100644 --- a/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py +++ b/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py @@ -12,8 +12,10 @@ # See the License for the specific language governing permissions and # limitations under the License. import itertools +import os +import time from functools import partial -from typing import Any, Optional +from typing import Any, Dict, List, Optional, Union import numpy as np import pytorch_lightning as pl @@ -31,8 +33,10 @@ from torchvision.utils import make_grid from tqdm import tqdm +from nemo.collections.multimodal.data.common.utils import get_collate_fn from nemo.collections.multimodal.data.stable_diffusion.stable_diffusion_dataset import ( build_train_valid_datasets, + build_train_valid_precached_clip_datasets, build_train_valid_precached_datasets, ) from nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder import ( @@ -41,6 +45,7 @@ VQModelInterface, ) from nemo.collections.multimodal.models.text_to_image.stable_diffusion.samplers.ddim import DDIMSampler +from nemo.collections.multimodal.modules.stable_diffusion.attention import LinearWrapper from nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.util import ( extract_into_tensor, make_beta_schedule, @@ -50,6 +55,7 @@ DiagonalGaussianDistribution, normal_kl, ) +from nemo.collections.multimodal.modules.stable_diffusion.encoders.modules import LoraWrapper from nemo.collections.multimodal.parts.stable_diffusion.utils import ( count_params, default, @@ -62,9 +68,12 @@ from nemo.collections.multimodal.parts.utils import randn_like from nemo.collections.nlp.models.language_modeling.megatron_base_model import MegatronBaseModel from nemo.collections.nlp.modules.common.megatron.module import Float16Module +from nemo.collections.nlp.parts.mixins.nlp_adapter_mixins import NLPAdapterModelMixin +from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP, PEFTConfig from nemo.collections.nlp.parts.utils_funcs import get_last_rank from nemo.core.classes.common import Serialization -from nemo.utils import logging +from nemo.core.classes.mixins.adapter_mixins import AdapterModuleMixin +from nemo.utils import logging, model_utils try: from apex import amp @@ -127,13 +136,7 @@ def __init__(self, cfg): self.channels = cfg.channels self.channels_last = cfg.get("channels_last", False) self.use_positional_encodings = cfg.use_positional_encodings - self.model = DiffusionWrapper( - cfg.unet_config, - cfg.conditioning_key, - cfg.inductor, - cfg.inductor_cudagraphs, - cfg.get("capture_cudagraph_iters", -1), - ) + self.model = DiffusionWrapper(cfg.unet_config, cfg.conditioning_key, cfg.inductor, cfg.inductor_cudagraphs) self.model_type = None count_params(self.model, verbose=True) @@ -157,7 +160,13 @@ def __init__(self, cfg): if self.learn_logvar: self.logvar = nn.Parameter(self.logvar, requires_grad=True) - self.rng = torch.Generator(device=torch.cuda.current_device(),) + cuda_graph_enabled = cfg.get("capture_cudagraph_iters", -1) >= 0 + if not cuda_graph_enabled: + logging.info("Use custom random generator") + self.rng = torch.Generator(device=torch.cuda.current_device(),) + else: + logging.info("Use system random generator since CUDA graph enabled") + self.rng = None def register_schedule( self, @@ -229,7 +238,9 @@ def register_schedule( self.register_buffer('lvlb_weights', lvlb_weights, persistent=False) assert not torch.isnan(self.lvlb_weights).all() - def init_from_ckpt(self, path, ignore_keys=list(), only_model=False): + def init_from_ckpt( + self, path, ignore_keys=list(), only_model=False, load_vae=True, load_unet=True, load_encoder=True, + ): pl_sd = torch.load(path, map_location="cpu") if "state_dict" in list(pl_sd.keys()): pl_sd = pl_sd["state_dict"] @@ -246,12 +257,45 @@ def init_from_ckpt(self, path, ignore_keys=list(), only_model=False): new_k = new_k.lstrip("model.") sd[new_k] = v + logging.info(f"Loading {path}") + logging.info(f"It has {len(sd)} entries") + logging.info(f"Existing model has {len(self.state_dict())} entries") + keys = list(sd.keys()) for k in keys: for ik in ignore_keys: if k.startswith(ik): - logging.info("Deleting key {} from state_dict.".format(k)) + logging.info("Deleting ignored key {} from state_dict.".format(k)) + del sd[k] + + if not load_vae: + deleted = 0 + keys = list(sd.keys()) + for k in keys: + if k.startswith("first_stage_model"): + deleted += 1 + del sd[k] + logging.info(f"Deleted {deleted} keys from `first_stage_model` state_dict.") + + if not load_encoder: + deleted = 0 + keys = list(sd.keys()) + for k in keys: + if k.startswith("cond_stage_model"): + deleted += 1 + logging.info("Deleting ignored key {} from state_dict.".format(k)) + del sd[k] + logging.info(f"Deleted {deleted} keys from `cond_stage_model` state_dict.") + + if not load_unet: + deleted = 0 + keys = list(sd.keys()) + for k in keys: + if k.startswith("model.diffusion_model"): + deleted += 1 del sd[k] + logging.info(f"Deleted {deleted} keys from `cond_stage_model` state_dict.") + missing, unexpected = ( self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict(sd, strict=False) ) @@ -513,7 +557,13 @@ def __init__(self, cfg, model_parallel_config): self.restarted_from_ckpt = False if ckpt_path is not None: - self.init_from_ckpt(ckpt_path, ignore_keys) + load_vae = True if cfg.load_vae is None else cfg.load_vae + load_unet = True if cfg.load_unet is None else cfg.load_unet + load_encoder = True if cfg.load_encoder is None else cfg.load_encoder + + self.init_from_ckpt( + ckpt_path, ignore_keys, load_vae=load_vae, load_unet=load_unet, load_encoder=load_encoder, + ) self.restarted_from_ckpt = True if self.channels_last: @@ -725,7 +775,13 @@ def get_input( if self.first_stage_key.endswith('encoded'): gaussian_parameters = batch[self.first_stage_key] encoder_posterior = DiagonalGaussianDistribution(gaussian_parameters) + elif self.first_stage_key.endswith('moments'): + # Loading distribution from disk and sampling encoded + distribution = batch[self.first_stage_key] # torch.size([3, 1, 8, 64, 64]) + distribution = torch.squeeze(distribution, dim=1) + encoder_posterior = DiagonalGaussianDistribution(distribution) else: + # Loading images from disk and encoding them x = super().get_input(batch, k) if bs is not None: x = x[:bs] @@ -1597,7 +1653,7 @@ def set_input_tensor(self, input_tensor): pass -class MegatronLatentDiffusion(MegatronBaseModel): +class MegatronLatentDiffusion(NLPAdapterModelMixin, MegatronBaseModel): """Megatron LatentDiffusion Model.""" def __init__(self, cfg: DictConfig, trainer: Trainer): @@ -1632,6 +1688,9 @@ def __init__(self, cfg: DictConfig, trainer: Trainer): else: raise ValueError('precision must be in ["32-true", "16-mixed", "bf16-mixed"]') + self.log_train_loss = bool(int(os.getenv("NEMO_LOG_TRAIN_LOSS", 1))) + self.loss_broadcast_src_rank = None + def get_module_list(self): if isinstance(self.model, list): return [model.module if isinstance(model, Float16Module) else model for model in self.model] @@ -1703,6 +1762,22 @@ def fwd_bwd_step(self, dataloader_iter, batch_idx, forward_only): else: loss_mean = torch.tensor(0.0, device=torch.cuda.current_device()) + if self.log_train_loss: + # When using pipeline parallelism, loss is calculated only in the last pipeline stage and + # it should be casted to other pipeline stages for logging. + # we can avoid this broadcast by updating the PTL log function to accept specific ranks + if parallel_state.get_pipeline_model_parallel_world_size() > 1: + if self.loss_broadcast_src_rank is None: + dp_size = parallel_state.get_data_parallel_world_size() + tp_size = parallel_state.get_tensor_model_parallel_world_size() + pp_size = parallel_state.get_pipeline_model_parallel_world_size() + rank_in_dp_tp_group = torch.distributed.get_rank() % (dp_size * tp_size) + last_pipeline_stage_offset = (tp_size * dp_size) * (pp_size - 1) + self.loss_broadcast_src_rank = last_pipeline_stage_offset + rank_in_dp_tp_group + torch.distributed.broadcast( + loss_mean, self.loss_broadcast_src_rank, group=parallel_state.get_pipeline_model_parallel_group(), + ) + return loss_mean, loss_dict def training_step(self, dataloader_iter, batch_idx): @@ -1720,8 +1795,6 @@ def training_step(self, dataloader_iter, batch_idx): loss_mean, loss_dict = self.fwd_bwd_step(dataloader_iter, batch_idx, False) - torch.distributed.broadcast(loss_mean, get_last_rank()) - # when using sequence parallelism, the sequence parallel layernorm grads must be all-reduced if self.cfg.get('tensor_model_parallel_size', 1) > 1 and self.cfg.get('sequence_parallel', False): self.allreduce_sequence_parallel_gradients() @@ -1740,13 +1813,30 @@ def training_step(self, dataloader_iter, batch_idx): # so we all-reduce gradients after the pipeline self.allreduce_gradients() # @sangkug we think this is causing memory to blow up (hurts perf) + # for cuda graph with pytorch lightning + # these values will be used outside the capturing range + if not hasattr(self, "loss_mean"): + self.loss_mean = torch.empty_like(loss_mean) + with torch.no_grad(): + self.loss_mean.copy_(loss_mean) + self.loss_dict = loss_dict + # this function is invoked by callback if with cuda graph, otherwise + # invoke it by ourselves + if self.cfg.get("capture_cudagraph_iters", -1) < 0: + self.non_cuda_graph_capturable() + + return loss_mean + + def non_cuda_graph_capturable(self): + if self.log_train_loss: + self.log('reduced_train_loss', self.loss_mean, prog_bar=True, rank_zero_only=True, batch_size=1) + if self.cfg.precision in [16, '16', '16-mixed']: loss_scale = self.trainer.precision_plugin.scaler._scale if loss_scale is not None: self.log('loss_scale', loss_scale, batch_size=1) - self.log_dict(loss_dict, prog_bar=False, logger=True, on_step=True, rank_zero_only=True, batch_size=1) - self.log('reduced_train_loss', loss_mean, prog_bar=True, rank_zero_only=True, batch_size=1) + self.log_dict(self.loss_dict, prog_bar=False, logger=True, on_step=True, rank_zero_only=True, batch_size=1) lr = self._optimizer.param_groups[0]['lr'] self.log('lr', lr, prog_bar=True, rank_zero_only=True, batch_size=1) self.log('global_step', self.trainer.global_step + 1, prog_bar=True, rank_zero_only=True, batch_size=1) @@ -1757,7 +1847,9 @@ def training_step(self, dataloader_iter, batch_idx): rank_zero_only=True, batch_size=1, ) - return loss_mean + + ts = torch.tensor(int(time.time() * 1e3), dtype=torch.float64) + self.log("timestamp", ts, batch_size=1, rank_zero_only=True) def backward(self, *args, **kwargs): """ LightningModule hook to do backward. @@ -1850,7 +1942,8 @@ def setup(self, stage=None): Args: stage (str, optional): Can be 'fit', 'validate', 'test' or 'predict'. Defaults to None. """ - self.model.rng.manual_seed(self.cfg.seed + 100 * parallel_state.get_data_parallel_rank()) + if self.model.rng: + self.model.rng.manual_seed(self.cfg.seed + 100 * parallel_state.get_data_parallel_rank()) # log number of parameters if isinstance(self.model, list): @@ -1890,16 +1983,22 @@ def setup(self, stage=None): self.setup_training_data(self.cfg.data) self.setup_validation_data(self.cfg.data) self.setup_test_data(self.cfg.data) + self.setup_complete = True def build_train_valid_test_datasets(self): - logging.info('Building datasets for Stable Diffusion...') + logging.info("Building datasets for Stable Diffusion...") if self.trainer.limit_val_batches > 1.0 and isinstance(self.trainer.limit_val_batches, float): raise ValueError("limit_val_batches must be an integer or float less than or equal to 1.0.") - if self.cfg.first_stage_key.endswith("encoded"): - self._train_ds, self._validation_ds = build_train_valid_precached_datasets( - model_cfg=self.cfg, consumed_samples=self.compute_consumed_samples(0), - ) + if self.cfg.first_stage_key.endswith("encoded") or self.cfg.first_stage_key.endswith("moments"): + if self.cfg.cond_stage_key.endswith("precached_clip"): + self._train_ds, self._validation_ds = build_train_valid_precached_clip_datasets( + model_cfg=self.cfg, consumed_samples=self.compute_consumed_samples(0), + ) + else: + self._train_ds, self._validation_ds = build_train_valid_precached_datasets( + model_cfg=self.cfg, consumed_samples=self.compute_consumed_samples(0), + ) else: self._train_ds, self._validation_ds = build_train_valid_datasets( model_cfg=self.cfg, consumed_samples=self.compute_consumed_samples(0) @@ -1921,6 +2020,13 @@ def setup_training_data(self, cfg): logging.info( f'Setting up train dataloader with len(len(self._train_ds)): {len(self._train_ds)} and consumed samples: {consumed_samples}' ) + if self.cfg.cond_stage_key.endswith("precached_clip"): + collate_fn = get_collate_fn( + first_stage_key=self.cfg.first_stage_key, cond_stage_key=self.cfg.cond_stage_key, + ) + else: + collate_fn = None + self._train_dl = torch.utils.data.DataLoader( self._train_ds, batch_size=self._micro_batch_size, @@ -1928,6 +2034,7 @@ def setup_training_data(self, cfg): pin_memory=True, drop_last=True, persistent_workers=True, + collate_fn=collate_fn, ) def setup_validation_data(self, cfg): @@ -2106,15 +2213,81 @@ def load_from_checkpoint( cls._set_model_restore_state(is_being_restored=False) return checkpoint + def _check_and_add_adapter(self, name, module, peft_name, peft_cfg, name_key_to_mcore_mixins=None): + if isinstance(module, AdapterModuleMixin): + if isinstance(module, LinearWrapper): + peft_cfg.in_features, peft_cfg.out_features = module.in_features, module.out_features + elif isinstance(module, LoraWrapper): + peft_cfg.in_features, peft_cfg.out_features = module.in_features, module.out_features + else: + return + if model_utils.import_class_by_path(peft_cfg._target_) in module.get_accepted_adapter_types(): + module.add_adapter( + name=peft_name, + cfg=peft_cfg, + base_model_cfg=self.cfg, + model_parallel_config=self.model_parallel_config, + ) + + def load_adapters( + self, filepath: str, peft_cfgs: Optional[Union[PEFTConfig, List[PEFTConfig]]] = None, map_location: str = None, + ): + """ + Utility method that restores only the adapter module(s), and not the entire model itself. + This allows the sharing of adapters which are often just a fraction of the size of the full model, + enabling easier deliver. + + .. note:: + + During restoration, assumes that the model does not currently already have one or more adapter modules. + + Args: + filepath: Filepath of the .ckpt or .nemo file. + peft_cfgs: One or more PEFTConfig objects that specify the PEFT method configuration. + If none, will infer from the .nemo checkpoint + map_location: Pytorch flag, where to place the adapter(s) state dict(s). + """ + + def _modify_state_dict(state_dict): + # Modify state key for Dreambooth inference + new_state_dict = {} + for key in state_dict.keys(): + new_key = key.replace('unet', 'model.diffusion_model') + new_key = new_key.replace('vae', 'first_stage_model') + new_key = new_key.replace('text_encoder', 'cond_stage_model') + new_key = new_key.replace('.noise_scheduler', '') + new_key = new_key.replace('._orig_mod', '') + new_state_dict[new_key] = state_dict[key] + state_dict = new_state_dict + return state_dict + + # Determine device + if map_location is None: + if torch.cuda.is_available(): + map_location = 'cuda' + else: + map_location = 'cpu' + + if filepath.endswith('.nemo'): + conf, state_dict = self._get_config_and_state_dict_from_nemo(filepath, map_location) + elif filepath.endswith('.ckpt'): + state_dict = torch.load(filepath, map_location)['state_dict'] + else: + raise RuntimeError(f"{filepath} is not nemo file or ckpt file") + if not peft_cfgs: + assert filepath.endswith( + '.nemo' + ), "Inferring peft scheme is only supported for .nemo checkpoints. Please supply the `peft_cfgs` argument." + peft_cfgs = [PEFT_CONFIG_MAP[conf.peft.peft_scheme](conf)] + self.add_adapter(peft_cfgs) + state_dict = _modify_state_dict(state_dict) + assert set(state_dict.keys()) == self.adapter_keys + super().load_state_dict(state_dict, strict=False) + class DiffusionWrapper(pl.LightningModule, Serialization): def __init__( - self, - diff_model_config, - conditioning_key, - inductor: bool = False, - inductor_cudagraphs: bool = False, - capture_cudagraph_iters: int = -1, + self, diff_model_config, conditioning_key, inductor: bool = False, inductor_cudagraphs: bool = False, ): super().__init__() self.diffusion_model = DiffusionWrapper.from_config_dict(diff_model_config) @@ -2128,10 +2301,6 @@ def __init__( torch._dynamo.config.automatic_dynamic_shapes = False inductor_config.triton.cudagraphs = inductor_cudagraphs self.diffusion_model = torch.compile(self.diffusion_model) - # CUDA graph - self.capture_cudagraph_iters = capture_cudagraph_iters - self.iterations = 0 - self.graphed_diffusion_model = None def forward(self, x, t, c_concat: list = None, c_crossattn: list = None): if self.conditioning_key is None: @@ -2141,15 +2310,7 @@ def forward(self, x, t, c_concat: list = None, c_crossattn: list = None): out = self.diffusion_model(xc, t) elif self.conditioning_key == 'crossattn': cc = torch.cat(c_crossattn, 1) - if self.iterations == self.capture_cudagraph_iters: - logging.info("Capturing CUDA graph for module: %s", self.diffusion_model.__class__.__name__) - self.graphed_diffusion_model = torch.cuda.make_graphed_callables(self.diffusion_model, (x, t, cc)) - - if 0 <= self.capture_cudagraph_iters <= self.iterations: - out = self.graphed_diffusion_model(x, t, cc) - else: - out = self.diffusion_model(x, t, context=cc) - self.iterations += 1 + out = self.diffusion_model(x, t, context=cc) elif self.conditioning_key == 'hybrid': xc = torch.cat([x] + c_concat, dim=1) cc = torch.cat(c_crossattn, 1) diff --git a/nemo/collections/multimodal/models/text_to_image/stable_diffusion/samplers/base_sampler.py b/nemo/collections/multimodal/models/text_to_image/stable_diffusion/samplers/base_sampler.py index b890d863428b..e1f2457f34de 100644 --- a/nemo/collections/multimodal/models/text_to_image/stable_diffusion/samplers/base_sampler.py +++ b/nemo/collections/multimodal/models/text_to_image/stable_diffusion/samplers/base_sampler.py @@ -117,6 +117,8 @@ def sample( # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ... **kwargs, ): + self.verbose = verbose + if conditioning is not None: if isinstance(conditioning, dict): ctmp = conditioning[list(conditioning.keys())[0]] @@ -132,7 +134,8 @@ def sample( # sampling C, H, W = shape size = (batch_size, C, H, W) - print(f"Data shape for sampling is {size}, eta {eta}") + if self.verbose: + print(f"Data shape for sampling is {size}, eta {eta}") if self.sampler is Sampler.DPM: return self.dpm_sampling_fn( @@ -223,8 +226,13 @@ def sampling_fn( else: time_range = reversed(range(0, timesteps)) if ddim_use_original_steps else np.flip(timesteps) total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0] - print(f"Running {self.sampler.name} Sampling with {total_steps} timesteps") - iterator = tqdm(time_range, desc=f"{self.sampler.name} Sampler", total=total_steps) + + if self.verbose: + print(f"Running {self.sampler.name} Sampling with {total_steps} timesteps") + iterator = tqdm(time_range, desc=f"{self.sampler.name} Sampler", total=total_steps) + else: + iterator = time_range + old_eps = [] for i, step in enumerate(iterator): index = total_steps - i - 1 diff --git a/nemo/collections/multimodal/modules/stable_diffusion/attention.py b/nemo/collections/multimodal/modules/stable_diffusion/attention.py index b2f211141065..e70a473d658b 100644 --- a/nemo/collections/multimodal/modules/stable_diffusion/attention.py +++ b/nemo/collections/multimodal/modules/stable_diffusion/attention.py @@ -22,6 +22,11 @@ from torch._dynamo import disable from nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.util import checkpoint +from nemo.collections.nlp.modules.common.megatron.adapters.parallel_adapters import ( + AdapterName, + ParallelLinearAdapterConfig, +) +from nemo.core import adapter_mixins def check_cuda(): @@ -82,7 +87,7 @@ def init_(tensor): class GEGLU(nn.Module): def __init__(self, dim_in, dim_out): super().__init__() - self.proj = nn.Linear(dim_in, dim_out * 2) + self.proj = LinearWrapper(dim_in, dim_out * 2) def forward(self, x): x, gate = self.proj(x).chunk(2, dim=-1) @@ -94,9 +99,9 @@ def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.0): super().__init__() inner_dim = int(dim * mult) dim_out = default(dim_out, dim) - project_in = nn.Sequential(nn.Linear(dim, inner_dim), nn.GELU()) if not glu else GEGLU(dim, inner_dim) + project_in = nn.Sequential(LinearWrapper(dim, inner_dim), nn.GELU()) if not glu else GEGLU(dim, inner_dim) - self.net = nn.Sequential(project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out)) + self.net = nn.Sequential(project_in, nn.Dropout(dropout), LinearWrapper(inner_dim, dim_out)) def forward(self, x): return self.net(x) @@ -184,10 +189,45 @@ def rearrange_heads_inner(t: torch.Tensor, h: int) -> torch.Tensor: return t.view(b, h, n, -1).transpose(1, 2).reshape(b, n, -1) +class LinearWrapper(nn.Linear, adapter_mixins.AdapterModuleMixin): + def __init__(self, in_features, out_features, bias=True, lora_network_alpha=None): + super().__init__(in_features, out_features, bias) + self.set_accepted_adapter_types([ParallelLinearAdapterConfig._target_]) + self.lora_network_alpha = lora_network_alpha + + def forward(self, x): + mixed_x = super().forward(x) + if self.is_adapter_available(): + lora_linear_adapter = self.get_adapter_module(AdapterName.PARALLEL_LINEAR_ADAPTER) + lora_mixed_x = lora_linear_adapter(x) + # This value has the same meaning as the `--network_alpha` option in the kohya-ss trainer script. + # See https://github.com/darkstorm2150/sd-scripts/blob/main/docs/train_network_README-en.md#execute-learning + if self.lora_network_alpha: + mixed_x = mixed_x + lora_mixed_x * (self.lora_network_alpha / lora_linear_adapter.dim) + else: + mixed_x = mixed_x + lora_mixed_x + return mixed_x + + def add_adapter(self, name, cfg, **kwargs): + self.lora_network_alpha = cfg.network_alpha + kwargs = {} + adapter_mixins.AdapterModuleMixin.add_adapter(self, name, cfg, **kwargs) + + class CrossAttention(nn.Module): - def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0, use_flash_attention=False): + def __init__( + self, + query_dim, + context_dim=None, + heads=8, + dim_head=64, + dropout=0.0, + use_flash_attention=False, + lora_network_alpha=None, + ): super().__init__() - inner_dim = dim_head * heads + + self.inner_dim = dim_head * heads context_dim = default(context_dim, query_dim) # make attention part be aware of self-attention/cross-attention self.context_dim = context_dim @@ -197,11 +237,13 @@ def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0. self.scale = dim_head ** -0.5 self.heads = heads - self.to_q = nn.Linear(query_dim, inner_dim, bias=False) - self.to_k = nn.Linear(context_dim, inner_dim, bias=False) - self.to_v = nn.Linear(context_dim, inner_dim, bias=False) + self.to_q = LinearWrapper(query_dim, self.inner_dim, bias=False, lora_network_alpha=lora_network_alpha) + self.to_k = LinearWrapper(context_dim, self.inner_dim, bias=False, lora_network_alpha=lora_network_alpha) + self.to_v = LinearWrapper(context_dim, self.inner_dim, bias=False, lora_network_alpha=lora_network_alpha) - self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)) + self.to_out = nn.Sequential( + LinearWrapper(self.inner_dim, query_dim, lora_network_alpha=lora_network_alpha), nn.Dropout(dropout) + ) self.use_flash_attention = use_flash_attention if dim_head <= 160 and (dim_head % 8) == 0 and flash_attn_installed: @@ -292,6 +334,7 @@ def __init__( use_checkpoint=False, use_flash_attention=False, disable_self_attn=False, + lora_network_alpha=None, ): super().__init__() self.disable_self_attn = disable_self_attn @@ -302,6 +345,7 @@ def __init__( dropout=dropout, use_flash_attention=use_flash_attention, context_dim=context_dim if self.disable_self_attn else None, + lora_network_alpha=lora_network_alpha, ) # is a self-attention self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff) self.attn2 = CrossAttention( @@ -311,6 +355,7 @@ def __init__( dim_head=d_head, dropout=dropout, use_flash_attention=use_flash_attention, + lora_network_alpha=lora_network_alpha, ) # is self-attn if context is none self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) @@ -351,6 +396,7 @@ def __init__( use_linear=False, use_checkpoint=False, use_flash_attention=False, + lora_network_alpha=None, ): super().__init__() if exists(context_dim) and not isinstance(context_dim, list): @@ -375,6 +421,7 @@ def __init__( use_checkpoint=use_checkpoint, use_flash_attention=use_flash_attention, disable_self_attn=disable_self_attn, + lora_network_alpha=lora_network_alpha, ) for d in range(depth) ] diff --git a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py index 1cf1798015eb..62842da602dc 100644 --- a/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py +++ b/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py @@ -470,6 +470,7 @@ def __init__( # It must be specified when from pretrained is not None. It indicates loading unet from NeMo trained ckpt or HF use_flash_attention: bool = False, enable_amp_o2_fp16: bool = False, + lora_network_alpha=None, ): super().__init__() if use_spatial_transformer: @@ -567,6 +568,7 @@ def __init__( use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint, use_flash_attention=use_flash_attention, + lora_network_alpha=lora_network_alpha, ) ) self.input_blocks.append(TimestepEmbedSequential(*layers)) @@ -631,6 +633,7 @@ def __init__( use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint, use_flash_attention=use_flash_attention, + lora_network_alpha=lora_network_alpha, ), ResBlock( ch, @@ -687,6 +690,7 @@ def __init__( use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint, use_flash_attention=use_flash_attention, + lora_network_alpha=lora_network_alpha, ) ) if level and i == num_res_blocks: @@ -722,7 +726,13 @@ def __init__( ) if from_pretrained is not None: - state_dict = torch.load(from_pretrained, map_location='cpu') + if from_pretrained.endswith('safetensors'): + from safetensors.torch import load_file as load_safetensors + + state_dict = load_safetensors(from_pretrained) + else: + state_dict = torch.load(from_pretrained, map_location='cpu') + if 'state_dict' in state_dict.keys(): state_dict = state_dict['state_dict'] missing_key, unexpected_keys, _, _ = self._load_pretrained_model(state_dict, from_NeMo=from_NeMo) @@ -854,10 +864,10 @@ def _state_key_mapping(self, state_dict: dict): return res_dict def _load_pretrained_model(self, state_dict, ignore_mismatched_sizes=False, from_NeMo=False): - if from_NeMo: - state_dict = self._strip_unet_key_prefix(state_dict) - else: + state_dict = self._strip_unet_key_prefix(state_dict) + if not from_NeMo: state_dict = self._state_key_mapping(state_dict) + model_state_dict = self.state_dict() loaded_keys = [k for k in state_dict.keys()] expected_keys = list(model_state_dict.keys()) @@ -912,14 +922,16 @@ def _strip_unet_key_prefix(self, state_dict): for key_, value_ in state_dict.items(): if key_.startswith('model.diffusion_model'): re_state_dict[key_.replace('model.diffusion_model.', '')] = value_ - if key_.startswith('model.model.diffusion_model'): + elif key_.startswith('model.model.diffusion_model'): re_state_dict[key_.replace('model.model.diffusion_model.', '')] = value_ - if key_.startswith('model._orig_mod.diffusion_model.'): + elif key_.startswith('model._orig_mod.diffusion_model.'): re_state_dict[key_.replace('model._orig_mod.diffusion_model.', '')] = value_ - if key_.startswith('model.model._orig_mod.diffusion_model.'): + elif key_.startswith('model.model._orig_mod.diffusion_model.'): re_state_dict[key_.replace('model.model._orig_mod.diffusion_model.', '')] = value_ - if key_.startswith('model.model.diffusion_model._orig_mod.'): + elif key_.startswith('model.model.diffusion_model._orig_mod.'): re_state_dict[key_.replace('model.model.diffusion_model._orig_mod.', '')] = value_ + else: + re_state_dict[key_] = value_ return re_state_dict def _load_state_dict_into_model(self, state_dict): diff --git a/nemo/collections/multimodal/modules/stable_diffusion/encoders/modules.py b/nemo/collections/multimodal/modules/stable_diffusion/encoders/modules.py index ec3bd82ba137..ca61f88fd901 100644 --- a/nemo/collections/multimodal/modules/stable_diffusion/encoders/modules.py +++ b/nemo/collections/multimodal/modules/stable_diffusion/encoders/modules.py @@ -28,8 +28,13 @@ TransformerWrapper, # TODO: can we directly rely on lucidrains code and simply add this as a reuirement? --> test ) from nemo.collections.multimodal.modules.stable_diffusion.encoders.x_transformer import Encoder +from nemo.collections.nlp.modules.common.megatron.adapters.parallel_adapters import ( + AdapterName, + ParallelLinearAdapterConfig, +) from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer from nemo.collections.nlp.parts.nlp_overrides import NLPSaveRestoreConnector +from nemo.core import adapter_mixins from nemo.utils import logging try: @@ -45,12 +50,37 @@ class AbstractEncoder(nn.Module): - def __init__(self): + def __init__(self, enable_lora_finetune=False, target_block=[], target_module=[]): super().__init__() + self.TARGET_BLOCK = target_block + self.TARGET_MODULE = target_module + if enable_lora_finetune: + self.lora_layers = [] def encode(self, *args, **kwargs): raise NotImplementedError + def _enable_lora(self, lora_model): + for module_name, module in lora_model.named_modules(): + if module.__class__.__name__ in self.TARGET_BLOCK: + tmp = {} + for sub_name, sub_module in module.named_modules(): + if sub_module.__class__.__name__ in self.TARGET_MODULE: + if hasattr(sub_module, "input_size") and hasattr( + sub_module, "output_size" + ): # for megatron ParallelLinear + lora = LoraWrapper(sub_module, sub_module.input_size, sub_module.output_size) + else: # for nn.Linear + lora = LoraWrapper(sub_module, sub_module.in_features, sub_module.out_features) + self.lora_layers.append(lora) + if sub_name not in tmp.keys(): + tmp.update({sub_name: lora}) + else: + print(f"Duplicate subnames are found in module {module_name}") + for sub_name, lora_layer in tmp.items(): + lora_name = f'{sub_name}_lora' + module.add_module(lora_name, lora_layer) + class ClassEmbedder(nn.Module): def __init__(self, embed_dim, n_classes=1000, key='class'): @@ -185,26 +215,60 @@ def encode(self, x): return self(x) +class LoraWrapper(nn.Module, adapter_mixins.AdapterModuleMixin): + def __init__(self, target_module, in_features, out_features, lora_network_alpha=None): + super().__init__() + self.target_module = target_module + self.set_accepted_adapter_types([ParallelLinearAdapterConfig._target_]) + self.lora_network_alpha = lora_network_alpha + self.in_features = in_features + self.out_features = out_features + + def forward(self, x): + org_results = self.target_forward(x) + if self.is_adapter_available(): + lora_linear_adapter = self.get_adapter_module(AdapterName.PARALLEL_LINEAR_ADAPTER) + lora_mixed_x = lora_linear_adapter(x) + # This value has the same meaning as the `--network_alpha` option in the kohya-ss trainer script. + # See https://github.com/darkstorm2150/sd-scripts/blob/main/docs/train_network_README-en.md#execute-learning + mixed_x = org_results[0] if isinstance(org_results, tuple) else org_results + + if self.lora_network_alpha: + mixed_x = mixed_x + lora_mixed_x * (self.lora_network_alpha / lora_linear_adapter.dim) + else: + mixed_x = mixed_x + lora_mixed_x + + if isinstance(org_results, tuple): + org_results = (mixed_x, *org_results[1:]) + else: + org_results = mixed_x + + return org_results + + def add_adapter(self, name, cfg, **kwargs): + self.lora_network_alpha = cfg.network_alpha + kwargs = {} + adapter_mixins.AdapterModuleMixin.add_adapter(self, name, cfg, **kwargs) + self.target_forward = self.target_module.forward + self.target_module.forward = self.forward + del self.target_module + + class FrozenCLIPEmbedder(AbstractEncoder): """Uses the CLIP transformer encoder for text (from Hugging Face)""" def __init__( - self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77, capture_cudagraph_iters: int = -1 + self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77, enable_lora_finetune=False ): - super().__init__() + super().__init__(enable_lora_finetune, target_block=["CLIPAttention", "CLIPMLP"], target_module=["Linear"]) self.tokenizer = CLIPTokenizer.from_pretrained(version) self.transformer = CLIPTextModel.from_pretrained(version) self.device = device self.max_length = max_length self.freeze() - - # CUDA graph captured sub-modules - self.capture_cudagraph_iters = capture_cudagraph_iters - self.iterations = 0 - self.stream = torch.cuda.Stream() - self.transformer_graph = torch.cuda.CUDAGraph() - self.static_tokens = None - self.static_outputs = None + if enable_lora_finetune: + self._enable_lora(self.transformer) + print(f"CLIP transformer encoder add {len(self.lora_layers)} lora layers.") def freeze(self): self.transformer = self.transformer.eval() @@ -221,35 +285,12 @@ def forward(self, text): padding="max_length", return_tensors="pt", ) - if self.capture_cudagraph_iters < 0: - tokens = batch_encoding["input_ids"].to(self.device, non_blocking=True) - outputs = self.transformer(input_ids=tokens) - z = outputs.last_hidden_state + tokens = batch_encoding["input_ids"].to(self.device, non_blocking=True) + outputs = self.transformer(input_ids=tokens) - else: - if self.static_tokens is None: - self.static_tokens = batch_encoding["input_ids"].to(device=self.device, non_blocking=True) - self.static_tokens.copy_(batch_encoding["input_ids"], non_blocking=True) - - if self.iterations == self.capture_cudagraph_iters: - # cuda graph capture - logging.info("Capturing CUDA graph for module: %s", self.transformer.__class__.__name__) - with torch.cuda.graph(self.transformer_graph): - self.static_outputs = self.transformer(input_ids=self.static_tokens) - - if 0 <= self.capture_cudagraph_iters <= self.iterations: - # cuda graph replay - self.transformer_graph.replay() - else: - # warmup - self.stream.wait_stream(torch.cuda.current_stream()) - with torch.cuda.stream(self.stream): - self.static_outputs = self.transformer(input_ids=self.static_tokens) - torch.cuda.current_stream().wait_stream(self.stream) - self.iterations += 1 - z = self.static_outputs.last_hidden_state + z = outputs.last_hidden_state - # # Pad the seq length to multiple of 8 + # Pad the seq length to multiple of 8 seq_len = (z.shape[1] + 8 - 1) // 8 * 8 z = torch.nn.functional.pad(z, (0, 0, 0, seq_len - z.shape[1]), value=0.0) return z @@ -278,10 +319,14 @@ def __init__( freeze=True, layer="last", use_fp16=False, + cache_dir=None, ): super().__init__() assert layer in self.LAYERS - model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version) + print(f"Downloading clip with", arch, version, cache_dir) + model, _, _ = open_clip.create_model_and_transforms( + arch, device=torch.device("cpu"), pretrained=version, cache_dir=cache_dir, + ) del model.visual self.model = model @@ -303,8 +348,12 @@ def freeze(self): param.requires_grad = False def forward(self, text): - tokens = open_clip.tokenize(text) - z = self.encode_with_transformer(tokens.to(self.device)) + if isinstance(text, list) and isinstance(text[0], str): + tokens = open_clip.tokenize(text) + else: + # tokenizer has been invoked before + tokens = text + z = self.encode_with_transformer(tokens.to(self.device, non_blocking=True)) return z def encode_with_transformer(self, text): @@ -331,8 +380,21 @@ def encode(self, text): class FrozenMegatronCLIPEmbedder(AbstractEncoder): - def __init__(self, restore_from_path, device="cuda", layer="last", freeze=True, cfg=None, use_fp16=False): - super().__init__() + def __init__( + self, + restore_from_path, + device="cuda", + layer="last", + freeze=True, + cfg=None, + use_fp16=False, + enable_lora_finetune=False, + ): + super().__init__( + enable_lora_finetune=enable_lora_finetune, + target_block=["ParallelAttention", "ParallelMLP"], + target_module=["ColumnParallelLinear", "RowParallelLinear"], + ) if restore_from_path is not None: cfg, state_dict = self.load_config_and_state_from_nemo(restore_from_path) elif cfg is not None: @@ -355,6 +417,10 @@ def __init__(self, restore_from_path, device="cuda", layer="last", freeze=True, else: raise NotImplementedError() + if enable_lora_finetune: + self._enable_lora(self.model.language_model) + print(f"Megatron CLIP encoder add {len(self.lora_layers)} lora layers.") + def freeze(self): self.model = self.model.eval() for param in self.parameters(): diff --git a/nemo/collections/multimodal/parts/stable_diffusion/pipeline.py b/nemo/collections/multimodal/parts/stable_diffusion/pipeline.py index e9de61d6025a..b28bfc6bcc5c 100644 --- a/nemo/collections/multimodal/parts/stable_diffusion/pipeline.py +++ b/nemo/collections/multimodal/parts/stable_diffusion/pipeline.py @@ -14,6 +14,8 @@ import os import pickle import time +from collections import defaultdict +from itertools import chain import torch from PIL import Image @@ -25,10 +27,10 @@ from nemo.collections.multimodal.parts.stable_diffusion.utils import DataParallelWrapper -def encode_prompt(cond_stage_model, prompt, unconditional_guidance_scale, batch_size): - c = cond_stage_model.encode(batch_size * [prompt]) +def encode_prompt(cond_stage_model, prompts, unconditional_guidance_scale): + c = cond_stage_model.encode(prompts) if unconditional_guidance_scale != 1.0: - uc = cond_stage_model.encode(batch_size * [""]) + uc = cond_stage_model.encode(len(prompts) * [""]) else: uc = None return c, uc @@ -73,10 +75,17 @@ def torch_to_numpy(images): return numpy_images +def pad_with_zeros(cond, u_cond, batch_size): + b, *shape = cond.shape + filler = torch.zeros(batch_size - b, *shape, device=cond.device) + return torch.cat([cond, filler]), torch.cat([u_cond, filler]) + + def pipeline(model, cfg, verbose=True, rng=None): # setup default values for inference configs unconditional_guidance_scale = cfg.infer.get("unconditional_guidance_scale", 7.5) - batch_size = cfg.infer.get('num_images_per_prompt', 1) + num_images_per_prompt = cfg.infer.get('num_images_per_prompt', 1) + batch_size = cfg.infer.get('batch_size', 1) prompts = cfg.infer.get('prompts', []) height = cfg.infer.get('height', 512) width = cfg.infer.get('width', 512) @@ -127,10 +136,16 @@ def pipeline(model, cfg, verbose=True, rng=None): if isinstance(prompts, str): prompts = [prompts] - for prompt in prompts: + multi_prompts = [p for p in prompts for _ in range(num_images_per_prompt)] + batched_prompts = [multi_prompts[i : i + batch_size] for i in range(0, len(multi_prompts), batch_size)] + # decrease batch_size if the number of imputs is lower than bs in the config + batch_size = min(len(batched_prompts[0]), batch_size) + + for batch in batched_prompts: tic = time.perf_counter() tic_total = tic - cond, u_cond = encode_prompt(model.cond_stage_model, prompt, unconditional_guidance_scale, batch_size) + cond, u_cond = encode_prompt(model.cond_stage_model, batch, unconditional_guidance_scale,) + cond, u_cond = pad_with_zeros(cond, u_cond, batch_size) toc = time.perf_counter() conditioning_time = toc - tic @@ -138,6 +153,7 @@ def pipeline(model, cfg, verbose=True, rng=None): latents = torch.randn( [batch_size, in_channels, height // downsampling_factor, width // downsampling_factor], generator=rng ).to(torch.cuda.current_device()) + assert len(cond) == len(latents), (len(cond), len(latents)) tic = time.perf_counter() samples, intermediates = sampler.sample( @@ -158,6 +174,8 @@ def pipeline(model, cfg, verbose=True, rng=None): tic = time.perf_counter() images = decode_images(model, samples) + # remove padding + images = images[: len(batch)] toc = time.perf_counter() decode_time = toc - tic @@ -186,9 +204,13 @@ def pipeline(model, cfg, verbose=True, rng=None): if save_to_file: os.makedirs(out_path, exist_ok=True) if output_type == 'pil': - for text_prompt, pils in zip(prompts, output): - for idx, image in enumerate(pils): - image.save(os.path.join(out_path, f'{text_prompt[:50]}_{idx}.png')) + prompts = chain.from_iterable(batched_prompts) + pils = chain.from_iterable(output) + counts = defaultdict(int) + for text_prompt, image in zip(prompts, pils): + idx = counts[text_prompt] + counts[text_prompt] += 1 + image.save(os.path.join(out_path, f'{text_prompt[:50]}_{idx}.png')) else: with open(os.path.join(out_path, 'output.pkl'), 'wb') as f: pickle.dump(output, f) diff --git a/nemo/collections/multimodal/parts/utils.py b/nemo/collections/multimodal/parts/utils.py index bb612db2ce46..c82e0cd37140 100644 --- a/nemo/collections/multimodal/parts/utils.py +++ b/nemo/collections/multimodal/parts/utils.py @@ -12,16 +12,28 @@ # See the License for the specific language governing permissions and # limitations under the License. import os +import tempfile from typing import Any, Callable, Tuple import torch -from omegaconf import DictConfig, open_dict +from omegaconf import DictConfig, OmegaConf, open_dict from PIL import Image from pytorch_lightning import Trainer from pytorch_lightning.plugins.environments import TorchElasticEnvironment +from transformers import CLIPImageProcessor from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector -from nemo.utils import logging +from nemo.collections.nlp.parts.peft_config import PEFT_CONFIG_MAP +from nemo.utils import AppState, logging + +try: + from megatron.core import dist_checkpointing + + HAVE_MEGATRON_CORE = True + +except (ImportError, ModuleNotFoundError): + + HAVE_MEGATRON_CORE = False def numpy_to_pil(images): @@ -84,6 +96,54 @@ def apply_with_stopping_condition(module, apply_fn, apply_condition=None, stoppi ) +def load_nemo_model_weights(nemo_path, sharded_state_dict=None): + """ + Shared method to load model weights from a given nemo_path. + """ + if torch.cuda.is_available(): + map_location = torch.device('cuda') + else: + map_location = torch.device('cpu') + + save_restore_connector = NLPSaveRestoreConnector() + cwd = os.getcwd() + app_state = AppState() + is_dist_ckpt = False + + with tempfile.TemporaryDirectory() as tmpdir: + try: + if os.path.isfile(nemo_path): + save_restore_connector._unpack_nemo_file(path2file=nemo_path, out_folder=tmpdir) + else: + tmpdir = nemo_path + os.chdir(tmpdir) + if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1: + model_weights = save_restore_connector._inject_model_parallel_rank_for_ckpt( + tmpdir, save_restore_connector.model_weights_ckpt + ) + else: + model_weights = os.path.join(tmpdir, save_restore_connector.model_weights_ckpt) + + state_dict = save_restore_connector._load_state_dict_from_disk(model_weights, map_location=map_location) + + # distributed checkpointing + if state_dict is None and sharded_state_dict is not None: + is_dist_ckpt = True + checkpoint = dict(state_dict=sharded_state_dict) + tmp_model_weights_ckpt = os.path.join(tmpdir, save_restore_connector.model_weights_ckpt) + tmp_model_weights_dir = os.path.splitext(tmp_model_weights_ckpt)[0] + assert os.path.isdir(tmp_model_weights_dir), f'Expected {tmp_model_weights_dir} to be a directory.' + checkpoint = dist_checkpointing.load( + sharded_state_dict=checkpoint, checkpoint_dir=tmp_model_weights_dir, + ) + state_dict = checkpoint["state_dict"] + + finally: + os.chdir(cwd) + + return state_dict, is_dist_ckpt + + def setup_trainer_and_models_for_inference( model_provider: Any, cfg: DictConfig, model_cfg_modifier: Callable, ): @@ -253,3 +313,153 @@ def dummy(): # Return the trainer and model objects. return trainer, model + + +def create_neva_model_and_processor(cfg): + from nemo.collections.multimodal.models.multimodal_llm.neva.neva_model import MegatronNevaModel + + plugins = [] + if cfg.get('cluster_type', None) == 'BCP': + plugins.append(TorchElasticEnvironment()) + # trainer required for restoring model parallel models + trainer = Trainer(plugins=plugins, strategy=NLPDDPStrategy(), **cfg.trainer) + + if ( + cfg.tensor_model_parallel_size < 0 + or cfg.pipeline_model_parallel_size < 0 + or cfg.get('pipeline_model_parallel_split_rank', -1) < 0 + ): + model_config = MegatronNevaModel.restore_from( + restore_path=cfg.neva_model_file, trainer=trainer, return_config=True, + ) + + with open_dict(cfg): + cfg.tensor_model_parallel_size = model_config.get('tensor_model_parallel_size', 1) + cfg.pipeline_model_parallel_size = model_config.get('pipeline_model_parallel_size', 1) + cfg.pipeline_model_parallel_split_rank = model_config.get('pipeline_model_parallel_split_rank', 0) + + assert ( + cfg.trainer.devices * cfg.trainer.num_nodes + == cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size + ), "devices * num_nodes should equal tensor_model_parallel_size * pipeline_model_parallel_size" + + if cfg.neva_model_file: + save_restore_connector = NLPSaveRestoreConnector() + if os.path.isdir(cfg.neva_model_file): + save_restore_connector.model_extracted_dir = cfg.neva_model_file + + neva_cfg = MegatronNevaModel.restore_from( + restore_path=cfg.neva_model_file, + trainer=trainer, + return_config=True, + save_restore_connector=save_restore_connector, + ) + OmegaConf.set_struct(neva_cfg, True) + with open_dict(neva_cfg): + neva_cfg.sequence_parallel = False + neva_cfg.activations_checkpoint_granularity = None + neva_cfg.activations_checkpoint_method = None + neva_cfg.precision = trainer.precision + neva_cfg.mm_cfg.llm.from_pretrained = cfg.get('base_model_file', None) + # neva_cfg.mm_cfg.vision_encoder.from_pretrained = None + + model = MegatronNevaModel.restore_from( + restore_path=cfg.neva_model_file, + trainer=trainer, + override_config_path=neva_cfg, + save_restore_connector=save_restore_connector, + ) + if neva_cfg.get('peft') is not None: + peft_cfg_cls = PEFT_CONFIG_MAP[neva_cfg.peft.peft_scheme] + if peft_cfg_cls is not None: + model.load_adapters(cfg.neva_model_file, peft_cfg_cls(neva_cfg)) + + elif cfg.checkpoint_dir: + app_state = AppState() + if cfg.tensor_model_parallel_size > 1 or cfg.pipeline_model_parallel_size > 1: + app_state.model_parallel_size = cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size + app_state.tensor_model_parallel_size = cfg.tensor_model_parallel_size + app_state.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size + ( + app_state.tensor_model_parallel_rank, + app_state.pipeline_model_parallel_rank, + app_state.model_parallel_size, + app_state.data_parallel_size, + app_state.pipeline_model_parallel_split_rank, + app_state.virtual_pipeline_model_parallel_rank, + ) = fake_initialize_model_parallel( + world_size=app_state.model_parallel_size, + rank=trainer.global_rank, + tensor_model_parallel_size_=cfg.tensor_model_parallel_size, + pipeline_model_parallel_size_=cfg.pipeline_model_parallel_size, + pipeline_model_parallel_split_rank_=cfg.pipeline_model_parallel_split_rank, + ) + checkpoint_path = inject_model_parallel_rank(os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name)) + # TODO: This wont work properly (We need to set model.llm.from_pretrained model.vision.from_pretrained to nul) + model = MegatronNevaModel.load_from_checkpoint(checkpoint_path, hparams_file=cfg.hparams_file, trainer=trainer) + else: + raise ValueError("need at least a nemo file or checkpoint dir") + + model.freeze() + + # Have to turn off activations_checkpoint_method for inference + try: + model.model.language_model.encoder.activations_checkpoint_method = None + except AttributeError: + pass + try: + model.model.module.language_model.encoder.activations_checkpoint_method = None + except AttributeError: + pass + + def image_processor(maybe_image_path): + if isinstance(maybe_image_path, str): + image = Image.open(maybe_image_path).convert('RGB') + else: + image = maybe_image_path + + if neva_cfg.mm_cfg.vision_encoder.from_hf: + processor = CLIPImageProcessor.from_pretrained( + neva_cfg.mm_cfg.vision_encoder.from_pretrained, torch_dtype=torch.bfloat16 + ) + else: + processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16) + + if neva_cfg.data.image_aspect_ratio == 'keep': + max_hw, min_hw = max(image.size), min(image.size) + aspect_ratio = max_hw / min_hw + max_len, min_len = 448, 224 + shortest_edge = int(min(max_len / aspect_ratio, min_len)) + image = processor.preprocess( + image, return_tensors='pt', do_center_crop=False, size={"shortest_edge": shortest_edge} + )['pixel_values'][0] + elif neva_cfg.data.image_aspect_ratio == 'pad': + + def expand2square(pil_img, background_color): + width, height = pil_img.size + if width == height: + return pil_img + elif width > height: + result = Image.new(pil_img.mode, (width, width), background_color) + result.paste(pil_img, (0, (width - height) // 2)) + return result + else: + result = Image.new(pil_img.mode, (height, height), background_color) + result.paste(pil_img, ((height - width) // 2, 0)) + return result + + image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean)) + image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0] + else: + image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0] + + if neva_cfg.precision in [16, '16', '16-mixed']: + media = image.type(torch.float16) + elif neva_cfg.precision in [32, '32', '32-true']: + media = image.type(torch.float32) + else: + media = image.type(torch.bfloat16) + + return media.unsqueeze(dim=0).unsqueeze(dim=0).unsqueeze(dim=0) + + return model, image_processor diff --git a/nemo/collections/nlp/modules/common/text_generation_strategy.py b/nemo/collections/nlp/modules/common/text_generation_strategy.py index b17dd12c0e3f..fd68eef592fd 100644 --- a/nemo/collections/nlp/modules/common/text_generation_strategy.py +++ b/nemo/collections/nlp/modules/common/text_generation_strategy.py @@ -20,7 +20,6 @@ from typing import List, Set, Tuple import torch -from transformers import CLIPImageProcessor from nemo.collections.nlp.modules.common.lm_utils import pad_batch from nemo.collections.nlp.modules.common.megatron.utils import get_ltor_masks_and_position_ids @@ -325,6 +324,65 @@ def prepare_batch_at_step( return batch, tensor_shape +def neva_process_prompts(prompt, tokenizer, multimodal_cfg, num_media_latents, conv_template): + from nemo.collections.multimodal.data.neva.neva_dataset import ( + DEFAULT_IMAGE_TOKEN, + preprocess_llama_2, + preprocess_multimodal, + preprocess_nvgpt, + preprocess_v1, + ) + + list_data_dict = [] + if multimodal_cfg["conv_template"] == "nvgpt": + record = { + 'system': 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user\'s questions.\n\n', + 'conversations': [{'from': 'User', 'value': prompt}, {'from': 'Assistant', 'value': '',},], + } + + for turn in record['conversations']: # + if turn.get('value') is not None: + turn['value'] = re.sub('', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) + list_data_dict.append(record) + + sources = preprocess_multimodal( + copy.deepcopy(list_data_dict), multimodal_cfg, num_media_latents + ) # HARDCODED FOR NOW + data_dict = preprocess_nvgpt(sources, tokenizer, multimodal_cfg) + + elif multimodal_cfg["conv_template"] == "llama_2": + record = { + 'conversations': [{'from': 'human', 'value': prompt,}, {'from': 'gpt', 'value': '',},], + } + + for turn in record['conversations']: + if turn.get('value') is not None: + turn['value'] = re.sub('', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) + list_data_dict.append(record) + + sources = preprocess_multimodal( + copy.deepcopy(list_data_dict), multimodal_cfg, num_media_latents + ) # HARDCODED FOR NOW + data_dict = preprocess_llama_2(sources, tokenizer, multimodal_cfg) + elif multimodal_cfg["conv_template"] == "v1": + record = { + 'conversations': [{'from': 'human', 'value': prompt,}, {'from': 'gpt', 'value': '',},], + } + + for turn in record['conversations']: + if turn.get('value') is not None: + turn['value'] = re.sub('', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) + list_data_dict.append(record) + + sources = preprocess_multimodal( + copy.deepcopy(list_data_dict), multimodal_cfg, num_media_latents + ) # HARDCODED FOR NOW + data_dict = preprocess_v1(sources, tokenizer, multimodal_cfg) + else: + raise ValueError(f"Conversation template `{conv_template}` is not supported in Neva now.") + return data_dict['tokens'].tolist() + + class NevaModelTextGenerationStrategy(TextGenerationStrategy): def __init__(self, model): super().__init__(model) @@ -335,15 +393,6 @@ def __init__(self, model): self.cfg = self.model.cfg self.data_cfg = self.model.cfg.data - if self.cfg.mm_cfg.vision_encoder.from_hf: - self.processor = CLIPImageProcessor.from_pretrained( - self.cfg.mm_cfg.vision_encoder.from_pretrained, torch_dtype=torch.bfloat16 - ) - else: - self.processor = CLIPImageProcessor.from_pretrained( - "openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16 - ) - add_extra_token = 0 self.multimodal_cfg = dict( is_multimodal=self.data_cfg.is_multimodal, @@ -353,7 +402,7 @@ def __init__(self, model): image_folder=self.data_cfg.image_folder, image_aspect_ratio=self.data_cfg.image_aspect_ratio, use_im_start_end=getattr(self.cfg.mm_cfg, 'use_im_start_end', False), - image_processor=self.processor, + image_processor=None, add_extra_token=add_extra_token, context_length=self.cfg.encoder_seq_length, ) @@ -379,122 +428,36 @@ def init_batch(self, context_tokens: torch.Tensor, context_length: int, compute_ compute_attention_mask=compute_attention_mask, ) - def process_prompts(self, prompt): - from nemo.collections.multimodal.data.neva.neva_dataset import ( - DEFAULT_IMAGE_TOKEN, - preprocess_llama_2, - preprocess_multimodal, - preprocess_nvgpt, - preprocess_v1, - ) - - list_data_dict = [] - if self.multimodal_cfg["conv_template"] == "nvgpt": - record = { - 'system': 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user\'s questions.\n\n', - 'conversations': [ - {'from': 'User', 'value': prompt,}, - { - 'from': 'Assistant', - 'value': '', - 'label': 'quality:6,toxicity:0,humor:0,creativity:0,violence:0,helpfulness:6,not_appropriate:0', - }, - ], - } - - for turn in record['conversations']: # - if turn.get('value') is not None: - turn['value'] = re.sub('', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) - list_data_dict.append(record) - - sources = preprocess_multimodal( - copy.deepcopy(list_data_dict), self.multimodal_cfg, self.num_media_latents - ) # HARDCODED FOR NOW - data_dict = preprocess_nvgpt(sources, self.tokenizer, self.multimodal_cfg) - - elif self.multimodal_cfg["conv_template"] == "llama_2": - record = { - 'conversations': [{'from': 'human', 'value': prompt,}, {'from': 'gpt', 'value': '',},], - } - - for turn in record['conversations']: - if turn.get('value') is not None: - turn['value'] = re.sub('', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) - list_data_dict.append(record) - - sources = preprocess_multimodal( - copy.deepcopy(list_data_dict), self.multimodal_cfg, self.num_media_latents - ) # HARDCODED FOR NOW - data_dict = preprocess_llama_2(sources, self.tokenizer, self.multimodal_cfg) - elif self.multimodal_cfg["conv_template"] == "v1": - record = { - 'conversations': [{'from': 'human', 'value': prompt,}, {'from': 'gpt', 'value': '',},], - } - - for turn in record['conversations']: - if turn.get('value') is not None: - turn['value'] = re.sub('', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) - list_data_dict.append(record) + def tokenize_batch(self, prompt, max_len, add_BOS): - sources = preprocess_multimodal( - copy.deepcopy(list_data_dict), self.multimodal_cfg, self.num_media_latents - ) # HARDCODED FOR NOW - data_dict = preprocess_v1(sources, self.tokenizer, self.multimodal_cfg) + if type(prompt) == str: + context_tokens = neva_process_prompts( + prompt, + self.tokenizer, + self.multimodal_cfg, + self.num_media_latents, + self.multimodal_cfg['conv_template'], + ) + elif type(prompt) == list: + context_tokens = [] + for p in prompt: + context_tokens.append( + neva_process_prompts( + p, + self.tokenizer, + self.multimodal_cfg, + self.num_media_latents, + self.multimodal_cfg['conv_template'], + )[0] + ) else: - raise ValueError(f"Conversation template `{self.conv_template}` is not supported in Neva now.") - return data_dict['tokens'].tolist() + raise ValueError(f'{type(prompt)} is not supported for tokenization') - def tokenize_batch(self, prompt, max_len, add_BOS): - context_tokens = self.process_prompts(prompt) context_tokens, context_lengths = pad_batch(context_tokens, self.tokenizer.eos_id, max_len) context_tokens_tensor = torch.cuda.LongTensor(context_tokens) context_length_tensor = torch.cuda.LongTensor(context_lengths) return context_tokens_tensor, context_length_tensor - def get_media_tensor(self, image_path): - from PIL import Image - - image = Image.open(image_path).convert('RGB') - - if self.data_cfg.image_aspect_ratio == 'keep': - max_hw, min_hw = max(image.size), min(image.size) - aspect_ratio = max_hw / min_hw - max_len, min_len = 448, 224 - shortest_edge = int(min(max_len / aspect_ratio, min_len)) - image = self.processor.preprocess( - image, return_tensors='pt', do_center_crop=False, size={"shortest_edge": shortest_edge} - )['pixel_values'][0] - elif self.data_cfg.image_aspect_ratio == 'pad': - - def expand2square(pil_img, background_color): - width, height = pil_img.size - if width == height: - return pil_img - elif width > height: - result = Image.new(pil_img.mode, (width, width), background_color) - result.paste(pil_img, (0, (width - height) // 2)) - return result - else: - result = Image.new(pil_img.mode, (height, height), background_color) - result.paste(pil_img, ((height - width) // 2, 0)) - return result - - image = expand2square(image, tuple(int(x * 255) for x in self.processor.image_mean)) - image = self.processor.preprocess(image, return_tensors='pt')['pixel_values'][0] - else: - image = self.processor.preprocess(image, return_tensors='pt')['pixel_values'][0] - - model_cfg = self.model.cfg - - if model_cfg.precision in [16, '16', '16-mixed']: - media = image.type(torch.float16) - elif model_cfg.precision in [32, '32', '32-true']: - media = image.type(torch.float32) - else: - media = image.type(torch.bfloat16) - - return media.unsqueeze(dim=0).unsqueeze(dim=0).unsqueeze(dim=0) - def prepare_batch_at_step( self, tokens: torch.Tensor, diff --git a/nemo/collections/nlp/modules/common/text_generation_utils.py b/nemo/collections/nlp/modules/common/text_generation_utils.py index 1a13d2278520..38daa44a036a 100644 --- a/nemo/collections/nlp/modules/common/text_generation_utils.py +++ b/nemo/collections/nlp/modules/common/text_generation_utils.py @@ -26,6 +26,11 @@ import torch.nn.functional as F from nemo.collections.common.tokenizers.tabular_tokenizer import TabularTokenizer +from nemo.collections.multimodal.data.neva.conversation import ( + DEFAULT_IM_END_TOKEN, + DEFAULT_IM_START_TOKEN, + DEFAULT_IMAGE_PATCH_TOKEN, +) from nemo.collections.nlp.modules.common.megatron.utils import get_ltor_masks_and_position_ids from nemo.collections.nlp.modules.common.text_generation_strategy import model_inference_strategy_dispatcher from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, OutputType, SamplingParam @@ -148,10 +153,9 @@ def megatron_neva_generate(model, prompt_dict_list, length_params, sampling_para conv_template = model.cfg.data.get("conv_template", "nvgpt") final_response = [] for idx, prompt_dict in enumerate(prompt_dict_list): - img = os.path.join(inference_config.inference.images_base_path, prompt_dict['image']) response = generate( model, - inputs=prompt_dict.get("prompt") or prompt_dict.get("text"), + inputs=prompt_dict.get('prompt'), tokens_to_generate=length_params['max_length'], all_probs=sampling_params['all_probs'], compute_logprob=sampling_params['compute_logprob'], @@ -164,17 +168,18 @@ def megatron_neva_generate(model, prompt_dict_list, length_params, sampling_para end_strings=sampling_params['end_strings'], min_tokens_to_generate=length_params['min_length'], compute_attention_mask=sampling_params.get("compute_attention_mask", True), - image_list=img, + image_list=prompt_dict.get('image'), **strategy_args, ) # Regular expression pattern to match the sequence - pattern = re.compile(r'( ⁇ )+') - clean_text = re.sub(pattern, '', response['sentences'][0]) + pattern = re.compile(rf'{DEFAULT_IM_START_TOKEN}( ⁇ )+{DEFAULT_IM_END_TOKEN}') + pattern_nvgpt = re.compile(rf'{DEFAULT_IM_START_TOKEN}({DEFAULT_IMAGE_PATCH_TOKEN})+{DEFAULT_IM_END_TOKEN}') + combined_pattern = re.compile(f'{pattern.pattern}|{pattern_nvgpt.pattern}') + clean_text = re.sub(combined_pattern, '', response['sentences'][0]) clean_response = clean_text - for string in sampling_params['end_strings']: - clean_response = clean_response.rstrip(string) + if conv_template == "nvgpt": labels_str_regexp = re.compile(f"quality:.*\n") last_match_end_position = None @@ -182,9 +187,13 @@ def megatron_neva_generate(model, prompt_dict_list, length_params, sampling_para last_match_end_position = match.end() if last_match_end_position is not None: clean_response = clean_response[last_match_end_position:] + clean_response = clean_response.strip("") elif conv_template == "llama_2": clean_response = clean_response.rsplit("[/INST] ", 1)[-1] - clean_response.strip() + elif conv_template == "v1": + clean_response = clean_response.rsplit("ASSISTANT: ", 1)[-1] + + clean_response = clean_response.strip() response["clean_text"] = clean_text response["clean_response"] = clean_response final_response.append(response) @@ -759,14 +768,10 @@ def sample_sequence_batch( lengths = torch.ones([batch_size]).long().cuda() * maxlen - media_tensor = None - if image_list is not None: - media_tensor = inference_strategy.get_media_tensor(image_list) - while context_length < maxlen: - if media_tensor is not None: + if image_list is not None: batch, tensor_shape = inference_strategy.prepare_batch_at_step( - tokens, maxlen, micro_batch_size, counter, context_length, compute_attention_mask, media_tensor + tokens, maxlen, micro_batch_size, counter, context_length, compute_attention_mask, image_list ) else: batch, tensor_shape = inference_strategy.prepare_batch_at_step( diff --git a/nemo/collections/nlp/parts/nlp_overrides.py b/nemo/collections/nlp/parts/nlp_overrides.py index 6ee36d6983cb..741cb0309813 100644 --- a/nemo/collections/nlp/parts/nlp_overrides.py +++ b/nemo/collections/nlp/parts/nlp_overrides.py @@ -73,6 +73,16 @@ HAVE_APEX = False + +try: + import amp_C + + HAVE_AMP_C = True + +except (ImportError, ModuleNotFoundError): + + HAVE_AMP_C = False + try: from megatron.core import dist_checkpointing, parallel_state from megatron.core.dist_checkpointing.dict_utils import dict_list_map_outplace @@ -930,7 +940,7 @@ def modify_state_dict(self, conf, state_dict): # Modify state key for Dreambooth inference if ( conf.get('target') - == 'nemo.collections.multimodal.models.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion' + == 'nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion' ): new_state_dict = {} for key in state_dict.keys(): @@ -1140,7 +1150,13 @@ def __init__( ) self.optimizer_update_skipped: Optional[bool] = None self.hysteresis = hysteresis - self._hysteresis_tracker = self.hysteresis + + def _lazy_init_scale_growth_tracker(self, dev): + super()._lazy_init_scale_growth_tracker(dev) + if HAVE_AMP_C: + self._hysteresis_tracker = torch.tensor([self.hysteresis], dtype=torch.int32, device=dev) + else: + self._hysteresis_tracker = self.hysteresis def _unscale_grads_(self, optimizer, *args): if getattr(optimizer, "_custom_amp_unscale_grads", False): @@ -1150,14 +1166,17 @@ def _unscale_grads_(self, optimizer, *args): def _maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs): retval = None - found_inf = torch.cuda.FloatTensor([sum(v.item() for v in optimizer_state["found_inf_per_device"].values())]) + found_infs = tuple(optimizer_state["found_inf_per_device"].values()) + found_inf = torch.stack(found_infs).sum(dim=0, keepdim=True) # Update across all model parallel instances. torch.distributed.all_reduce( found_inf, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group() ) - if found_inf.item() == 0: + self._found_infs_cpu = found_inf.item() + self._found_infs_cuda = found_inf + if self._found_infs_cpu == 0: retval = optimizer.step(*args, **kwargs) self.optimizer_update_skipped = False else: @@ -1213,11 +1232,38 @@ def update(self, new_scale=None): ) found_inf_combined += found_inf - if found_inf_combined > 0: - self._hysteresis_tracker -= 1 - if self._hysteresis_tracker <= 0: - # When hysteresis becomes zero, follow the native grad scale update rule. - # Increase scale and reset growth tracker + if HAVE_AMP_C: + amp_C.update_scale_hysteresis( + _scale, + _growth_tracker, + self._hysteresis_tracker, + found_inf_combined, + self._growth_factor, + self._backoff_factor, + self._growth_interval, + self.hysteresis, + ) + else: + if found_inf_combined > 0: + self._hysteresis_tracker -= 1 + if self._hysteresis_tracker <= 0: + # When hysteresis becomes zero, follow the native grad scale update rule. + # Increase scale and reset growth tracker + torch._amp_update_scale_( + _scale, + _growth_tracker, + found_inf_combined, + self._growth_factor, + self._backoff_factor, + self._growth_interval, + ) + else: + # Only reset the growth tracker when hysteresis is larger than zero + _growth_tracker.fill_(0.0) + else: + # When no inf found, follow the native grad scale update rule. + # Increment growth_tracker, update scale when growth tracker reaches the interval, and + # reset the hysteresis tracker. torch._amp_update_scale_( _scale, _growth_tracker, @@ -1226,22 +1272,7 @@ def update(self, new_scale=None): self._backoff_factor, self._growth_interval, ) - else: - # Only reset the growth tracker when hysteresis is larger than zero - _growth_tracker.fill_(0.0) - else: - # When no inf found, follow the native grad scale update rule. - # Increment growth_tracker, update scale when growth tracker reaches the interval, and - # reset the hysteresis tracker. - torch._amp_update_scale_( - _scale, - _growth_tracker, - found_inf_combined, - self._growth_factor, - self._backoff_factor, - self._growth_interval, - ) - self._hysteresis_tracker = self.hysteresis + self._hysteresis_tracker = self.hysteresis # To prepare for next iteration, clear the data collected from optimizers this iteration. self._per_optimizer_states = defaultdict(torch.cuda.amp.grad_scaler._refresh_per_optimizer_state) @@ -1288,7 +1319,10 @@ def load_state_dict(self, state_dict): if "_hysterisis_tracker" in state_dict: self._hysteresis_tracker = state_dict["_hysterisis_tracker"] else: - self._hysteresis_tracker = 1 + if HAVE_AMP_C: + self._hysteresis_tracker = torch.tensor([1], dtype=torch.int32, device="cuda") + else: + self._hysteresis_tracker = 1 class MegatronHalfPrecisionPlugin(MixedPrecisionPlugin): diff --git a/nemo/collections/nlp/parts/peft_config.py b/nemo/collections/nlp/parts/peft_config.py index c01ba337e8c2..72bcdf55e8ae 100644 --- a/nemo/collections/nlp/parts/peft_config.py +++ b/nemo/collections/nlp/parts/peft_config.py @@ -199,6 +199,30 @@ def __init__(self, cfg): super().__init__(adapter_tuning_cfg, name_key_to_cfg) +class SDLoraPEFTConfig(PEFTConfig): + def __init__(self, cfg): + lora_cfg = cfg.peft.lora_tuning + + # Stable diffusion has different attn dimensions, we pass a dummy config and infer from each module when adding adapter + config_args = { + "in_features": None, + "out_features": None, + "dim": lora_cfg.adapter_dim, + "norm_position": None, + "norm_type": None, + "activation": "identity", + "column_init_method": lora_cfg.get("column_init_method", "normal"), + "row_init_method": lora_cfg.get("row_init_method", "zero"), + "gather_output": False, + "dropout": lora_cfg.adapter_dropout, + "network_alpha": lora_cfg.network_alpha, + } + + name_key_to_cfg = {AdapterName.PARALLEL_LINEAR_ADAPTER: ParallelLinearAdapterConfig(**config_args)} + self.name_key_to_mcore_mixins = None + super().__init__(lora_cfg, name_key_to_cfg) + + PEFT_CONFIG_MAP = { "adapter": CanonicalAdaptersPEFTConfig, "ia3": IA3PEFTConfig, @@ -207,4 +231,5 @@ def __init__(self, cfg): "selective": SelectivePEFTConfig, 'none': None, None: None, + "sdlora": SDLoraPEFTConfig, } diff --git a/nemo/collections/vision/modules/common/megatron/vision_transformer.py b/nemo/collections/vision/modules/common/megatron/vision_transformer.py index 744c821f984a..80793067128c 100644 --- a/nemo/collections/vision/modules/common/megatron/vision_transformer.py +++ b/nemo/collections/vision/modules/common/megatron/vision_transformer.py @@ -36,7 +36,7 @@ ModelType = AttnMaskType = AttnType = LayerType = ApexGuardDefaults() try: - from megatron.core import parallel_state + from megatron.core import ModelParallelConfig, parallel_state HAVE_MEGATRON_CORE = True @@ -82,6 +82,16 @@ def forward(self, hidden_state): return output +class LayerScale(torch.nn.Module): + def __init__(self, dim, init_values=1e-5, inplace=False): + super().__init__() + self.inplace = inplace + self.gamma = torch.nn.Parameter(init_values * torch.ones(dim)) + + def forward(self, x): + return x.mul_(self.gamma) if self.inplace else x * self.gamma + + class ParallelVisionTransformerLayer_(ParallelTransformerLayer_): """A single transformer layer. @@ -91,7 +101,7 @@ class ParallelVisionTransformerLayer_(ParallelTransformerLayer_): def __init__( self, - config, + config: ModelParallelConfig, init_method, output_layer_init_method, layer_number, @@ -102,37 +112,48 @@ def __init__( self_attn_mask_type=AttnMaskType.padding, fp32_residual_connection=False, precision=16, - apply_query_key_layer_scaling=True, + apply_query_key_layer_scaling=False, kv_channels=None, layernorm_epsilon=1e-5, hidden_dropout=0.1, - bias_dropout_add_fusion=True, persist_layer_norm=False, bias_activation_fusion=True, + bias_dropout_add_fusion=True, + masked_softmax_fusion=True, openai_gelu=False, onnx_safe=False, - masked_softmax_fusion=True, attention_dropout=0.1, ffn_dropout=0.0, drop_path_rate=0.0, + layerscale=False, activation='gelu', megatron_legacy=False, bias=True, chunk_size=64, normalization='layernorm', transformer_block_type='pre_ln', + position_embedding_type='learned_absolute', + multi_query_attention=False, headscale=False, activations_checkpoint_granularity=None, normalize_attention_scores=True, + num_moe_experts=1, + moe_frequency=1, + moe_dropout=0.0, use_flash_attention=False, ): kwargs = locals() for key in ["self", "__class__"]: kwargs.pop(key) drop_path_rate = kwargs.pop("drop_path_rate") + layerscale = kwargs.pop("layerscale") super(ParallelVisionTransformerLayer_, self).__init__(**kwargs) self.drop_path = DropPath(drop_path_rate) if drop_path_rate > 0.0 else None + self.layerscale = layerscale + if self.layerscale: + self.post_attention_layerscale = LayerScale(hidden_size, init_values=1e-5) + self.post_mlp_layerscale = LayerScale(hidden_size, init_values=1e-5) def forward( self, @@ -144,8 +165,7 @@ def forward( get_key_value=False, set_inference_key_value_memory=False, inference_max_sequence_len=None, - rotary_pos_emb=None, - # list of positional embedding tensors, first one self attention, second one and third one are for cross attention (q, k) + rotary_pos_emb=None, # list of positional embedding tensors, first one self attention, second one and third one are for cross attention (q, k) self_attention_relative_position_bias=None, cross_attention_relative_position_bias=None, checkpoint_core_attention=False, @@ -200,7 +220,14 @@ def forward( # different nn.functional routines to account for varying # dropout semantics during training and inference phases. - if self.drop_path is None: + if self.is_adapter_available(): + adapter_1 = self.get_adapter_module(AdapterName.PRE_ATTN_ADAPTER) + if adapter_1: + attention_output = ( + adapter_1(attention_output) + attention_output + ) # simple adapter call with residual connection + + if self.drop_path is None and not self.layerscale: bias_dropout_add_func = self._get_bias_droput_add_func( transformer_block_type=self.transformer_block_type, position_after='attention' ) @@ -215,8 +242,11 @@ def forward( out = torch.nn.functional.dropout( attention_output + attention_bias, p=self.hidden_dropout, training=self.training ) - layernorm_input = residual + self.drop_path(out) - # print(f"Layer: {self.layer_number} Attention checksum {layernorm_input.sum()}") + if self.drop_path is not None: + out = self.drop_path(out) + if self.layerscale: + out = self.post_attention_layerscale(out) + layernorm_input = residual + out # Post-LN normalization after residual if self.transformer_block_type == 'post_ln': @@ -283,10 +313,15 @@ def forward( layernorm_input = normalization_output # MLP. mlp_output, mlp_bias = self.mlp(normalization_output) + if self.is_adapter_available(): + # TODO: (@adithyre) was able to move adapter_2 back to the end of the transformer after ptl 1.7 update. + adapter_2 = self.get_adapter_module(AdapterName.POST_ATTN_ADAPTER) + if adapter_2: + mlp_output = adapter_2(mlp_output) + mlp_output # simple adapter call with residual connection residual = layernorm_input - if self.drop_path is None: + if self.drop_path is None and not self.layerscale: bias_dropout_add_func = self._get_bias_droput_add_func( transformer_block_type=self.transformer_block_type, position_after='mlp' ) @@ -295,7 +330,11 @@ def forward( else: out = torch.nn.functional.dropout(mlp_output + mlp_bias, p=self.hidden_dropout, training=self.training) - output = residual + self.drop_path(out) + if self.drop_path is not None: + out = self.drop_path(out) + if self.layerscale: + out = self.post_mlp_layerscale(out) + output = residual + out # print(f"Layer: {self.layer_number} MLP + Dropout + Residual checksum {output.sum()}") if self.transformer_block_type == 'post_ln': @@ -349,14 +388,14 @@ class ParallelVisionTransformer(ParallelTransformer): def __init__( self, - config, + config: ModelParallelConfig, init_method, output_layer_init_method, num_layers, hidden_size, ffn_hidden_size, num_attention_heads, - apply_query_key_layer_scaling=True, + apply_query_key_layer_scaling=False, kv_channels=None, layer_type=LayerType.encoder, # it can be a list of types or single type self_attn_mask_type=AttnMaskType.padding, @@ -371,6 +410,7 @@ def __init__( attention_dropout=0.1, ffn_dropout=0.0, drop_path_rate=0.0, + layerscale=False, bias_activation_fusion=True, bias_dropout_add_fusion=True, masked_softmax_fusion=True, @@ -384,17 +424,34 @@ def __init__( chunk_size=64, normalization='layernorm', transformer_block_type='pre_ln', + position_embedding_type='learned_absolute', headscale=False, layer_number_offset=0, # this is use only for attention norm_factor scaling activations_checkpoint_granularity=None, - normalize_attention_scores=True, + activations_checkpoint_layers_per_pipeline=None, + transformer_engine=False, + fp8=False, + fp8_e4m3=False, + fp8_hybrid=False, + fp8_margin=0, + fp8_interval=1, + fp8_amax_history_len=1, + fp8_amax_compute_algo='most_recent', + reduce_amax=True, + use_emha=False, ub_tp_comm_overlap=False, + normalize_attention_scores=True, + multi_query_attention=False, + num_moe_experts=1, + moe_frequency=1, + moe_dropout=0.0, use_flash_attention=False, ): kwargs = locals() for key in ["self", "__class__"]: kwargs.pop(key) self.drop_path_rate = kwargs.pop("drop_path_rate") + layerscale = kwargs.pop("layerscale") super(ParallelVisionTransformer, self).__init__(**kwargs) self.num_layers = self.get_num_layers(num_layers) @@ -431,10 +488,12 @@ def build_layer(layer_number): attention_dropout=attention_dropout, ffn_dropout=ffn_dropout, drop_path_rate=self.drop_path_rates[layer_number - 1], + layerscale=layerscale, bias_activation_fusion=bias_activation_fusion, bias_dropout_add_fusion=bias_dropout_add_fusion, masked_softmax_fusion=masked_softmax_fusion, persist_layer_norm=persist_layer_norm, + position_embedding_type=position_embedding_type, openai_gelu=openai_gelu, onnx_safe=onnx_safe, activation=activation, @@ -446,6 +505,9 @@ def build_layer(layer_number): headscale=headscale, activations_checkpoint_granularity=activations_checkpoint_granularity, normalize_attention_scores=normalize_attention_scores, + num_moe_experts=num_moe_experts, + moe_frequency=moe_frequency, + moe_dropout=moe_dropout, use_flash_attention=use_flash_attention, ) diff --git a/nemo/collections/vision/modules/vit/vit_backbone.py b/nemo/collections/vision/modules/vit/vit_backbone.py index 422ccf65475e..ebd7e0da3e5c 100644 --- a/nemo/collections/vision/modules/vit/vit_backbone.py +++ b/nemo/collections/vision/modules/vit/vit_backbone.py @@ -213,8 +213,8 @@ def __init__( self.num_patches_per_dim_h = self.img_h // self.patch_dim self.num_patches_per_dim_w = self.img_w // self.patch_dim self.num_patches = self.num_patches_per_dim_h * self.num_patches_per_dim_w - class_token_length = model_cfg.get("class_token_length", 8) - self.seq_length = self.num_patches + (class_token_length if self.class_token else 0) + self.class_token_length = model_cfg.get("class_token_length", 8) if self.class_token else 0 + self.seq_length = self.num_patches + self.class_token_length self.flatten_dim = self.patch_dim * self.patch_dim * model_cfg.num_channels self.input_tensor = None self.position_ids = None @@ -223,7 +223,7 @@ def __init__( if self.pre_process: # cls_token if self.class_token: - self.cls_token = torch.nn.Parameter(torch.randn(1, class_token_length, self.hidden_size)) + self.cls_token = torch.nn.Parameter(torch.randn(1, self.class_token_length, self.hidden_size)) torch.nn.init.zeros_(self.cls_token) self.position_ids = torch.arange(self.seq_length).expand(1, -1).cuda() @@ -249,7 +249,7 @@ def __init__( self.embedding_dropout = torch.nn.Dropout(model_cfg.hidden_dropout) self.drop_patch = DropPatch( - self.drop_patch_rate, class_token_length=class_token_length, exclude_cls_tokens=self.class_token + self.drop_patch_rate, class_token_length=self.class_token_length, exclude_cls_tokens=self.class_token ) if preprocess_layernorm: @@ -282,6 +282,7 @@ def __init__( hidden_dropout=model_cfg.hidden_dropout, attention_dropout=model_cfg.attention_dropout, drop_path_rate=model_cfg.drop_path_rate, + layerscale=model_cfg.get("layerscale", False), bias_activation_fusion=model_cfg.get("bias_activation_fusion", False), persist_layer_norm=model_cfg.persist_layer_norm, openai_gelu=model_cfg.openai_gelu, @@ -298,6 +299,36 @@ def set_input_tensor(self, input_tensor): """See megatron.model.transformer.set_input_tensor()""" self.transformer.set_input_tensor(input_tensor) + def interpolate_pos_encoding( + self, x, + ): + output_seq_len = x.shape[1] + assert isPerfectSquare(output_seq_len - self.class_token_length) + + num_tok_output = output_seq_len - self.class_token_length + num_tok_input = self.num_patches + + if num_tok_input == num_tok_output: + return self.position_embeddings + + embed_tok = self.position_embeddings[: self.class_token_length] + embed_grid = self.position_embeddings[self.class_token_length :] + + gs_new = int(math.sqrt(num_tok_output)) + gs_input = (self.num_patches_per_dim_h, self.num_patches_per_dim_w) + + embed_grid = embed_grid.transpose(0, 1).contiguous() + embed_grid = embed_grid.reshape((1, -1, gs_input[0], gs_input[1])) + embed_grid = embed_grid.float() + scale_factor = (gs_new / gs_input[0], gs_new / gs_input[1]) + + embed_grid = F.interpolate(embed_grid, scale_factor=scale_factor, mode="bicubic") + + embed_grid = embed_grid.reshape((-1, num_tok_output)) + embed_grid = embed_grid.transpose(0, 1).contiguous() + + return torch.cat((embed_tok, embed_grid), dim=0) + def forward(self, input): if self.pre_process: @@ -318,7 +349,7 @@ def forward(self, input): self.position_ids[:, : concatenated_tokens.shape[1]] ) elif self.position_embedding_type == "learned_parameters": - token_embeddings = concatenated_tokens + self.position_embeddings + token_embeddings = concatenated_tokens + self.interpolate_pos_encoding(concatenated_tokens) else: raise ValueError(f"Unrecognized position embedding type: {self.position_embedding_type}.") diff --git a/nemo/utils/callbacks/__init__.py b/nemo/utils/callbacks/__init__.py index 6623657a2dc2..6992ce751ed5 100644 --- a/nemo/utils/callbacks/__init__.py +++ b/nemo/utils/callbacks/__init__.py @@ -12,5 +12,6 @@ # See the License for the specific language governing permissions and # limitations under the License. +from nemo.utils.callbacks.cuda_graph import CUDAGraphCallback from nemo.utils.callbacks.nemo_model_checkpoint import NeMoModelCheckpoint from nemo.utils.callbacks.preemption import PreemptionCallback diff --git a/nemo/utils/callbacks/cuda_graph.py b/nemo/utils/callbacks/cuda_graph.py new file mode 100644 index 000000000000..ba6046b79850 --- /dev/null +++ b/nemo/utils/callbacks/cuda_graph.py @@ -0,0 +1,385 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +from dataclasses import dataclass +from types import MethodType +from typing import Any, Dict + +import pytorch_lightning as pl +import torch +from pytorch_lightning.callbacks import Callback +from pytorch_lightning.loops.optimization.automatic import ClosureResult +from pytorch_lightning.utilities.rank_zero import rank_zero_info +from pytorch_lightning.utilities.signature_utils import is_param_in_hook_signature +from pytorch_lightning.utilities.types import STEP_OUTPUT +from torch.nn.parallel import DistributedDataParallel + +__all__ = ["CUDAGraphCallback"] + + +def struct_copy_one(src): + if isinstance(src, tuple): + return tuple(struct_copy_one(i) for i in src) + elif isinstance(src, list): + return list(struct_copy_one(i) for i in src) + elif isinstance(src, dict): + return {k: struct_copy_one(src[k]) for k in src} + elif isinstance(src, torch.Tensor): + return src.clone().detach().cuda() + else: + return src + + +def struct_copy_two(tgt, src): + if isinstance(src, tuple): + raise Exception(f"Unsupported copy for tuple yet: {type(src)}") + elif isinstance(src, list): + for i in range(len(src)): + if isinstance(src[i], (tuple, list, dict, torch.Tensor)): + struct_copy_two(tgt[i], src[i]) + else: + tgt[i] = src[i] + elif isinstance(src, dict): + for k in src: + if isinstance(src[k], (tuple, list, dict, torch.Tensor)): + struct_copy_two(tgt[k], src[k]) + else: + tgt[k] = src[k] + elif isinstance(src, torch.Tensor): + tgt.copy_(src, non_blocking=True) + else: + raise Exception(f"Expect top-level as container type but got: {type(src)}") + + +class StaticBufferLoader: + """Load data to static buffers.""" + + def __init__(self, loader): + self.loader = loader + self.stream = torch.cuda.Stream() + self.static = None + + def __iter__(self): + for inputs in self.loader: + if self.static is None: + with torch.cuda.stream(self.stream): + self.static = struct_copy_one(inputs) + + with torch.cuda.stream(self.stream): + struct_copy_two(self.static, inputs) + torch.cuda.current_stream().wait_stream(self.stream) + yield self.static + + def __len__(self): + return len(self.loader) + + +def get_lr(lr_scheduler): + lrs = lr_scheduler.__orig_get_lr__() + if not hasattr(lr_scheduler, "static_lrs"): + lr_scheduler.static_lrs = lrs + for i in range(len(lrs)): + lr_scheduler.static_lrs[i].copy_(lrs[i]) + return lr_scheduler.static_lrs + + +def zero_grad(optimizer, *args, **kwargs): + # We invoke zero_grad before graph capturing. + if torch.cuda.is_current_stream_capturing(): + rank_zero_info("CUDAGraphCallback: set optimizer.zero_grad as nop during graph capturing.") + else: + optimizer.__orig_zero_grad__(*args, **kwargs) + + +def get_optimizer_step(state): + def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_closure=None,) -> None: + # Not all optimizer supports set_to_none. + if not hasattr(optimizer, "support_set_to_none"): + optimizer.support_set_to_none = is_param_in_hook_signature( + optimizer.zero_grad, "set_to_none", explicit=True + ) + if optimizer.support_set_to_none: + zero_grad_kwargs = {"set_to_none": True} + else: + zero_grad_kwargs = {} + + if 0 <= state.current_iteration < state.capture_iteration or state.capture_iteration < 0: + state.stream.wait_stream(torch.cuda.current_stream()) + with torch.cuda.stream(state.stream): + optimizer.zero_grad(**zero_grad_kwargs) + self.__orig_optimizer_step__( + epoch, batch_idx, optimizer, optimizer_closure=optimizer_closure, + ) + torch.cuda.current_stream().wait_stream(state.stream) + + if state.current_iteration == state.capture_iteration: + optimizer.zero_grad(**zero_grad_kwargs) + torch.cuda.synchronize() + # Sleep for one second to let environment stable + time.sleep(1) + rank_zero_info("CUDAGraphCallback: capturing CUDA graph for module %s.", self.__class__.__name__) + with torch.cuda.graph(state.graph, stream=state.stream): + self.__orig_optimizer_step__( + epoch, batch_idx, optimizer, optimizer_closure=optimizer_closure, + ) + torch.cuda.synchronize() + + # Graph replay and reconstruct missing result + if state.current_iteration >= state.capture_iteration >= 0: + state.graph.replay() + optimizer_closure._result = ClosureResult.from_training_step_output(state.output) + + # If something is not capturable, try to put it there, e.g. `self.log()`. + if hasattr(self, "non_cuda_graph_capturable"): + self.non_cuda_graph_capturable() + + state.current_iteration += 1 + + return optimizer_step + + +def get_training_step(state): + def training_step(self, batch, batch_idx): + results = self.__orig_training_step__(batch, batch_idx) + if state.output is None: + state.output = struct_copy_one(results) + + # Copy results to static buffer to rebuild states required by PL. + with torch.no_grad(): + struct_copy_two(state.output, results) + return results + + return training_step + + +def get_amp_autocast_init(state): + def amp_autocast_init(self, *args, **kwargs): + if "cache_enabled" not in kwargs: + kwargs["cache_enabled"] = False + if state.current_iteration == 0: + rank_zero_info("CUDAGraphCallback: disable autocast cache.") + return self.__orig_init__(*args, **kwargs) + + return amp_autocast_init + + +def get_ddp_init(state): + def init(self, *args, **kwargs): + rank_zero_info("CUDAGraphCallback: init DDP on side stream.") + with torch.cuda.stream(state.stream): + self.__orig_init__(*args, **kwargs) + + return init + + +@dataclass +class CUDAGraphState: + current_iteration: int = 0 + capture_iteration: int = -1 # -1 to disable + stream: torch.cuda.Stream = None + graph: torch.cuda.CUDAGraph = None + output: Any = None # static forward output + + +class CUDAGraphCallback(Callback): + """Full iteration CUDA graph callback. + + Dataloader and LR scheduler are not included in the CUDA graph with this callback. + """ + + def __init__(self, capture_iteration=-1): + super().__init__() + + # Required by CUDA graph with DDP + # Ref: https://pytorch.org/docs/stable/notes/cuda.html#usage-with-distributeddataparallel + if 0 <= capture_iteration <= 11: + raise Exception("Warmup must run at least 11 DDP-enabled eager iterations before capture.") + if torch.distributed.is_initialized(): + raise Exception("CUDAGraphCallback should be initialized before process group.") + os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0" + + self.state = CUDAGraphState(capture_iteration=capture_iteration) + + def setup(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", stage: str) -> None: + """Called when fit, validate, test, predict, or tune begins.""" + if self.state.capture_iteration < 0: + return + + # Hack to avoid CUDA graph issue with AMP, PyTorch Lightning doesn't support + # changing autocast arguments for now. + # https://github.com/pytorch/pytorch/blob/v1.13.1/torch/cuda/graphs.py#L234 + torch.autocast.__orig_init__ = torch.autocast.__init__ + torch.autocast.__init__ = get_amp_autocast_init(self.state) + + # Before full-backward capture, DDP must be constructed in a side-stream context. + # We've merged the change that init DDP on side stream to PyTorch Lightning V2, + # but not all user defined strategy init DDP on side stream. + DistributedDataParallel.__orig_init__ = DistributedDataParallel.__init__ + DistributedDataParallel.__init__ = get_ddp_init(self.state) + + def teardown(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", stage: str) -> None: + """Called when fit, validate, test, predict, or tune ends.""" + if self.state.capture_iteration < 0: + return + + torch.autocast.__init__ = torch.autocast.__orig_init__ + del torch.autocast.__orig_init__ + + DistributedDataParallel.__init__ = DistributedDataParallel.__orig_init__ + del DistributedDataParallel.__orig_init__ + + def on_fit_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None: + """Called when fit begins.""" + if self.state.capture_iteration < 0: + return + + if is_param_in_hook_signature(pl_module, "dataloader_iter", explicit=True): + raise Exception( + "Found `dataloader_iter` argument in the `training_step`. This is " + "not supported by full iteration CUDA graph capturing yet since " + "dataloader will be within the CUDA graph capturing range.\n" + "Try to change `dataloader_iter` to `batch` and remove " + "`next(dataloader_iter)` from `training_step`." + ) + + # Now that CUDA device has been set, we can init stream and graph now + self.state.stream = torch.cuda.Stream() + self.state.graph = torch.cuda.CUDAGraph() + + def on_fit_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None: + """Called when fit ends.""" + if self.state.capture_iteration < 0: + return + + def on_train_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None: + """Called when the train begins.""" + if self.state.capture_iteration < 0: + return + + # Ensure training dataloader loads data to static buffer + dataloader = trainer.train_dataloader + assert isinstance( + dataloader, torch.utils.data.dataloader.DataLoader + ), f"Expect Dataloader type but got {type(dataloader)}" + trainer.train_dataloader.__orig_dataloader__ = dataloader + static_loader = StaticBufferLoader(dataloader) + trainer.train_dataloader.loaders = static_loader + + # Warn if `optimizer.zero_grad()` invoked during graph capturing + for optimizer in trainer.optimizers: + assert isinstance(optimizer, torch.optim.Optimizer), f"Expect Optimizer type but got {type(optimizer)}" + optimizer.__orig_zero_grad__ = optimizer.zero_grad + optimizer.zero_grad = MethodType(zero_grad, optimizer) + + # Ensure LR scheduler writes to static buffer + # We don't include LR scheduler in the full CUDA graph for now since + # its overhead is very small. + for config in trainer.lr_scheduler_configs: + assert isinstance( + config.scheduler, torch.optim.lr_scheduler._LRScheduler + ), f"Expect _LRScheduler type but got {type(dataloader)}" + config.scheduler.__orig_get_lr__ = config.scheduler.get_lr + config.scheduler.get_lr = MethodType(get_lr, config.scheduler) + + # Save model outputs to static buffer for PL states reconstruct + pl_module.__orig_training_step__ = pl_module.training_step + training_step = get_training_step(self.state) + pl_module.training_step = MethodType(training_step, pl_module) + + # Capture CUDA graph from model forward propagation to optimizer step + pl_module.__orig_optimizer_step__ = pl_module.optimizer_step + optimizer_step = get_optimizer_step(self.state) + pl_module.optimizer_step = MethodType(optimizer_step, pl_module) + + def on_train_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None: + """Called when the train ends.""" + if self.state.capture_iteration < 0: + return + + dataloader = trainer.train_dataloader.__orig_dataloader__ + trainer.train_dataloader.loaders = dataloader + del trainer.train_dataloader.__orig_dataloader__ + + for optimizer in trainer.optimizers: + optimizer.zero_grad = optimizer.__orig_zero_grad__ + del optimizer.__orig_zero_grad__ + + for config in trainer.lr_scheduler_configs: + config.scheduler.get_lr = config.scheduler.__orig_get_lr__ + del config.scheduler.__orig_get_lr__ + + pl_module.training_step = pl_module.__orig_training_step__ + del pl_module.__orig_training_step__ + + pl_module.optimizer_step = pl_module.__orig_optimizer_step__ + del pl_module.__orig_optimizer_step__ + + def on_train_epoch_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None: + """Called when the train epoch begins.""" + pass + + def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None: + """Called when the train epoch ends. + + To access all batch outputs at the end of the epoch, either: + + 1. Implement `training_epoch_end` in the `LightningModule` and access outputs via the module OR + 2. Cache data across train batch hooks inside the callback implementation to post-process in this hook. + """ + pass + + def on_train_batch_start( + self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", batch: Any, batch_idx: int + ) -> None: + """Called when the train batch begins.""" + pass + + def on_train_batch_end( + self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", outputs: STEP_OUTPUT, batch: Any, batch_idx: int + ) -> None: + """Called when the train batch ends. + + Note: + The value ``outputs["loss"]`` here will be the normalized value w.r.t ``accumulate_grad_batches`` of the + loss returned from ``training_step``. + """ + pass + + def on_save_checkpoint( + self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", checkpoint: Dict[str, Any] + ) -> None: + r""" + Called when saving a checkpoint to give you a chance to store anything else you might want to save. + + Args: + trainer: the current :class:`~pytorch_lightning.trainer.Trainer` instance. + pl_module: the current :class:`~pytorch_lightning.core.module.LightningModule` instance. + checkpoint: the checkpoint dictionary that will be saved. + """ + # Since we've add bound method to optimizer and lr_scheduler, it can lead to more + # CUDA tensors passed to consumer process unexpectedly. + if "optimizer_states" in checkpoint: + for optimizer_state in checkpoint["optimizer_states"]: + for k in list(optimizer_state.keys()): + v = optimizer_state[k] + if isinstance(v, MethodType) and hasattr(v, "__self__"): + del optimizer_state[k] + if "lr_schedulers" in checkpoint: + for lr_scheduler in checkpoint["lr_schedulers"]: + for k in list(lr_scheduler.keys()): + v = lr_scheduler[k] + if isinstance(v, MethodType) and hasattr(v, "__self__"): + del lr_scheduler[k] diff --git a/requirements/requirements_docs.txt b/requirements/requirements_docs.txt index 34406bd2a366..8412c67d4ab2 100644 --- a/requirements/requirements_docs.txt +++ b/requirements/requirements_docs.txt @@ -1,3 +1,4 @@ +boto3 Jinja2<3.1 latexcodec numpy diff --git a/scripts/diffusion_model_lora_merge/conf/merge_lora_weights.yaml b/scripts/diffusion_model_lora_merge/conf/merge_lora_weights.yaml new file mode 100644 index 000000000000..b13bfc68cd6d --- /dev/null +++ b/scripts/diffusion_model_lora_merge/conf/merge_lora_weights.yaml @@ -0,0 +1,16 @@ +name: stable-diffusion-train + +trainer: + devices: 1 + num_nodes: 1 + accelerator: gpu + precision: 16 + logger: False # logger provided by exp_manager + +model: + restore_from_path: null + precision: ${trainer.precision} + +lora_model_path: null +lora_scale: 1.0 +merged_model_path: null \ No newline at end of file diff --git a/scripts/diffusion_model_lora_merge/merge_lora_weights_into_base_model.py b/scripts/diffusion_model_lora_merge/merge_lora_weights_into_base_model.py new file mode 100644 index 000000000000..57d9964cad3d --- /dev/null +++ b/scripts/diffusion_model_lora_merge/merge_lora_weights_into_base_model.py @@ -0,0 +1,85 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import tempfile +from typing import Any, Dict + +import torch +from pytorch_lightning import Trainer + +from nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm import MegatronLatentDiffusion +from nemo.collections.multimodal.parts.utils import setup_trainer_and_model_for_inference +from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector +from nemo.core.config import hydra_runner + + +def load_lora(lora_nemo): + + with tempfile.TemporaryDirectory() as tmpdir: + NLPSaveRestoreConnector._unpack_nemo_file(lora_nemo, tmpdir) + # assert os.path.isdir(lora_extracted_dir), "requires the untar'ed the lora .nemo file" + + ckpt_file = f"{tmpdir}/model_weights.ckpt" + + lora_state_dict = torch.load(ckpt_file, map_location=torch.device('cpu')) + return lora_state_dict + + +def merge(base_model_state_dict: Dict[str, Any], lora_state_dict: Dict[int, Any], lora_scale=1.0): + + for key in lora_state_dict.keys(): + if 'linear_out' in key: + continue + key_lora_in = key + key_lora_out = key.replace('linear_in', 'linear_out') + key_base_model = key.replace('.adapter_layer.parallel_linear_adapter.linear_in', '').replace('._orig_mod', '') + + wt_lora_in = lora_state_dict[key_lora_in] + wt_lora_out = lora_state_dict[key_lora_out] + wt_base_model = base_model_state_dict[key_base_model] + + wt_lora = wt_lora_out @ wt_lora_in + base_model_state_dict[key_base_model] = wt_base_model + wt_lora * lora_scale + print(f"merging weights for following key : {key_base_model}") + return base_model_state_dict + + +@hydra_runner(config_path="conf", config_name="merge_lora_weights") +def main(cfg) -> None: + trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer) + + def model_cfg_modifier(model_cfg): + model_cfg.precision = cfg.trainer.precision + model_cfg.ckpt_path = None + model_cfg.inductor = False + model_cfg.unet_config.use_flash_attention = False + model_cfg.unet_config.from_pretrained = None + model_cfg.first_stage_config.from_pretrained = None + + trainer, megatron_diffusion_model = setup_trainer_and_model_for_inference( + model_provider=MegatronLatentDiffusion, cfg=cfg, model_cfg_modifier=model_cfg_modifier + ) + model = megatron_diffusion_model.cpu() + lora_weights = load_lora(cfg.lora_model_path) + + merged_weights = merge(model.state_dict(), lora_weights, lora_scale=cfg.lora_scale) + + model.load_state_dict(merged_weights) + + model.save_to(cfg.merged_model_path) + print(f"saved merged model to {cfg.merged_model_path}") + + +if __name__ == '__main__': + main() diff --git a/scripts/fid-eval-text2img/compute_clip_score.py b/scripts/fid-eval-text2img/compute_clip_score.py new file mode 100644 index 000000000000..ab573f76c67f --- /dev/null +++ b/scripts/fid-eval-text2img/compute_clip_score.py @@ -0,0 +1,149 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +python clip_script.py --captions_path /path/to/coco2014_val/captions \ + --fid_images_path /path/to/synthetic_images \ + --output_path /path/to/output/clip_scores.csv + +1. `--captions_path`: The path to the real images captions directory. In this example, + it is set to `/path/to/coco2014_val/captions`. This path should point to the + directory containing the COCO 2014 validation dataset captions. + +2. `--fid_images_path`: The path to the directory containing subfolders with synthetic + images. In this example, it is set to `/path/to/synthetic_images`. Each subfolder + should contain a set of synthetic images for which you want to compute CLIP scores + against the captions from `--captions_path`. + +3. `--output_path`: The path to the output CSV file where the CLIP scores will be saved. + In this example, it is set to `/path/to/output/clip_scores.csv`. This file will + contain a table with two columns: `cfg` and `clip_score`. The `cfg` + column lists the names of the subfolders in `--fid_images_path`, and the + `clip_score` column lists the corresponding average CLIP scores between the synthetic + images in each subfolder and the captions from `--captions_path`. +""" + +import argparse +import csv +import os +from glob import glob + +import open_clip +import torch +import torch.nn as nn +from PIL import Image +from tqdm import tqdm + + +class CLIPEncoder(nn.Module): + def __init__(self, clip_version='ViT-B/32', pretrained='', cache_dir=None, device='cuda'): + super().__init__() + + self.clip_version = clip_version + if not pretrained: + if self.clip_version == 'ViT-H-14': + self.pretrained = 'laion2b_s32b_b79k' + elif self.clip_version == 'ViT-g-14': + self.pretrained = 'laion2b_s12b_b42k' + else: + self.pretrained = 'openai' + + self.model, _, self.preprocess = open_clip.create_model_and_transforms( + self.clip_version, pretrained=self.pretrained, cache_dir=cache_dir + ) + + self.model.eval() + self.model.to(device) + + self.device = device + + @torch.no_grad() + def get_clip_score(self, text, image): + if isinstance(image, str): # filenmae + image = Image.open(image) + if isinstance(image, Image.Image): # PIL Image + image = self.preprocess(image).unsqueeze(0).to(self.device) + image_features = self.model.encode_image(image).float() + image_features /= image_features.norm(dim=-1, keepdim=True) + + if not isinstance(text, (list, tuple)): + text = [text] + text = open_clip.tokenize(text).to(self.device) + text_features = self.model.encode_text(text).float() + text_features /= text_features.norm(dim=-1, keepdim=True) + similarity = image_features @ text_features.T + + return similarity + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--captions_path', default='/coco2014/coco2014_val_sampled_30k/captions/', type=str) + parser.add_argument('--fid_images_path', default=None, type=str) + parser.add_argument('--output_path', default='./clip_scores.csv', type=str) + parser.add_argument('--clip_version', default='ViT-L-14', type=str) + args = parser.parse_args() + + # Initialize distributed training + torch.distributed.init_process_group(backend='nccl') + torch.cuda.set_device(int(os.environ['LOCAL_RANK'])) + + captions_path = args.captions_path + print('Init CLIP Encoder..') + encoder = CLIPEncoder(clip_version=args.clip_version) + + # Create output CSV file + with open(args.output_path, 'w', newline='') as csvfile: + fieldnames = ['cfg', 'clip_score'] + writer = csv.DictWriter(csvfile, fieldnames=fieldnames) + writer.writeheader() + + # Iterate through subfolders in fid_images_path + for subfolder in os.listdir(args.fid_images_path): + subfolder_path = os.path.join(args.fid_images_path, subfolder) + if os.path.isdir(subfolder_path): + images = sorted( + glob(f'{subfolder_path}/*.png'), key=lambda x: (int(x.split('/')[-1].strip('.png').strip('image'))) + ) + texts = sorted(glob(f'{captions_path}/*.txt')) + print(images[:5], texts[:5]) + # this enables computing clip on the smaller images set + texts = texts[: len(images)] + assert len(images) == len(texts) + print(f'Number of images text pairs: {len(images)}') + + imgs = torch.utils.data.DataLoader( + images, sampler=torch.utils.data.distributed.DistributedSampler(images) + ) + txts = torch.utils.data.DataLoader( + texts, sampler=torch.utils.data.distributed.DistributedSampler(texts) + ) + + ave_sim = torch.tensor(0.0).cuda() + count = 0 + for text, img in zip(tqdm(txts), imgs): + with open(text[0], 'r') as f: + text = f.read().strip() + sim = encoder.get_clip_score(text, img[0]) + ave_sim += sim[0, 0] + count += 1 + if count % 2000 == 0: + print(ave_sim / count) + + torch.distributed.all_reduce(ave_sim) + ave_sim /= len(images) + print(f'The CLIP similarity for CFG {subfolder}: {ave_sim}') + + # Write CLIP score to output CSV file + writer.writerow({'cfg': subfolder, 'clip_score': float(ave_sim)}) diff --git a/scripts/fid-eval-text2img/plot.py b/scripts/fid-eval-text2img/plot.py new file mode 100644 index 000000000000..fbdd9b0c0e42 --- /dev/null +++ b/scripts/fid-eval-text2img/plot.py @@ -0,0 +1,80 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +python plot_fid_vs_clip.py \ + --fid_scores_csv path/to/fid_scores.csv \ + --clip_scores_csv path/to/clip_scores.csv +Replace path/to/fid_scores.csv and path/to/clip_scores.csv with the paths +to the respective CSV files. The script will display the plot with FID +scores against CLIP scores, with cfg values annotated on each point. +""" + +import argparse + +import matplotlib.pyplot as plt +import pandas as pd + + +def plot_fid_vs_clip(fid_scores_csv, clip_scores_csv, ax, label): + fid_scores = pd.read_csv(fid_scores_csv) + clip_scores = pd.read_csv(clip_scores_csv) + merged_data = pd.merge(fid_scores, clip_scores, on='cfg').sort_values('cfg') + merged_data.index = range(len(merged_data)) + + ax.plot( + merged_data['clip_score'], merged_data['fid'], marker='o', linestyle='-', label=label + ) # Connect points with a line + + for i, txt in enumerate(merged_data['cfg']): + ax.annotate(txt, (merged_data['clip_score'][i], merged_data['fid'][i])) + + ax.set_xlabel('CLIP Score') + ax.set_ylabel('FID') + ax.set_title('FID vs CLIP Score') + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument( + '--fid_scores_csv', nargs='+', required=True, type=str, help='Paths to the FID scores CSV files' + ) + parser.add_argument( + '--clip_scores_csv', nargs='+', required=True, type=str, help='Paths to the CLIP scores CSV files' + ) + parser.add_argument( + '--labels', nargs='+', required=False, type=str, help='If provided, curves will be named with these names' + ) + parser.add_argument( + '--save_plot_path', required=False, type=str, help='If provided, the plot will be stored at this path' + ) + args = parser.parse_args() + + if not args.labels: + args.labels = [None] * len(args.fid_scores_csv) + + assert len(args.fid_scores_csv) == len(args.clip_scores_csv) == len(args.labels), ( + len(args.fid_scores_csv), + len(args.clip_scores_csv), + len(args.labels), + ) + + fig, ax = plt.subplots() + + for fid, clip, label in zip(args.fid_scores_csv, args.clip_scores_csv, args.labels): + plot_fid_vs_clip(fid, clip, ax, label) + + plt.show() + if args.save_plot_path: + plt.savefig(args.save_plot_path) diff --git a/tutorials/multimodal/DreamBooth Tutorial.ipynb b/tutorials/multimodal/DreamBooth Tutorial.ipynb new file mode 100644 index 000000000000..8651b55d6308 --- /dev/null +++ b/tutorials/multimodal/DreamBooth Tutorial.ipynb @@ -0,0 +1,273 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a428c1bf", + "metadata": {}, + "source": [ + "# DreamBooth Training / Inference Tutorial\n", + "\n", + "### Note:\n", + "\n", + "Currently, this notebook must be run in a NeMo container. An example command to launch the container:\n", + "\n", + "```bash\n", + "docker run --runtime=nvidia --gpus all -it --rm -v :/opt/NeMo --shm-size=8g -p 8888:8888 \\\n", + " --ulimit memlock=-1 --ulimit stack=67108864 \n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "5af7015b", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "This guide walks you through the process of training DreamBooth in NeMo. [DreamBooth](https://arxiv.org/abs/2208.12242) is originally developed by Google using the Imagen as the base model. NeMo's implementation, however, is based on Stable Diffusion. We'll cover the following topics in this tutorial:\n", + "\n", + "1. Downloading and setting up the dataset along with the pretrained stable diffusion checkpoints.\n", + "2. Conducting training using either the online encoding or pre-cached latents.\n", + "3. Running inference using the fine-tuned model weights." + ] + }, + { + "cell_type": "markdown", + "id": "f99ec204", + "metadata": {}, + "source": [ + "## Prepare Dataset And Checkpoint\n", + "DreamBooth finetunes a pretrained diffusion model using images of a particular object. Sample datasets can be accessed at https://github.com/google/dreambooth. To demonstrate that, we'll use the dataset/dog6 dataset. Here is an example to download the images into a specified directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac481796", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a dataset directory and download instance images\n", + "\n", + "import os\n", + "import wget\n", + "DATA_DIR = '/datasets/instance_dir'\n", + "os.makedirs(DATA_DIR, exist_ok=True)\n", + "\n", + "urls = [f'https://github.com/google/dreambooth/blob/main/dataset/dog6/0{i}.jpg?raw=true' for i in range(5)]\n", + "\n", + "for i, url in enumerate(urls):\n", + " wget.download(url, out=f'{DATA_DIR}/image0{i}.jpg')" + ] + }, + { + "cell_type": "markdown", + "id": "f76fb588", + "metadata": {}, + "source": [ + "Following that, we'll retrieve the pretrained Stable Diffusion checkpoints from Huggingface as our starting point. Two checkpoints are essential: one for the pretrained U-Net weights and the other for the Variational Auto Encoder (VAE)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f9a134b", + "metadata": {}, + "outputs": [], + "source": [ + "CKPT_DIR = '/ckpts'\n", + "os.makedirs(CKPT_DIR, exist_ok=True)\n", + "\n", + "#There are multiple versions of open-sourced Stable Diffusion checkpoints available on Huggingface, below are the links to U-Net and VAE checkpoints of Stable Diffusion v1.5 at https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main\n", + "unet_url = 'https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/unet/diffusion_pytorch_model.bin' \n", + "vae_url = 'https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/vae/diffusion_pytorch_model.bin'\n", + "\n", + "wget.download(unet_url, out=f'{CKPT_DIR}/unet.bin')\n", + "wget.download(vae_url, out=f'{CKPT_DIR}/vae.bin')" + ] + }, + { + "cell_type": "markdown", + "id": "6af02b73", + "metadata": {}, + "source": [ + "## Model Config Setup\n", + "\n", + "An example configuration file is readily available at `/opt/NeMo/examples/multimodal/generative/dreambooth/conf/dreambooth.yaml`. We'll kick off with the simplest scenario, finetuning the model using a single node and a single GPU. First, let's review the default configuration file." + ] + }, + { + "cell_type": "markdown", + "id": "c0b99def", + "metadata": {}, + "source": [ + "The trainer section defines number of nodes, number of GPUs and precision used in mixed precision training.\n", + "```yaml\n", + "config.trainer.devices = 1 # number of GPUs\n", + "config.trainer.num_nodes = 1 # number of nodes\n", + "config.trainer.precision = 'bf16-mixed' # mixed precision training with bf16, other options are '16-mixed' (fp16) and '32-true' (tf32)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "6a8f8de5", + "metadata": {}, + "source": [ + "For the CLIP text encoder, we use the NeMo CLIP architecture so one needs to convert the pretrained checkpoint from `open_clip` to Nemo format. Here is the example to run the conversion script." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a54f4d1e", + "metadata": {}, + "outputs": [], + "source": [ + "! python /opt/NeMo/examples/multimodal/foundation/clip/convert_external_clip_to_nemo.py \\\n", + " --arch ViT-L-14 \\\n", + " --version openai \\\n", + " --hparams_file /opt/NeMo/examples/multimodal/foundation/clip/conf/megatron_clip_VIT-L-14.yaml \\\n", + " --nemo_file /ckpts/openai.nemo" + ] + }, + { + "cell_type": "markdown", + "id": "17d5411a", + "metadata": {}, + "source": [ + "After all checkpoints are ready, ensure that the dataset and checkpoints you have prepared are correctly referenced in the configuration file.\n", + "1. `/ckpts/unet.bin` goes to ``model.unet_config.form_pretrained``.\n", + "2. `/ckpts/vae.bin` goes to ``model.first_stage_config.from_pretrained``.\n", + "3. `/ckpts/openai.nemo` goes to ``model.cond_stage_config.restore_from_path``.\n", + "4. `/datasets/instance_dir` goes to ``model.data.instance_dir``." + ] + }, + { + "cell_type": "markdown", + "id": "1ac9fd48", + "metadata": {}, + "source": [ + "## Model Training" + ] + }, + { + "cell_type": "markdown", + "id": "42d7ddd5", + "metadata": {}, + "source": [ + "By default, the images from `instance_dir` are pre-processed using the VAE checkpoint from our earlier steps, which, during training, allows for direct loading of latents and offers a 75% increase in performance. One can however disable this feature by adding `model.use_cached_latents=False` to the following command." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed75fa5b", + "metadata": {}, + "outputs": [], + "source": [ + "## This is the example command for running dreambooth training\n", + "! python /opt/NeMo/examples/multimodal/generative/dreambooth/dreambooth.py \\\n", + " model.unet_config.from_pretrained=/ckpts/unet.bin \\\n", + " model.unet_config.from_NeMo=False \\\n", + " model.first_stage_config.from_pretrained=/ckpts/vae.bin \\\n", + " model.data.instance_dir=/datasets/instance_dir \\\n", + " model.data.instance_prompt='a photo of a sks dog' " + ] + }, + { + "cell_type": "markdown", + "id": "51296ba9", + "metadata": {}, + "source": [ + "`model.data.instance_prompt` is where we set the special token associated with the particular object we want to inject to the model. Here we use a random word 'sks'. This unique token is a rare combination of letters that is unlikely to be found in typical captions, making it suitable to act as an identifier for the injected object.\n", + "\n", + "\n", + "The experiment results are stored in ``./nemo_experiment.`` However, you can designate a specific log directory using ``exp_manager.explicit_log_dir``. Checkpoints are saved in ``./nemo_experiment/Dreambooth/checkpoints``. After training, the final weights are automatically converted to the ``.nemo`` format, which is the recommended format for inference.\n", + "\n", + "## Model Inference" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ddbc863", + "metadata": {}, + "outputs": [], + "source": [ + "## This is the example command for running DreamBooth inference\n", + "! torchrun /opt/NeMo/examples/multimodal/generative/dreambooth/dreambooth_infer.py \\\n", + " model.restore_from_path='/opt/NeMo/tutorials/multimodal/nemo_experiments/Dreambooth/checkpoints/Dreambooth.nemo' \\\n", + " infer.num_images_per_prompt=4 \\\n", + " infer.inference_steps=50 \\\n", + " infer.out_path='./DreamBooth_output' \\\n", + " infer.prompts='a photo of a sks dog sleeping'\n" + ] + }, + { + "cell_type": "markdown", + "id": "76c61577", + "metadata": {}, + "source": [ + "### Example of Inference Output" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "77eede7a", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from PIL import Image\n", + "import matplotlib.pyplot as plt\n", + "\n", + "x0 = Image.open('/datasets/instance_dir/image01.jpg')\n", + "x1 = Image.open('./DreamBooth_output/a photo of a sks dog sleeping_2.png')\n", + "\n", + "fig, (ax1, ax2) = plt.subplots(1, 2)\n", + "ax1.imshow(x0)\n", + "ax1.axis('off')\n", + "ax1.set_title('Source image of a \"sks\" dog')\n", + "ax2.imshow(x1)\n", + "ax2.axis('off')\n", + "ax2.set_title('\"sks\" dog generated from DreamBooth')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tutorials/multimodal/Multimodal Data Preparation.ipynb b/tutorials/multimodal/Multimodal Data Preparation.ipynb new file mode 100644 index 000000000000..6fdc7da8c2fb --- /dev/null +++ b/tutorials/multimodal/Multimodal Data Preparation.ipynb @@ -0,0 +1,670 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Multimodal Dataset Preparation\n", + "\n", + "First step of pre-training any deep learning model is data preparation. This notebook will walk you through 5 stages of data preparation for training a multimodal model: \n", + "1. Download your Data\n", + "2. Extract Images and Text\n", + "3. Re-organize to ensure uniform text-image pairs\n", + "4. Precache Encodings\n", + "5. Generate Metadata required for training\n", + "\n", + "This notebook will show you how to prepare an image-text dataset into the [WebDataset](https://github.com/webdataset/webdataset) format. The Webdataset format is required to train all multimodal models in NeMo, such as Stable Diffusion and Imagen. \n", + "\n", + "This notebook is designed to demonstrate the different stages of multimodal dataset preparation. It is not meant to be used to process large-scale datasets since many stages are too time-consuming to run without parallelism. For large workloads, we recommend running the multimodal dataset preparation pipeline with the NeMo-Megatron-Launcher on multiple processors/GPUs. NeMo-Megatron-Launcher packs the same 5 scripts in this notebook into one runnable command and one config file to enable a smooth and a streamlined workflow.\n", + "\n", + "Depending on your use case, not all 5 stages need to be run. Please go to (TODO doc link) for an overview of the 5 stages.\n", + " \n", + "We will use a [dummy dataset](https://huggingface.co/datasets/cuichenx/dummy-image-text-dataset) as the dataset example throughout this notebook. This dataset is formatted as a table with one column storing the text captions, and one column storing the URL link to download the corresponding image. This is the same format as most common text-image datasets. The use of this dummy dataset is for demonstration purposes only. **Each user is responsible for checking the content of the dataset and the applicable licenses to determine if it is suitable for the intended use.**\n", + "\n", + "Let's first set up some paths." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "693d0fcd", + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell.\n", + "\n", + "## Install dependencies\n", + "! pip install img2dataset\n", + "! pip uninstall -y opencv-python-headless\n", + "! pip install opencv-python==4.8.0.74 # https://github.com/opencv/opencv-python/issues/884\n", + "\n", + "### Install NeMo\n", + "BRANCH = 'main'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "c06f3527", + "metadata": {}, + "source": [ + "# Multimodal Dataset Preparation\n", + "\n", + "This notebook will show you how to prepare an image-text dataset into the [WebDataset](https://github.com/webdataset/webdataset) format. The Webdataset format is required to train all multimodal models in NeMo, such as Stable Diffusion and Imagen. \n", + "\n", + "This notebook is designed to demonstrate the different stages of multimodal dataset preparation. It is not meant to be used to process large-scale datasets since many stages are too time-consuming to run without parallelism. For large workloads, we recommend running the multimodal dataset preparation pipeline with the NeMo-Megatron-Launcher on multiple processors/GPUs. NeMo-Megatron-Launcher packs the same 5 scripts in this notebook into one runnable command and one config file to enable a smooth and a streamlined workflow.\n", + "\n", + "Depending on your use case, not all 5 stages need to be run. Please go to (TODO doc link) for an overview of the 5 stages.\n", + " \n", + "We will use a [dummy dataset](https://huggingface.co/datasets/cuichenx/dummy-image-text-dataset) as the dataset example throughout this notebook. This dataset is formatted as a table with one column storing the text captions, and one column storing the URL link to download the corresponding image. This is the same format as most common text-image datasets. The use of this dummy dataset is for demonstration purposes only. **Each user is responsible for checking the content of the dataset and the applicable licenses to determine if it is suitable for the intended use.**\n", + "\n", + "Let's first set up some paths." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bef3833e", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "LAUNCHER_DIR = \"/opt/NeMo-Megatron-Launcher\"\n", + "SCRIPT_DIR = os.path.join(LAUNCHER_DIR, \"launcher_scripts/nemo_launcher/collections/dataprep_scripts/multimodal_dataprep\")\n", + "CONF_DIR = \"conf\"\n", + "DATA_DIR = \"dummy_data\"\n", + "os.makedirs(CONF_DIR, exist_ok=True)\n", + "os.makedirs(DATA_DIR, exist_ok=True)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "25d76419", + "metadata": {}, + "source": [ + "\n", + "## Stage 1: Download Parquet Files from HuggingFace\n", + ">**Alternative workflows:**\n", + ">- **If your dataset is not hosted on HuggingFace or your dataset does not contain .parquet files, please move on to Stage 2**\n", + ">- **If you want to experiment with local image and text files, please see Appendix 1 for a tutorial to create a WebDataset from local images, then move on to Stage 3**\n", + ">- **If you have a dataset in the WebDataset format already and only want to precache the embeddings, please move on to Stage 4**\n", + "\n", + "In this stage, we download the raw data files (.parquet format) from HuggingFace. The parquet files should contain the text captions and the urls to download each image. We then optionally subpartition the parquet file so that the next stage can be parallelized more efficiently.\n", + "\n", + "Script: download_parquet.py\n", + "\n", + "Arguments:\n", + "- `dataset_repo_id`: huggingface dataset repo id, in the format of {user_or_company}/{dataset_name}. See [here](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) for a list of datasets on HuggingFace.\n", + "- `output_dir`: output of this stage\n", + "- `parquet_subpartitions`: increase the number of partitions to reduce the image downloading time (next stage) of each task. Useful if the next stage is parallelized over multiple tasks. We will use 3 for this example to keep the run time of subsequent stages short.\n", + "- `parquet_pattern`: `glob` pattern to use to find the parquet files. Defaults to `*.parquet` (all files that end with the extension)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a06a904b", + "metadata": {}, + "outputs": [], + "source": [ + "! python $SCRIPT_DIR/download_parquet.py \\\n", + " dataset_repo_id='cuichenx/dummy-image-text-dataset' \\\n", + " output_dir=$DATA_DIR/parquet \\\n", + " parquet_subpartitions=3 \\\n", + " parquet_pattern='*.parquet'" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "ebfd2565", + "metadata": {}, + "source": [ + "**Milestone**: You should now see 3 files in the output directory `$DATA_DIR/parquet/dummy_dataset50000.parquet_parts`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3ccc6d7e", + "metadata": {}, + "outputs": [], + "source": [ + "! ls $DATA_DIR/parquet/dummy_dataset50000.parquet_parts | wc -l\n", + "# should output 3" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "56d5dbb9", + "metadata": {}, + "source": [ + "\n", + "## Stage 2: Download Images Files\n", + ">**Alternative workflows:**\n", + ">- **If your dataset is not hosted on HuggingFace or your dataset does not contain .parquet files, please consult the README page of [img2dataset](https://github.com/rom1504/img2dataset) and modify the command to call img2dataset.**\n", + "\n", + "In this stage, we extract the images and texts from the parquet files into the WebDataset format using an open-source tool, [img2dataset](https://github.com/rom1504/img2dataset). \n", + "This stage will typically benefit from a large degree of parallelism (e.g. thousands of tasks). We will pretend there are 3 tasks running (3 was set in the previous stage), and only work on the first of the 3 parquet subpartitions (i.e. shards) in this notebook.\n", + "\n", + "Script: download_images.py\n", + "\n", + "Environment variables (automatically set by SLURM if running with NeMo-Megatron-Launcher):\n", + "- `SLURM_ARRAY_TASK_COUNT`: total number of tasks, should be set to the number of parquet files in `$DATA_DIR/parquet/dummy_dataset50000.parquet_parts`. (i.e. `parquet_subpartitions` x `num_parquets_downloaded`)\n", + "- `SLURM_ARRAY_TASK_ID`: id of the current task (0 <= SLURM_ARRAY_TASK_ID < SLURM_ARRAY_TASK_COUNT)\n", + "\n", + "Arguments:\n", + "- `input_dir`: parquet download dir from the previous stage.\n", + "- `output_dir`: output of this stage\n", + "- `parquet_pattern`: see stage 1\n", + "- `download_num_processes`: number of processes to use. This should be set to number of CPUs in the machine\n", + "- `download_num_threads`: number of threads to use. This should be tuned to balance cpu usage, internet bandwidth and disk bandwidth. \n", + "- `img2dataset_additional_arguments`: see [img2dataset](https://github.com/rom1504/img2dataset) for complete list of parameters and [here](https://github.com/rom1504/img2dataset/tree/main/dataset_examples) for some examples. In this example, we use encode_quality=95 for jpeg compression quality, and resize_mode=no to keep the original images on disk. You can also override these arguments to suit your own needs: input_format (default is parquet), caption_col (default is TEXT), url_col (default is URL)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72d7e0f8", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# pretend that we're the first task out of 3 tasks\n", + "! SLURM_ARRAY_TASK_ID=0 SLURM_ARRAY_TASK_COUNT=3 python $SCRIPT_DIR/download_images.py \\\n", + " input_dir=$DATA_DIR/parquet \\\n", + " output_dir=$DATA_DIR/tarfiles_raw \\\n", + " parquet_pattern='*.parquet' \\\n", + " download_num_processes=2 \\\n", + " download_num_threads=16 \\\n", + " \"img2dataset_additional_arguments={{encode_quality:95,resize_mode:10}}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false + }, + "source": [ + "Note: In this dummy dataset, you will likely see a success rate of 1.000 (no failures). However, for read datasets, the success rate will always be much less than 1.000" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "27cf740d", + "metadata": {}, + "source": [ + "**Milestone**: You should now see tar files (along with other files) in the output directory `$DATA_DIR/tarfiles_raw/part.0.parquet`. Inside each tar file, you should see the .jpg image, the corresponding .txt caption, and .json metadata." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b1b6477e", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "! ls $DATA_DIR/tarfiles_raw/part.0.parquet | head -n 6" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f8572f2", + "metadata": {}, + "outputs": [], + "source": [ + "! tar -tf $DATA_DIR/tarfiles_raw/part.0.parquet/00000.tar | tail -n 6" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "09fea68d", + "metadata": {}, + "source": [ + "## Stage 3: Reorganize Tarfiles to Same Number of Image-Text Pairs" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2d0cbfc0", + "metadata": {}, + "source": [ + "*Note: This stage is required to train multimodal models in NeMo.*\n", + "\n", + "In this stage, we reorganize the contents of tar files from the download_images step, so that the tar files are uniform\n", + "(i.e. each containing an equal number (usually 1000) of training examples (image-text pairs)).\n", + "The tar files created from the download_images step are not uniform, because there is always a portion of images\n", + "that fail to download or are no long available.\n", + "Uniform tar files are important if a sequential sampler is used during training (i.e. not infinite sampler).\n", + "Uniform tar files are also important for precaching because a sequential sampler is used there.\n", + "\n", + "Script: reorganize_tar.py\n", + "\n", + "Environment variables (automatically set by SLURM if running with NeMo-Megatron-Launcher):\n", + "- `SLURM_ARRAY_TASK_COUNT`: total number of tasks, should be set to parquet_subpartitions x num_parquets_downloaded\n", + "- `SLURM_ARRAY_TASK_ID`: id of the current task (0 <= `SLURM_ARRAY_TASK_ID` < `SLURM_ARRAY_TASK_COUNT`)\n", + "\n", + "Arguments:\n", + "- `input_dir`: image download dir from the previous stage.\n", + "- `output_dir`: output of this stage\n", + "- `file_ext_in_tar`: target file extensions in each tar file to transfer to the reorganized tar files. In this example, we have .jpg, .txt, and .json in the downloaded tar files, but we will only keep the image and text and discard the .json metadata.\n", + "- `tar_chunk_size`: number of training examples in each output tar file\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc3d1e60", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "! SLURM_ARRAY_TASK_ID=0 SLURM_ARRAY_TASK_COUNT=1 python $SCRIPT_DIR/reorganize_tar.py \\\n", + " input_dir=$DATA_DIR/tarfiles_raw \\\n", + " output_dir=$DATA_DIR/tarfiles_reorganized \\\n", + " tar_chunk_size=1000 \\\n", + " file_ext_in_tar=[.jpg,.txt]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "95c77097", + "metadata": {}, + "source": [ + "**Milestone**: You should now see tar files (along with other files) in the output directory `$DATA_DIR/tarfiles_reorganized/`. Inside each tar file, you should see exactly 1000 pairs of .jpg image and .txt caption, for a total of 2000 files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4495428f", + "metadata": {}, + "outputs": [], + "source": [ + "! tar -tf $DATA_DIR/tarfiles_reorganized/task0000/00001.tar | wc -l\n", + "# should output 2000" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "09b83dd7", + "metadata": {}, + "source": [ + "## Stage 4: Precache Encodings\n", + "\n", + ">**Alternative workflows:**\n", + ">- **If you're only testing out the NeMo text2image models and do not care about good training performance, you can skip this step and move on to Stage 5.**\n", + "\n", + "### General Format\n", + "\n", + "Precaching refers to the offline computation of image and text encodings prior to training a model. This technique\n", + "is suitable for any model that uses pretrained, frozen encoders during training.\n", + "By using precached encodings, embeddings for image and text do not need to be recomputed in each epoch,\n", + "thereby significantly improving training throughput (up to 60% higher).\n", + "Precached encodings are saved in the format of WebDataset.\n", + "Each tar file contains one pickle file to store all the modality embeddings for each training example. Optionally,\n", + "the tar file may also include the original image or text files\n", + "\n", + "```\n", + "t0_r0_0.tar\n", + "|---- 00000.pickle\n", + "|---- 00000.jpg (optional)\n", + "|---- 00000.txt (optional)\n", + "|---- 00001.pickle\n", + "|---- 00001.jpg (optional)\n", + "|---- 00001.txt (optional)\n", + "...\n", + "```\n", + "Each pickle file stores one python dictionary, with key value pairs storing the embedding name and the embedding as a\n", + "numpy array.\n", + "\n", + "### Precaching Config\n", + "Configuration for precaching can be extensive and intricate for some models. To maintain clarity and ensure an\n", + "organized workflow, we utilize a separate YAML file for these configurations. The YAML file looks like this:\n", + "\n", + "```\n", + "encodings:\n", + " - modality: image\n", + " extension: jpg\n", + " key: autoencoderkl_image\n", + " precision: 16\n", + " encoder_config:\n", + " cls: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL\n", + " ... (kwargs to initialize the encoder)\n", + " - modality: text\n", + " extension: txt\n", + " key: clip-vit-large-patch14_text\n", + " precision: 32\n", + " store_pad_tokens: True\n", + " encoder_config:\n", + " cls: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder\n", + " ... (kwargs to initialize the encoder)\n", + "```\n", + "\n", + "In this YAML file, the encodings field specifies a list of embeddings to be saved in the pickle file.\n", + "Each entry can have the following attributes:\n", + "\n", + "\n", + "- `modality`: either image or text\n", + "- `extension`: file extension for this modality in the tar file (e.g. 'jpg', 'txt')\n", + "- `key`: dictionary key for the encoding. It is recommended to follow the format `{model_name}-{model_variant}_{modality}`, if applicable. e.g. `clip-vit-large-patch14_text`\n", + "- `precision`: precision of the stored tensors (32 or 16)\n", + "- `store_pad_tokens`: Whether to store the PAD tokens. Not storing PAD tokens can significantly reduce disk usage, but the training script must account for this. Ignored for image modality.\n", + "- `encoder_config`: This dictionary must contain `cls` which points to the location of the encoder class. The rest of the parameters are treated as kwargs to initiate the encoder class.\n", + " - Note: the encoder class must implement an `encode` or `__call__` function. If `store_pad_tokens`, this function must return the encoded tensor. Otherwise, this function must return a tuple of (encoded_tensor, mask). The mask is needed so the script knows which tokens are pad tokens and should be ignored for caching. A mask value of 1 denotes regular tokens, and 0 denotes pad tokens.\n", + "\n", + "\n", + "Note that it is not required to have only one encoding per modality, in the case of multiple encoders.\n", + "The `encodings` field is designed as a list to account for this. For example, it's possible to have one image embedding\n", + "and two text embeddings (e.g. one from CLIP and one from T5) and both are used during training. An example is shown below.\n", + "\n", + "```\n", + "encodings:\n", + " - modality: image\n", + " extension: jpg\n", + " key: image_emb\n", + " encoder_config:\n", + " cls: path.to.ImageEncoder\n", + " ...\n", + " - modality: text\n", + " extension: txt\n", + " key: text_emb_1\n", + " encoder_config:\n", + " cls: path.to.TextEncoder1\n", + " ...\n", + " - modality: text\n", + " extension: txt\n", + " key: text_emb_2\n", + " encoder_config:\n", + " cls: path.to.TextEncoder2\n", + " ...\n", + "```\n", + "\n", + "In this tutorial, we will show an example of precaching workflow for Stable Diffusion. " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "27b26036", + "metadata": {}, + "source": [ + "Let's download an example precaching config file ## TODO modify this path" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2bd39f5", + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "! wget TODO_github_link/precache_sd.yaml -P $CONF_DIR/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "986045fe", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "from omegaconf import OmegaConf\n", + "precache_cfg = OmegaConf.load(os.path.join(CONF_DIR, \"precache_sd.yaml\"))\n", + "# visualize the config\n", + "print(OmegaConf.to_yaml(precache_cfg))" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f5bee705", + "metadata": {}, + "source": [ + "There are a few things to note about this config file:\n", + "- `batch_size_per_GPU`: this should be set to as much as your GPU memory can fit\n", + "- `save_original_in_tar`: for SD, original images or text are not used during training (if using precached encodings), so we can leave this empty. If you want the original image and text copied into the tar file, you can set this to [image, text]. \n", + "\n", + "In this example, we need to download the weights of the image autoencoder from the HuggingFace [Stable Diffusion v1.5 repo](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/vae/diffusion_pytorch_model.bin). Text encoder weights will be downloaded automatically when the model `FrozenCLIPEmbedder` is initialized." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d6804d4", + "metadata": {}, + "outputs": [], + "source": [ + "! wget https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/vae/diffusion_pytorch_model.bin" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d2e5d6aa", + "metadata": {}, + "source": [ + "Then, we modify the `from_pretrained` field with the weights file, and save this config as a yaml file to disk.\n", + "We also adjust the config to use 1 GPU for this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d509be7c", + "metadata": {}, + "outputs": [], + "source": [ + "precache_cfg.encodings[0].encoder_config.from_pretrained = 'diffusion_pytorch_model.bin'\n", + "precache_cfg.lightning.devices=1\n", + "# precache_cfg.batch_size_per_GPU=8 # adjust if needed\n", + "\n", + "OmegaConf.save(precache_cfg, os.path.join(CONF_DIR, \"precache_sd_example.yaml\"))" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "765d281e", + "metadata": {}, + "source": [ + "Now we can run the precaching script. \n", + "\n", + "Script: precache_encodings.py\n", + "\n", + "Environment variables (automatically set by SLURM if running with NeMo-Megatron-Launcher):\n", + "- `SLURM_ARRAY_TASK_COUNT`: total number of tasks, should be set to parquet_subpartitions x num_parquets_downloaded\n", + "- `SLURM_ARRAY_TASK_ID`: id of the current task (0 <= `SLURM_ARRAY_TASK_ID` < `SLURM_ARRAY_TASK_COUNT`)\n", + "\n", + "Arguments:\n", + "- `input_dir`: reorganized tar dir from the previous stage.\n", + "- `output_dir`: output of this stage\n", + "- `tar_chunk_size`: number of training examples in each output tar file\n", + "- `precache_cfg`: precaching config file as describe above\n", + "\n", + "This stage will typically benefit from a large degree of parallelism (e.g. thousands of tasks). We will pretend that we are the first task out of 2 tasks. Since the input directory is already 1/3 of the full dataset, the result of this stage in the tutorial will be 1/6 of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "63c9b44c", + "metadata": {}, + "outputs": [], + "source": [ + "! SLURM_ARRAY_TASK_ID=0 SLURM_ARRAY_TASK_COUNT=2 python $SCRIPT_DIR/precache_encodings.py \\\n", + " input_dir=$DATA_DIR/tarfiles_reorganized \\\n", + " output_dir=$DATA_DIR/tarfiles_precached \\\n", + " tar_chunk_size=1000 \\\n", + " precache_config_path=conf/precache_sd_example.yaml" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "76082b01", + "metadata": {}, + "source": [ + "**Milestone**: You should now see tar files (along with other files) in the output directory `$DATA_DIR/tarfiles_precached/`. Inside each tar file, you should see exactly 1000 .pickle files storing the image and text embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e93883ba", + "metadata": {}, + "outputs": [], + "source": [ + "! tar -tf `ls -d $DATA_DIR/tarfiles_precached/* | head -n 1` | wc -l\n", + "# should output 1000" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "6394e07f", + "metadata": {}, + "source": [ + "## Stage 5: Generate `wdinfo` Metadata\n", + "\n", + "This stage generates the metadata required by the NeMo multimodal training data pipeline. The metadata contains the chunk size, the total size of dataset, and most importantly, the list of tarfiles.\n", + "\n", + "The metadata will only include tarfiles with exactly `tar_chunk_size` (1000 in this tutorial) examples, and ignore/discard incomplete tar files. Incomplete tar files are the last tar file generated by each process which contain the leftover training examples, most likely less than `tar_chunk_size`. \n", + "\n", + "In the case of a high degree of parallelism, there can be a significant number of incomplete tarfiles leading to a waste of discarded training examples. Therefore, before creating the metadata file, this stage will also find all the incomplete tar files generated in the previous stage, and combine them in a single-process script so that there is at most one incomplete tarfile throughout the entire dataset.\n", + "\n", + "Script: generate_wdinfo.py\n", + "\n", + "Arguments:\n", + "- `input_dir`: output tar dir from stage 3 or stage 4.\n", + "- `output_wdinfo_path`: output of this stage\n", + "- `tar_chunk_size`: number of training examples in each output tar file\n", + "- `file_ext_in_tar`: see explanation in Stage 3. If you performed precaching without copying any files from the source tar (i.e. save_original_in_tar: null in precache_sd_example.yaml) then this should be [.pickle]\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5b72a03a", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "! python $SCRIPT_DIR/generate_wdinfo.py \\\n", + " input_dir=$DATA_DIR/tarfiles_precached \\\n", + " output_wdinfo_path=$DATA_DIR/wdinfo.pkl \\\n", + " tar_chunk_size=1000 \\\n", + " file_ext_in_tar=[.pickle]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "a9c8e2ca", + "metadata": {}, + "source": [ + "**Milestone**: You should now see the wdinfo.pkl file generated. The content of the file is printed above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cc2e2a6", + "metadata": {}, + "outputs": [], + "source": [ + "! test -f $DATA_DIR/wdinfo.pkl && echo \"File exists\" || echo \"File does not exist\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false + }, + "source": [ + "## Appendix 1: Create a WebDataset from Local Image Text Files\n", + "\n", + "\n", + "If you have image and text files already downloaded, you can quickly convert your dataset to the WebDataset format and proceed with Stages 3-5 of this tutorial without wasting any time on download. Simply follow the steps below.\n", + "\n", + "1. Manipulate your dataset so that each text caption is stored in a single file, and shares the same file path as the corresponding image, except the extension. The file path can contain subfolders. An example is shown below\n", + "\n", + " ```bash\n", + " > cd dataset\n", + " > find . -type f\n", + " ./train/n00001234/00010000.jpg ./train/n00001234/00010000.txt\n", + " ./train/n00001234/00010001.jpg ./train/n00001234/00010001.txt\n", + " ./train/n00001234/00010002.jpg ./train/n00001234/00010002.txt\n", + " ./train/n00001234/00010003.jpg ./train/n00001234/00010003.txt\n", + " ./train/n00001234/00010004.jpg ./train/n00001234/00010004.txt\n", + " ./train/n00001235/00010000.jpg ./train/n00001235/00010000.txt\n", + " ./train/n00001235/00010001.jpg ./train/n00001235/00010001.txt\n", + " ...\n", + " > cd ..\n", + " ```\n", + "\n", + "2. Run this command to create of tarball of the folder with sorted file names. It is important for WebDataset to have the image and text files in consecutive blocks on disk, hence the sorting is necessary.\n", + "\n", + "```bash\n", + "> tar --sort=name -cf dataset.tar dataset/\n", + "```\n", + "\n", + "For more information, please visit [Creating a WebDataset](https://github.com/webdataset/webdataset#creating-a-webdataset)\n", + "\n", + "After this, you can proceed with Stage 3 of the tutorial.\n", + "Note: if you can use a script to create folders with exactly `tar_chunk_size` (1000 in the tutorial) image-text pairs, and create multiple tarfiles each with `tar_chunk_size` pairs of data, then you can skip Stage 3 and proceed with Stage 4 of the tutorial." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tutorials/multimodal/README.md b/tutorials/multimodal/README.md new file mode 100644 index 000000000000..3da7c87d7c08 --- /dev/null +++ b/tutorials/multimodal/README.md @@ -0,0 +1,15 @@ +# NeMo MultiModal Tutorials + + +The goal of this collection of IPython notebooks is to familiarize users with NeMo multimodal offerings. By launching the latest Nemo container, users can easily step through the tutorials, conduct experiments, and build custom application. + +#### Getting Started Checklist +* Register a NGC account +* Generate your NGC API key for pulling NeMo contrainer (please refer to this [video](https://youtu.be/yBNt4qSnn0k?feature=shared) for details) +* Make sure the container host system meets the minimal GPU requirement (i.e. NVIDIA A100 GPU) + +## What does this repository contain? +This repository contains the following resources: +* [Data Preparation](./Multimodal%20Data%20Preparation.ipynb) +* [Train And Infer Stable Diffusion Model](./Stable%20Diffusion%20Tutorial.ipynb) +* [Train DreanBooth Model](./DreamBooth%20Tutorial.ipynb) diff --git a/tutorials/multimodal/Stable Diffusion Tutorial.ipynb b/tutorials/multimodal/Stable Diffusion Tutorial.ipynb new file mode 100644 index 000000000000..ed794356f280 --- /dev/null +++ b/tutorials/multimodal/Stable Diffusion Tutorial.ipynb @@ -0,0 +1,275 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "d874e23f-9631-48e0-b635-84e7280bf07b", + "metadata": {}, + "source": [ + "# Stable Diffusion Training / Inference Tutorial\n", + "\n", + "### Note:\n", + "Currently, this notebook must be run in a NeMo container. An example command to launch the container:\n", + "\n", + "```\n", + "docker run --gpus all -it --rm -v :/opt/NeMo --shm-size=8g \\\n", + " -p 8888:8888 --ulimit memlock=-1 --ulimit \\\n", + " stack=67108864 \n", + "```\n", + "\n", + "## Introduction\n", + "\n", + "This notebook illustrates how to train and perform inference using Stable Diffusion with the NeMo Toolkit. For the sake of brevity, we've chosen to use Stable Diffusion as an example to demonstrate the foundational process of training and inferencing with Text2Img models. However, you can apply the same approach to other foundational Text2Img models, such as Imagen.\n", + "\n", + "The implementation of Stable Diffusion is based on [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752).\n", + "\n", + "This tutorial will guide you through the following topics:\n", + "\n", + "1. Training a Stable Diffusion model.\n", + "2. Performing inference with the trained model.\n", + "\n", + "## Datasets\n", + "\n", + "Please refer to [ADD LINK]() for how to prepare a training dataset for Stable diffusion.\n", + "\n", + "For a pre-cached Stable Diffusion dataset, each webdataset tar file should, at a minimum, include the pickle files that store the pre-cached image and text features:\n", + "\n", + "```\n", + "t0_r0_0.tar\n", + "|---- 0000.pickle\n", + "|---- 0001.pickle\n", + "...\n", + "```\n", + "\n", + "For non-precached Stable Diffusion dataset, each webdataset tar file should contain the raw texts and corresponding images:\n", + "\n", + "```\n", + "t0_r0_0.tar\n", + "|---- 0000.jpg\n", + "|---- 0000.txt\n", + "|---- 0001.jpg\n", + "|---- 0001.txt\n", + "...\n", + "```\n", + "\n", + "## Encoders Preparation\n", + "\n", + "Depending on whether you precache the dataset, you might also need to first download the image and/or text encoders.\n", + "\n", + "### Option 1: Training on Non-Precached Dataset (Use Encoders During Training)\n", + "\n", + "#### A. Prepare VAE\n", + "To download the default VAE for Stable Diffusion:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "730cd137-0fce-4bab-8ac7-219e5c55faf2", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "! wget https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/vae/diffusion_pytorch_model.bin\n", + "! mkdir -p /ckpts\n", + "! mv diffusion_pytorch_model.bin /ckpts/vae.bin" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "fef8b245-7cee-4048-a9ec-3ada90432a89", + "metadata": {}, + "source": [ + "The above command will download the default VAE weights from HuggingFace and save it to `/ckpts/vae.bin`.\n", + "\n", + "**Note**: if you want to customize the saved location, make sure it is also reflected in your training config.\n", + "#### B. Prepare Text Encoder\n", + "For the text encoder used in Stable Diffusion, you can either use [HuggingFace CLIP ViT-L/14 model](https://huggingface.co/openai/clip-vit-large-patch14) or use NeMo's CLIP-ViT. NeMo Stable Diffusion also supports native CLIP ViT model trained in NeMo framework.\n", + "\n", + "Make sure the following settings are used in `cond_stage_config`:\n", + "\n", + "```\n", + " cond_stage_config:\n", + " # For compatibility with the previous version that uses HuggingFace CLIP model\n", + " _target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder\n", + " version: openai/clip-vit-large-patch14\n", + " device: cuda\n", + " max_length: 77\n", + " capture_cudagraph_i rs: ${model.capture_cudagraph_ters}\n", + "```" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e52057c4-83ee-4f21-a11c-11a5c367a0b8", + "metadata": {}, + "source": [ + "Alternatively, you can use the CLIP model in `.nemo` format . This can be achieved by using the provided NeMo script to download and convert the CLIP model via the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ada77920-06f5-43f3-bb26-82d9daabde8f", + "metadata": {}, + "outputs": [], + "source": [ + "! python examples/multimodal/foundation/clip/convert_external_clip_to_nemo.py \\\n", + " --arch ViT-L-14 \\\n", + " --version openai \\\n", + " --hparams_file /opt/NeMo/examples/multimodal/foundation/clip/conf/megatron_clip_VIT-L-14.yaml \\\n", + " --nemo_file /ckpts/openai.nemo" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "34a385ff-f8ff-4e64-bd6f-b814be388598", + "metadata": {}, + "source": [ + "When using `.nemo` ViT model, you can use the default `cond_stage_config`:\n", + "\n", + "```\n", + " cond_stage_config:\n", + " _target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenMegatronCLIPEmbedder\n", + " restore_from_path: /ckpts/openai.nemo\n", + " device: cuda\n", + " freeze: True\n", + " layer: \"last\"\n", + "```" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "8854eb7a-e822-43f6-a1d5-12357049485a", + "metadata": {}, + "source": [ + "\n", + "### Option 2: Training on Precached Dataset (Training UNet Only)\n", + "\n", + "When using precached dataset (please refer to the [Dataset Tutorial](ADD_LINK) for details), every text feature and image feature are stored as key-value pairs in `.pickle` file:\n", + "\n", + "```\n", + "{\n", + " image_key: torch.Tensor(),\n", + " text_key: torch.Tensor(),\n", + "}\n", + "```\n", + "\n", + "Make sure in the training config, `cond_stage_key` is associated with `text_key` and `first_stage_key` is associated with `image_key`.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5762427b-f60c-4dfd-8318-e55771b25354", + "metadata": {}, + "source": [ + "## Model Config Setup\n", + "\n", + "Now we will begin setting up the config file needed for Stable Diffusion training. We will use [sd_train.yaml]() as the template.\n", + "\n", + "1. Modify `model.data.train.dataset_path` so that it has all the webdataset info files you want to train on\n", + "2. Modify `model.data.webdataset.local_root_path` to point to your dataset path\n", + "3. Make sure VAE path `model.first_stage_config.from_pretrained` is adjusted if using non-precached dataset\n", + "4. Make sure the text encoder config is correct (detailed above)\n", + "5. Configure `exp_manager.exp_dir` for experiment save directory\n", + "6. Configure `exp_manager.wandb_logger_kwargs` and/or `exp_manager.create_tensorboard_logger` if needed" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "70f858b3-f7d5-4678-b380-80582337bc23", + "metadata": {}, + "source": [ + "**Note**: Please refer to NeMo Toolkit Developer Guide's Stable Diffusion page for more details on in-depth customizations, including all available optimizations.\n", + "\n", + "## Training\n", + "\n", + "Once everything is set up, training stable diffusion is as simple as running:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "589e3a14-c881-4a56-b2bd-370653059dfc", + "metadata": {}, + "outputs": [], + "source": [ + "! torchrun /opt/NeMo/examples/multimodal/generative/stable_diffusion/sd_train.py trainer.max_steps=100 model.data.synthetic_data=True" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "892d72dd-c4d7-4ca4-a948-168e187af65c", + "metadata": {}, + "source": [ + "Intermediate checkpoints (during training) and final checkpoint will be saved to `exp_manager.exp_dir` folder. Note that here we use synthetic data for demo purpose." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "087c8b9a-92c3-43d3-86a3-bf7e848dfbd2", + "metadata": {}, + "source": [ + "## Inference\n", + "\n", + "Stable Diffusion inference needs a trained NeMo Stable Diffusion checkpoint, along with both the image encoder (VAE) and text encoder (CLIP). The checkpoint can be either a fully trained `.nemo` checkpoint or an intermediate checkpoint from training (typically in `.ckpt` format). Both `.nemo` and `.ckpt` checkpoints can be used for inference. For information on downloading the encoders, please refer to the previous section.\n", + "\n", + "### Inference Config Setup\n", + "\n", + "Now we will begin setting up the config file needed for Stable Diffusion inference. We will use [sd_infer.yaml]() as the template.\n", + "\n", + "We generally use [Classifier Free Guidance](https://arxiv.org/abs/2207.12598) for better visual quality, which can be set at `infer.unconditional_guidance_scale`.\n", + "\n", + "NeMo Stable Diffusion supports multiple samplers. Please refer to the developer guide for more details. Samplers can be set at `infer.sampler_type`.\n", + "\n", + "Inference supports a batch of text prompts, which can be set at `infer.prompts`. One can also generate a configurable number of images per prompt by setting `infer.num_images_per_prompt`. Generated images will be saved to `infer.out_path`.\n", + "\n", + "You will also need to set the model checkpoint path at `model.restore_from_path`.\n", + "\n", + "### Running the Inference\n", + "\n", + "Once everything is set up, Stable Diffusion inference is as simple as running:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e676c5d-d711-489e-8ab7-3ee20046d88d", + "metadata": {}, + "outputs": [], + "source": [ + "! ! torchrun /opt/NeMo/examples/multimodal/generative/stable_diffusion/sd_infer.py model.restore_from_path='/opt/NeMo/tutorials/multimodal/nemo_experiments/stable-diffusion-train/checkpoints/stable-diffusion-train.nemo'" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}