From 2b2e62db3172491840deffb631914f9f8de67f59 Mon Sep 17 00:00:00 2001
From: yaoyu-33 <yaoyu.094@gmail.com>
Date: Wed, 10 Jul 2024 10:20:26 -0700
Subject: [PATCH] update docs

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
---
 docs/source/multimodal/mllm/checkpoint.rst | 114 ---------------------
 docs/source/multimodal/mllm/intro.rst      |   1 -
 docs/source/multimodal/vlm/checkpoint.rst  |  56 +++-------
 3 files changed, 17 insertions(+), 154 deletions(-)
 delete mode 100644 docs/source/multimodal/mllm/checkpoint.rst

diff --git a/docs/source/multimodal/mllm/checkpoint.rst b/docs/source/multimodal/mllm/checkpoint.rst
deleted file mode 100644
index d1fe7b651e66..000000000000
--- a/docs/source/multimodal/mllm/checkpoint.rst
+++ /dev/null
@@ -1,114 +0,0 @@
-Checkpoints
-===========
-
-In this section, we present four key functionalities of NVIDIA NeMo related to checkpoint management:
-
-1. **Checkpoint Loading**: Load local ``.nemo`` checkpoint files with the :code:`restore_from()` method.
-2. **Partial Checkpoint Conversion**: Convert partially-trained ``.ckpt`` checkpoints to the ``.nemo`` format.
-3. **Community Checkpoint Conversion**: Transition checkpoints from community sources, like HuggingFace, into the ``.nemo`` format.
-4. **Model Parallelism Adjustment**: Modify model parallelism to efficiently train models that exceed the memory of a single GPU. NeMo employs both tensor (intra-layer) and pipeline (inter-layer) model parallelisms. Dive deeper with "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (`link <https://arxiv.org/pdf/2104.04473.pdf>`_). This tool aids in adjusting model parallelism, accommodating users who need to deploy on larger GPU arrays due to memory constraints.
-
-Understanding Checkpoint Formats
---------------------------------
-
-A ``.nemo`` checkpoint is fundamentally a tar file that bundles the model configurations (given as a YAML file), model weights, and other pertinent artifacts like tokenizer models or vocabulary files. This consolidated design streamlines sharing, loading, tuning, evaluating, and inference.
-
-On the other hand, the ``.ckpt`` file is a product of PyTorch Lightning training. It stores model weights and optimizer states, and it's generally used for resuming training.
-
-Subsequent sections delve into each of the previously listed functionalities, emphasizing the loading of fully trained checkpoints for evaluation or additional fine-tuning.
-
-
-Loading Local Checkpoints
--------------------------
-
-NeMo inherently saves any model's checkpoints in the ``.nemo`` format. To manually save a model at any stage:
-
-.. code-block:: python
-
-   model.save_to(<checkpoint_path>.nemo)
-
-To load a local ``.nemo`` checkpoint:
-
-.. code-block:: python
-
-   import nemo.collections.multimodal as nemo_multimodal
-   model = nemo_multimodal.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
-
-Replace `<MODEL_BASE_CLASS>` with the appropriate MM model class.
-
-Converting Local Checkpoints
-----------------------------
-
-The training script only auto-converts the final checkpoint into the ``.nemo`` format. To evaluate intermediate training checkpoints, conversion to ``.nemo`` might be needed. For this:
-
-.. code-block:: bash
-
-   python -m torch.distributed.launch --nproc_per_node=<tensor_model_parallel_size> * <pipeline_model_parallel_size> \
-       examples/multimodal/convert_ckpt_to_nemo.py \
-       --checkpoint_folder <path_to_PTL_checkpoints_folder> \
-       --checkpoint_name <checkpoint_name> \
-       --nemo_file_path <path_to_output_nemo_file> \
-       --tensor_model_parallel_size <tensor_model_parallel_size> \
-       --pipeline_model_parallel_size <pipeline_model_parallel_size>
-
-Converting Community Checkpoints
---------------------------------
-
-NeVA Checkpoints
-^^^^^^^^^^^^^^^^
-
-Currently, the conversion mainly supports LLaVA checkpoints based on "llama-2 chat" checkpoints. As a reference, we'll consider the checkpoint `llava-llama-2-13b-chat-lightning-preview <https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview>`_.
-
-After downloading this checkpoint and saving it at ``/path/to/llava-llama-2-13b-chat-lightning-preview``, undertake the following procedures:
-
-Modifying the Tokenizer
-"""""""""""""""""""""""
-
-NeMo mandates adding specific tokens to the tokenizer model for peak performance. To modify an existing tokenizer located in ``/path/to/llava-llama-2-13b-chat-lightning-preview/tokenizer``, execute the following in the NeMo container:
-
-.. code-block:: bash
-
-   cd /opt/sentencepiece/src/
-   protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
-   python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
-   --input_file /path/to/llava-llama-2-13b-chat-lightning-preview/tokenizer.model \
-   --output_file /path/to/llava-llama-2-13b-chat-lightning-preview/tokenizer_neva.model \
-   --is_userdefined \
-   --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
-            "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"
-
-Checkpoint Conversion
-"""""""""""""""""""""
-
-For conversion:
-
-.. code-block:: bash
-
-   python examples/multimodal/mllm/neva/convert_hf_llava_to_neva.py \
-     --in-file /path/to/llava-llama-2-13b-chat-lightning-preview \
-     --out-file /path/to/neva-llava-llama-2-13b-chat-lightning-preview.nemo \
-     --tokenizer-model /path/to/llava-llama-2-13b-chat-lightning-preview/tokenizer_add_special.model
-     --conv-template llama_2
-
-
-Model Parallelism Adjustment
-----------------------------
-
-NeVA Checkpoints
-^^^^^^^^^^^^^^^^
-
-Adjust model parallelism with:
-
-.. code-block:: bash
-
-   python examples/nlp/language_modeling/megatron_change_num_partitions.py \
-    --model_file=/path/to/source.nemo \
-    --target_file=/path/to/target.nemo \
-    --tensor_model_parallel_size=??? \
-    --target_tensor_model_parallel_size=??? \
-    --pipeline_model_parallel_size=??? \
-    --target_pipeline_model_parallel_size=??? \
-    --model_class="nemo.collections.multimodal.models.multimodal_llm.neva.neva_model.MegatronNevaModel" \
-    --precision=32 \
-    --tokenizer_model_path=/path/to/tokenizer.model \
-    --tp_conversion_only
diff --git a/docs/source/multimodal/mllm/intro.rst b/docs/source/multimodal/mllm/intro.rst
index 0e76a9737a0f..48bfd56f9ae1 100644
--- a/docs/source/multimodal/mllm/intro.rst
+++ b/docs/source/multimodal/mllm/intro.rst
@@ -8,7 +8,6 @@ The endeavor to extend Language Models (LLMs) into multimodal domains by integra
 
    datasets
    configs
-   checkpoint
    neva
    video_neva
    sequence_packing
diff --git a/docs/source/multimodal/vlm/checkpoint.rst b/docs/source/multimodal/vlm/checkpoint.rst
index 996d9828f5aa..d984f1453510 100644
--- a/docs/source/multimodal/vlm/checkpoint.rst
+++ b/docs/source/multimodal/vlm/checkpoint.rst
@@ -35,58 +35,36 @@ To load a local ``.nemo`` checkpoint:
 
 Replace `<MODEL_BASE_CLASS>` with the appropriate MM model class.
 
-Converting Local Checkpoints
-----------------------------
-
-Only the last checkpoint is automatically saved in the ``.nemo`` format. If intermediate training checkpoints evaluation is required, a ``.nemo`` conversion might be necessary. For this, refer to the script at `script <http://TODOURL>`_:
-
-.. code-block:: python
-
-   python -m torch.distributed.launch --nproc_per_node=<tensor_model_parallel_size> * <pipeline_model_parallel_size> \
-       examples/multimodal/convert_ckpt_to_nemo.py \
-       --checkpoint_folder <path_to_PTL_checkpoints_folder> \
-       --checkpoint_name <checkpoint_name> \
-       --nemo_file_path <path_to_output_nemo_file> \
-       --tensor_model_parallel_size <tensor_model_parallel_size> \
-       --pipeline_model_parallel_size <pipeline_model_parallel_size>
-
 Converting Community Checkpoints
 --------------------------------
 
 CLIP Checkpoints
 ^^^^^^^^^^^^^^^^
 
-To migrate community checkpoints:
 
-.. code-block:: python
+To migrate community checkpoints, use the following command:
+
+.. code-block:: bash
 
-   python examples/multimodal/foundation/clip/convert_external_clip_to_nemo.py \
-       --arch=ViT-H-14 \
-       --version=laion2b_s32b_b79k \
-       --hparams_file=path/to/saved.yaml \
-       --nemo_file_path=open_clip.nemo
+    torchrun --nproc-per-node=1 /opt/NeMo/scripts/checkpoint_converters/convert_clip_hf_to_nemo.py \
+        --input_name_or_path=openai/clip-vit-large-patch14 \
+        --output_path=openai_clip.nemo \
+        --hparams_file=/opt/NeMo/examples/multimodal/vision_language_foundation/clip/conf/megatron_clip_VIT-L-14.yaml
 
 Ensure the NeMo hparams file has the correct model architectural parameters, placed at `path/to/saved.yaml`. An example can be found in `examples/multimodal/foundation/clip/conf/megatron_clip_config.yaml`.
 
-For OpenCLIP migrations, provide the architecture (`arch`) and version (`version`) according to the OpenCLIP `model list <https://github.com/mlfoundations/open_clip#usage>`_. For Hugging Face conversions, set the version to `huggingface` and the architecture (`arch`) to the specific Hugging Face model identifier, e.g., `yuvalkirstain/PickScore_v1`.
+After conversion, you can verify the model with the following command:
 
-Model Parallelism Adjustment
-----------------------------
+.. code-block:: bash
 
-CLIP Checkpoints
-^^^^^^^^^^^^^^^^
+    wget https://upload.wikimedia.org/wikipedia/commons/0/0f/1665_Girl_with_a_Pearl_Earring.jpg
+    torchrun --nproc-per-node=1 /opt/NeMo/examples/multimodal/vision_language_foundation/clip/megatron_clip_infer.py \
+        model.restore_from_path=./openai_clip.nemo \
+        image_path=./1665_Girl_with_a_Pearl_Earring.jpg \
+        texts='["a dog", "a boy", "a girl"]'
 
-To adjust model parallelism from original model parallelism size to a new model parallelism size (Note: NeMo CLIP currently only supports `pipeline_model_parallel_size=1`):
+It should generate a high probability for the "a girl" tag. For example:
 
-.. code-block:: python
+.. code-block:: text
 
-   python examples/nlp/language_modeling/megatron_change_num_partitions.py \
-    --model_file=/path/to/source.nemo \
-    --target_file=/path/to/target.nemo \
-    --tensor_model_parallel_size=??? \
-    --target_tensor_model_parallel_size=??? \
-    --pipeline_model_parallel_size=-1 \
-    --target_pipeline_model_parallel_size=1 \
-    --precision=32 \
-    --model_class="nemo.collections.multimodal.models.clip.megatron_clip_models.MegatronCLIPModel" \
-    --tp_conversion_only
+    Given image's CLIP text probability:  [('a dog', 0.0049710185), ('a boy', 0.002258187), ('a girl', 0.99277073)]