Update llama-3 PEFT notebook to download model from NGC (#9667)

* Update llama-3 PEFT notebook to download model from NGC Signed-off-by: Shashank Verma <[email protected]> * Fix broken link in llama-3 PEFT tutorial README Signed-off-by: Shashank Verma <[email protected]> * Fix broken code block in llama 3 PEFT tutorial README Signed-off-by: Shashank Verma <[email protected]> * Copy-edits to Llama-3 8B PEFT tutorial README Signed-off-by: Shashank Verma <[email protected]> * Fix broken link Signed-off-by: Shashank Verma <[email protected]> * Minor formatting fixes Signed-off-by: Shashank Verma <[email protected]> --------- Signed-off-by: Shashank Verma <[email protected]>
NVIDIA · Jul 10, 2024 · 3ab0a2a · 3ab0a2a
1 parent 14d42dc
commit 3ab0a2a
Show file tree

Hide file tree

Showing 2 changed files with 47 additions and 124 deletions.
diff --git a/tutorials/llm/llama-3/README.rst b/tutorials/llm/llama-3/README.rst
@@ -1,9 +1,9 @@
 Llama 3 LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM
 ==========================================================================
 
-`Llama 3 <https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/>`_ is an open source large language model by Meta that delivers state-of-the-art performance on popular industry benchmarks. It has been pretrained on over 15 trillion tokens, and supports an 8K token context length. It is available in two sizes, 8B and 70B, and each size has two variants—base pretrained and instruction tuned.
+`Llama 3 <https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/>`_ is an open-source large language model by Meta that delivers state-of-the-art performance on popular industry benchmarks. It has been pretrained on over 15 trillion tokens, and supports an 8K token context length. It is available in two sizes, 8B and 70B, and each size has two variants—base pretrained and instruction tuned.
 
-`Low-Rank Adaptation (LoRA) <https://arxiv.org/pdf/2106.09685>`__ has emerged as a popular Parameter Efficient Fine-Tuning (PEFT) technique that tunes a very small number of additional parameters as compared to full fine-tuning, thereby reducing the compute required.
+`Low-Rank Adaptation (LoRA) <https://arxiv.org/pdf/2106.09685>`__ has emerged as a popular Parameter-Efficient Fine-Tuning (PEFT) technique that tunes a very small number of additional parameters as compared to full fine-tuning, thereby reducing the compute required.
 
 `NVIDIA NeMo
 Framework <https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html>`__ provides tools to perform LoRA on Llama 3 to fit your use case, which can then be deployed using `NVIDIA NIM <https://www.nvidia.com/en-us/ai/>`__ for optimized inference on NVIDIA GPUs.
@@ -16,32 +16,34 @@ Framework <https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.htm
   Figure 1: Steps for LoRA customization using the NVIDIA NeMo Framework and deployment with NVIDIA NIM
 
 
-| NIM supports seamless deployment of multiple LoRA adapters (aka “multi-LoRA”) over the same base model by dynamically loading the adapter weights based on incoming requests at runtime. This provides the flexibility to handle inputs from various tasks or use cases without the need for deploying a unique model for each individual use case. More information on NIM for LLMs can be found it its `documentation <https://docs.nvidia.com/nim/large-language-models latest/introduction.html>`__.
+| NIM enables seamless deployment of multiple LoRA adapters (referred to as “multi-LoRA”) on the same base model. It dynamically loads the adapter weights based on incoming requests at runtime. This flexibility allows handling inputs from various tasks or use cases without deploying a unique model for each individual scenario. For further details, consult the `NIM documentation for LLMs <https://docs.nvidia.com/nim/large-language-models/latest/introduction.html>`__.
 
 Requirements
 -------------
 
-In order to proceed, ensure that you have met the following requirements:
-
 * System Configuration
     * Access to at least 1 NVIDIA GPU with a cumulative memory of at least 80GB, for example: 1 x H100-80GB or 1 x A100-80GB.
     * A Docker-enabled environment, with `NVIDIA Container Runtime <https://developer.nvidia.com/container-runtime>`_ installed, which will make the container GPU-aware.
     * `Additional NIM requirements <https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#prerequisites>`_.
 
-* Requested the necessary permission from Hugging Face and Meta to download `Meta-Llama-3-8B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`_. Then, you can use your Hugging Face `access token <https://huggingface.co/docs/hub/en/security-tokens>`_ to download the model, which we will then convert and customize with NeMo Framework.
-
-* `Authenticate with NVIDIA NGC <https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-authentication>`_, and download `NGC CLI Tool <https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-cli-tool>`_.
+* `Authenticate with NVIDIA NGC <https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-authentication>`_, and download `NGC CLI Tool <https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-cli-tool>`_. You will use this tool to download the model and customize it with NeMo Framework.
 
 
 `Create a LoRA Adapter with NeMo Framework <./llama3-lora-nemofw.ipynb>`__
 --------------------------------------------------------------------------
 
 This notebook shows how to perform LoRA PEFT on **Llama 3 8B Instruct** using `PubMedQA <https://pubmedqa.github.io/>`__ with NeMo Framework. PubMedQA is a Question-Answering dataset for biomedical texts. You will use the NeMo Framework which is available as a `docker container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo>`__.
 
-To get started
-^^^^^^^^^^^^^^
+1. Download the `Llama 3 8B Instruct .nemo <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/llama-3-8b-instruct-nemo>`__ from NVIDIA NGC using the NGC CLI. The following command saves the ``.nemo`` format model in a folder named ``llama-3-8b-instruct-nemo_v1.0`` in the current directory. You can specify another path using the ``-d`` option in the CLI tool.
+
+.. code:: bash
+   
+   ngc registry model download-version "nvidia/nemo/llama-3-8b-instruct-nemo:1.0"
+
+
+Alternatively, you can download the model from `Hugging Face <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__ and convert it to the ``.nemo`` format using the Hugging Face to NeMo `Llama checkpoint conversion script <https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/ckpt_converters/user_guide.html#community-model-converter-user-guide>`__.  If you'd like to skip this extra step, the ``.nemo`` model is available on NGC as mentioned above.
 
-1. Run the container using the following command. It assumes that you have the notebook(s) available in the current working directory. If not, mount the appropriate folder to ``/workspace``.
+2. Run the container using the following command. It is assumed that you have the notebook(s) and llama-3-8b-instruct model available in the current directory. If not, mount the appropriate folder to ``/workspace``.
 
 .. code:: bash
 
@@ -61,13 +63,13 @@ To get started
      -v ${PWD}/results:/results \
      nvcr.io/nvidia/nemo:$FW_VERSION bash
 
-2. From within the container, start the Jupyter lab:
+3. From within the container, start the Jupyter lab:
 
 .. code:: bash
 
    jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
 
-3. Then, navigate to `this notebook <./llama3-lora-nemofw.ipynb>`__.
+4. Then, navigate to `this notebook <./llama3-lora-nemofw.ipynb>`__.
 
 
 `Deploy Multiple LoRA Inference Adapters with NVIDIA NIM <./llama3-lora-deploy-nim.ipynb>`__
@@ -100,15 +102,11 @@ The following steps assume that you have authenticated with NGC and downloaded t
    popd
    chmod -R 777 $LOCAL_PEFT_DIRECTORY
 
-2. Prepare the LoRA model store
+2. Prepare the LoRA model store.
 
-After training is complete, that LoRA model checkpoint will be
-created at
-``./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo``,
-assuming default paths in the first notebook weren’t modified.
+After training is complete, that LoRA model checkpoint will be created at ``./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo``, assuming default paths in the first notebook weren’t modified.
 
-To ensure model store is organized as expected, create a folder named
-``llama3-8b-pubmed-qa``, and move your .nemo checkpoint there.
+To ensure the model store is organized as expected, create a folder named ``llama3-8b-pubmed-qa``, and move your ``.nemo`` checkpoint there.
 
 .. code:: bash
 
@@ -119,7 +117,7 @@ To ensure model store is organized as expected, create a folder named
 
 
 
-The LoRA model store directory should have a structure like so - with the name of the model as a sub-folder that contains the .nemo file.
+Ensure that the LoRA model store directory follows this structure: the model name(s) should be sub-folder(s) containing the ``.nemo`` file(s).
 
 ::
 
@@ -131,11 +129,10 @@ The LoRA model store directory should have a structure like so - with the name o
    └── llama3-8b-pubmed-qa
        └── megatron_gpt_peft_lora_tuning.nemo
 
-The last one was just trained on the PubmedQA dataset in the previous
-notebook.
+The last one was just trained on the PubmedQA dataset in the previous notebook.
 
 
-3. Set-up NIM
+3. Set-up NIM.
 
 From your host OS environment, start the NIM docker container while mounting the LoRA model store, as follows:
 
@@ -167,12 +164,11 @@ From your host OS environment, start the NIM docker container while mounting the
        -p 8000:8000 \
        nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
 
-The first time you run the command, it will download the model and cache it in ``$NIM_CACHE_PATH`` so subsequent deployments are even faster. There are several options to configure NIM other than the ones listed above. You can find a full list in `NIM configuration <https://docs.nvidia.com/nim/large-language-models/latest/configuration.html>`__ documentation.
+The first time you run the command, it will download the model and cache it in ``$NIM_CACHE_PATH`` so subsequent deployments are even faster. There are several options to configure NIM other than the ones listed above. You can find a full list in the `NIM configuration <https://docs.nvidia.com/nim/large-language-models/latest/configuration.html>`__ documentation.
 
 
-4. Start the notebook
+4. Start the notebook.
 
-From another terminal, follow the same instructions as the previous
-notebook to launch Jupyter Lab, and navigate to `this notebook <./llama3-lora-deploy-nim.ipynb>`__.
+From another terminal, follow the same instructions as the previous notebook to launch Jupyter Lab, and then navigate to `this notebook <./llama3-lora-deploy-nim.ipynb>`__.
 
-You can use the same NeMo Framework docker container which already has Jupyter Lab installed.
+You can use the same NeMo Framework docker container which has Jupyter Lab already installed.
diff --git a/tutorials/llm/llama-3/llama3-lora-nemofw.ipynb b/tutorials/llm/llama-3/llama3-lora-nemofw.ipynb
@@ -15,7 +15,7 @@
    "source": [
     "This notebook showcases performing LoRA PEFT **Llama 3 8B** on [PubMedQA](https://pubmedqa.github.io/) using NeMo Framework. PubMedQA is a Question-Answering dataset for biomedical texts.\n",
     "\n",
-    "> `NOTE:` Ensure that you run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies. Instructions are available in the associated tutorial README."
+    "> `NOTE:` Ensure that you run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies. **Instructions are available in the associated tutorial README to download the model and the container.**"
    ]
   },
   {
@@ -32,122 +32,49 @@
   },
   {
    "cell_type": "markdown",
-   "id": "deb6a910-a05e-4ae1-aac4-56e5092be2b4",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "---\n",
-    "##  Step-by-step instructions\n",
-    "\n",
-    "This notebook is structured into six steps:\n",
-    "1. Download Llama-3-8B-Instruct from Hugging Face\n",
-    "2. Convert Llama-3-8B-Instruct to NeMo format\n",
-    "3. Prepare the dataset\n",
-    "4. Run the PEFT finetuning script\n",
-    "5. Inference with NeMo Framework\n",
-    "6. Check the model accuracy\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e1f8f06d-aa9b-49cf-b50b-023967fc9e1a",
+   "id": "0b285d5a-d838-423b-9d6c-65add61f48ce",
    "metadata": {},
    "source": [
-    "### Step 1: Download the model from Hugging Face"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b5c50597-53e9-4604-9b86-af4c8e6b027e",
-   "metadata": {},
-   "source": [
-    "> `NOTE:` Access to Meta-Llama-3-8B-Instruct is gated. Before you proceed, ensure that you have a Hugging Face account, and have requested the necessary permission from Hugging Face and Meta to download the model on the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) page. Then, you can use your Hugging Face [access token](https://huggingface.co/docs/hub/en/security-tokens) to download the model in the following code snippet, which we will then convert and customize with NeMo Framework."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f14a2ea5-309b-4f78-8524-313043e9daeb",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import huggingface_hub\n",
-    "\n",
-    "# Set your Hugging Face access token\n",
-    "huggingface_hub.login(\"<YOUR_HUGGINGFACE_ACCESS_TOKEN>\")"
+    "---\n",
+    "## Before you begin\n",
+    "Ensure that you have the `Meta Llama3 8B Instruct .nemo` model downloaded and the corresponding folder mounted to the container."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "99125f50",
+   "id": "3057e525-7957-45c0-bedc-c347d4811081",
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
-    "os.makedirs(\"./Meta-Llama-3-8B-Instruct\" ,exist_ok=True)\n",
-    "huggingface_hub.snapshot_download(repo_id=\"meta-llama/Meta-Llama-3-8B-Instruct\", local_dir=\"Meta-Llama-3-8B-Instruct\", local_dir_use_symlinks=False)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "18d5a8a9-41db-4186-a51a-a89d0501e1c0",
-   "metadata": {},
-   "source": [
-    "The Llama-3-8B-Instruct model will be downloaded to `./Meta-Llama-3-8B-Instruct`"
+    "!ls /workspace/llama-3-8b-instruct-nemo_v1.0"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "49fc4629",
-   "metadata": {},
-   "source": [
-    "### Step 2: Convert Llama-3-8B-Instruct to NeMo format\n",
-    "\n",
-    "Run the below code to convert the model to the NeMo format. \n",
-    "\n",
-    "The generated `.nemo` file uses distributed checkpointing and can be loaded with any Tensor Parallel (TP) or Pipeline Parallel (PP) combination without reshaping or splitting. For more information on parallelisms in NeMo, refer to [NeMo Framework documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "55331dd3",
+   "id": "deb6a910-a05e-4ae1-aac4-56e5092be2b4",
    "metadata": {
     "tags": []
    },
-   "outputs": [],
    "source": [
-    "%%bash\n",
-    "\n",
-    "# clear any previous temporary weights dir if any\n",
-    "rm -r model_weights\n",
+    "---\n",
+    "##  Step-by-step instructions\n",
     "\n",
-    "python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \\\n",
-    "  --precision bf16 \\\n",
-    "  --input_name_or_path=./Meta-Llama-3-8B-Instruct/ \\\n",
-    "  --output_path=./Meta-Llama-3-8B-Instruct.nemo"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fafb86d7-6254-42d4-b9aa-ab8a723f90c1",
-   "metadata": {},
-   "source": [
-    "This will create a .nemo model file in current working directory."
+    "This notebook is structured into four steps:\n",
+    "1. Prepare the dataset\n",
+    "2. Run the PEFT finetuning script\n",
+    "3. Inference with NeMo Framework\n",
+    "4. Check the model accuracy"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "8ea5bd31",
    "metadata": {},
    "source": [
-    "### Step 3: Prepare the dataset\n",
+    "### Step 1: Prepare the dataset\n",
     "\n",
     "Download the PubMedQA dataset and run the pre-processing script in the cloned directory."
    ]
@@ -288,7 +215,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "### Step 4: Run PEFT finetuning script for LoRA\n",
+    "### Step 2: Run PEFT finetuning script for LoRA\n",
     "\n",
     "NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!\n",
     "\n",
@@ -309,7 +236,7 @@
     "%%bash\n",
     "\n",
     "# Set paths to the model, train, validation and test sets.\n",
-    "MODEL=\"./Meta-Llama-3-8B-Instruct.nemo\"\n",
+    "MODEL=\"/workspace/llama-3-8b-instruct-nemo_v1.0/8b_instruct_nemo_bf16.nemo\"\n",
     "TRAIN_DS=\"[./pubmedqa/data/pubmedqa_train.jsonl]\"\n",
     "VALID_DS=\"[./pubmedqa/data/pubmedqa_val.jsonl]\"\n",
     "TEST_DS=\"[./pubmedqa/data/pubmedqa_test.jsonl]\"\n",
@@ -377,7 +304,7 @@
     "tags": []
    },
    "source": [
-    "### Step 5: Inference with NeMo Framework\n",
+    "### Step 3: Inference with NeMo Framework\n",
     "\n",
     "Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged  deployment solution like NVIDIA NIM."
    ]
@@ -454,7 +381,7 @@
    "id": "2fe048f9",
    "metadata": {},
    "source": [
-    "### Step 6: Check the model accuracy\n",
+    "### Step 4: Check the model accuracy\n",
     "\n",
     "Now that the results are in, let's read the results and calculate the accuracy on the pubmedQA task. You can compare your accuracy results with the public leaderboard at https://pubmedqa.github.io/.\n",
     "\n",
@@ -565,8 +492,8 @@
    "source": [
     "For the Llama-3-8B-Instruct model, you should see accuracy comparable to the below:\n",
     "```\n",
-    "Accuracy 0.786000\n",
-    "Macro-F1 0.550305\n",
+    "Accuracy 0.792000\n",
+    "Macro-F1 0.594778\n",
     "```"
    ]
   }