diff --git a/examples/llm/slimpajama/pretraining.ipynb b/examples/llm/slimpajama/pretraining.ipynb index 9c16ac371e6f1..50484ee63c1a3 100644 --- a/examples/llm/slimpajama/pretraining.ipynb +++ b/examples/llm/slimpajama/pretraining.ipynb @@ -1,22 +1,19 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pretraining using Slimpajama\n", + "\n", + "Let's see how we can use the data generated from the [data pipeline notebook](./data_pipeline.ipynb) to pretrain a model. All we need to do is define the data module based on the generated data and replace it with the mock data module provided by default in the [NeMo llm recipes](../../../nemo/collections/llm/recipes/__init__.py)." + ] + }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/lib/python3.10/dist-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n", - "[NeMo W 2024-10-28 22:56:08 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", - " cm = get_cmap(\"Set1\")\n", - " \n" - ] - } - ], + "outputs": [], "source": [ "import nemo_run as run\n", "from typing import Optional\n", @@ -25,6 +22,14 @@ "from nemo.collections.common.tokenizers import SentencePieceTokenizer" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Defining the data module\n", + "To define the data module, we can use `llm.PreTrainingDataModule` and pass in the data paths and tokenizer. In case you don't have either of the two, please refer to [data pipeline notebook](./data_pipeline.ipynb). You can also look at the definition of the data module for the other parameters supported like `split`, `num_workers`, `index_mapping_dir`, etc." + ] + }, { "cell_type": "code", "execution_count": 2, @@ -44,12 +49,21 @@ " global_batch_size=gbs,\n", " micro_batch_size=mbs,\n", " tokenizer=run.Config(SentencePieceTokenizer, model_path=\"/data/tokenizer/tokenizer.model\"),\n", - " split=\"99990,8,2\",\n", + " split=\"99,8,2\",\n", " num_workers=2,\n", " index_mapping_dir=\"/data/index_mapping\",\n", " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configuring the recipe and launching pretraining\n", + "Once the data module is defined, you can use an existing recipe and replace the data module as shown below.\n", + "To learn more about the recipes, refer to the [quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html)." + ] + }, { "cell_type": "code", "execution_count": 3, @@ -108,6 +122,14 @@ " run.run(recipe, executor=executor)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run pretraining\n", + "Now you can just call the `run_pretraining` function to start pretraining on your local machine using torchrun." + ] + }, { "cell_type": "code", "execution_count": null,