Skip to content

Commit

Permalink
Add docs for pretraining notebook
Browse files Browse the repository at this point in the history
Signed-off-by: Hemil Desai <[email protected]>
  • Loading branch information
hemildesai committed Nov 6, 2024
1 parent 679379b commit be0dcae
Showing 1 changed file with 37 additions and 15 deletions.
52 changes: 37 additions & 15 deletions examples/llm/slimpajama/pretraining.ipynb
Original file line number Diff line number Diff line change
@@ -1,22 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pretraining using Slimpajama\n",
"\n",
"Let's see how we can use the data generated from the [data pipeline notebook](./data_pipeline.ipynb) to pretrain a model. All we need to do is define the data module based on the generated data and replace it with the mock data module provided by default in the [NeMo llm recipes](../../../nemo/collections/llm/recipes/__init__.py)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.10/dist-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"[NeMo W 2024-10-28 22:56:08 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n",
" cm = get_cmap(\"Set1\")\n",
" \n"
]
}
],
"outputs": [],
"source": [
"import nemo_run as run\n",
"from typing import Optional\n",
Expand All @@ -25,6 +22,14 @@
"from nemo.collections.common.tokenizers import SentencePieceTokenizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Defining the data module\n",
"To define the data module, we can use `llm.PreTrainingDataModule` and pass in the data paths and tokenizer. In case you don't have either of the two, please refer to [data pipeline notebook](./data_pipeline.ipynb). You can also look at the definition of the data module for the other parameters supported like `split`, `num_workers`, `index_mapping_dir`, etc."
]
},
{
"cell_type": "code",
"execution_count": 2,
Expand All @@ -44,12 +49,21 @@
" global_batch_size=gbs,\n",
" micro_batch_size=mbs,\n",
" tokenizer=run.Config(SentencePieceTokenizer, model_path=\"/data/tokenizer/tokenizer.model\"),\n",
" split=\"99990,8,2\",\n",
" split=\"99,8,2\",\n",
" num_workers=2,\n",
" index_mapping_dir=\"/data/index_mapping\",\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configuring the recipe and launching pretraining\n",
"Once the data module is defined, you can use an existing recipe and replace the data module as shown below.\n",
"To learn more about the recipes, refer to the [quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html)."
]
},
{
"cell_type": "code",
"execution_count": 3,
Expand Down Expand Up @@ -108,6 +122,14 @@
" run.run(recipe, executor=executor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run pretraining\n",
"Now you can just call the `run_pretraining` function to start pretraining on your local machine using torchrun."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down

0 comments on commit be0dcae

Please sign in to comment.