Add notebook

Signed-off-by: Hemil Desai <[email protected]>
NVIDIA · Oct 12, 2024 · 1b81dd6 · 1b81dd6
1 parent dcc73bf
commit 1b81dd6
Showing 1 changed file with 319 additions and 0 deletions.
diff --git a/examples/llm/slimpajama/data_pipeline.ipynb b/examples/llm/slimpajama/data_pipeline.ipynb
@@ -0,0 +1,319 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Data Processing for NeMo 2.0 LLMs\n",
+    "\n",
+    "This tutorial will cover the steps to go from a raw pretraining dataset all the way to configuring the data module for pretraining using a NeMo 2.0 recipe.\n",
+    "We will be using the [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B>) dataset for reference. We will also show how to exclude certain sources from the dataset, for instance, by default we will be excluding all data from the `RedPajamaBook` set.\n",
+    "\n",
+    "This tutorial involves four steps:\n",
+    "\n",
+    "1. Downloading data\n",
+    "2. Extracting data\n",
+    "3. Concatenating data\n",
+    "4. Preprocessing data for NeMo 2.0/Megatron\n",
+    "\n",
+    "First, we'll define each step. Next, we will see how we can use NeMo-Run to execute the steps sequentially on your local workstation using Docker or on Slurm.\n",
+    "\n",
+    "### Pre-requisites\n",
+    "This notebook assumes familiarity with [NeMo-Run](https://github.com/NVIDIA/NeMo-Run). Additionally, the docker execution and slurm execution steps require access to docker on your host and a remote slurm cluster respectively.\n",
+    "Additionally, you will have to complete the following steps:\n",
+    "\n",
+    "1. Set HOST_DATA_PATH in the first cell to a parent folder on your workstation where you want to save the data.\n",
+    "1. Create directories `HOST_DATA_PATH/tokenizer` and `HOST_DATA_PATH/slimpajama`.\n",
+    "1. Download the Llama 3 `tokenizer.model` file either from [Huggingface](https://huggingface.co/meta-llama/Llama-2-7b/blob/main/tokenizer.model) or https://www.llama.com/llama-downloads/ and place it at `{HOST_DATA_PATH}/tokenizer/tokenizer.model`.\n",
+    "    For HF, you can do it by running \n",
+    "    ```bash\n",
+    "    HF_TOKEN=... huggingface-cli download meta-llama/Llama-2-7B tokenizer.model --local-dir {HOST_DATA_PATH}/tokenizer/\n",
+    "    ```\n",
+    "\n",
+    "> [!NOTE]\n",
+    "> All code for this tutorial can be found at https://github.com/NVIDIA/NeMo/tree/main/examples/llm/slimpajama."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nemo_run as run\n",
+    "\n",
+    "from data.download import download_slimpajama\n",
+    "from data.extract import run_extraction\n",
+    "from data.preprocess import preprocess_data\n",
+    "\n",
+    "HOST_DATA_PATH = \"/home/hemild/dev/data\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Downloading Data\n",
+    "\n",
+    "First, we will configure the task to download data from Huggingface. We will use the Huggingface CLI for this. The function that configures the download script can be found [here](./data/download.py)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "download_task = download_slimpajama(\n",
+    "    include_pattern='--include \"train/chunk1/*_100*zst\"',\n",
+    ")\n",
+    "\n",
+    "# The configured script looks like below\n",
+    "print(download_task.inline)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extracting Data\n",
+    "\n",
+    "The downloaded data is in compressed zst format. We need to extract it into jsonl files. For that, we will configure the `extract_data` function defined [here](./data/extract.py). This function also allows excluding certain sources. By default, we exclude all data from `RedPajamaBook` set, but this setting is configurable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "run_extraction??"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "extract_task = run.Partial(run_extraction, data_dir=\"/data/slimpajama\")\n",
+    "extract_task"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Concatenating Data\n",
+    "\n",
+    "This optional step concatenates small jsonl files into a single large jsonl files. The example script is [here](./data/concat.sh) but feel free to change it based on your needs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "concat_task = run.Script(\"/nemo_run/code/data/concat.sh\", args=[\"/data/slimpajama/train\", \"1\"])\n",
+    "concat_task"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preprocessing Data\n",
+    "\n",
+    "This final step preprocesses the jsonl files to the bin and idx files required by NeMo and Megatron. It uses the `preprocess_data` function defined [here](./data/preprocess.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preprocess_data??"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preprocess_task = run.Partial(\n",
+    "    preprocess_data,\n",
+    "    data_dir=\"/data/slimpajama\",\n",
+    "    output_dir=\"/data/slimpajama_megatron\",\n",
+    "    tokenizer_model=\"/data/tokenizer/tokenizer.model\",\n",
+    "    tokenizer_library=\"sentencepiece\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preprocess_task"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Putting it all together\n",
+    "\n",
+    "Now that all the tasks are configured, lets define an executor to run them on and an experiment to run them sequeuntially. \n",
+    "\n",
+    "> [!NOTE]\n",
+    "> Each task can be run individually, or in any combination. The notebook runs all tasks sequentially. To remove a task, just remove the corresponding `exp.add(...)` for that corresponding task.\n",
+    "> This customization is handy if you already have `jsonl` files processed, for example, from NeMo-Curator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's define a local executor to run the experiment locally.\n",
+    "def docker_executor(host_data_path: str):\n",
+    "    packager = run.GitArchivePackager(subpath=\"examples/llm/slimpajama\") # This will package all code inside the folder. NOTE: only committed changes are packaged, so if you make a change, make sure to commit it.\n",
+    "    executor = run.DockerExecutor(\n",
+    "        packager=packager,\n",
+    "        ipc_mode=\"host\",\n",
+    "        shm_size=\"30g\",\n",
+    "        env_vars={\"PYTHONUNBUFFERED\": \"1\"},\n",
+    "        volumes=[f\"{host_data_path}:/data\"],\n",
+    "        container_image=\"python:3.11\",\n",
+    "        ulimits=[\"memlock:-1\", \"stack:67108864\"],\n",
+    "    )\n",
+    "    return executor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "executor = docker_executor(host_data_path=\"/home/hemild/dev/data\")\n",
+    "with run.Experiment(\"slimpajama-data-pipeline\") as exp:\n",
+    "    exp.add(download_task, name=\"download_slimpajama\", executor=executor)\n",
+    "\n",
+    "    # Use NeMo image for the remaining tasks\n",
+    "    executor.container_image = \"nvcr.io/nvidia/nemo:dev\"\n",
+    "    exp.add(extract_task, name=\"extract_slimpajama\", executor=executor)\n",
+    "\n",
+    "    # examples/llm/slimpajama is automatically mounted to /nemo_run/code\n",
+    "    exp.add(concat_task, name=\"concat_slimpajama\", executor=executor)\n",
+    "    exp.add(preprocess_task, name=\"preprocess_slimpajama\", executor=executor)\n",
+    "\n",
+    "    exp.run(sequential=True, tail_logs=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the experiment runs succesfully, you will see the bin and idx files as shown below. These can directly be used in NeMo and Megatron Data Loaders."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls {HOST_DATA_PATH}/slimpajama_megatron"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also run the same experiment on a remote cluster like Slurm by replacing the docker executor with a slurm executor. A sample definition of slurm executor looks like\n",
+    "\n",
+    "```python\n",
+    "def slurm_executor(\n",
+    "    user: str,\n",
+    "    host: str,\n",
+    "    remote_job_dir: str,\n",
+    "    account: str,\n",
+    "    partition: str,\n",
+    "    nodes: int,\n",
+    "    tasks_per_node: int,\n",
+    "    time: str = \"04:00:00\",\n",
+    "    custom_mounts: Optional[list[str]] = None,\n",
+    "    custom_env_vars: Optional[dict[str, str]] = None,\n",
+    "    container_image: str = \"nvcr.io/nvidia/nemo:dev\",\n",
+    "    retries: int = 0,\n",
+    ") -> run.SlurmExecutor:\n",
+    "    if not (user and host and remote_job_dir and account and partition and nodes and tasks_per_node):\n",
+    "        raise RuntimeError(\n",
+    "            \"Please set user, host, remote_job_dir, account, partition, nodes and devices args for using this function.\"\n",
+    "        )\n",
+    "\n",
+    "    mounts = []\n",
+    "    if custom_mounts:\n",
+    "        mounts.extend(custom_mounts)\n",
+    "\n",
+    "    env_vars = {\n",
+    "        \"NVIDIA_VISIBLE_DEVICES\": \"void\", # Might be needed for CPU only nodes with NeMo docker image\n",
+    "    }\n",
+    "    if custom_env_vars:\n",
+    "        env_vars |= custom_env_vars\n",
+    "\n",
+    "    executor = run.SlurmExecutor(\n",
+    "        account=account,\n",
+    "        partition=partition,\n",
+    "        tunnel=run.SSHTunnel(\n",
+    "            user=user,\n",
+    "            host=host,\n",
+    "            job_dir=remote_job_dir,\n",
+    "            identity=\"/path/to/identity/file/for/ssh/to/cluster\",  # OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password\n",
+    "        ),\n",
+    "        nodes=nodes,\n",
+    "        ntasks_per_node=tasks_per_node,\n",
+    "        mem=\"0\",\n",
+    "        exclusive=True,\n",
+    "        packager=run.GitArchivePackager(subpath=\"examples/llm/slimpajama\"),\n",
+    "    )\n",
+    "\n",
+    "    executor.container_image = container_image\n",
+    "    executor.container_mounts = mounts\n",
+    "    executor.env_vars = env_vars\n",
+    "    executor.retries = retries\n",
+    "    executor.time = time\n",
+    "\n",
+    "    return executor\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}