-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Hemil Desai <[email protected]>
- Loading branch information
1 parent
dcc73bf
commit 1b81dd6
Showing
1 changed file
with
319 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,319 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Data Processing for NeMo 2.0 LLMs\n", | ||
"\n", | ||
"This tutorial will cover the steps to go from a raw pretraining dataset all the way to configuring the data module for pretraining using a NeMo 2.0 recipe.\n", | ||
"We will be using the [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B>) dataset for reference. We will also show how to exclude certain sources from the dataset, for instance, by default we will be excluding all data from the `RedPajamaBook` set.\n", | ||
"\n", | ||
"This tutorial involves four steps:\n", | ||
"\n", | ||
"1. Downloading data\n", | ||
"2. Extracting data\n", | ||
"3. Concatenating data\n", | ||
"4. Preprocessing data for NeMo 2.0/Megatron\n", | ||
"\n", | ||
"First, we'll define each step. Next, we will see how we can use NeMo-Run to execute the steps sequentially on your local workstation using Docker or on Slurm.\n", | ||
"\n", | ||
"### Pre-requisites\n", | ||
"This notebook assumes familiarity with [NeMo-Run](https://github.com/NVIDIA/NeMo-Run). Additionally, the docker execution and slurm execution steps require access to docker on your host and a remote slurm cluster respectively.\n", | ||
"Additionally, you will have to complete the following steps:\n", | ||
"\n", | ||
"1. Set HOST_DATA_PATH in the first cell to a parent folder on your workstation where you want to save the data.\n", | ||
"1. Create directories `HOST_DATA_PATH/tokenizer` and `HOST_DATA_PATH/slimpajama`.\n", | ||
"1. Download the Llama 3 `tokenizer.model` file either from [Huggingface](https://huggingface.co/meta-llama/Llama-2-7b/blob/main/tokenizer.model) or https://www.llama.com/llama-downloads/ and place it at `{HOST_DATA_PATH}/tokenizer/tokenizer.model`.\n", | ||
" For HF, you can do it by running \n", | ||
" ```bash\n", | ||
" HF_TOKEN=... huggingface-cli download meta-llama/Llama-2-7B tokenizer.model --local-dir {HOST_DATA_PATH}/tokenizer/\n", | ||
" ```\n", | ||
"\n", | ||
"> [!NOTE]\n", | ||
"> All code for this tutorial can be found at https://github.com/NVIDIA/NeMo/tree/main/examples/llm/slimpajama." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import nemo_run as run\n", | ||
"\n", | ||
"from data.download import download_slimpajama\n", | ||
"from data.extract import run_extraction\n", | ||
"from data.preprocess import preprocess_data\n", | ||
"\n", | ||
"HOST_DATA_PATH = \"/home/hemild/dev/data\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Downloading Data\n", | ||
"\n", | ||
"First, we will configure the task to download data from Huggingface. We will use the Huggingface CLI for this. The function that configures the download script can be found [here](./data/download.py)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"download_task = download_slimpajama(\n", | ||
" include_pattern='--include \"train/chunk1/*_100*zst\"',\n", | ||
")\n", | ||
"\n", | ||
"# The configured script looks like below\n", | ||
"print(download_task.inline)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Extracting Data\n", | ||
"\n", | ||
"The downloaded data is in compressed zst format. We need to extract it into jsonl files. For that, we will configure the `extract_data` function defined [here](./data/extract.py). This function also allows excluding certain sources. By default, we exclude all data from `RedPajamaBook` set, but this setting is configurable." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"run_extraction??" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"extract_task = run.Partial(run_extraction, data_dir=\"/data/slimpajama\")\n", | ||
"extract_task" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Concatenating Data\n", | ||
"\n", | ||
"This optional step concatenates small jsonl files into a single large jsonl files. The example script is [here](./data/concat.sh) but feel free to change it based on your needs." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"concat_task = run.Script(\"/nemo_run/code/data/concat.sh\", args=[\"/data/slimpajama/train\", \"1\"])\n", | ||
"concat_task" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Preprocessing Data\n", | ||
"\n", | ||
"This final step preprocesses the jsonl files to the bin and idx files required by NeMo and Megatron. It uses the `preprocess_data` function defined [here](./data/preprocess.py)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"preprocess_data??" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"preprocess_task = run.Partial(\n", | ||
" preprocess_data,\n", | ||
" data_dir=\"/data/slimpajama\",\n", | ||
" output_dir=\"/data/slimpajama_megatron\",\n", | ||
" tokenizer_model=\"/data/tokenizer/tokenizer.model\",\n", | ||
" tokenizer_library=\"sentencepiece\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"preprocess_task" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Putting it all together\n", | ||
"\n", | ||
"Now that all the tasks are configured, lets define an executor to run them on and an experiment to run them sequeuntially. \n", | ||
"\n", | ||
"> [!NOTE]\n", | ||
"> Each task can be run individually, or in any combination. The notebook runs all tasks sequentially. To remove a task, just remove the corresponding `exp.add(...)` for that corresponding task.\n", | ||
"> This customization is handy if you already have `jsonl` files processed, for example, from NeMo-Curator." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Let's define a local executor to run the experiment locally.\n", | ||
"def docker_executor(host_data_path: str):\n", | ||
" packager = run.GitArchivePackager(subpath=\"examples/llm/slimpajama\") # This will package all code inside the folder. NOTE: only committed changes are packaged, so if you make a change, make sure to commit it.\n", | ||
" executor = run.DockerExecutor(\n", | ||
" packager=packager,\n", | ||
" ipc_mode=\"host\",\n", | ||
" shm_size=\"30g\",\n", | ||
" env_vars={\"PYTHONUNBUFFERED\": \"1\"},\n", | ||
" volumes=[f\"{host_data_path}:/data\"],\n", | ||
" container_image=\"python:3.11\",\n", | ||
" ulimits=[\"memlock:-1\", \"stack:67108864\"],\n", | ||
" )\n", | ||
" return executor" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"executor = docker_executor(host_data_path=\"/home/hemild/dev/data\")\n", | ||
"with run.Experiment(\"slimpajama-data-pipeline\") as exp:\n", | ||
" exp.add(download_task, name=\"download_slimpajama\", executor=executor)\n", | ||
"\n", | ||
" # Use NeMo image for the remaining tasks\n", | ||
" executor.container_image = \"nvcr.io/nvidia/nemo:dev\"\n", | ||
" exp.add(extract_task, name=\"extract_slimpajama\", executor=executor)\n", | ||
"\n", | ||
" # examples/llm/slimpajama is automatically mounted to /nemo_run/code\n", | ||
" exp.add(concat_task, name=\"concat_slimpajama\", executor=executor)\n", | ||
" exp.add(preprocess_task, name=\"preprocess_slimpajama\", executor=executor)\n", | ||
"\n", | ||
" exp.run(sequential=True, tail_logs=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"If the experiment runs succesfully, you will see the bin and idx files as shown below. These can directly be used in NeMo and Megatron Data Loaders." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!ls {HOST_DATA_PATH}/slimpajama_megatron" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"You can also run the same experiment on a remote cluster like Slurm by replacing the docker executor with a slurm executor. A sample definition of slurm executor looks like\n", | ||
"\n", | ||
"```python\n", | ||
"def slurm_executor(\n", | ||
" user: str,\n", | ||
" host: str,\n", | ||
" remote_job_dir: str,\n", | ||
" account: str,\n", | ||
" partition: str,\n", | ||
" nodes: int,\n", | ||
" tasks_per_node: int,\n", | ||
" time: str = \"04:00:00\",\n", | ||
" custom_mounts: Optional[list[str]] = None,\n", | ||
" custom_env_vars: Optional[dict[str, str]] = None,\n", | ||
" container_image: str = \"nvcr.io/nvidia/nemo:dev\",\n", | ||
" retries: int = 0,\n", | ||
") -> run.SlurmExecutor:\n", | ||
" if not (user and host and remote_job_dir and account and partition and nodes and tasks_per_node):\n", | ||
" raise RuntimeError(\n", | ||
" \"Please set user, host, remote_job_dir, account, partition, nodes and devices args for using this function.\"\n", | ||
" )\n", | ||
"\n", | ||
" mounts = []\n", | ||
" if custom_mounts:\n", | ||
" mounts.extend(custom_mounts)\n", | ||
"\n", | ||
" env_vars = {\n", | ||
" \"NVIDIA_VISIBLE_DEVICES\": \"void\", # Might be needed for CPU only nodes with NeMo docker image\n", | ||
" }\n", | ||
" if custom_env_vars:\n", | ||
" env_vars |= custom_env_vars\n", | ||
"\n", | ||
" executor = run.SlurmExecutor(\n", | ||
" account=account,\n", | ||
" partition=partition,\n", | ||
" tunnel=run.SSHTunnel(\n", | ||
" user=user,\n", | ||
" host=host,\n", | ||
" job_dir=remote_job_dir,\n", | ||
" identity=\"/path/to/identity/file/for/ssh/to/cluster\", # OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password\n", | ||
" ),\n", | ||
" nodes=nodes,\n", | ||
" ntasks_per_node=tasks_per_node,\n", | ||
" mem=\"0\",\n", | ||
" exclusive=True,\n", | ||
" packager=run.GitArchivePackager(subpath=\"examples/llm/slimpajama\"),\n", | ||
" )\n", | ||
"\n", | ||
" executor.container_image = container_image\n", | ||
" executor.container_mounts = mounts\n", | ||
" executor.env_vars = env_vars\n", | ||
" executor.retries = retries\n", | ||
" executor.time = time\n", | ||
"\n", | ||
" return executor\n", | ||
"```" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |