Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MM Dataprep Tutorial #8410

Merged
merged 1 commit into from
Feb 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 9 additions & 18 deletions tutorials/multimodal/Multimodal Data Preparation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,19 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multimodal Dataset Preparation\n",
"\n",
"First step of pre-training any deep learning model is data preparation. This notebook will walk you through 5 stages of data preparation for training a multimodal model: \n",
"The first step of pre-training any deep learning model is data preparation. This notebook will walk you through the 5 stages of data preparation for training a multimodal model:\n",
"1. Download your Data\n",
"2. Extract Images and Text\n",
"3. Re-organize to ensure uniform text-image pairs\n",
"4. Precache Encodings\n",
"5. Generate Metadata required for training\n",
"\n",
"This notebook will show you how to prepare an image-text dataset into the [WebDataset](https://github.com/webdataset/webdataset) format. The Webdataset format is required to train all multimodal models in NeMo, such as Stable Diffusion and Imagen. \n",
"\n",
"This notebook is designed to demonstrate the different stages of multimodal dataset preparation. It is not meant to be used to process large-scale datasets since many stages are too time-consuming to run without parallelism. For large workloads, we recommend running the multimodal dataset preparation pipeline with the NeMo-Megatron-Launcher on multiple processors/GPUs. NeMo-Megatron-Launcher packs the same 5 scripts in this notebook into one runnable command and one config file to enable a smooth and a streamlined workflow.\n",
"\n",
"Depending on your use case, not all 5 stages need to be run. Please go to (TODO doc link) for an overview of the 5 stages.\n",
" \n",
"We will use a [dummy dataset](https://huggingface.co/datasets/cuichenx/dummy-image-text-dataset) as the dataset example throughout this notebook. This dataset is formatted as a table with one column storing the text captions, and one column storing the URL link to download the corresponding image. This is the same format as most common text-image datasets. The use of this dummy dataset is for demonstration purposes only. **Each user is responsible for checking the content of the dataset and the applicable licenses to determine if it is suitable for the intended use.**\n",
"\n",
"Let's first set up some paths."
]
"5. Generate Metadata required for training\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
Expand Down Expand Up @@ -58,13 +50,12 @@
"id": "c06f3527",
"metadata": {},
"source": [
"# Multimodal Dataset Preparation\n",
"\n",
"This notebook will show you how to prepare an image-text dataset into the [WebDataset](https://github.com/webdataset/webdataset) format. The Webdataset format is required to train all multimodal models in NeMo, such as Stable Diffusion and Imagen. \n",
"\n",
"This notebook is designed to demonstrate the different stages of multimodal dataset preparation. It is not meant to be used to process large-scale datasets since many stages are too time-consuming to run without parallelism. For large workloads, we recommend running the multimodal dataset preparation pipeline with the NeMo-Megatron-Launcher on multiple processors/GPUs. NeMo-Megatron-Launcher packs the same 5 scripts in this notebook into one runnable command and one config file to enable a smooth and a streamlined workflow.\n",
"\n",
"Depending on your use case, not all 5 stages need to be run. Please go to (TODO doc link) for an overview of the 5 stages.\n",
"Depending on your use case, not all 5 stages need to be run. Please go to [NeMo Multimodal Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/multimodal/text2img/datasets.html) for an overview of the 5 stages.\n",
" \n",
"We will use a [dummy dataset](https://huggingface.co/datasets/cuichenx/dummy-image-text-dataset) as the dataset example throughout this notebook. This dataset is formatted as a table with one column storing the text captions, and one column storing the URL link to download the corresponding image. This is the same format as most common text-image datasets. The use of this dummy dataset is for demonstration purposes only. **Each user is responsible for checking the content of the dataset and the applicable licenses to determine if it is suitable for the intended use.**\n",
"\n",
Expand Down Expand Up @@ -413,7 +404,7 @@
"id": "27b26036",
"metadata": {},
"source": [
"Let's download an example precaching config file ## TODO modify this path"
"Let's download an example precaching config file"
]
},
{
Expand All @@ -425,7 +416,7 @@
},
"outputs": [],
"source": [
"! wget TODO_github_link/precache_sd.yaml -P $CONF_DIR/"
"! wget https://github.com/NVIDIA/NeMo-Megatron-Launcher/blob/master/launcher_scripts/conf/data_preparation/multimodal/precache_sd.yaml -P $CONF_DIR/"
]
},
{
Expand Down
Loading