Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Megatron-Style input data pipelines #3

Closed
wants to merge 80 commits into from

Conversation

TJ-Solergibert
Copy link
Collaborator

What does this PR do?

The current version of Nanotron's input pipelines is based on Hugging Face Datasets and relies on clm_preprocess, which tokenises and preprocesses the entire dataset at the beginning of the training (linked to the sequence_length, making it even more difficult to reuse the result across different experiments).

I have developed new data input pipelines based on those included in Megatron. Since I didn't want Nanotron to lose its essence, I removed many functionalities that we don't need (such as those related to BERT models pretraining). What I mainly modified is the torch.utils.data.Dataset, and we continue to work with the same Sampler*, Collator* and DataLoader (* I had to modify them slightly), so it doesn't alter the behavior of other modules like the PipelineEngine at all. It also allows us to continue using the previous pipeline based on Hugging Face Datasets, since I added the script run_train_nanoset.py to launch the training with the new pipeline.

Relevant details:

  • The new input pipelines work with the same files as Megatron. In this PR, I include references to this project to carry out the data preprocessing, although we could include them here as well. The scripts could be simplified, and necessary dependencies could be added.
  • I included a new configuration in the .yaml file called NanosetDatasetsArgs, which can replace `PretrainDatasetsArgs. You only need to specify the path to the dataset (generated by Megatron's preprocess_data.py, without the extension as they specify) and the distribution of the dataset samples for each of the splits (train, valid, and test) so that it sums up to 1.
  • The Nanoset will be the new dataset format. It is a lighter version of GPTDataset and MegatronDataset from Megatron.
  • To build the datasets, we will use the NanosetBuilder, which, based on a NanosetConfig (Contains NanosetDatasetsArgs + other details), will build a Nanoset for each split. In this first version, we only support one dataset file, but I will include the possibility of using multiple files (BlendedNanoset), hence preserving the NanosetBuilder.
  • Each Nanoset contains an MMapIndexedDataset. This object is found in indexed_dataset.py and comes from fairseq. Megatron also includes it as such.
  • I have added a page to the documentation with more details about the preprocessing to create the datasets and the internal functioning similar to what is included in Megatron.

I think maybe we should centralize the input data pipelines and perhaps move the dataloader.py file to another location. I also propose moving several functions from this file with the comments # Question:.

To use the Nanoset datasets, you need to specify the data_path and split fields in config.data.dataset in the .yaml file and use the script run_train_nanoset.py in the same way as run_train.py.

data:
  dataset:
    data_path: /mloscratch/homes/solergib/s-ai/nanotron/datasets/llama2/europarl-gpt-llama2_text_document
    split: 949,50,1
  num_loading_workers: 0
  seed: 1234

I've published the wandb logs of the different tests I have carried out, comparing the HF Datasets and the new Nanoset datasets with 1 and 4 GPUs and resuming training from a checkpoint.

This is a first version, I am open to all suggestions you can think of!

Toni

xrsrke and others added 30 commits March 3, 2024 13:26
- Enhance docstring and type hints in get_local_rank for clarity
- Simplify parameter names in get_global_rank for readability
- Update tests for get_global_rank
- Attempt to fix a bug related to get_local_rank
TJ-Solergibert pushed a commit to TJ-Solergibert/nanotron that referenced this pull request May 27, 2024
@ischlag
Copy link

ischlag commented Jul 2, 2024

already integrated upstream and resolved with the most recent sync.

@ischlag ischlag closed this Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants