Adding Megatron-Style input data pipelines #3

TJ-Solergibert · 2024-03-13T15:07:03Z

What does this PR do?

The current version of Nanotron's input pipelines is based on Hugging Face Datasets and relies on clm_preprocess, which tokenises and preprocesses the entire dataset at the beginning of the training (linked to the sequence_length, making it even more difficult to reuse the result across different experiments).

I have developed new data input pipelines based on those included in Megatron. Since I didn't want Nanotron to lose its essence, I removed many functionalities that we don't need (such as those related to BERT models pretraining). What I mainly modified is the torch.utils.data.Dataset, and we continue to work with the same Sampler*, Collator* and DataLoader (* I had to modify them slightly), so it doesn't alter the behavior of other modules like the PipelineEngine at all. It also allows us to continue using the previous pipeline based on Hugging Face Datasets, since I added the script run_train_nanoset.py to launch the training with the new pipeline.

Relevant details:

The new input pipelines work with the same files as Megatron. In this PR, I include references to this project to carry out the data preprocessing, although we could include them here as well. The scripts could be simplified, and necessary dependencies could be added.
I included a new configuration in the .yaml file called NanosetDatasetsArgs, which can replace `PretrainDatasetsArgs. You only need to specify the path to the dataset (generated by Megatron's preprocess_data.py, without the extension as they specify) and the distribution of the dataset samples for each of the splits (train, valid, and test) so that it sums up to 1.
The Nanoset will be the new dataset format. It is a lighter version of GPTDataset and MegatronDataset from Megatron.
To build the datasets, we will use the NanosetBuilder, which, based on a NanosetConfig (Contains NanosetDatasetsArgs + other details), will build a Nanoset for each split. In this first version, we only support one dataset file, but I will include the possibility of using multiple files (BlendedNanoset), hence preserving the NanosetBuilder.
Each Nanoset contains an MMapIndexedDataset. This object is found in indexed_dataset.py and comes from fairseq. Megatron also includes it as such.
I have added a page to the documentation with more details about the preprocessing to create the datasets and the internal functioning similar to what is included in Megatron.

I think maybe we should centralize the input data pipelines and perhaps move the dataloader.py file to another location. I also propose moving several functions from this file with the comments # Question:.

To use the Nanoset datasets, you need to specify the data_path and split fields in config.data.dataset in the .yaml file and use the script run_train_nanoset.py in the same way as run_train.py.

data:
  dataset:
    data_path: /mloscratch/homes/solergib/s-ai/nanotron/datasets/llama2/europarl-gpt-llama2_text_document
    split: 949,50,1
  num_loading_workers: 0
  seed: 1234

I've published the wandb logs of the different tests I have carried out, comparing the HF Datasets and the new Nanoset datasets with 1 and 4 GPUs and resuming training from a checkpoint.

This is a first version, I am open to all suggestions you can think of!

Toni

…rank_matrix

- Enhance docstring and type hints in get_local_rank for clarity - Simplify parameter names in get_global_rank for readability - Update tests for get_global_rank - Attempt to fix a bug related to get_local_rank

…rank_matrix

…ext_world_rank_matrix [Feature] Refactor ParallelContext.world_rank_matrix

…st_as_an_requirement [Docs] Add unit tests as a requirement

…set and BlendedNanoset to their respective files. Parameterized elements of the dataset and deleted random params. Updated to use collator and get_sampler func from dataloader.py

… circular error

… to 1 assertion in build_nanoset_datasets_split

…esses to tests/

Update

ischlag · 2024-07-02T09:33:56Z

already integrated upstream and resolved with the most recent sync.

xrsrke and others added 30 commits March 3, 2024 13:26

require unit test

57036a4

Add get_global_rank method to ParallelContext class

242f387

modify world_rank_matrix in trainer.py

8dfdbf9

Merge branch 'huggingface:main' into refactor_parallel_context_world_…

aedb306

…rank_matrix

Add tests to validate return type of get_global_rank

ee5ce49

Merge branch 'huggingface:main' into refactor_parallel_context_world_…

d62dbb7

…rank_matrix

Add/Update tests to validate return type of get_global_rank

3740022

Update test for get_global_rank

4cbada7

Update type test for get_global_rank

d0d0013

Update get_global_rank for precise type check

9b7ab4b

Bug fix attempt

5717127

Initial commit

07afee9

added megatron references to nanoset

27a1c99

fixed wrong check

576de9a

fixed getLogger

e6bebc3

fixed logging + torch.data.utils

d59d6f8

fixed bugs

7ba817c

Removed DistributedSamplerWithLoop args

dab6b6c

fixed nanoset config generator

c73dbe7

Added preprocessing scripts

04d797c

added nanoset docs

31ff34c

deleted preprocessing files

da59185

Resolving pre-commit-hook changes

e19dd44

Improve Rank Functions and Tests

c167492

- Enhance docstring and type hints in get_local_rank for clarity - Simplify parameter names in get_global_rank for readability - Update tests for get_global_rank - Attempt to fix a bug related to get_local_rank

Handle conflicts

d86deaf

Merge branch 'huggingface:main' into refactor_parallel_context_world_…

7b66174

…rank_matrix

added preprocessing tools

48f94d5

Resolving pre-commit-hook changes

dd4ad58

Deleted multimodal references & cleaned indexed_dataset.py

db52f3e

Added BlendedNanoset

29d2ee7

TJ-Solergibert and others added 6 commits March 25, 2024 00:16

Cleaned indexed_dataset

b1fd615

Deleted cpp helpers

ce4e743

Merge pull request huggingface#94 from 0xkerem/refactor_parallel_cont…

0a987b5

…ext_world_rank_matrix [Feature] Refactor ParallelContext.world_rank_matrix

add a single command that run all the tests

c0e35ea

Merge pull request huggingface#93 from huggingface/xrsrke/add_unit_te…

f7a021b

…st_as_an_requirement [Docs] Add unit tests as a requirement

First review fixes

3dd5f62

TJ-Solergibert force-pushed the nanoset branch from dc32dd8 to 3dd5f62 Compare March 25, 2024 21:43

TJ-Solergibert added 20 commits March 26, 2024 20:06

Fixed samples of the datasets and updated docs. Moved helpers of Nano…

896473d

…set and BlendedNanoset to their respective files. Parameterized elements of the dataset and deleted random params. Updated to use collator and get_sampler func from dataloader.py

moved LICENSE

50acff8

Added typing to preprocess_data & log_rank pathc

16f44d1

Add check only input_pp_rank and output_pp_rank get Tensors

384f337

Merge remote-tracking branch 'upstream/main' into nanoset

abe4cd4

Added logging BlendedNanoset stats. Cant include typing due to import…

334fcf4

… circular error

Adapted code to loading dataloader based on training stages

531f802

Added dacite commento to config, renamed get_sampler, added split sum…

0acf47f

… to 1 assertion in build_nanoset_datasets_split

Added deterministic creation of Nanosets & BlendedNanoset in all proc…

4026257

…esses to tests/

Patch for different datasets based on training stages

05be999

Added support for multistage training with Nanosets

f163c1d

Fixed redundant logging

fc55f6e

Add Nanoset/BlendedNanoset recovery from failure test

acbc365

Refractored MMapIndexedDataset

e3190be

Updated Tests & added Nanosets to run_train.py

fb60c4f

Reducing number of tests

b0c5b09

Adding back @rerun_if_address_is_in_use() decorator

ad21be0

Added nanosets flavour

3523ca3

Added world process group to NanosetBuilder (main_rank_first)

0c0f58d

Modified preprocess_data

b70720c

TJ-Solergibert pushed a commit to TJ-Solergibert/nanotron that referenced this pull request May 27, 2024

Merge pull request swiss-ai#3 from huggingface/main

decfc3f

Update

ischlag closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Megatron-Style input data pipelines #3

Adding Megatron-Style input data pipelines #3

TJ-Solergibert commented Mar 13, 2024

ischlag commented Jul 2, 2024

Adding Megatron-Style input data pipelines #3

Adding Megatron-Style input data pipelines #3

Conversation

TJ-Solergibert commented Mar 13, 2024

What does this PR do?

ischlag commented Jul 2, 2024