Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Megatron-Style input data pipelines #3

Closed
wants to merge 80 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
57036a4
require unit test
xrsrke Mar 3, 2024
242f387
Add get_global_rank method to ParallelContext class
0xkerem Mar 3, 2024
8dfdbf9
modify world_rank_matrix in trainer.py
0xkerem Mar 4, 2024
aedb306
Merge branch 'huggingface:main' into refactor_parallel_context_world_…
0xkerem Mar 4, 2024
ee5ce49
Add tests to validate return type of get_global_rank
0xkerem Mar 6, 2024
d62dbb7
Merge branch 'huggingface:main' into refactor_parallel_context_world_…
0xkerem Mar 6, 2024
3740022
Add/Update tests to validate return type of get_global_rank
0xkerem Mar 6, 2024
4cbada7
Update test for get_global_rank
0xkerem Mar 7, 2024
d0d0013
Update type test for get_global_rank
0xkerem Mar 7, 2024
9b7ab4b
Update get_global_rank for precise type check
0xkerem Mar 11, 2024
5717127
Bug fix attempt
0xkerem Mar 11, 2024
07afee9
Initial commit
TJ-Solergibert Mar 13, 2024
27a1c99
added megatron references to nanoset
TJ-Solergibert Mar 13, 2024
576de9a
fixed wrong check
TJ-Solergibert Mar 13, 2024
e6bebc3
fixed getLogger
TJ-Solergibert Mar 13, 2024
d59d6f8
fixed logging + torch.data.utils
TJ-Solergibert Mar 13, 2024
7ba817c
fixed bugs
TJ-Solergibert Mar 13, 2024
dab6b6c
Removed DistributedSamplerWithLoop args
TJ-Solergibert Mar 13, 2024
c73dbe7
fixed nanoset config generator
TJ-Solergibert Mar 13, 2024
04d797c
Added preprocessing scripts
TJ-Solergibert Mar 13, 2024
31ff34c
added nanoset docs
TJ-Solergibert Mar 13, 2024
da59185
deleted preprocessing files
TJ-Solergibert Mar 13, 2024
e19dd44
Resolving pre-commit-hook changes
TJ-Solergibert Mar 13, 2024
c167492
Improve Rank Functions and Tests
0xkerem Mar 14, 2024
d86deaf
Handle conflicts
0xkerem Mar 14, 2024
7b66174
Merge branch 'huggingface:main' into refactor_parallel_context_world_…
0xkerem Mar 14, 2024
48f94d5
added preprocessing tools
TJ-Solergibert Mar 16, 2024
dd4ad58
Resolving pre-commit-hook changes
TJ-Solergibert Mar 16, 2024
db52f3e
Deleted multimodal references & cleaned indexed_dataset.py
TJ-Solergibert Mar 16, 2024
29d2ee7
Added BlendedNanoset
TJ-Solergibert Mar 17, 2024
dac7205
clean
TJ-Solergibert Mar 17, 2024
903761b
Added merge_datasets script
TJ-Solergibert Mar 17, 2024
0910fbe
Improved docs
TJ-Solergibert Mar 17, 2024
5984cf4
Cleaned docs
TJ-Solergibert Mar 17, 2024
cde43c7
added run instructions
TJ-Solergibert Mar 18, 2024
9ae26d4
simplified build funct
TJ-Solergibert Mar 19, 2024
698c85b
simplified nanoset
TJ-Solergibert Mar 19, 2024
50a189f
fixed dacite UnionMatchError
TJ-Solergibert Mar 20, 2024
2904562
add loading dataloader based on training stages
xrsrke Mar 21, 2024
3936ccf
Added tests
TJ-Solergibert Mar 21, 2024
baad2f7
typos
TJ-Solergibert Mar 21, 2024
89c3a4e
fixed test for cuda backend
TJ-Solergibert Mar 21, 2024
2f14b78
destroy parallel_context on test end
TJ-Solergibert Mar 21, 2024
d223835
refractored test
TJ-Solergibert Mar 21, 2024
2e15ab8
fixed dataset dependency
TJ-Solergibert Mar 22, 2024
4924d54
Added tensor shape assertion
TJ-Solergibert Mar 22, 2024
b052a23
updated dependencies
TJ-Solergibert Mar 22, 2024
e6723f3
refactor
xrsrke Mar 22, 2024
b51fa4d
Merge pull request #113 from huggingface/xrsrke/training_stages_rebase
xrsrke Mar 23, 2024
0c5df84
Simplified preprocess_data.py
TJ-Solergibert Mar 24, 2024
b9cac6e
Added random generation of hyperparameters to build nanoset dataloade…
TJ-Solergibert Mar 24, 2024
a5041b7
Simplified Nanosets
TJ-Solergibert Mar 24, 2024
a5bef2c
Added Under the hood section to Docs
TJ-Solergibert Mar 25, 2024
454d86c
Added len assertion of Nanosets belonging to the BlendedNanoset to tests
TJ-Solergibert Mar 25, 2024
b1fd615
Cleaned indexed_dataset
TJ-Solergibert Mar 25, 2024
ce4e743
Deleted cpp helpers
TJ-Solergibert Mar 25, 2024
0a987b5
Merge pull request #94 from 0xkerem/refactor_parallel_context_world_r…
NouamaneTazi Mar 25, 2024
c0e35ea
add a single command that run all the tests
xrsrke Mar 25, 2024
f7a021b
Merge pull request #93 from huggingface/xrsrke/add_unit_test_as_an_re…
NouamaneTazi Mar 25, 2024
3dd5f62
First review fixes
TJ-Solergibert Mar 25, 2024
896473d
Fixed samples of the datasets and updated docs. Moved helpers of Nano…
TJ-Solergibert Mar 26, 2024
50acff8
moved LICENSE
TJ-Solergibert Mar 26, 2024
16f44d1
Added typing to preprocess_data & log_rank pathc
TJ-Solergibert Mar 26, 2024
384f337
Add check only input_pp_rank and output_pp_rank get Tensors
TJ-Solergibert Mar 26, 2024
abe4cd4
Merge remote-tracking branch 'upstream/main' into nanoset
TJ-Solergibert Mar 26, 2024
334fcf4
Added logging BlendedNanoset stats. Cant include typing due to import…
TJ-Solergibert Mar 27, 2024
531f802
Adapted code to loading dataloader based on training stages
TJ-Solergibert Mar 27, 2024
0acf47f
Added dacite commento to config, renamed get_sampler, added split sum…
TJ-Solergibert Mar 27, 2024
4026257
Added deterministic creation of Nanosets & BlendedNanoset in all proc…
TJ-Solergibert Mar 27, 2024
05be999
Patch for different datasets based on training stages
TJ-Solergibert Mar 27, 2024
f163c1d
Added support for multistage training with Nanosets
TJ-Solergibert Mar 27, 2024
fc55f6e
Fixed redundant logging
TJ-Solergibert Mar 27, 2024
acbc365
Add Nanoset/BlendedNanoset recovery from failure test
TJ-Solergibert Mar 28, 2024
e3190be
Refractored MMapIndexedDataset
TJ-Solergibert Mar 29, 2024
fb60c4f
Updated Tests & added Nanosets to run_train.py
TJ-Solergibert Apr 3, 2024
b0c5b09
Reducing number of tests
TJ-Solergibert Apr 3, 2024
ad21be0
Adding back @rerun_if_address_is_in_use() decorator
TJ-Solergibert Apr 3, 2024
3523ca3
Added nanosets flavour
TJ-Solergibert Apr 20, 2024
0c0f58d
Added world process group to NanosetBuilder (main_rank_first)
TJ-Solergibert Apr 20, 2024
b70720c
Modified preprocess_data
TJ-Solergibert Apr 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/3d_parallelism_unit_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ jobs:
pip install -e .
pip install -e .[dev]
pip install -e .[test]
pip install -e .[nanosets]

- name: Show installed libraries and their versions
run: pip freeze | tee installed.txt
Expand Down
16 changes: 16 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Run nanotron's tests and examples's tests
test:
pytest \
--color=yes \
--durations=0 \
--ignore tests/fp8 \
--verbose \
tests/

pip install -r examples/doremi/requirements.txt
pytest \
--color=yes \
--durations=0 \
--ignore tests/fp8 \
--verbose \
examples/doremi/tests/
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,10 @@ pre-commit install
pre-commit run --config .pre-commit-config.yaml --all-files
```

*As a part of making sure we aren't slowed down as the codebase grows, we will not merge a PR if the features it introduces do not have test coverage.*

We have extensions built on top of Nanotron, with their tests located in the `/examples` folder. Since VSCode defaults to discovering tests only in the `/tests` folder, please run tests from both `/examples` and `/tests` to ensure your PR does not break these extensions. Please run `make tests` to execute all the nanotron tests and the tests in the `/examples` directory that you need to pass.

Features we would like to add:
- [ ] Support `torch.compile`
- [ ] More optimized kernels
Expand Down
167 changes: 167 additions & 0 deletions docs/nanoset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Nanosets

## Install
To use `Nanosets`, it's necessary to install Nanotron with the `nanosets` flavor.
```
pip install -e '.[nanosets]'
```

## Data pre-processing

Nanotron incorporates [`Nanosets`](../src/nanotron/data/nanoset.py), a kind of datasets based on numpy memory-mapped arrays. It also includes [`BlendedNanosets`](../src/nanotron/data/blended_nanoset.py), to be able to combine different `Nanosets`.


To use these datasets, first, we need to preprocess the data. The input format can either be a column of a Hugging Face Dataset or a .json file containing a text sample per line. For example:

<pre>
{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
</pre>

The dataset is then processed into a mmap format for training using the [`tools/preprocess_data.py`](../tools/preprocess_data.py) script. Below we show an example for processing a corpus with the Llama2 tokenizer.

<pre>
python tools/preprocess_data.py \
--input data/my_corpus.json \
--output-prefix data/processed-datasets/my-llama2-dataset \
--tokenizer-name-or-path meta-llama/Llama-2-7b-hf \
--num-workers 128
</pre>

In `--tokenizer-name-or-path`, we will have to specify a tokenizer in the same way as we do when using `AutoTokenizers.from_pretrained(...)`.

The output will be one file named, in this case, `my-llama2-dataset_input_ids.npy`. We will then have to specify this file in the `data_path` field in the config file.

## Working with Nanosets

To work with Nanosets, we need to configure 3 arguments:
1. `split`: The distribution we want to divide our dataset into among train, valid, and test split.
2. `path_to_cache`: Directory used to store certain metadata of Nanosets to reuse them between different runs.
3. `data_path`: This argument specifies the file/files that will compose the Nanoset. There are 2 ways to specify it:
1. If we specify a single path, we will create a `Nanoset`.
```yaml
data_stages:
- name: General purpose training (Nanoset)
start_training_step: 1
data:
dataset:
data_path: nanosets/SlimPajama-6B_input_ids.npy
split: 949,50,1
path_to_cache: .nanoset_cache
num_loading_workers: 0
seed: 1234
```
2. With a dictionary, we can create a `BlendedNanoset` where the keys are the paths to the dataset files and the values are the weights for each dataset.
```yaml
data_stages:
- name: General purpose training (BlendedNanoset)
start_training_step: 1
data:
dataset:
data_path:
nanoset/SlimPajama-6B_input_ids.npy: 0.8
nanoset/europarl_input_ids.npy: 0.2
split: 949,50,1
path_to_cache: .nanoset_cache
num_loading_workers: 0
seed: 1234
```

Finally, to use the Nanosets, launch the training with [`run_train_nanoset.py`](../run_train_nanoset.py).
```shell
torchrun --nproc-per-node 8 run_train_nanoset.py --config configs/nanoset_llama2.yaml
```

## Under the hood
### Number of samples

When using Nanosets, we specify the `data_path` to the preprocessed dataset and a `split`. This `split` value will be used to divide the total number of tokens into train, valid, and test sets. The number of samples will be the `number of tokens in split / sequence_lenght`.

For the train split, the number of samples consumed from the Nanoset will be determined by the `number of train steps * global batch size`, so if this number is higher than the number of samples in the train split, we will see the dataset samples more than once (> 1 epoch). In the case of the valid and test split, we will see all the samples only once.

In the case of the `BlendedNanoset`, we will also indicate the weight of each dataset to construct data batches according to the specified proportion. In this case, the train split will respect this proportion, considering that the number of samples will be computed in the same way as in the `Nanosets`, so it may happen that we consume one dataset for 3 epochs and another larger dataset for only one epoch. For the valid and test splits, the same as in the `Nanosets` will occur; we will consume all the samples only once.

### Nanoset
A `Nanoset` is paremeterized by the following variables:
- The underlying `MMapIndexedDataset` instance (`indexed_dataset`)
- The sequence length `S`
- The split indices `indexed_indices` (the congituous subset of sample indices used for training, validation, and testing)
- The total number of samples `N` of the Nanoset that we will consume during training. In the case of the valid and test splits, we will only consume the dataset once
- The random seed `R`

The `Nanoset` creates a single index (`shuffle_index`) to map the indices of the Nanoset (0, 1, 2, ... `N`) to the indices of the `MMapIndexedDataset` for the specific split (`indexed_indices`).

In the train split, the shuffle index (`shuffle_index`) is a 1-D array mapping from _k_ to _j_ of length `n_concatenations * len(indexed_indices)`, where `n_concatenations` is defined as `(N / len(indexed_indices)) + 1`, so that `len(shuffle_index)` is always greater than `N`. While for the valid and test splits, `len(shuffle_index) == len(indexed_indices)`. Before concatenating the full array, `shuffle_index` is shuffled according to `R`.
```
Given:

N = 70

indexed_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Then, for example:

shuffle_index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Shuffle the indices -> shuffle_index = [19, 9, 3, 18, 15, 12, 5, 10, 17, 1, 4, 8, 11, 16, 13, 7, 2, 14, 6, 0]

n_concatenations = (70/(20)) + 1 = 4
shuffle_index = shuffle_index concatenated 4 times

len(shuffle_index) = 80 > N
```

To query the `Nanoset` for the k-th sample we do the following:
1. Use the `shuffle_index` to get the index _j_ into the `sample_index`
```
j = shuffle_index[k]
```
2. To retrieve `S + 1` tokens from the `indexed_dataset` we have to specify the `offset` (`idx * sequence length` for CausalLM) and the length (the number of tokens to extract)
```
offset = j * sequence_length
sample = indexed_dataset[offset:offset + sequence_length + 1]
```

Despite having repeated indices in the `shuffle_index`, throughout 1 epoch, we will only observe each sample once. We achieve this by deactivating shuffling in the `DistributedDataSampler`, so that the indices of the `shuffle_index` are consumed in the order they appear by the multiple processes. It is worth noting that the samples are already shuffled in the `shuffle_index`.
```
Given:

4 Processes loading data

[19, 9, 3, 18, 15, 12, 5, 10, 17, 1, 4, 8, 11, 16, 13, 7, 2, 14, 6, 0]

(P1) idx_list = [0, 4, 8, 12, 16, ...] -> shuffle_index[idx_list] = [19, 15, 17, 11, 2, 19, ...]
(P2) idx_list = [1, 5, 9, 13, 17, ...] -> shuffle_index[idx_list] = [9, 12, 1, 16, 14, 9, ...]
(P3) idx_list = [2, 6, 10, 14, 18, ...] -> shuffle_index[idx_list] = [3, 5, 4, 13, 6, 3, ...]
(P4) idx_list = [3, 7, 11, 15, 19, ...] -> shuffle_index[idx_list] = [18, 10, 8, 7, 0, 18, ...]
```
### BlendedNanoset
The `BlendedNanoset` is parameterized by the following variables:
- The underlying `Nanoset` instances `D`
- The weights `W` (one per dataset)
- The number of samples `U`

The `BlendedNanoset` creates two "blending" indices to facilitate lookup: (1) The `dataset_index` and (2) the `dataset_sample_index`.

1. The `dataset_index` is a 1-D array mapping from _i_ to dataset index from `D` of length `U`.
```
Given:

D = [d0, d1, d2, d3]
W = [0.1, 0.5, 0.3, 0.1]
U = 20

Then, for example:

dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
```
2. The `dataset_sample_index` is a 1-D mapping from _i_ to the sample index for dataset_index[_i_] of length `U`.
```
dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
dataset_sample_index = [0, 0, 0, 1, 0, 2, 1, 3, 2, 4, 1, 5, 3, 6, 1, 7, 4, 8, 5, 9]
```
To query the `BlendedNanoset` for the k-th sample we do the following:
- Use the `dataset_index` to retrieve the corresponding dataset from `D` and the `dataset_sample_index` to retrieve the corresponding sample from that dataset.
```
sample = D[dataset_index[k]][dataset_sample_index[k]]
```
103 changes: 103 additions & 0 deletions examples/config_nanoset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
checkpoints:
checkpoint_interval: 200
checkpoints_path: checkpoints
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_initial_state: false

data_stages:
- name: General purpose training (Nanoset)
start_training_step: 1
data:
dataset:
data_path: datasets/testing_alpaca_small_input_ids.npy
split: 8,1,1
num_loading_workers: 0
seed: 1234

- name: Second purpose training (BlendedNanoset)
start_training_step: 15
data:
dataset:
data_path:
datasets/testing_alpaca_small_input_ids.npy: 0.8
datasets/yelp_review_full_input_ids.npy: 0.2
split: 6,2,2
num_loading_workers: 0
seed: 1234

general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: false
project: debug
run: tiny_llama_%date_%jobid
seed: 42
step: null
lighteval: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 1
eos_token_id: 2
hidden_act: silu
hidden_size: 16
initializer_range: 0.02
intermediate_size: 64
is_llama_config: true
max_position_embeddings: 256
num_attention_heads: 4
num_hidden_layers: 2
num_key_value_heads: 4
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_scaling: null
tie_word_embeddings: true
use_cache: true
vocab_size: 50257
optimizer:
accumulate_grad_in_fp32: true
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 0.0003
lr_decay_starting_step: null
lr_decay_steps: 8
lr_decay_style: cosine
lr_warmup_steps: 2
lr_warmup_style: linear
min_decay_lr: 1.0e-05
torch_adam_is_fused: true
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 2
pp: 1
pp_engine: 1f1b
tp: 2
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: gpt2
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 2
sequence_length: 32
train_steps: 100
val_check_interval: -1
41 changes: 29 additions & 12 deletions examples/config_tiny_llama.yaml
Original file line number Diff line number Diff line change
@@ -1,19 +1,36 @@
checkpoints:
checkpoint_interval: 10
checkpoints_path: /fsx/nouamane/projects/nanotron/checkpoints
checkpoints_path: checkpoints
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_initial_state: false
data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
hf_dataset_splits: train
text_column_name: completion
num_loading_workers: 1
seed: 42

data_stages:
- name: Stable Training Stage
start_training_step: 1
data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
hf_dataset_splits: train
text_column_name: completion
num_loading_workers: 1
seed: 42
- name: Annealing Phase
start_training_step: 10
data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
hf_dataset_splits: train
text_column_name: completion
num_loading_workers: 1
seed: 42

general:
benchmark_csv_path: null
consumed_train_samples: null
Expand Down Expand Up @@ -87,5 +104,5 @@ tokens:
limit_val_batches: 0
micro_batch_size: 2
sequence_length: 32
train_steps: 10
train_steps: 20
val_check_interval: -1
24 changes: 14 additions & 10 deletions examples/contributor-guide/debug_config_tiny_llama.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,20 @@ checkpoints:
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_initial_state: false
data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
hf_dataset_splits: train
text_column_name: completion
num_loading_workers: 1
seed: 42

data_stages:
- name: General purpose training
start_training_step: 1
data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
hf_dataset_splits: train
text_column_name: completion
num_loading_workers: 1
seed: 42
general:
benchmark_csv_path: null
consumed_train_samples: null
Expand Down
Loading
Loading