swiss-ai · TJ-Solergibert · Mar 3, 2024 · Mar 3, 2024 · Mar 4, 2024 · Mar 4, 2024
diff --git a/.github/workflows/3d_parallelism_unit_tests.yaml b/.github/workflows/3d_parallelism_unit_tests.yaml
@@ -45,6 +45,7 @@ jobs:
         pip install -e .
         pip install -e .[dev]
         pip install -e .[test]
+        pip install -e .[nanosets]
 
     - name: Show installed libraries and their versions
       run: pip freeze | tee installed.txt

diff --git a/Makefile b/Makefile
@@ -0,0 +1,16 @@
+# Run nanotron's tests and examples's tests
+test:
+	pytest \
+        --color=yes \
+        --durations=0 \
+        --ignore tests/fp8 \
+        --verbose \
+        tests/
+
+	pip install -r examples/doremi/requirements.txt
+	pytest \
+        --color=yes \
+        --durations=0 \
+        --ignore tests/fp8 \
+        --verbose \
+        examples/doremi/tests/
diff --git a/README.md b/README.md
@@ -91,6 +91,10 @@ pre-commit install
 pre-commit run --config .pre-commit-config.yaml --all-files
 ```
 
+*As a part of making sure we aren't slowed down as the codebase grows, we will not merge a PR if the features it introduces do not have test coverage.*
+
+We have extensions built on top of Nanotron, with their tests located in the `/examples` folder. Since VSCode defaults to discovering tests only in the `/tests` folder, please run tests from both `/examples` and `/tests` to ensure your PR does not break these extensions. Please run `make tests` to execute all the nanotron tests and the tests in the `/examples` directory that you need to pass.
+
 Features we would like to add:
 - [ ] Support `torch.compile`
 - [ ] More optimized kernels

diff --git a/docs/nanoset.md b/docs/nanoset.md
@@ -0,0 +1,167 @@
+# Nanosets
+
+## Install
+To use `Nanosets`, it's necessary to install Nanotron with the `nanosets` flavor.
+```
+pip install -e '.[nanosets]'
+```
+
+## Data pre-processing
+
+Nanotron incorporates [`Nanosets`](../src/nanotron/data/nanoset.py), a kind of datasets based on numpy memory-mapped arrays. It also includes [`BlendedNanosets`](../src/nanotron/data/blended_nanoset.py), to be able to combine different `Nanosets`.
+
+
+To use these datasets, first, we need to preprocess the data. The input format can either be a column of a Hugging Face Dataset or a .json file containing a text sample per line. For example:
+
+<pre>
+{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
+{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
+</pre>
+
+The dataset is then processed into a mmap format for training using the [`tools/preprocess_data.py`](../tools/preprocess_data.py) script. Below we show an example for processing a corpus with the Llama2 tokenizer.
+
+<pre>
+python tools/preprocess_data.py \
+       --input data/my_corpus.json \
+       --output-prefix data/processed-datasets/my-llama2-dataset \
+       --tokenizer-name-or-path meta-llama/Llama-2-7b-hf \
+       --num-workers 128
+</pre>
+
+In `--tokenizer-name-or-path`, we will have to specify a tokenizer in the same way as we do when using `AutoTokenizers.from_pretrained(...)`.
+
+The output will be one file named, in this case, `my-llama2-dataset_input_ids.npy`. We will then have to specify this file in the `data_path` field in the config file.
+
+## Working with Nanosets
+
+To work with Nanosets, we need to configure 3 arguments:
+1. `split`: The distribution we want to divide our dataset into among train, valid, and test split.
+2. `path_to_cache`: Directory used to store certain metadata of Nanosets to reuse them between different runs.
+3. `data_path`: This argument specifies the file/files that will compose the Nanoset. There are 2 ways to specify it:
+   1. If we specify a single path, we will create a `Nanoset`.
+    ```yaml
+    data_stages:
+      - name: General purpose training (Nanoset)
+        start_training_step: 1
+        data:
+          dataset:
+            data_path: nanosets/SlimPajama-6B_input_ids.npy
+            split: 949,50,1
+            path_to_cache: .nanoset_cache
+          num_loading_workers: 0
+          seed: 1234
+    ```
+   2. With a dictionary, we can create a `BlendedNanoset` where the keys are the paths to the dataset files and the values are the weights for each dataset.
+    ```yaml
+    data_stages:
+      - name: General purpose training (BlendedNanoset)
+        start_training_step: 1
+        data:
+          dataset:
+            data_path:
+              nanoset/SlimPajama-6B_input_ids.npy: 0.8
+              nanoset/europarl_input_ids.npy: 0.2
+            split: 949,50,1
+            path_to_cache: .nanoset_cache
+          num_loading_workers: 0
+          seed: 1234
+    ```
+
+Finally, to use the Nanosets, launch the training with [`run_train_nanoset.py`](../run_train_nanoset.py).
+```shell
+torchrun --nproc-per-node 8 run_train_nanoset.py --config configs/nanoset_llama2.yaml
+```
+
+## Under the hood
+### Number of samples
+
+When using Nanosets, we specify the `data_path` to the preprocessed dataset and a `split`. This `split` value will be used to divide the total number of tokens into train, valid, and test sets. The number of samples will be the `number of tokens in split / sequence_lenght`.
+
+For the train split, the number of samples consumed from the Nanoset will be determined by the `number of train steps * global batch size`, so if this number is higher than the number of samples in the train split, we will see the dataset samples more than once (> 1 epoch). In the case of the valid and test split, we will see all the samples only once.
+
+In the case of the `BlendedNanoset`, we will also indicate the weight of each dataset to construct data batches according to the specified proportion. In this case, the train split will respect this proportion, considering that the number of samples will be computed in the same way as in the `Nanosets`, so it may happen that we consume one dataset for 3 epochs and another larger dataset for only one epoch. For the valid and test splits, the same as in the `Nanosets` will occur; we will consume all the samples only once.
+
+### Nanoset
+A `Nanoset` is paremeterized by the following variables:
+- The underlying `MMapIndexedDataset` instance (`indexed_dataset`)
+- The sequence length `S`
+- The split indices `indexed_indices` (the congituous subset of sample indices used for training, validation, and testing)
+- The total number of samples `N` of the Nanoset that we will consume during training. In the case of the valid and test splits, we will only consume the dataset once
+- The random seed `R`
+
+The `Nanoset` creates a single index (`shuffle_index`) to map the indices of the Nanoset (0, 1, 2, ... `N`) to the indices of the `MMapIndexedDataset` for the specific split (`indexed_indices`).
+
+In the train split, the shuffle index (`shuffle_index`) is a 1-D array mapping from _k_ to _j_ of length `n_concatenations * len(indexed_indices)`, where `n_concatenations` is defined as `(N / len(indexed_indices)) + 1`, so that `len(shuffle_index)` is always greater than `N`. While for the valid and test splits, `len(shuffle_index) == len(indexed_indices)`. Before concatenating the full array, `shuffle_index` is shuffled according to `R`.
+```
+Given:
+
+N = 70
+
+indexed_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
+
+Then, for example:
+
+shuffle_index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
+
+Shuffle the indices -> shuffle_index = [19, 9, 3, 18, 15, 12, 5, 10, 17, 1, 4, 8, 11, 16, 13, 7, 2, 14, 6, 0]
+
+n_concatenations = (70/(20)) + 1 = 4
+shuffle_index = shuffle_index concatenated 4 times
+
+len(shuffle_index) = 80 > N
+```
+
+To query the `Nanoset` for the k-th sample we do the following:
+1. Use the `shuffle_index` to get the index _j_ into the `sample_index`
+```
+j = shuffle_index[k]
+```
+2. To retrieve `S + 1` tokens from the `indexed_dataset` we have to specify the `offset` (`idx * sequence length` for CausalLM) and the length (the number of tokens to extract)
+```
+offset = j * sequence_length
+sample = indexed_dataset[offset:offset + sequence_length + 1]
+```
+
+Despite having repeated indices in the `shuffle_index`, throughout 1 epoch, we will only observe each sample once. We achieve this by deactivating shuffling in the `DistributedDataSampler`, so that the indices of the `shuffle_index` are consumed in the order they appear by the multiple processes. It is worth noting that the samples are already shuffled in the `shuffle_index`.
+```
+Given:
+
+4 Processes loading data
+
+[19, 9, 3, 18, 15, 12, 5, 10, 17, 1, 4, 8, 11, 16, 13, 7, 2, 14, 6, 0]
+
+(P1) idx_list = [0, 4, 8, 12, 16, ...]    -> shuffle_index[idx_list] = [19, 15, 17, 11, 2, 19, ...]
+(P2) idx_list = [1, 5, 9, 13, 17, ...]    -> shuffle_index[idx_list] = [9, 12, 1, 16, 14, 9, ...]
+(P3) idx_list = [2, 6, 10, 14, 18, ...]   -> shuffle_index[idx_list] = [3, 5, 4, 13, 6, 3, ...]
+(P4) idx_list = [3, 7, 11, 15, 19, ...]   -> shuffle_index[idx_list] = [18, 10, 8, 7, 0, 18, ...]
+```
+### BlendedNanoset
+The `BlendedNanoset` is parameterized by the following variables:
+- The underlying `Nanoset` instances `D`
+- The weights `W` (one per dataset)
+- The number of samples `U`
+
+The `BlendedNanoset` creates two "blending" indices to facilitate lookup: (1) The `dataset_index` and (2) the `dataset_sample_index`.
+
+1. The `dataset_index` is a 1-D array mapping from _i_ to dataset index from `D` of length `U`.
+```
+Given:
+
+D = [d0, d1, d2, d3]
+W = [0.1, 0.5, 0.3, 0.1]
+U = 20
+
+Then, for example:
+
+dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
+```
+2. The `dataset_sample_index` is a 1-D mapping from _i_ to the sample index for dataset_index[_i_] of length `U`.
+```
+dataset_index =         [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
+dataset_sample_index =  [0, 0, 0, 1, 0, 2, 1, 3, 2, 4, 1, 5, 3, 6, 1, 7, 4, 8, 5, 9]
+```
+To query the `BlendedNanoset` for the k-th sample we do the following:
+- Use the `dataset_index` to retrieve the corresponding dataset from `D` and the `dataset_sample_index` to retrieve the corresponding sample from that dataset.
+```
+sample = D[dataset_index[k]][dataset_sample_index[k]]
+```
diff --git a/examples/config_nanoset.yaml b/examples/config_nanoset.yaml
@@ -0,0 +1,103 @@
+checkpoints:
+  checkpoint_interval: 200
+  checkpoints_path: checkpoints
+  checkpoints_path_is_shared_file_system: false
+  resume_checkpoint_path: null
+  save_initial_state: false
+
+data_stages:
+  - name: General purpose training (Nanoset)
+    start_training_step: 1
+    data:
+      dataset:
+        data_path: datasets/testing_alpaca_small_input_ids.npy
+        split: 8,1,1
+      num_loading_workers: 0
+      seed: 1234
+
+  - name: Second purpose training (BlendedNanoset)
+    start_training_step: 15
+    data:
+      dataset:
+        data_path:
+          datasets/testing_alpaca_small_input_ids.npy: 0.8
+          datasets/yelp_review_full_input_ids.npy: 0.2
+        split: 6,2,2
+      num_loading_workers: 0
+      seed: 1234
+
+general:
+  benchmark_csv_path: null
+  consumed_train_samples: null
+  ignore_sanity_checks: false
+  project: debug
+  run: tiny_llama_%date_%jobid
+  seed: 42
+  step: null
+lighteval: null
+logging:
+  iteration_step_info_interval: 1
+  log_level: info
+  log_level_replica: info
+model:
+  ddp_bucket_cap_mb: 25
+  dtype: bfloat16
+  init_method:
+    std: 0.025
+  make_vocab_size_divisible_by: 1
+  model_config:
+    bos_token_id: 1
+    eos_token_id: 2
+    hidden_act: silu
+    hidden_size: 16
+    initializer_range: 0.02
+    intermediate_size: 64
+    is_llama_config: true
+    max_position_embeddings: 256
+    num_attention_heads: 4
+    num_hidden_layers: 2
+    num_key_value_heads: 4
+    pad_token_id: null
+    pretraining_tp: 1
+    rms_norm_eps: 1.0e-05
+    rope_scaling: null
+    tie_word_embeddings: true
+    use_cache: true
+    vocab_size: 50257
+optimizer:
+  accumulate_grad_in_fp32: true
+  adam_beta1: 0.9
+  adam_beta2: 0.95
+  adam_eps: 1.0e-08
+  clip_grad: 1.0
+  learning_rate_scheduler:
+    learning_rate: 0.0003
+    lr_decay_starting_step: null
+    lr_decay_steps: 8
+    lr_decay_style: cosine
+    lr_warmup_steps: 2
+    lr_warmup_style: linear
+    min_decay_lr: 1.0e-05
+  torch_adam_is_fused: true
+  weight_decay: 0.01
+  zero_stage: 0
+parallelism:
+  dp: 2
+  pp: 1
+  pp_engine: 1f1b
+  tp: 2
+  tp_linear_async_communication: true
+  tp_mode: REDUCE_SCATTER
+profiler: null
+tokenizer:
+  tokenizer_max_length: null
+  tokenizer_name_or_path: gpt2
+  tokenizer_revision: null
+tokens:
+  batch_accumulation_per_replica: 1
+  limit_test_batches: 0
+  limit_val_batches: 0
+  micro_batch_size: 2
+  sequence_length: 32
+  train_steps: 100
+  val_check_interval: -1
diff --git a/examples/config_tiny_llama.yaml b/examples/config_tiny_llama.yaml
@@ -1,19 +1,36 @@
 checkpoints:
   checkpoint_interval: 10
-  checkpoints_path: /fsx/nouamane/projects/nanotron/checkpoints
+  checkpoints_path: checkpoints
   checkpoints_path_is_shared_file_system: false
   resume_checkpoint_path: null
   save_initial_state: false
-data:
-  dataset:
-    dataset_overwrite_cache: false
-    dataset_processing_num_proc_per_process: 1
-    hf_dataset_config_name: null
-    hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
-    hf_dataset_splits: train
-    text_column_name: completion
-  num_loading_workers: 1
-  seed: 42
+
+data_stages:
+  - name: Stable Training Stage
+    start_training_step: 1
+    data:
+      dataset:
+        dataset_overwrite_cache: false
+        dataset_processing_num_proc_per_process: 1
+        hf_dataset_config_name: null
+        hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
+        hf_dataset_splits: train
+        text_column_name: completion
+      num_loading_workers: 1
+      seed: 42
+  - name: Annealing Phase
+    start_training_step: 10
+    data:
+      dataset:
+        dataset_overwrite_cache: false
+        dataset_processing_num_proc_per_process: 1
+        hf_dataset_config_name: null
+        hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
+        hf_dataset_splits: train
+        text_column_name: completion
+      num_loading_workers: 1
+      seed: 42
+
 general:
   benchmark_csv_path: null
   consumed_train_samples: null
@@ -87,5 +104,5 @@ tokens:
   limit_val_batches: 0
   micro_batch_size: 2
   sequence_length: 32
-  train_steps: 10
+  train_steps: 20
   val_check_interval: -1
diff --git a/examples/contributor-guide/debug_config_tiny_llama.yaml b/examples/contributor-guide/debug_config_tiny_llama.yaml
@@ -4,16 +4,20 @@ checkpoints:
   checkpoints_path_is_shared_file_system: false
   resume_checkpoint_path: null
   save_initial_state: false
-data:
-  dataset:
-    dataset_overwrite_cache: false
-    dataset_processing_num_proc_per_process: 1
-    hf_dataset_config_name: null
-    hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
-    hf_dataset_splits: train
-    text_column_name: completion
-  num_loading_workers: 1
-  seed: 42
+
+data_stages:
+  - name: General purpose training
+    start_training_step: 1
+    data:
+      dataset:
+        dataset_overwrite_cache: false
+        dataset_processing_num_proc_per_process: 1
+        hf_dataset_config_name: null
+        hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
+        hf_dataset_splits: train
+        text_column_name: completion
+      num_loading_workers: 1
+      seed: 42
 general:
   benchmark_csv_path: null
   consumed_train_samples: null