Skip to content

Latest commit

 

History

History
executable file
·
98 lines (78 loc) · 3.31 KB

preprocess_c4.md

File metadata and controls

executable file
·
98 lines (78 loc) · 3.31 KB

MaskLLM with C4 Dataset

Dataset Download

Download the C4 subset 00000-00019.

python scripts/data/download_c4.py

Output

assets/data/c4
├── ...
└── en
    ├── c4-train.00000-of-01024.json
    ├── c4-train.00001-of-01024.json
    ├── c4-train.00002-of-01024.json
    ├── c4-train.00003-of-01024.json
    ├── c4-train.00004-of-01024.json
    ├── c4-train.00005-of-01024.json
    ├── c4-train.00006-of-01024.json
    ├── c4-train.00007-of-01024.json
    ├── c4-train.00008-of-01024.json
    ├── c4-train.00009-of-01024.json
    ├── c4-train.00010-of-01024.json
    ├── c4-train.00011-of-01024.json
    ├── c4-train.00012-of-01024.json
    ├── c4-train.00013-of-01024.json
    └── c4-train.00014-of-01024.json
    ...

Requirements

We use pytorch:24.01-py3 as the base image. Please make sure you have installed docker. More details can be found in NVIDIA/Megatron-LM.

Install additional packages:

pip install nltk sentencepiece

Pre-processing for LLaMA-2

bash scripts/data/prepare_c4_megatron_llama2.py
assets/data/c4_llama2_pretokenized/
├── c4_llama2_00000_text_document.bin
├── c4_llama2_00000_text_document.idx
├── c4_llama2_00001_text_document.bin
├── c4_llama2_00001_text_document.idx
├── c4_llama2_00002_text_document.bin
├── c4_llama2_00002_text_document.idx
├── c4_llama2_00003_text_document.bin
├── c4_llama2_00003_text_document.idx
├── c4_llama2_00004_text_document.bin
├── c4_llama2_00004_text_document.idx
...

To use this in Megatron-LM, we provide a blending file assets/c4-blend.sh for training.

Pre-processing for LLaMA-3

The preprocessing for LLaMA-3 closely resembles that of LLaMA-2, albeit with a modified script. Notably, LLaMA-3 employs a new tokenizer and tokenizer.model is no longer used. Instead, the new tokenizer.json will be loaded with AutoTokenizer. Thus, you will find that the script accepts a folder name --tokenizer-model ./assets/checkpoints/llama3_8b_hf to load the new tokenizer.

bash scripts/data/prepare_c4_megatron_llama3.py
assets/data/c4_llama3_pretokenized/
├── c4_llama3_00000_text_document.bin
├── c4_llama3_00000_text_document.idx
├── c4_llama3_00001_text_document.bin
├── c4_llama3_00001_text_document.idx
├── c4_llama3_00002_text_document.bin
├── c4_llama3_00002_text_document.idx
├── c4_llama3_00003_text_document.bin
├── c4_llama3_00003_text_document.idx
├── c4_llama3_00004_text_document.bin
├── c4_llama3_00004_text_document.idx
...

The blending file can also be found at assets/c4-blend-llama3.sh.

Pre-processing for LLaMA-3.1

bash scripts/data/prepare_c4_megatron_llama3.1.py