Skip to content

Commit

Permalink
Improve RedPajama download tutorial (#647)
Browse files Browse the repository at this point in the history
  • Loading branch information
carmocca authored Oct 16, 2023
1 parent 39adff8 commit c1c618c
Showing 1 changed file with 15 additions and 14 deletions.
29 changes: 15 additions & 14 deletions tutorials/pretrain_redpajama.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,25 +28,34 @@ the smaller [RedPajama-1T-Sample](https://huggingface.co/datasets/togethercomput

You can download the data using git lfs:

```bash
# Make sure you have git-lfs installed (https://git-lfs.com):
git lfs install
```

```bash
# The full 1 trillion token dataset:
# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T data/RedPajama-Data-1T
```

```bash
# The 1 billion token subset
# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample data/RedPajama-Data-1T-Sample
```

## Prepare RedPajama for training

The full dataset consists of 2084 `jsonl` files (the sample dataset contains 11). In order to start pretraining lit-gpt
on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the `PackedDataset`
streaming dataset that comes with lit-gpt.
streaming dataset that comes with lit-gpt. You will need to have the tokenizer config available:

Do to so, run
```bash
pip install huggingface_hub sentencepiece

python scripts/download.py --repo_id meta-llama/Llama-2-7b-chat-hf --access_token your_hf_token
```

Then, run

```bash
python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf/ --destination_path data/lit-redpajama
Expand All @@ -62,7 +71,7 @@ for the sample dataset.

In the above we are assuming that you will be using the same tokenizer as used in LLaMA, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.

The script will take a while to run, so time for :tea:. (The 1B sample script takes about 45 min for the data preparation.)
The script will take a while to run, so time for :tea: (The 1B sample script takes about 45 min for the data preparation.)

## Pretraining

Expand Down Expand Up @@ -95,15 +104,7 @@ The currently supported model names are contained in the [config.py](https://git
You can

1) either search this file for lines containing "name =",
2) run `python scripts/download.py` without additional command line arguments,
3) or

```python
from lit_gpt.config import configs

for conf in configs:
print(conf["name"])
```
2) or run `python scripts/download.py` without additional command line arguments

Keep in mind that the original LLaMA training for the 7B model required 83k A100 80GB
hours, so you'll need access to a cluster.
Expand Down

0 comments on commit c1c618c

Please sign in to comment.