Improve RedPajama download tutorial (#647)

Lightning-AI · Oct 16, 2023 · c1c618c · c1c618c
1 parent 39adff8
commit c1c618c
Showing 1 changed file with 15 additions and 14 deletions.
diff --git a/tutorials/pretrain_redpajama.md b/tutorials/pretrain_redpajama.md
@@ -28,25 +28,34 @@ the smaller [RedPajama-1T-Sample](https://huggingface.co/datasets/togethercomput
 
 You can download the data using git lfs:
 
+```bash
+# Make sure you have git-lfs installed (https://git-lfs.com):
+git lfs install
+```
+
 ```bash
 # The full 1 trillion token dataset:
-# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
 git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T data/RedPajama-Data-1T
 ```
 
 ```bash
 # The 1 billion token subset
-# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
 git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample data/RedPajama-Data-1T-Sample
 ```
 
 ## Prepare RedPajama for training
 
 The full dataset consists of 2084 `jsonl` files (the sample dataset contains 11). In order to start pretraining lit-gpt
 on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the `PackedDataset`
-streaming dataset that comes with lit-gpt.
+streaming dataset that comes with lit-gpt. You will need to have the tokenizer config available:
 
-Do to so, run
+```bash
+pip install huggingface_hub sentencepiece
+
+python scripts/download.py --repo_id meta-llama/Llama-2-7b-chat-hf --access_token your_hf_token
+```
+
+Then, run
 
 ```bash
 python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf/ --destination_path data/lit-redpajama
@@ -62,7 +71,7 @@ for the sample dataset.
 
 In the above we are assuming that you will be using the same tokenizer as used in LLaMA, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.
 
-The script will take a while to run, so time for :tea:. (The 1B sample script takes about 45 min for the data preparation.)
+The script will take a while to run, so time for :tea: (The 1B sample script takes about 45 min for the data preparation.)
 
 ## Pretraining
 
@@ -95,15 +104,7 @@ The currently supported model names are contained in the [config.py](https://git
 You can
 
 1) either search this file for lines containing "name =",
-2) run `python scripts/download.py` without additional command line arguments,
-3) or
-
-```python
-from lit_gpt.config import configs
-
-for conf in configs:
-    print(conf["name"])
-```
+2) or run `python scripts/download.py` without additional command line arguments
 
 Keep in mind that the original LLaMA training for the 7B model required 83k A100 80GB
 hours, so you'll need access to a cluster.