Skip to content

Commit

Permalink
Custom datasets examples for pretraining and continued pretraining (#…
Browse files Browse the repository at this point in the history
…1276)

Co-authored-by: Carlos Mocholí <[email protected]>
  • Loading branch information
rasbt and carmocca authored Apr 13, 2024
1 parent 4e731c0 commit 683b3cb
Showing 1 changed file with 40 additions and 10 deletions.
50 changes: 40 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,24 +105,54 @@ litgpt chat \
--checkpoint_dir out/phi-2-lora/final
```

&nbsp;

### Pretrain an LLM
Train an LLM from scratch on your own data via [pretraining](tutorials/pretrain.md):
Train an LLM from scratch on your own data via pretraining:

```bash
mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download \
--repo_id EleutherAI/pythia-160m \
--tokenizer_only True

# 2) Pretrain the model
litgpt pretrain \
--model_name pythia-160m \
--tokenizer_dir checkpoints/EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 10_000_000 \
--out_dir out/custom-model

# 3) Chat with the model
litgpt chat \
--checkpoint_dir out/custom-model/final
```

Specialize an already pretrained model by training on custom data:

```
mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt
# 1) Download a pretrained model
litgpt download --repo_id microsoft/phi-2
litgpt download --repo_id EleutherAI/pythia-160m
# 2) Finetune the model
# 2) Continue pretraining the model
litgpt pretrain \
--initial_checkpoint_dir checkpoints/microsoft/phi-2 \
--data Alpaca2k \
--out_dir out/custom-phi-2
--model_name pythia-160m \
--initial_checkpoint_dir checkpoints/EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 10_000_000 \
--out_dir out/custom-model
# 3) Chat with the model
litgpt chat \
--checkpoint_dir out/phi-2-lora/final
--checkpoint_dir out/custom-model/final
```

&nbsp;
Expand Down

0 comments on commit 683b3cb

Please sign in to comment.