Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom datasets examples for pretraining and continued pretraining #1276

Merged
merged 4 commits into from
Apr 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 40 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,24 +105,54 @@ litgpt chat \
--checkpoint_dir out/phi-2-lora/final
```

 

### Pretrain an LLM
Train an LLM from scratch on your own data via [pretraining](tutorials/pretrain.md):
Train an LLM from scratch on your own data via pretraining:

```bash
mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download \
--repo_id EleutherAI/pythia-160m \
--tokenizer_only True
rasbt marked this conversation as resolved.
Show resolved Hide resolved

# 2) Pretrain the model
litgpt pretrain \
--model_name pythia-160m \
--tokenizer_dir checkpoints/EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 10_000_000 \
--out_dir out/custom-model

# 3) Chat with the model
litgpt chat \
--checkpoint_dir out/custom-model/final
```

Specialize an already pretrained model by training on custom data:

```
mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a pretrained model
litgpt download --repo_id microsoft/phi-2
litgpt download --repo_id EleutherAI/pythia-160m

# 2) Finetune the model
# 2) Continue pretraining the model
litgpt pretrain \
--initial_checkpoint_dir checkpoints/microsoft/phi-2 \
--data Alpaca2k \
--out_dir out/custom-phi-2
--model_name pythia-160m \
--initial_checkpoint_dir checkpoints/EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 10_000_000 \
--out_dir out/custom-model

# 3) Chat with the model
litgpt chat \
--checkpoint_dir out/phi-2-lora/final
--checkpoint_dir out/custom-model/final
```

 
Expand Down
Loading