Merge branch 'Lightning-AI:main' into correct-output-name

Lightning-AI · Apr 2, 2024 · 53008ac · 53008ac
2 parents 23d4e9b + 3a3ae43
commit 53008ac
Show file tree

Hide file tree

Showing 4 changed files with 89 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@
 
 ✅ &nbsp;Optimized and efficient code: Flash Attention v2, multi-GPU support via fully-sharded data parallelism, [optional CPU offloading](tutorials/oom.md#do-sharding-across-multiple-gpus), and [TPU and XLA support](extensions/xla).
 
-✅ &nbsp;[Pretraining](tutorials/pretrain_tinyllama.md), [finetuning](tutorials/finetune.md), and [inference](tutorials/inference.md) in various precision settings: FP32, FP16, BF16, and FP16/FP32 mixed.
+✅ &nbsp;[Pretraining](tutorials/pretrain.md), [finetuning](tutorials/finetune.md), and [inference](tutorials/inference.md) in various precision settings: FP32, FP16, BF16, and FP16/FP32 mixed.
 
 ✅ &nbsp;[Configuration files](config_hub) for great out-of-the-box performance.
 
@@ -37,7 +37,7 @@
 
 ✅ &nbsp;[Exporting](tutorials/convert_lit_models.md) to other popular model weight formats.
 
-✅ &nbsp;Many popular datasets for [pretraining](tutorials/pretrain_tinyllama.md) and [finetuning](tutorials/prepare_dataset.md), and [support for custom datasets](tutorials/prepare_dataset.md#preparing-custom-datasets-for-instruction-finetuning).
+✅ &nbsp;Many popular datasets for [pretraining](tutorials/pretrain.md) and [finetuning](tutorials/prepare_dataset.md), and [support for custom datasets](tutorials/prepare_dataset.md#preparing-custom-datasets-for-instruction-finetuning).
 
 ✅ &nbsp;Readable and easy-to-modify code to experiment with the latest research ideas.
 
@@ -114,7 +114,7 @@ For more information, refer to the [download](tutorials/download_model_weights.m
 
 ## Finetuning and pretraining
 
-LitGPT supports [pretraining](tutorials/pretrain_tinyllama.md) and [finetuning](tutorials/finetune.md) to optimize models on excisting or custom datasets. Below is an example showing how to finetune a model with LoRA:
+LitGPT supports [pretraining](tutorials/pretrain.md) and [finetuning](tutorials/finetune.md) to optimize models on excisting or custom datasets. Below is an example showing how to finetune a model with LoRA:
 
 ```bash
 # 1) Download a pretrained model
@@ -336,7 +336,7 @@ If you have general questions about building with LitGPT, please [join our Disco
 Tutorials and in-depth feature documentation can be found below:
 
 -  Finetuning, incl. LoRA, QLoRA, and Adapters ([tutorials/finetune.md](tutorials/finetune.md))
--  Pretraining ([tutorials/pretrain_tinyllama.md](tutorials/pretrain_tinyllama.md))
+-  Pretraining ([tutorials/pretrain.md](tutorials/pretrain.md))
 -  Model evaluation ([tutorials/evaluation.md](tutorials/evaluation.md))
 -  Supported and custom datasets ([tutorials/prepare_dataset.md](tutorials/prepare_dataset.md))
 -  Quantization ([tutorials/quantize.md](tutorials/quantize.md))

diff --git a/tutorials/0_to_litgpt.md b/tutorials/0_to_litgpt.md
@@ -125,6 +125,7 @@ litgpt pretrain --help
 
 **More information and additional resources**
 
+- [tutorials/pretraimd](./pretrain.md): General information about pretraining in LitGPT
 - [tutorials/pretrain_tinyllama](./pretrain_tinyllama.md): A tutorial for finetuning a 1.1B TinyLlama model on 3 trillion tokens
 - [config_hub/pretrain](../config_hub/pretrain): Pre-made config files for pretraining that work well out of the box
 - Project templates in reproducible environments with multi-GPU and multi-node support:

diff --git a/tutorials/pretrain.md b/tutorials/pretrain.md
@@ -0,0 +1,65 @@
+# Pretrain LLMs with LitGPT
+
+
+This document explains how to pretrain LLMs using LitGPT.
+
+&nbsp;
+## The Pretraining API
+
+You can pretrain models in LitGPT using the `litgpt pretrain` API starting with any of the available architectures listed by calling `litgpt pretrain` without any additional arguments:
+
+```bash
+litgpt pretrain
+```
+
+Shown below is an abbreviated list:
+
+```
+ValueError: Please specify --model_name <model_name>. Available values:
+Camel-Platypus2-13B
+...
+Gemma-2b
+...
+Llama-2-7b-hf
+...
+Mixtral-8x7B-v0.1
+...
+pythia-14m
+```
+
+For demonstration purposes, we can pretrain a small 14 million-parameter Pythia model on the small TinyStories dataset using the [debug.yaml config file](https://github.com/Lightning-AI/litgpt/blob/main/config_hub/pretrain/debug.yaml) as follows:
+
+```bash
+litgpt pretrain \
+   --model_name pythia-14m \
+   --config https://raw.githubusercontent.com/Lightning-AI/litgpt/main/config_hub/pretrain/debug.yaml
+```
+
+
+
+
+&nbsp;
+## Pretrain a 1.1B TinyLlama model
+
+You can find an end-to-end LitGPT tutorial for pretraining a TinyLlama model using LitGPT [here](pretrain_tinyllama.md).
+
+
+&nbsp;
+## Optimize LitGPT pretraining with Lightning Thunder
+
+[Lightning Thunder](https://github.com/Lightning-AI/lightning-thunder) is a source-to-source compiler for PyTorch, which is fully compatible with LitGPT. In experiments, Thunder resulted in a 40% speed-up compared to using regular PyTorch when finetuning a 7B Llama 2 model.
+
+For more information, see the [Lightning Thunder extension README](https://github.com/Lightning-AI/lightning-thunder).
+
+
+&nbsp;
+## Project templates
+
+The following [Lightning Studio](https://lightning.ai/lightning-ai/studios) templates provide LitGPT pretraining projects in reproducible environments with multi-GPU and multi-node support:
+&nbsp;
+
+|                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                |
+|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| <p align="left">[Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) <br> [<img src="https://pl-public-data.s3.amazonaws.com/assets_litgpt/readme/3.webp" width="300"></p>](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset)         | [Pretrain LLMs - TinyLlama 1.1B](https://lightning.ai/lightning-ai/studios/pretrain-llms-tinyllama-1-1b) <br> <p align="left">[<img src="https://pl-public-data.s3.amazonaws.com/assets_litgpt/readme/4.webp" width="300"></p>](https://lightning.ai/lightning-ai/studios/pretrain-llms-tinyllama-1-1b)                                        |
+| [Continued Pretraining with TinyLlama 1.1B](https://lightning.ai/lightning-ai/studios/continued-pretraining-with-tinyllama-1-1b) <br> <p align="left">[<img src="https://pl-public-data.s3.amazonaws.com/assets_litgpt/readme/1.webp" width="300"></p>](https://lightning.ai/lightning-ai/studios/continued-pretraining-with-tinyllama-1-1b) | |
+|                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                     
diff --git a/tutorials/pretrain_tinyllama.md b/tutorials/pretrain_tinyllama.md
@@ -5,6 +5,7 @@ This tutorial will walk you through pretraining [TinyLlama](https://github.com/j
 > [!TIP]
 > To get started with zero setup, clone the [TinyLlama studio on Lightning AI](https://lightning.ai/lightning-ai/studios/llm-pretrain-tinyllama-1-1b).
 
+&nbsp;
 ## What's TinyLlama?
 
 [TinyLlama](https://github.com/jzhang38/TinyLlama/) is architecturally the same as Meta AI's LLama 2, but only has 1.1B parameters and is instead trained on multiple epochs on a mix of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) and [Starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) datasets.
@@ -26,6 +27,7 @@ Here is a quick fact sheet:
 
 (this table was sourced from the author's [README](https://github.com/jzhang38/TinyLlama/))
 
+&nbsp;
 ## Download datasets
 
 You can download the data using git lfs:
@@ -42,6 +44,7 @@ git clone https://huggingface.co/datasets/bigcode/starcoderdata data/starcoderda
 
 Around 1.2 TB of disk space is required to store both datasets.
 
+&nbsp;
 ## Prepare the datasets for training
 
 In order to start pretraining litgpt on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the `litdata` optimization pipeline and streaming dataset.
@@ -95,6 +98,7 @@ python litgpt/data/prepare_slimpajama.py \
 If you want to run on a small slice of the datasets first, pass the flag `--fast_dev_run=true` to the commands above.
 In the above we are assuming that you will be using the same tokenizer as used in LlaMA/TinyLlama, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.
 
+&nbsp;
 ## Pretraining
 
 Running the pretraining script with its default settings requires at least 8 A100 GPUs.
@@ -139,6 +143,7 @@ Last, logging is kept minimal in the script, but for long-running experiments we
 As an example, we included WandB (set `--logger_name=wandb`) to show how you can integrate any experiment tracking framework.
 For reference, [here are the loss curves for our reproduction](https://api.wandb.ai/links/awaelchli/y7pzdpwy).
 
+&nbsp;
 ## Resume training
 
 The checkpoints saved during pretraining contain all the information to resume if needed.
@@ -151,6 +156,7 @@ litgpt pretrain \
 ```
 **Important:** Each checkpoint is a directory. Point to the directory, not the 'lit_model.pth' file inside of it.
 
+&nbsp;
 ## Export checkpoints
 
 After training is completed, you can convert the checkpoint to a format that can be loaded for evaluation, inference, finetuning etc.
@@ -172,3 +178,16 @@ checkpoints/tiny-llama/final
 ```
 
 You can then use this checkpoint folder to run [evaluation](evaluation.md), [inference](inference.md), [finetuning](finetune_lora.md) or [process the checkpoint further](convert_lit_models.md).
+
+
+&nbsp;
+## Project templates
+
+The following [Lightning Studio](https://lightning.ai/lightning-ai/studios) templates provide LitGPT pretraining projects in reproducible environments with multi-GPU and multi-node support:
+&nbsp;
+
+|                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                |
+|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| <p align="left">[Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) <br> [<img src="https://pl-public-data.s3.amazonaws.com/assets_litgpt/readme/3.webp" width="300"></p>](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset)         | [Pretrain LLMs - TinyLlama 1.1B](https://lightning.ai/lightning-ai/studios/pretrain-llms-tinyllama-1-1b) <br> <p align="left">[<img src="https://pl-public-data.s3.amazonaws.com/assets_litgpt/readme/4.webp" width="300"></p>](https://lightning.ai/lightning-ai/studios/pretrain-llms-tinyllama-1-1b)                                        |
+| [Continued Pretraining with TinyLlama 1.1B](https://lightning.ai/lightning-ai/studios/continued-pretraining-with-tinyllama-1-1b) <br> <p align="left">[<img src="https://pl-public-data.s3.amazonaws.com/assets_litgpt/readme/1.webp" width="300"></p>](https://lightning.ai/lightning-ai/studios/continued-pretraining-with-tinyllama-1-1b) | |
+|                                                                                                                                                                                                                                                                                                                                             |