Merge branch 'main' into litgpt-eval

Lightning-AI · Apr 3, 2024 · e6f8dc3 · e6f8dc3
2 parents 052d097 + 70218de
commit e6f8dc3
Show file tree

Hide file tree

Showing 67 changed files with 1,939 additions and 440 deletions.
diff --git a/.github/workflows/cpu-tests.yml b/.github/workflows/cpu-tests.yml
@@ -37,7 +37,7 @@ jobs:
     - uses: actions/checkout@v4
 
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v4
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ matrix.python-version }}
 
@@ -46,9 +46,7 @@ jobs:
 
     - name: Install minimal dependencies
       run: |
-        # uv pip install . is not yet supported, only `-e .`
-        # https://github.com/astral-sh/uv/issues/1896
-        uv pip install --system -e .
+        uv pip install --system .
         uv pip list
         # make sure all modules are still importable with only the minimal dependencies available
         modules=$(
@@ -61,7 +59,7 @@ jobs:
 
     - name: Install all dependencies
       run: |
-        uv pip install --system -e '.[all,test]' 'lm_eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@115206dc89dad67b8b'
+        uv pip install --system '.[all,test]' 'lm_eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@115206dc89dad67b8b'
         uv pip list
 
     - name: Run tests

diff --git a/README.md b/README.md
@@ -27,21 +27,20 @@
 
 ✅ &nbsp;Optimized and efficient code: Flash Attention v2, multi-GPU support via fully-sharded data parallelism, [optional CPU offloading](tutorials/oom.md#do-sharding-across-multiple-gpus), and [TPU and XLA support](extensions/xla).
 
-✅ &nbsp;[Pretraining](tutorials/pretraining.md), [finetuning](tutorials/finetune.md), and [inference](tutorials/inference.md) in various precision settings: FP32, FP16, BF16, and FP16/FP32 mixed.
+✅ &nbsp;[Pretraining](tutorials/pretrain.md), [finetuning](tutorials/finetune.md), and [inference](tutorials/inference.md) in various precision settings: FP32, FP16, BF16, and FP16/FP32 mixed.
 
 ✅ &nbsp;[Configuration files](config_hub) for great out-of-the-box performance.
 
 ✅ &nbsp;Efficient finetuning: [LoRA](tutorials/finetune_lora.md), [QLoRA](tutorials/finetune_lora.md), [Adapter](tutorials/finetune_adapter.md), and [Adapter v2](tutorials/finetune_adapter.md).
 
 ✅ &nbsp;[Quantization](tutorials/quantize.md): 4-bit floats, 8-bit integers, and double quantization.
 
-✅ &nbsp;[Exporting](https://github.com/Lightning-AI/litgpt/blob/wip/tutorials/convert_lit_models.md) to other popular model weight formats.
+✅ &nbsp;[Exporting](tutorials/convert_lit_models.md) to other popular model weight formats.
 
-✅ &nbsp;Many popular datasets for [pretraining](tutorials/pretrain_tinyllama.md) and [finetuning](tutorials/prepare_dataset.md), and [support for custom datasets](tutorials/prepare_dataset.md#preparing-custom-datasets-for-instruction-finetuning).
+✅ &nbsp;Many popular datasets for [pretraining](tutorials/pretrain.md) and [finetuning](tutorials/prepare_dataset.md), and [support for custom datasets](tutorials/prepare_dataset.md#preparing-custom-datasets-for-instruction-finetuning).
 
 ✅ &nbsp;Readable and easy-to-modify code to experiment with the latest research ideas.
 
-
 &nbsp;
 <br>
 &nbsp;
@@ -59,8 +58,6 @@ The following [Lightning Studio](https://lightning.ai/lightning-ai/studios) temp
 
 
 
-
-
 &nbsp;
 <br>
 &nbsp;
@@ -107,9 +104,17 @@ For more information, refer to the [download](tutorials/download_model_weights.m
 
 
 &nbsp;
+
+> [!NOTE]
+> We recommend starting with the **[Zero to LitGPT: Getting Started with Pretraining, Finetuning, and Using LLMs](tutorials/0_to_litgpt.md)** if you are looking to get started with using LitGPT.
+
+
+
+&nbsp;
+
 ## Finetuning and pretraining
 
-LitGPT supports [pretraining](tutorials/pretrain_tinyllama.md) and [finetuning](tutorials/finetune.md) to optimize models on excisting or custom datasets. Below is an example showing how to finetune a model with LoRA:
+LitGPT supports [pretraining](tutorials/pretrain.md) and [finetuning](tutorials/finetune.md) to optimize models on excisting or custom datasets. Below is an example showing how to finetune a model with LoRA:
 
 ```bash
 # 1) Download a pretrained model
@@ -134,7 +139,7 @@ LitGPT also allows users to use configuration files in YAML format instead of sp
 
 ```bash
 litgpt finetune lora \
-  --config https://github.com/Lightning-AI/litgpt/blob/wip/config_hub/finetune/llama-2-7b/lora.yaml
+  --config https://raw.githubusercontent.com/Lightning-AI/litgpt/main/config_hub/finetune/llama-2-7b/lora.yaml
 ```
 
 For added convenience, you can also manually override config file setting via the CLI:
@@ -146,7 +151,7 @@ litgpt finetune lora \
   --lora_r 4
 ```
 
-You can browse the available configuration files [here](https://github.com/Lightning-AI/litgpt/tree/main/config_hub).
+You can browse the available configuration files [here](config_hub).
 
 &nbsp;
 
@@ -324,8 +329,14 @@ If you have general questions about building with LitGPT, please [join our Disco
 
 ## Tutorials, how-to guides, and docs
 
+
+> [!NOTE]
+> We recommend starting with the **[Zero to LitGPT: Getting Started with Pretraining, Finetuning, and Using LLMs](tutorials/0_to_litgpt.md)** if you are looking to get started with using LitGPT.
+
+Tutorials and in-depth feature documentation can be found below:
+
 -  Finetuning, incl. LoRA, QLoRA, and Adapters ([tutorials/finetune.md](tutorials/finetune.md))
--  Pretraining ([tutorials/pretrain_tinyllama.md](tutorials/pretrain_tinyllama.md))
+-  Pretraining ([tutorials/pretrain.md](tutorials/pretrain.md))
 -  Model evaluation ([tutorials/evaluation.md](tutorials/evaluation.md))
 -  Supported and custom datasets ([tutorials/prepare_dataset.md](tutorials/prepare_dataset.md))
 -  Quantization ([tutorials/quantize.md](tutorials/quantize.md))
@@ -401,4 +412,3 @@ If you use LitGPT in your research, please cite the following work:
 ## License
 
 LitGPT is released under the [Apache 2.0](https://github.com/Lightning-AI/litgpt/blob/main/LICENSE) license.
-
diff --git a/config_hub/finetune/README.md b/config_hub/finetune/README.md
@@ -1,6 +1,6 @@
 ## Config files
 
-The table below lists the performances you can expect from the provided config files. Note that you can achieve lower memory consumption by lowering the micro batch size as needed. In addition, you can lower the rank (`lora_r`) in the LoRA configuration files and disable LoRA for certain layers (for example, setting `lora_projection` and other LoRA layer-specific parameters to `false`). 
+The table below lists the performances you can expect from the provided config files. Note that you can achieve lower memory consumption by lowering the micro batch size as needed. In addition, you can lower the rank (`lora_r`) in the LoRA configuration files and disable LoRA for certain layers (for example, setting `lora_projection` and other LoRA layer-specific parameters to `false`).
 For more information, see the [Dealing with out-of-memory (OOM) errors](../../tutorials/oom.md) on lowering the memory requirements.
 
 &nbsp;
@@ -11,29 +11,56 @@ For more information, see the [Dealing with out-of-memory (OOM) errors](../../tu
 | falcon-7b/lora.yaml               | 7B   | Alpaca 2k | 4      | 0.945    | 16.69 GB    | 512            | 2                | bfloat16  | 24.88 min (1xA10G) |
 | falcon-7b/qlora.yaml              | 7B   | Alpaca 2k | 4      | 0.993    | 9.44 GB     | 512            | 2                | bfloat16  | 50.76 min (1xA10G) |
 |                                   |      |           |        |          |             |                |                  |           |                    |
-| gemma-2b/lora.yaml                | 2B   | Alpaca 2k | 3      | 1.476    | 12.62 GB    | 512            | 2                | bfloat16  | 18.31 min (1xA10G) |
-| gemma-2b/qlora.yaml               | 2B   | Alpaca 2k | 3      | 1.626    | 11.51 GB    | 512            | 2                | bfloat16  | 25.29 min (1xA10G) |
-| gemma-2b/full.yaml                | 2B   | Alpaca 2k | 0.35   | 1.046    | 18.47 GB    | 512            | 2                | bfloat16  | 16.79 min (2xA10G) |
+| gemma-2b/lora.yaml                | 2B   | Alpaca 2k | 2      | 1.476    | 12.62 GB    | 512            | 2                | bfloat16  |  9.29 min (1xA10G) |
+| gemma-2b/qlora.yaml               | 2B   | Alpaca 2k | 2      | 0.981    | 11.59 GB    | 512            | 2                | bfloat16  | 12.90 min (1xA10G) |
+| gemma-2b/full.yaml                | 2B   | Alpaca 2k | 0.35   | 0.990    | 17.43 GB    | 512            | 1                | bfloat16  | 13.61 min (4xA10G) |
+|                                   |      |           |        |          |             |                |                  |           |                    |
+| gemma-7b/lora.yaml                | 7B   | Alpaca 2k | 2      | 0.903    | 25.30 GB    | 512            | 1                | bfloat16  | 11.47 min (1xA100) |
+| gemma-7b/qlora.yaml               | 7B   | Alpaca 2k | 2      | 0.951    | 17.31 GB    | 512            | 1                | bfloat16  | 23.46 min (1xA100) |
 |                                   |      |           |        |          |             |                |                  |           |                    |
 | llama-2-7b/lora.yaml              | 7B   | Alpaca 2k | 4      | 0.802    | 19.77 GB    | 512            | 2                | bfloat16  | 32.75 min (A10G)   |
 | llama-2-7b/qlora.yaml             | 7B   | Alpaca 2k | 4      | 0.814    | 13.68 GB    | 512            | 2                | bfloat16  | 45.68 min (A10G)   |
 | llama-2-7b/full.yaml              | 7B   | Alpaca 2k | 1      | 0.941    | 26.81 GB    | 512            | 4                | bfloat16  | 1.78 min (4xA100)  |
 |                                   |      |           |        |          |             |                |                  |           |                    |
-| mistral-7b/lora.yaml              | 7B   | Alpaca 2k | 4      | 0.796    | 20.65 GB    | 512            | 2                | bfloat16  | 31.04 min (1xA10G) |
-| mistral-7b/qlora.yaml             | 7B   | Alpaca 2k | 4      | 0.803    | 14.29 GB    | 512            | 2                | bfloat16  | 44.69 min (1xA10G) |
+| mistral-7b/lora.yaml  (v0.1)      | 7B   | Alpaca 2k | 4      | 0.796    | 20.65 GB    | 512            | 2                | bfloat16  | 31.04 min (1xA10G) |
+| mistral-7b/qlora.yaml (v0.1)      | 7B   | Alpaca 2k | 4      | 0.803    | 14.29 GB    | 512            | 2                | bfloat16  | 44.69 min (1xA10G) |
+|                                   |      |           |        |          |             |                |                  |           |                    |
+| mistral-7b-v0.2/lora.yaml         | 7B   | Alpaca 2k | 4      | 0.801    | 20.65 GB    | 512            | 2                | bfloat16  | 30.96 min (1xA10G) |
+| mistral-7b-v0.2/qlora.yaml        | 7B   | Alpaca 2k | 4      | 0.813    | 14.29 GB    | 512            | 2                | bfloat16  | 44.68 min (1xA10G) |
 |                                   |      |           |        |          |             |                |                  |           |                    |
 | phi-2/lora.yaml                   | 2B   | Alpaca 2k | 1      | 0.832    | 13.98 GB    | 512            | 4                | bfloat16  | 3.82 min (1xA10G)  |
 | phi-2/qlora.yaml                  | 2B   | Alpaca 2k | 1      | 0.846    | 14.27 GB    | 512            | 4                | bfloat16  | 4.55 min (1xA10G)  |
 | phi-2/full.yaml                   | 2B   | Alpaca 2k | 1      | 0.937    | 14.44 GB    | 512            | 4                | bfloat16  | 13.00 min (1xA10G) |
 |                                   |      |           |        |          |             |                |                  |           |                    |
-| stablelm-base-alpha-3b/lora.yaml  | 7B   | Alpaca 2k | 4      | 1.367    | 8.58 GB     | 512            | 2                | bfloat16  | 13.02 min (1xA10G) |
-| stablelm-base-alpha-3b/qlora.yaml | 7B   | Alpaca 2k | 4      | 1.392    | 5.24 GB     | 512            | 2                | bfloat16  | 25.71 min (1xA10G) |
-| stablelm-base-alpha-3b/full.yaml  | 7B   | Alpaca 2k | 1      | 1.494    | 21.23 GB    | 512            | 1                | bfloat16  | 72.72 min (2xA10G) |
+| stablelm-base-alpha-3b/lora.yaml  | 3B   | Alpaca 2k | 4      | 1.367    | 8.58 GB     | 512            | 2                | bfloat16  | 13.02 min (1xA10G) |
+| stablelm-base-alpha-3b/qlora.yaml | 3B   | Alpaca 2k | 4      | 1.392    | 5.24 GB     | 512            | 2                | bfloat16  | 25.71 min (1xA10G) |
+| stablelm-base-alpha-3b/full.yaml  | 3B   | Alpaca 2k | 1      | 1.494    | 21.23 GB    | 512            | 1                | bfloat16  | 72.72 min (2xA10G) |
 |                                   |      |           |        |          |             |                |                  |           |                    |
 | tiny-llama/lora.yaml              | 1.1B | Alpaca 2k | 3      | 1.038    | 13.50 GB    | 512            | 8                | bfloat16  | 8.06 min (1xA10G)  |
 | tiny-llama/qlora.yaml             | 1.1B | Alpaca 2k | 3      | 1.056    | 16.24 GB    | 512            | 8                | bfloat16  | 8.74 min (1xA10G)  |
 | tiny-llama/full.yaml              | 1.1B | Alpaca 2k | 1      | 1.105    | 14.10 GB    | 512            | 4                | bfloat16  | 2.59 min (1xA10G)  |
 
 &nbsp;
+## Extending the context length
 
 If you require a longer sequence length than the one used in a given config file, you can either edit the `max_seq_length` in the config file or pass an additional argument when running the finetuning command, for example, `--max_seq_length 4096` to override the sequence length provided in the config file.
+
+&nbsp;
+## Training on GPUs without bfloat16 support
+
+If you are training on GPUs without bfloat-16 support, you need to change the `precision` option to `16-true` (16-bit floating point precision) or `16-mixed` (16/32-bit mixed precision) training:
+
+```bash
+litgpt finetune lora \
+  --config config_hub/finetune/phi-2/lora.yaml \
+  --precision 16-true
+```
+or
+
+```bash
+litgpt finetune lora \
+  --config config_hub/finetune/phi-2/lora.yaml \
+  --precision 16-mixed
+```
+
+Note that `16-true` is more compute and memory-efficient, but it can sometimes lead to training convergence issues. In this case, it's recommended to use `16-mixed`.
diff --git a/config_hub/finetune/gemma-2b/full.yaml b/config_hub/finetune/gemma-2b/full.yaml
@@ -9,7 +9,7 @@ out_dir: out/finetune/full-gemma-2b
 precision: bf16-true
 
 # How many devices/GPUs to use. (type: Union[int, str], default: 1)
-devices: 1
+devices: 4
 
 # Data-related arguments. If not provided, the default is ``litgpt.data.Alpaca``.
 data:
@@ -32,7 +32,7 @@ train:
   log_interval: 1
 
   # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 128)
-  global_batch_size: 6
+  global_batch_size: 16
 
   # Number of samples per data-parallel rank (type: int, default: 4)
   micro_batch_size: 1
@@ -41,13 +41,13 @@ train:
   lr_warmup_steps: 100
 
   # Number of epochs to train on (type: Optional[int], default: 5)
-  epochs: 3
+  epochs: 1
 
   # Total number of tokens to train on (type: Optional[int], default: null)
   max_tokens:
 
   # Limits the number of optimizer steps to run. (type: Optional[int], default: null)
-  max_steps:
+  max_steps: 50
 
   # Limits the length of samples. Off by default (type: Optional[int], default: null)
   max_seq_length: 512

diff --git a/config_hub/finetune/gemma-2b/lora.yaml b/config_hub/finetune/gemma-2b/lora.yaml
@@ -15,7 +15,7 @@ quantize:
 devices: 1
 
 # The LoRA rank. (type: int, default: 8)
-lora_r: 16
+lora_r: 8
 
 # The LoRA alpha. (type: int, default: 16)
 lora_alpha: 16
@@ -71,7 +71,7 @@ train:
   lr_warmup_steps: 200
 
   # Number of epochs to train on (type: Optional[int], default: 5)
-  epochs: 4
+  epochs: 2
 
   # Total number of tokens to train on (type: Optional[int], default: null)
   max_tokens:

diff --git a/config_hub/finetune/gemma-2b/qlora.yaml b/config_hub/finetune/gemma-2b/qlora.yaml
@@ -71,7 +71,7 @@ train:
   lr_warmup_steps: 200
 
   # Number of epochs to train on (type: Optional[int], default: 5)
-  epochs: 4
+  epochs: 2
 
   # Total number of tokens to train on (type: Optional[int], default: null)
   max_tokens:

diff --git a/config_hub/finetune/gemma-7b/lora.yaml b/config_hub/finetune/gemma-7b/lora.yaml
@@ -0,0 +1,122 @@
+
+# The path to the base model's checkpoint directory to load for finetuning. (type: <class 'Path'>, default: checkpoints/stabilityai/stablelm-base-alpha-3b)
+checkpoint_dir: checkpoints/google/gemma-7b
+
+# Directory in which to save checkpoints and logs. (type: <class 'Path'>, default: out/lora)
+out_dir: out/finetune/qlora-gemma-7b
+
+# The precision to use for finetuning. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
+precision: bf16-true
+
+# If set, quantize the model with this algorithm. See ``tutorials/quantize.md`` for more information. (type: Optional[Literal['nf4', 'nf4-dq', 'fp4', 'fp4-dq', 'int8-training']], default: null)
+quantize:
+
+# How many devices/GPUs to use. (type: Union[int, str], default: 1)
+devices: 1
+
+# The LoRA rank. (type: int, default: 8)
+lora_r: 16
+
+# The LoRA alpha. (type: int, default: 16)
+lora_alpha: 16
+
+# The LoRA dropout value. (type: float, default: 0.05)
+lora_dropout: 0.1
+
+# Whether to apply LoRA to the query weights in attention. (type: bool, default: True)
+lora_query: true
+
+# Whether to apply LoRA to the key weights in attention. (type: bool, default: False)
+lora_key: true
+
+# Whether to apply LoRA to the value weights in attention. (type: bool, default: True)
+lora_value: true
+
+# Whether to apply LoRA to the output projection in the attention block. (type: bool, default: False)
+lora_projection: true
+
+# Whether to apply LoRA to the weights of the MLP in the attention block. (type: bool, default: False)
+lora_mlp: true
+
+# Whether to apply LoRA to output head in GPT. (type: bool, default: False)
+lora_head: true
+
+# Data-related arguments. If not provided, the default is ``litgpt.data.Alpaca``.
+data:
+  class_path: litgpt.data.Alpaca2k
+  init_args:
+    mask_prompt: false
+    val_split_fraction: 0.03847
+    prompt_style: alpaca
+    ignore_index: -100
+    seed: 42
+    num_workers: 4
+
+# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
+train:
+
+  # Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
+  save_interval: 800
+
+  # Number of iterations between logging calls (type: int, default: 1)
+  log_interval: 1
+
+  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 128)
+  global_batch_size: 6
+
+  # Number of samples per data-parallel rank (type: int, default: 4)
+  micro_batch_size: 1
+
+  # Number of iterations with learning rate warmup active (type: int, default: 100)
+  lr_warmup_steps: 200
+
+  # Number of epochs to train on (type: Optional[int], default: 5)
+  epochs: 2
+
+  # Total number of tokens to train on (type: Optional[int], default: null)
+  max_tokens:
+
+  # Limits the number of optimizer steps to run. (type: Optional[int], default: null)
+  max_steps:
+
+  # Limits the length of samples. Off by default (type: Optional[int], default: null)
+  max_seq_length: 512
+
+  # Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: null)
+  tie_embeddings:
+
+  #   (type: float, default: 0.0003)
+  learning_rate: 0.0002
+
+  #   (type: float, default: 0.02)
+  weight_decay: 0.0
+
+  #   (type: float, default: 0.9)
+  beta1: 0.9
+
+  #   (type: float, default: 0.95)
+  beta2: 0.95
+
+  #   (type: Optional[float], default: null)
+  max_norm:
+
+  #   (type: float, default: 6e-05)
+  min_lr: 6.0e-05
+
+# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
+eval:
+
+  # Number of optimizer steps between evaluation calls (type: int, default: 100)
+  interval: 25
+
+  # Number of tokens to generate (type: Optional[int], default: 100)
+  max_new_tokens: 100
+
+  # Number of iterations (type: int, default: 100)
+  max_iters: 100
+
+# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: csv)
+logger_name: csv
+
+# The random seed to use for reproducibility. (type: int, default: 1337)
+seed: 1337