From 66e68c3bc678b020041d7259620fea2f48036efa Mon Sep 17 00:00:00 2001 From: rasbt Date: Fri, 10 May 2024 11:12:41 -0500 Subject: [PATCH 1/2] Explain dataset options --- tutorials/prepare_dataset.md | 40 ++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/tutorials/prepare_dataset.md b/tutorials/prepare_dataset.md index 055b769bac..2feea3db0b 100644 --- a/tutorials/prepare_dataset.md +++ b/tutorials/prepare_dataset.md @@ -50,6 +50,9 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +> [!TIP] +> Use `litgpt finetune --data.help Alpaca` to list additional dataset-specific command line options. + #### Truncating datasets By default, the finetuning scripts will determine the size of the longest tokenized sample in the dataset to determine the block size. However, if you are willing to truncate a few examples in the training set, you can reduce the computational resource requirements significantly. For instance you can set a sequence length threshold via `--train.max_seq_length`. We can determine an appropriate maximum sequence length by considering the distribution of the data sample lengths shown in the histogram below. @@ -73,8 +76,20 @@ For comparison, the Falcon 7B model requires 23.52 GB of memory for the original [Alpaca-2k](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test) is a smaller, 2000-sample subset of Alpaca described above. +```bash +litgpt finetune lora \ + --data Alpaca2k \ + --checkpoint_dir "checkpoints/tiiuae/falcon-7b" +``` + +> [!TIP] +> Use `litgpt finetune --data.help Alpaca2k` to list additional dataset-specific command line options. + +The Alpaca-2k dataset distribution is shown below. + + ### Alpaca-GPT4 The Alpaca-GPT4 was built by using the prompts of the original Alpaca dataset and generate the responses via GPT 4. The @@ -88,6 +103,9 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +> [!TIP] +> Use `litgpt finetune --data.help AlpacaGPT4` to list additional dataset-specific command line options. + The Alpaca-GPT4 dataset distribution is shown below. @@ -108,6 +126,9 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +> [!TIP] +> Use `litgpt finetune --data.help Alpaca` to list additional dataset-specific command line options. + The Alpaca Libre dataset distribution is shown below. @@ -136,6 +157,9 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +> [!TIP] +> Use `litgpt finetune --data.help Deita` to list additional dataset-specific command line options. + Deita contains multiturn conversations. By default, only the first instruction-response pairs from each of these multiturn conversations are included. If you want to override this behavior and include the follow-up instructions and responses, set `--data.include_multiturn_conversations True`, which will include all multiturn conversations as regular @@ -172,6 +196,9 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" \ ``` +> [!TIP] +> Use `litgpt finetune --data.help Dolly` to list additional dataset-specific command line options. + The Dolly dataset distribution is shown below. @@ -226,6 +253,9 @@ litgpt finetune lora \ --train.max_seq_length 1500 ``` +> [!TIP] +> Use `litgpt finetune --data.help LongForm` to list additional dataset-specific command line options. +   ### LIMA @@ -242,6 +272,10 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +> [!TIP] +> Use `litgpt finetune --data.help LIMA` to list additional dataset-specific command line options. + + LIMA contains a handful of multiturn conversations. By default, only the first instruction-response pairs from each of these multiturn conversations are included. If you want to override this behavior and include the follow-up instructions and responses, set `--data.include_multiturn_conversations True`. @@ -283,6 +317,9 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +> [!TIP] +> Use `litgpt finetune --data.help FLAN` to list additional dataset-specific command line options. + You can find a list of all 66 supported subsets [here](https://huggingface.co/datasets/Muennighoff/flan).   @@ -363,6 +400,9 @@ litgpt finetune lora \ You can also pass a directory containing a `train.json` and `val.json` to `--data.json_path` to define a fixed train/val split. +> [!TIP] +> Use `litgpt finetune --data.help JSON` to list additional dataset-specific command line options. +   ### Preparing Custom Datasets Using DataModule From 725e6ab89e4e860ae3d54d78c192c2c509a7183d Mon Sep 17 00:00:00 2001 From: rasbt Date: Fri, 10 May 2024 11:15:04 -0500 Subject: [PATCH 2/2] add whitespace --- tutorials/prepare_dataset.md | 40 ++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/tutorials/prepare_dataset.md b/tutorials/prepare_dataset.md index 2feea3db0b..a20108b702 100644 --- a/tutorials/prepare_dataset.md +++ b/tutorials/prepare_dataset.md @@ -30,6 +30,8 @@ For the following examples, we will focus on finetuning with the `litgpt/finetun However, the same steps apply to all other models and finetuning scripts. Please read the [tutorials/finetune_*.md](.) documents for more information about finetuning models. +  + > [!IMPORTANT] > By default, the maximum sequence length is obtained from the model configuration file. In case you run into out-of-memory errors, especially in the cases of LIMA and Dolly, > you can try to lower the context length by setting the `--train.max_seq_length` parameter, for example, `litgpt finetune lora --train.max_seq_length 256`. For more information on truncating datasets, see the *Truncating datasets* section in the Alpaca section near the top of this article. @@ -50,9 +52,13 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +  + > [!TIP] > Use `litgpt finetune --data.help Alpaca` to list additional dataset-specific command line options. +  + #### Truncating datasets By default, the finetuning scripts will determine the size of the longest tokenized sample in the dataset to determine the block size. However, if you are willing to truncate a few examples in the training set, you can reduce the computational resource requirements significantly. For instance you can set a sequence length threshold via `--train.max_seq_length`. We can determine an appropriate maximum sequence length by considering the distribution of the data sample lengths shown in the histogram below. @@ -82,9 +88,13 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +  + > [!TIP] > Use `litgpt finetune --data.help Alpaca2k` to list additional dataset-specific command line options. +  + The Alpaca-2k dataset distribution is shown below. @@ -103,9 +113,13 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +  + > [!TIP] > Use `litgpt finetune --data.help AlpacaGPT4` to list additional dataset-specific command line options. +  + The Alpaca-GPT4 dataset distribution is shown below. @@ -126,9 +140,13 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +  + > [!TIP] > Use `litgpt finetune --data.help Alpaca` to list additional dataset-specific command line options. +  + The Alpaca Libre dataset distribution is shown below. @@ -157,9 +175,14 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +  + + > [!TIP] > Use `litgpt finetune --data.help Deita` to list additional dataset-specific command line options. +  + Deita contains multiturn conversations. By default, only the first instruction-response pairs from each of these multiturn conversations are included. If you want to override this behavior and include the follow-up instructions and responses, set `--data.include_multiturn_conversations True`, which will include all multiturn conversations as regular @@ -196,9 +219,13 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" \ ``` +  + > [!TIP] > Use `litgpt finetune --data.help Dolly` to list additional dataset-specific command line options. +  + The Dolly dataset distribution is shown below. @@ -253,11 +280,15 @@ litgpt finetune lora \ --train.max_seq_length 1500 ``` +  + > [!TIP] > Use `litgpt finetune --data.help LongForm` to list additional dataset-specific command line options.   +  + ### LIMA The LIMA dataset is a collection of 1,000 carefully curated prompts and responses, as described in the [LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206) paper. The dataset is sourced from three community Q&A websites: Stack Exchange, wikiHow, and the Pushshift Reddit Dataset. In addition, it also contains prompts and answers written and collected by the authors of the LIMA paper. @@ -272,9 +303,12 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +  + > [!TIP] > Use `litgpt finetune --data.help LIMA` to list additional dataset-specific command line options. +  LIMA contains a handful of multiturn conversations. By default, only the first instruction-response pairs from each of these multiturn conversations are included. If you want to override this behavior and include the follow-up instructions @@ -317,9 +351,13 @@ litgpt finetune lora \ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" ``` +  + > [!TIP] > Use `litgpt finetune --data.help FLAN` to list additional dataset-specific command line options. +  + You can find a list of all 66 supported subsets [here](https://huggingface.co/datasets/Muennighoff/flan).   @@ -400,6 +438,8 @@ litgpt finetune lora \ You can also pass a directory containing a `train.json` and `val.json` to `--data.json_path` to define a fixed train/val split. +  + > [!TIP] > Use `litgpt finetune --data.help JSON` to list additional dataset-specific command line options.