docs

huggingface · Oct 3, 2024 · 0920b83 · 0920b83
1 parent ed8e0c1
commit 0920b83
Showing 1 changed file with 96 additions and 36 deletions.
diff --git a/docs/source/tasks/llm_finetuning.mdx b/docs/source/tasks/llm_finetuning.mdx
@@ -1,60 +1,120 @@
 # LLM Finetuning
 
-With AutoTrain, you can easily finetune large language models (LLMs) on your own data!
+With AutoTrain, you can easily finetune large language models (LLMs) on your own data. 
+You can use AutoTrain to finetune LLMs for a variety of tasks, such as text generation, text classification, 
+and text summarization. You can also use AutoTrain to finetune LLMs for specific use cases, such as chatbots, 
+question-answering systems, and code generation and even basic fine-tuning tasks like classic text generation.
 
-AutoTrain supports the following types of LLM finetuning:
-
-- Causal Language Modeling (CLM)
-- Masked Language Modeling (MLM) [Coming Soon]
 
 ## Data Preparation
 
-LLM finetuning accepts data in CSV format.
+LLM finetuning accepts data in CSV and JSONL formats. JSONL is the preferred format.
+How data is formatted depends on the task you are training the LLM for.
 
-### Data Format For SFT / Generic Trainer
+### Classic Text Generation
 
-For SFT / Generic Trainer, the data should be in the following format:
+For text generation, the data should be in the following format:
 
 | text                                                          |
 |---------------------------------------------------------------|
-| human: hello \n bot: hi nice to meet you                      |
-| human: how are you \n bot: I am fine                          |
-| human: What is your name? \n bot: My name is Mary             |
-| human: Which is the best programming language? \n bot: Python |
+| wikipedia is a free online encyclopedia                       |
+| it is a collaborative project                                 |
+| that anyone can edit                                          |
+| wikipedia is the largest and most popular general reference work on the internet |
+
+An example dataset for this format can be found here: [stas/openwebtext-10k](https://huggingface.co/datasets/stas/openwebtext-10k)
+
+Example tasks:
+- Text generation
+- Code completion
+
+Compatible trainers:
+- SFT Trainer
+- Generic Trainer
+
+### Chatbot / question-answering / code generation / function calling
+
+For this task, you can use CSV or JSONL data. If you are formatting the data yourself (adding start, end tokens, etc.), you can use CSV or JSONL format.
+If you do not want to format the data yourself and want `--chat-template` parameter to format the data for you, you must use JSONL format.
+In both cases, CSV and JSONL can be used interchangeably but JSONL is the most preferred format.
+
+To train a chatbot, your data will have `content` and `role`. Some models support `system` role as well.
+
+Here is an example of a chatbot dataset (single sample):
+
+```
+[ { "content": "Help write a letter of 100 -200 words to my future self for Kyra, reflecting on her goals and aspirations.", "role": "user" }, { "content": "Dear Future Self,\n\nI hope you're happy and proud of what you've achieved. As I write this, I'm excited to think about our goals and how far you've come. One goal was to be a machine learning engineer. I hope you've worked hard and become skilled in this field. Keep learning and innovating. Traveling was important to us. I hope you've seen different places and enjoyed the beauty of our world. Remember the memories and lessons. Starting a family mattered to us. If you have kids, treasure every moment. Be patient, loving, and grateful for your family.\n\nTake care of yourself. Rest, reflect, and cherish the time you spend with loved ones. Remember your dreams and celebrate what you've achieved. Your determination brought you here. I'm excited to see the person you've become, the impact you've made, and the love and joy in your life. Embrace opportunities and keep dreaming big.\n\nWith love,\nKyra", "role": "assistant" } ]
+```
 
+As you can see, the data has `content` and `role` columns. The `role` column can be `user` or `assistant` or `system`.
+This data is, however, not formatted for training. You can use the `--chat-template` parameter to format the data during training.
 
-An example dataset for this format can be found here: https://huggingface.co/datasets/timdettmers/openassistant-guanaco
+`--chat-template` supports the following kinds of templates:
+- `none` (default)
+- `zephyr`
+- `chatml`
+- `tokenizer`: use chat template mentioned in tokenizer config
+
+A multi-line sample is also shown below:
+
+```json
+[{"content": "hello", "role": "user"}, {"content": "hi nice to meet you", "role": "assistant"}]
+[{"content": "how are you", "role": "user"}, {"content": "I am fine", "role": "assistant"}]
+[{"content": "What is your name?", "role": "user"}, {"content": "My name is Mary", "role": "assistant"}]
+[{"content": "Which is the best programming language?", "role": "user"}, {"content": "Python", "role": "assistant"}]
+.
+.
+.
+```
+
+An example dataset for this format can be found here: [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
+
+If you dont want to format the data using `--chat-template`, you can format the data yourself and use the following format:
+
+```
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 03 Oct 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHelp write a letter of 100 -200 words to my future self for Kyra, reflecting on her goals and aspirations.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nDear Future Self,\n\nI hope you're happy and proud of what you've achieved. As I write this, I'm excited to think about our goals and how far you've come. One goal was to be a machine learning engineer. I hope you've worked hard and become skilled in this field. Keep learning and innovating. Traveling was important to us. I hope you've seen different places and enjoyed the beauty of our world. Remember the memories and lessons. Starting a family mattered to us. If you have kids, treasure every moment. Be patient, loving, and grateful for your family.\n\nTake care of yourself. Rest, reflect, and cherish the time you spend with loved ones. Remember your dreams and celebrate what you've achieved. Your determination brought you here. I'm excited to see the person you've become, the impact you've made, and the love and joy in your life. Embrace opportunities and keep dreaming big.\n\nWith love,\nKyra<|eot_id|>
+```
+
+A sample multi-line dataset is shown below:
+
+```
+| text                                                          |
+|---------------------------------------------------------------|
+| <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 03 Oct 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nhi nice to meet you<|eot_id|> |
+| <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 03 Oct 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhow are you<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI am fine<|eot_id|> |
+| <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 03 Oct 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is your name?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMy name is Mary<|eot_id|> |
+| <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 03 Oct 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhich is the best programming language?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nPython<|eot_id|> |
+.
+.
+.
+```
 
-For SFT/Generic training, your dataset must have a `text` column
+An example dataset for this format can be found here: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
 
-### Data Format For Reward Trainer
+In the examples above, we have seen only two turns: one from the user and one from the assistant. However, you can have multiple turns from the user and assistant in a single sample.
 
-For Reward Trainer, the data should be in the following format:
+Chat models can be trained using the following trainers:
 
-| text                                                          | rejected_text                                                     |
-|---------------------------------------------------------------|-------------------------------------------------------------------|
-| human: hello \n bot: hi nice to meet you                      | human: hello \n bot: leave me alone                               |
-| human: how are you \n bot: I am fine                          | human: how are you \n bot: I am not fine                          |
-| human: What is your name? \n bot: My name is Mary             | human: What is your name? \n bot: Whats it to you?                |
-| human: Which is the best programming language? \n bot: Python | human: Which is the best programming language? \n bot: Javascript |
+- SFT Trainer:
+    - requires only `text` column
+    - example dataset: [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
 
-For Reward Trainer, your dataset must have a `text` column (aka chosen text) and a `rejected_text` column.
+- Generic Trainer:
+    - requires only `text` column
+    - example dataset: [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
 
-### Data Format For DPO/ORPO Trainer
+- Reward Trainer:
+    - requires `text` and `rejected_text` columns
+    - example dataset: [trl-lib/ultrafeedback_binarized](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized)
 
-For DPO/ORPO Trainer, the data should be in the following format:
+- DPO Trainer:
+    - requires `prompt`, `text`, and `rejected_text` columns
+    - example dataset: [trl-lib/ultrafeedback_binarized](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized)
 
-| prompt                                  | text                | rejected_text      |
-|-----------------------------------------|---------------------|--------------------|
-| hello                                   | hi nice to meet you | leave me alone     |
-| how are you                             | I am fine           | I am not fine      |
-| What is your name?                      | My name is Mary     | Whats it to you?   |
-| What is your name?                      | My name is Mary     | I dont have a name |
-| Which is the best programming language? | Python              | Javascript         |
-| Which is the best programming language? | Python              | C++                |
-| Which is the best programming language? | Java                | C++                |
+- ORPO Trainer:
+    - requires `prompt`, `text`, and `rejected_text` columns
+    - example dataset: [trl-lib/ultrafeedback_binarized](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized)
 
-For DPO/ORPO Trainer, your dataset must have a `prompt` column, a `text` column (aka chosen text) and a `rejected_text` column.
+The only difference between the data format for reward trainer and DPO/ORPO trainer is that the reward trainer requires only `text` and `rejected_text` columns, while the DPO/ORPO trainer requires an additional `prompt` column.
 
 
-For all tasks, you can use both CSV and JSONL files!