text clf/reg

huggingface · Oct 4, 2024 · 8e6c2ac · 8e6c2ac
1 parent 82d5a86
commit 8e6c2ac
Show file tree

Hide file tree

Showing 6 changed files with 55 additions and 275 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -27,8 +27,8 @@
 - sections:
   - local: tasks/llm_finetuning
     title: LLM Finetuning
-  - local: tasks/text_classification
-    title: Text Classification
+  - local: tasks/text_classification_regression
+    title: Text Classification/Regression
   - local: tasks/extractive_qa
     title: Extractive QA
   - local: tasks/sentence_transformer

diff --git a/docs/source/params/llm_finetuning_params.bck b/docs/source/params/llm_finetuning_params.bck
diff --git a/docs/source/params/text_classification_params.bck b/docs/source/params/text_classification_params.bck
diff --git a/docs/source/tasks/llm_finetuning.mdx b/docs/source/tasks/llm_finetuning.mdx
@@ -12,12 +12,12 @@ Config file task names:
 - `llm-dpo`: DPO trainer
 - `llm-orpo`: ORPO trainer
 
-# Data Preparation
+## Data Preparation
 
 LLM finetuning accepts data in CSV and JSONL formats. JSONL is the preferred format.
 How data is formatted depends on the task you are training the LLM for.
 
-## Classic Text Generation
+### Classic Text Generation
 
 For text generation, the data should be in the following format:
 
@@ -38,7 +38,7 @@ Compatible trainers:
 - SFT Trainer
 - Generic Trainer
 
-## Chatbot / question-answering / code generation / function calling
+### Chatbot / question-answering / code generation / function calling
 
 For this task, you can use CSV or JSONL data. If you are formatting the data yourself (adding start, end tokens, etc.), you can use CSV or JSONL format.
 If you do not want to format the data yourself and want `--chat-template` parameter to format the data for you, you must use JSONL format.
@@ -146,9 +146,9 @@ Chat models can be trained using the following trainers:
 The only difference between the data format for reward trainer and DPO/ORPO trainer is that the reward trainer requires only `text` and `rejected_text` columns, while the DPO/ORPO trainer requires an additional `prompt` column.
 
 
-# Training
+## Training
 
-## Local Training
+### Local Training
 
 Locally the training can be performed by using `autotrain --config config.yaml` command. The `config.yaml` file should contain the following parameters:
 
@@ -222,7 +222,7 @@ $ autotrain --config config.yaml
 
 More example config files for finetuning different types of lllm and different tasks can be found in the [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/llm_finetuning).
 
-## Training in Hugging Face Spaces
+### Training in Hugging Face Spaces
 
 If you are training in Hugging Face Spaces, everything is the same as local training:
 
@@ -232,13 +232,13 @@ In the UI, you need to make sure you select the right model, the dataset and the
 
 Once you are happy with the parameters, you can click on the `Start Training` button to start the training process.
 
-# Parameters
+## Parameters
 
-## LLM Fine Tuning Parameters
+### LLM Fine Tuning Parameters
 
 [[autodoc]] trainers.clm.params.LLMTrainingParams
 
-##  Task specific parameters
+### Task specific parameters
 
 
 The length parameters used for different trainers can be different. Some require more context than others.
@@ -257,7 +257,7 @@ The length parameters used for different trainers can be different. Some require
 
 **NOTE**: Not following these constraints will result in an error / nan losses.
 
-### Generic Trainer
+#### Generic Trainer
 
 ```
 --add_eos_token, --add-eos-token
@@ -271,7 +271,7 @@ The length parameters used for different trainers can be different. Some require
                     Default is 1024
 ```
 
-### SFT Trainer
+#### SFT Trainer
 
 ```
 --block_size BLOCK_SIZE, --block-size BLOCK_SIZE
@@ -282,7 +282,7 @@ The length parameters used for different trainers can be different. Some require
                     Default is 1024
 ```
 
-### Reward Trainer
+#### Reward Trainer
 
 ```
 --block_size BLOCK_SIZE, --block-size BLOCK_SIZE
@@ -293,7 +293,7 @@ The length parameters used for different trainers can be different. Some require
                     Default is 1024
 ```
 
-### DPO Trainer
+#### DPO Trainer
 
 ```
 --dpo-beta DPO_BETA, --dpo-beta DPO_BETA
@@ -314,7 +314,7 @@ The length parameters used for different trainers can be different. Some require
                     Completion length to use, for orpo: encoder-decoder models only
 ```
 
-### ORPO Trainer
+#### ORPO Trainer
 
 ```
 --block_size BLOCK_SIZE, --block-size BLOCK_SIZE

diff --git a/docs/source/tasks/text_classification.mdx → .../tasks/text_classification_regression.mdx b/docs/source/tasks/text_classification.mdx → .../tasks/text_classification_regression.mdx
@@ -1,18 +1,21 @@
-# Text Classification
+# Text Classification & Regression
 
-Training a text classification model with AutoTrain is super-easy! Get your data ready in
+Training a text classification/regression model with AutoTrain is super-easy! Get your data ready in
 proper format and then with just a few clicks, your state-of-the-art model will be ready to
 be used in production.
 
 Config file task names:
-- `text_classification``
+- `text_classification`
 - `text-classification`
+- `text_regression`
+- `text-regression`
 
-# Data Format
+## Data Format
 
-Text classification supports datasets in both CSV and JSONL formats.
+Text classification/regression supports datasets in both CSV and JSONL formats.
+
+### CSV Format
 
-## CSV Format
 Let's train a model for classifying the sentiment of a movie review. The data should be
 in the following CSV format:
 
@@ -29,8 +32,18 @@ As you can see, we have two columns in the CSV file. One column is the text and
 is the label. The label can be any string. In this example, we have two labels: `positive`
 and `negative`. You can have as many labels as you want.
 
+And if you would like to train a model for scoring a movie review on a scale of 1-5. The data can be as follows:
+
+```csv
+text,target
+"this movie is great",4.9
+"this movie is bad",1.5
+.
+.
+.
+```
 
-## JSONL Format
+### JSONL Format
 Instead of CSV you can also use JSONL format. The JSONL format should be as follows:
 
 ```json
@@ -41,21 +54,27 @@ Instead of CSV you can also use JSONL format. The JSONL format should be as foll
 .
 ```
 
-## Columns
+and for regression:
+
+```json
+{"text": "this movie is great", "target": 4.9}
+{"text": "this movie is bad", "target": 1.5}
+
+### Column Mapping / Names
 
 Your CSV dataset must have two columns: `text` and `target`.
 If your column names are different than `text` and `target`, you can map the dataset column to AutoTrain column names.
 
-# Training
+## Training
 
-## Local Training
+### Local Training
 
-To train a text classification model locally, you can use the `autotrain --config config.yaml` command.
+To train a text classification/regression model locally, you can use the `autotrain --config config.yaml` command.
 
 Here is an example of a `config.yaml` file for training a text classification model:
 
 ```yaml
-task: text_classification
+task: text_classification # or text_regression
 base_model: google-bert/bert-base-uncased
 project_name: autotrain-bert-imdb-finetuned
 log: tensorboard
@@ -109,14 +128,20 @@ To train the model, run the following command:
 $ autotrain --config config.yaml
 ```
 
-## Training on Hugging Face Spaces
+You can find example config files for text classification and regression in the [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/text_classification) and [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/text_regression) respectively.
+
+### Training on Hugging Face Spaces
 
 The parameters for training on Hugging Face Spaces are the same as for local training. 
 If you are using your own dataset, select "Local" as dataset source and upload your dataset.
 In the following screenshot, we are training a text classification model using the `google-bert/bert-base-uncased` model on the IMDB dataset.
 
 ![AutoTrain Text Classification on Hugging Face Spaces](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/autotrain_text_classification.png)
 
-# Parameters
+For text regression, all you need to do is select "Text Regression" as the task and everything else remains the same (except the data, of course).
+
+## Training Parameters
+
+Training parameters for text classification and regression are the same.
 
 [[autodoc]] trainers.text_classification.params.TextClassificationParams