Skip to content

Commit

Permalink
text clf/reg
Browse files Browse the repository at this point in the history
  • Loading branch information
abhishekkrthakur committed Oct 4, 2024
1 parent 82d5a86 commit 8e6c2ac
Show file tree
Hide file tree
Showing 6 changed files with 55 additions and 275 deletions.
4 changes: 2 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@
- sections:
- local: tasks/llm_finetuning
title: LLM Finetuning
- local: tasks/text_classification
title: Text Classification
- local: tasks/text_classification_regression
title: Text Classification/Regression
- local: tasks/extractive_qa
title: Extractive QA
- local: tasks/sentence_transformer
Expand Down
95 changes: 0 additions & 95 deletions docs/source/params/llm_finetuning_params.bck

This file was deleted.

3 changes: 0 additions & 3 deletions docs/source/params/text_classification_params.bck

This file was deleted.

28 changes: 14 additions & 14 deletions docs/source/tasks/llm_finetuning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ Config file task names:
- `llm-dpo`: DPO trainer
- `llm-orpo`: ORPO trainer

# Data Preparation
## Data Preparation

LLM finetuning accepts data in CSV and JSONL formats. JSONL is the preferred format.
How data is formatted depends on the task you are training the LLM for.

## Classic Text Generation
### Classic Text Generation

For text generation, the data should be in the following format:

Expand All @@ -38,7 +38,7 @@ Compatible trainers:
- SFT Trainer
- Generic Trainer

## Chatbot / question-answering / code generation / function calling
### Chatbot / question-answering / code generation / function calling

For this task, you can use CSV or JSONL data. If you are formatting the data yourself (adding start, end tokens, etc.), you can use CSV or JSONL format.
If you do not want to format the data yourself and want `--chat-template` parameter to format the data for you, you must use JSONL format.
Expand Down Expand Up @@ -146,9 +146,9 @@ Chat models can be trained using the following trainers:
The only difference between the data format for reward trainer and DPO/ORPO trainer is that the reward trainer requires only `text` and `rejected_text` columns, while the DPO/ORPO trainer requires an additional `prompt` column.


# Training
## Training

## Local Training
### Local Training

Locally the training can be performed by using `autotrain --config config.yaml` command. The `config.yaml` file should contain the following parameters:

Expand Down Expand Up @@ -222,7 +222,7 @@ $ autotrain --config config.yaml

More example config files for finetuning different types of lllm and different tasks can be found in the [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/llm_finetuning).

## Training in Hugging Face Spaces
### Training in Hugging Face Spaces

If you are training in Hugging Face Spaces, everything is the same as local training:

Expand All @@ -232,13 +232,13 @@ In the UI, you need to make sure you select the right model, the dataset and the

Once you are happy with the parameters, you can click on the `Start Training` button to start the training process.

# Parameters
## Parameters

## LLM Fine Tuning Parameters
### LLM Fine Tuning Parameters

[[autodoc]] trainers.clm.params.LLMTrainingParams

## Task specific parameters
### Task specific parameters


The length parameters used for different trainers can be different. Some require more context than others.
Expand All @@ -257,7 +257,7 @@ The length parameters used for different trainers can be different. Some require

**NOTE**: Not following these constraints will result in an error / nan losses.

### Generic Trainer
#### Generic Trainer

```
--add_eos_token, --add-eos-token
Expand All @@ -271,7 +271,7 @@ The length parameters used for different trainers can be different. Some require
Default is 1024
```
### SFT Trainer
#### SFT Trainer
```
--block_size BLOCK_SIZE, --block-size BLOCK_SIZE
Expand All @@ -282,7 +282,7 @@ The length parameters used for different trainers can be different. Some require
Default is 1024
```
### Reward Trainer
#### Reward Trainer
```
--block_size BLOCK_SIZE, --block-size BLOCK_SIZE
Expand All @@ -293,7 +293,7 @@ The length parameters used for different trainers can be different. Some require
Default is 1024
```
### DPO Trainer
#### DPO Trainer
```
--dpo-beta DPO_BETA, --dpo-beta DPO_BETA
Expand All @@ -314,7 +314,7 @@ The length parameters used for different trainers can be different. Some require
Completion length to use, for orpo: encoder-decoder models only
```
### ORPO Trainer
#### ORPO Trainer
```
--block_size BLOCK_SIZE, --block-size BLOCK_SIZE
Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
# Text Classification
# Text Classification & Regression

Training a text classification model with AutoTrain is super-easy! Get your data ready in
Training a text classification/regression model with AutoTrain is super-easy! Get your data ready in
proper format and then with just a few clicks, your state-of-the-art model will be ready to
be used in production.

Config file task names:
- `text_classification``
- `text_classification`
- `text-classification`
- `text_regression`
- `text-regression`

# Data Format
## Data Format

Text classification supports datasets in both CSV and JSONL formats.
Text classification/regression supports datasets in both CSV and JSONL formats.

### CSV Format

## CSV Format
Let's train a model for classifying the sentiment of a movie review. The data should be
in the following CSV format:

Expand All @@ -29,8 +32,18 @@ As you can see, we have two columns in the CSV file. One column is the text and
is the label. The label can be any string. In this example, we have two labels: `positive`
and `negative`. You can have as many labels as you want.

And if you would like to train a model for scoring a movie review on a scale of 1-5. The data can be as follows:

```csv
text,target
"this movie is great",4.9
"this movie is bad",1.5
.
.
.
```

## JSONL Format
### JSONL Format
Instead of CSV you can also use JSONL format. The JSONL format should be as follows:

```json
Expand All @@ -41,21 +54,27 @@ Instead of CSV you can also use JSONL format. The JSONL format should be as foll
.
```

## Columns
and for regression:

```json
{"text": "this movie is great", "target": 4.9}
{"text": "this movie is bad", "target": 1.5}

### Column Mapping / Names

Your CSV dataset must have two columns: `text` and `target`.
If your column names are different than `text` and `target`, you can map the dataset column to AutoTrain column names.

# Training
## Training

## Local Training
### Local Training

To train a text classification model locally, you can use the `autotrain --config config.yaml` command.
To train a text classification/regression model locally, you can use the `autotrain --config config.yaml` command.

Here is an example of a `config.yaml` file for training a text classification model:

```yaml
task: text_classification
task: text_classification # or text_regression
base_model: google-bert/bert-base-uncased
project_name: autotrain-bert-imdb-finetuned
log: tensorboard
Expand Down Expand Up @@ -109,14 +128,20 @@ To train the model, run the following command:
$ autotrain --config config.yaml
```

## Training on Hugging Face Spaces
You can find example config files for text classification and regression in the [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/text_classification) and [here](https://github.com/huggingface/autotrain-advanced/tree/main/configs/text_regression) respectively.

### Training on Hugging Face Spaces

The parameters for training on Hugging Face Spaces are the same as for local training.
If you are using your own dataset, select "Local" as dataset source and upload your dataset.
In the following screenshot, we are training a text classification model using the `google-bert/bert-base-uncased` model on the IMDB dataset.

![AutoTrain Text Classification on Hugging Face Spaces](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/autotrain_text_classification.png)

# Parameters
For text regression, all you need to do is select "Text Regression" as the task and everything else remains the same (except the data, of course).

## Training Parameters

Training parameters for text classification and regression are the same.

[[autodoc]] trainers.text_classification.params.TextClassificationParams
Loading

0 comments on commit 8e6c2ac

Please sign in to comment.