Skip to content

Commit

Permalink
update (#630)
Browse files Browse the repository at this point in the history
  • Loading branch information
abhishekkrthakur authored May 8, 2024
1 parent a1b6fd8 commit 831afc2
Show file tree
Hide file tree
Showing 40 changed files with 1,048 additions and 830 deletions.
40 changes: 32 additions & 8 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -1,19 +1,27 @@
- sections:
- local: index
title: 🤗 AutoTrain
- local: getting_started
title: Installation
- local: cost
title: How much does it cost?
- local: support
title: Get help and support
- local: faq
title: Frequently Asked Questions
title: Getting Started
- sections:
- local: starting_ui
title: Starting the UI
- local: starting_cli
title: Starting the CLI
title: Starting AutoTrain
- local: quickstart_spaces
title: Quickstart
title: AutoTrain on Hugging Face Spaces
- sections:
- local: quickstart
title: Quickstart
- local: config
title: Configurations
title: Use AutoTrain Locally
- sections:
- local: col_map
title: Understanding Column Mapping
title: Miscellaneous
- sections:
- local: text_classification
title: Text Classification
Expand All @@ -31,4 +39,20 @@
title: Token Classification
- local: tabular
title: Tabular
title: Tasks
title: Data Formats
- sections:
- local: text_classification_params
title: Text Classification & Regression
- local: llm_finetuning_params
title: LLM Finetuning
- local: image_classification_params
title: Image Classification
- local: dreambooth_params
title: DreamBooth
- local: seq2seq_params
title: Seq2Seq
- local: token_classification_params
title: Token Classification
- local: tabular_params
title: Tabular
title: Parameters
162 changes: 162 additions & 0 deletions docs/source/col_map.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Understanding Column Mapping

Column mapping is a critical setup process in AutoTrain that informs the system
about the roles of different columns in your dataset. Whether it's a tabular
dataset, text classification data, or another type, the need for precise
column mapping ensures that AutoTrain processes each dataset element correctly.

## How Column Mapping Works

AutoTrain has no way of knowing what the columns in your dataset represent.
AutoTrain requires a clear understanding of each column's function within
your dataset to train models effectively. This is managed through a
straightforward mapping system in the user interface, represented as a dictionary.
Here's a typical example:

```
{"text": "text", "label": "target"}
```

In this example, the `text column in your dataset corresponds to the text data
AutoTrain uses for processing, and the `target`` column is treated as the
label for training.

But let's not get confused! AutoTrain has a way to understand what each column in your dataset represents.
If your data is already in AutoTrain format, you dont need to change column mappings.
If not, you can easily map the columns in your dataset to the correct AutoTrain format.

In the UI, you will see column mapping as a dictionary:

```
{"text": "text", "label": "target"}
```

Here, the column `text` in your dataset is mapped to the AutoTrain column `text`,
and the column `target` in your dataset is mapped to the AutoTrain column `label`.

Let's say you are training a text classification model and your dataset has the following columns:

```
full_text, target_sentiment
"this movie is great", positive
"this movie is bad", negative
```

You can map these columns to the AutoTrain format as follows:

```
{"text": "full_text", "label": "target_sentiment"}
```

If your dataset has the columns: `text` and `label`, you don't need to change the column mapping.

Let's take a look at column mappings for each task:

## LLM

Note: For all LLM tasks, if the text column(s) is not formatted i.e. if contains samples in chat format (dict or json), then you
should use `chat_template` parameter. Read more about it in LLM Parameters Section.


### SFT / Generic Trainer

```
{"text": "text"}
```

`text`: The column in your dataset that contains the text data.


### Reward / ORPO Trainer

```
{"text": "text", "rejected_text": "rejected_text"}
```

`text`: The column in your dataset that contains the text data.

`rejected_text`: The column in your dataset that contains the rejected text data.

### DPO Trainer

```
{"prompt": "prompt", "text": "text", "rejected_text": "rejected_text"}
```

`prompt`: The column in your dataset that contains the prompt data.

`text`: The column in your dataset that contains the text data.

`rejected_text`: The column in your dataset that contains the rejected text data.


## Text Classification & Regression, Seq2Seq

For text classification and regression, the column mapping should be as follows:

```
{"text": "dataset_text_column", "label": "dataset_target_column"}
```

`text`: The column in your dataset that contains the text data.

`label`: The column in your dataset that contains the target variable.


## Token Classification


```
{"text": "tokens", "label": "tags"}
```

`text`: The column in your dataset that contains the tokens. These tokens must be a list of strings.

`label`: The column in your dataset that contains the tags. These tags must be a list of strings.

For token classification, if you are using a CSV, make sure that the columns are stringified lists.

## Tabular Classification & Regression

```
{"id": "id", "label": ["target"]}
```

`id`: The column in your dataset that contains the unique identifier for each row.

`label`: The column in your dataset that contains the target variable. This should be a list of strings.

For a single target column, you can pass a list with a single element.

For multiple target columns, e.g. a multi label classification task, you can pass a list with multiple elements.


# DreamBooth LoRA

Dreambooth doesn't require column mapping.

# Image Classification

For image classification, the column mapping should be as follows:

```
{"image": "image_column", "label": "label_column"}
```

Image classification requires column mapping only when you are using a dataset from Hugging Face Hub.
For uploaded datasets, leave column mapping as it is.

## Ensuring Accurate Mapping

To ensure your model trains correctly:

- Verify Column Names: Double-check that the names used in the mapping dictionary accurately reflect those in your dataset.

- Format Appropriately: Especially in token classification, ensure your data format matches expectations (e.g., lists of strings).

- Update Mappings for New Datasets: Each new dataset might require its unique mappings based on its structure and the task at hand.

By following these guidelines and using the provided examples as templates,
you can effectively instruct AutoTrain on how to interpret and handle your
data for various machine learning tasks. This process is fundamental for
achieving optimal results from your model training endeavors.
65 changes: 65 additions & 0 deletions docs/source/config.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# AutoTrain Configs

AutoTrain Configs are the way to use and train models using AutoTrain locally.

Once you have installed AutoTrain Advanced, you can use the following command to train models using AutoTrain config files:

```bash
$ export HF_USERNAME=your_hugging_face_username
$ export HF_TOKEN=your_hugging_face_write_token

$ autotrain --config path/to/config.yaml
```

Example configurations for all tasks can be found in the `configs` directory of
the [AutoTrain Advanced GitHub repository](https://github.com/huggingface/autotrain-advanced).

Here is an example of an AutoTrain config file:

```yaml
task: llm
base_model: meta-llama/Meta-Llama-3-8B-Instruct
project_name: autotrain-llama3-8b-orpo
log: tensorboard
backend: local

data:
path: argilla/distilabel-capybara-dpo-7k-binarized
train_split: train
valid_split: null
chat_template: chatml
column_mapping:
text_column: chosen
rejected_text_column: rejected

params:
trainer: orpo
block_size: 1024
model_max_length: 2048
max_prompt_length: 512
epochs: 3
batch_size: 2
lr: 3e-5
peft: true
quantization: int4
target_modules: all-linear
padding: right
optimizer: adamw_torch
scheduler: linear
gradient_accumulation: 4
mixed_precision: bf16

hub:
username: ${HF_USERNAME}
token: ${HF_TOKEN}
push_to_hub: true
```
In this config, we are finetuning the `meta-llama/Meta-Llama-3-8B-Instruct` model
on the `argilla/distilabel-capybara-dpo-7k-binarized` dataset using the `orpo`
trainer for 3 epochs with a batch size of 2 and a learning rate of `3e-5`.
More information on the available parameters can be found in the *Data Formats and Parameters* section.

In case you dont want to push the model to hub, you can set `push_to_hub` to `false` in the config file.
If not pushing the model to hub username and token are not required. Note: they may still be needed
if you are trying to access gated models or datasets.
41 changes: 35 additions & 6 deletions docs/source/cost.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,40 @@
# How much does it cost?

AutoTrain provides you with best models which are deployable with just a few clicks.
Unlike other services, we don't own your models. Once the training is done, you can download them and use them anywhere you want.
AutoTrain offers an accessible approach to model training, providing deployable models
with just a few clicks. Understanding the cost involved is essential to planning and
executing your projects efficiently.

You will be charged per minute based on the hardware you choose.

Pricing information is available in the [pricing](https://huggingface.co/pricing#spaces) section.
## Local Usage

Please note that in order to use AutoTrain, you need to have a valid payment method on file.
You can add your payment method in the [billing](https://huggingface.co/settings/billing) section.
When you choose to use AutoTrain locally on your own hardware, there is no cost.
This option is ideal for those who prefer to manage their own infrastructure and
do not require the scalability that cloud resources offer.

## Using AutoTrain on Hugging Face Spaces

**Pay-As-You-Go**: Costs for using AutoTrain in Hugging Face Spaces are based on the
computing resources you consume. This flexible pricing structure ensures you only pay
for what you use, making it cost-effective and scalable for projects of any size.


**Ownership and Portability**: Unlike some other platforms, AutoTrain does not retain
ownership of your models. Once training is complete, you are free to download and
deploy your models wherever you choose, providing flexibility and control over your all your assets.

### Pricing Details

**Resource-Based Billing**: Charges are accrued per minute according to the type of hardware
utilized during training. This means you can scale your resource usage based on the
complexity and needs of your projects.

For a detailed breakdown of the costs associated with using Hugging Face Spaces,
please refer to the [pricing](https://huggingface.co/pricing#spaces) section on our website.

To access the paid features of AutoTrain, you must have a valid payment method on file.
You can manage your payment options and view your billing information in
the [billing section of your Hugging Face account settings.](https://huggingface.co/settings/billing)

By offering both free and flexible paid options, AutoTrain ensures that users can choose
the most suitable model training solution for their needs, whether they are experimenting
on a local machine or scaling up operations on Hugging Face Spaces.
Loading

0 comments on commit 831afc2

Please sign in to comment.