-
Notifications
You must be signed in to change notification settings - Fork 507
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a1b6fd8
commit 831afc2
Showing
40 changed files
with
1,048 additions
and
830 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
# Understanding Column Mapping | ||
|
||
Column mapping is a critical setup process in AutoTrain that informs the system | ||
about the roles of different columns in your dataset. Whether it's a tabular | ||
dataset, text classification data, or another type, the need for precise | ||
column mapping ensures that AutoTrain processes each dataset element correctly. | ||
|
||
## How Column Mapping Works | ||
|
||
AutoTrain has no way of knowing what the columns in your dataset represent. | ||
AutoTrain requires a clear understanding of each column's function within | ||
your dataset to train models effectively. This is managed through a | ||
straightforward mapping system in the user interface, represented as a dictionary. | ||
Here's a typical example: | ||
|
||
``` | ||
{"text": "text", "label": "target"} | ||
``` | ||
|
||
In this example, the `text column in your dataset corresponds to the text data | ||
AutoTrain uses for processing, and the `target`` column is treated as the | ||
label for training. | ||
|
||
But let's not get confused! AutoTrain has a way to understand what each column in your dataset represents. | ||
If your data is already in AutoTrain format, you dont need to change column mappings. | ||
If not, you can easily map the columns in your dataset to the correct AutoTrain format. | ||
|
||
In the UI, you will see column mapping as a dictionary: | ||
|
||
``` | ||
{"text": "text", "label": "target"} | ||
``` | ||
|
||
Here, the column `text` in your dataset is mapped to the AutoTrain column `text`, | ||
and the column `target` in your dataset is mapped to the AutoTrain column `label`. | ||
|
||
Let's say you are training a text classification model and your dataset has the following columns: | ||
|
||
``` | ||
full_text, target_sentiment | ||
"this movie is great", positive | ||
"this movie is bad", negative | ||
``` | ||
|
||
You can map these columns to the AutoTrain format as follows: | ||
|
||
``` | ||
{"text": "full_text", "label": "target_sentiment"} | ||
``` | ||
|
||
If your dataset has the columns: `text` and `label`, you don't need to change the column mapping. | ||
|
||
Let's take a look at column mappings for each task: | ||
|
||
## LLM | ||
|
||
Note: For all LLM tasks, if the text column(s) is not formatted i.e. if contains samples in chat format (dict or json), then you | ||
should use `chat_template` parameter. Read more about it in LLM Parameters Section. | ||
|
||
|
||
### SFT / Generic Trainer | ||
|
||
``` | ||
{"text": "text"} | ||
``` | ||
|
||
`text`: The column in your dataset that contains the text data. | ||
|
||
|
||
### Reward / ORPO Trainer | ||
|
||
``` | ||
{"text": "text", "rejected_text": "rejected_text"} | ||
``` | ||
|
||
`text`: The column in your dataset that contains the text data. | ||
|
||
`rejected_text`: The column in your dataset that contains the rejected text data. | ||
|
||
### DPO Trainer | ||
|
||
``` | ||
{"prompt": "prompt", "text": "text", "rejected_text": "rejected_text"} | ||
``` | ||
|
||
`prompt`: The column in your dataset that contains the prompt data. | ||
|
||
`text`: The column in your dataset that contains the text data. | ||
|
||
`rejected_text`: The column in your dataset that contains the rejected text data. | ||
|
||
|
||
## Text Classification & Regression, Seq2Seq | ||
|
||
For text classification and regression, the column mapping should be as follows: | ||
|
||
``` | ||
{"text": "dataset_text_column", "label": "dataset_target_column"} | ||
``` | ||
|
||
`text`: The column in your dataset that contains the text data. | ||
|
||
`label`: The column in your dataset that contains the target variable. | ||
|
||
|
||
## Token Classification | ||
|
||
|
||
``` | ||
{"text": "tokens", "label": "tags"} | ||
``` | ||
|
||
`text`: The column in your dataset that contains the tokens. These tokens must be a list of strings. | ||
|
||
`label`: The column in your dataset that contains the tags. These tags must be a list of strings. | ||
|
||
For token classification, if you are using a CSV, make sure that the columns are stringified lists. | ||
|
||
## Tabular Classification & Regression | ||
|
||
``` | ||
{"id": "id", "label": ["target"]} | ||
``` | ||
|
||
`id`: The column in your dataset that contains the unique identifier for each row. | ||
|
||
`label`: The column in your dataset that contains the target variable. This should be a list of strings. | ||
|
||
For a single target column, you can pass a list with a single element. | ||
|
||
For multiple target columns, e.g. a multi label classification task, you can pass a list with multiple elements. | ||
|
||
|
||
# DreamBooth LoRA | ||
|
||
Dreambooth doesn't require column mapping. | ||
|
||
# Image Classification | ||
|
||
For image classification, the column mapping should be as follows: | ||
|
||
``` | ||
{"image": "image_column", "label": "label_column"} | ||
``` | ||
|
||
Image classification requires column mapping only when you are using a dataset from Hugging Face Hub. | ||
For uploaded datasets, leave column mapping as it is. | ||
|
||
## Ensuring Accurate Mapping | ||
|
||
To ensure your model trains correctly: | ||
|
||
- Verify Column Names: Double-check that the names used in the mapping dictionary accurately reflect those in your dataset. | ||
|
||
- Format Appropriately: Especially in token classification, ensure your data format matches expectations (e.g., lists of strings). | ||
|
||
- Update Mappings for New Datasets: Each new dataset might require its unique mappings based on its structure and the task at hand. | ||
|
||
By following these guidelines and using the provided examples as templates, | ||
you can effectively instruct AutoTrain on how to interpret and handle your | ||
data for various machine learning tasks. This process is fundamental for | ||
achieving optimal results from your model training endeavors. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# AutoTrain Configs | ||
|
||
AutoTrain Configs are the way to use and train models using AutoTrain locally. | ||
|
||
Once you have installed AutoTrain Advanced, you can use the following command to train models using AutoTrain config files: | ||
|
||
```bash | ||
$ export HF_USERNAME=your_hugging_face_username | ||
$ export HF_TOKEN=your_hugging_face_write_token | ||
|
||
$ autotrain --config path/to/config.yaml | ||
``` | ||
|
||
Example configurations for all tasks can be found in the `configs` directory of | ||
the [AutoTrain Advanced GitHub repository](https://github.com/huggingface/autotrain-advanced). | ||
|
||
Here is an example of an AutoTrain config file: | ||
|
||
```yaml | ||
task: llm | ||
base_model: meta-llama/Meta-Llama-3-8B-Instruct | ||
project_name: autotrain-llama3-8b-orpo | ||
log: tensorboard | ||
backend: local | ||
|
||
data: | ||
path: argilla/distilabel-capybara-dpo-7k-binarized | ||
train_split: train | ||
valid_split: null | ||
chat_template: chatml | ||
column_mapping: | ||
text_column: chosen | ||
rejected_text_column: rejected | ||
|
||
params: | ||
trainer: orpo | ||
block_size: 1024 | ||
model_max_length: 2048 | ||
max_prompt_length: 512 | ||
epochs: 3 | ||
batch_size: 2 | ||
lr: 3e-5 | ||
peft: true | ||
quantization: int4 | ||
target_modules: all-linear | ||
padding: right | ||
optimizer: adamw_torch | ||
scheduler: linear | ||
gradient_accumulation: 4 | ||
mixed_precision: bf16 | ||
|
||
hub: | ||
username: ${HF_USERNAME} | ||
token: ${HF_TOKEN} | ||
push_to_hub: true | ||
``` | ||
In this config, we are finetuning the `meta-llama/Meta-Llama-3-8B-Instruct` model | ||
on the `argilla/distilabel-capybara-dpo-7k-binarized` dataset using the `orpo` | ||
trainer for 3 epochs with a batch size of 2 and a learning rate of `3e-5`. | ||
More information on the available parameters can be found in the *Data Formats and Parameters* section. | ||
|
||
In case you dont want to push the model to hub, you can set `push_to_hub` to `false` in the config file. | ||
If not pushing the model to hub username and token are not required. Note: they may still be needed | ||
if you are trying to access gated models or datasets. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,40 @@ | ||
# How much does it cost? | ||
|
||
AutoTrain provides you with best models which are deployable with just a few clicks. | ||
Unlike other services, we don't own your models. Once the training is done, you can download them and use them anywhere you want. | ||
AutoTrain offers an accessible approach to model training, providing deployable models | ||
with just a few clicks. Understanding the cost involved is essential to planning and | ||
executing your projects efficiently. | ||
|
||
You will be charged per minute based on the hardware you choose. | ||
|
||
Pricing information is available in the [pricing](https://huggingface.co/pricing#spaces) section. | ||
## Local Usage | ||
|
||
Please note that in order to use AutoTrain, you need to have a valid payment method on file. | ||
You can add your payment method in the [billing](https://huggingface.co/settings/billing) section. | ||
When you choose to use AutoTrain locally on your own hardware, there is no cost. | ||
This option is ideal for those who prefer to manage their own infrastructure and | ||
do not require the scalability that cloud resources offer. | ||
|
||
## Using AutoTrain on Hugging Face Spaces | ||
|
||
**Pay-As-You-Go**: Costs for using AutoTrain in Hugging Face Spaces are based on the | ||
computing resources you consume. This flexible pricing structure ensures you only pay | ||
for what you use, making it cost-effective and scalable for projects of any size. | ||
|
||
|
||
**Ownership and Portability**: Unlike some other platforms, AutoTrain does not retain | ||
ownership of your models. Once training is complete, you are free to download and | ||
deploy your models wherever you choose, providing flexibility and control over your all your assets. | ||
|
||
### Pricing Details | ||
|
||
**Resource-Based Billing**: Charges are accrued per minute according to the type of hardware | ||
utilized during training. This means you can scale your resource usage based on the | ||
complexity and needs of your projects. | ||
|
||
For a detailed breakdown of the costs associated with using Hugging Face Spaces, | ||
please refer to the [pricing](https://huggingface.co/pricing#spaces) section on our website. | ||
|
||
To access the paid features of AutoTrain, you must have a valid payment method on file. | ||
You can manage your payment options and view your billing information in | ||
the [billing section of your Hugging Face account settings.](https://huggingface.co/settings/billing) | ||
|
||
By offering both free and flexible paid options, AutoTrain ensures that users can choose | ||
the most suitable model training solution for their needs, whether they are experimenting | ||
on a local machine or scaling up operations on Hugging Face Spaces. |
Oops, something went wrong.