Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs #356

Merged
merged 1 commit into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 5 additions & 9 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,17 @@
- local: support
title: Get help and support
title: Get started
- sections:
- local: model_choice
title: Model Selection
- local: param_choice
title: Parameter Selection
title: Selecting Models and Parameters
- sections:
- local: text_classification
title: Text Classification
- local: llm_finetuning
title: LLM Finetuning
title: Text Tasks
- sections:
- local: image_classification
title: Image Classification
- local: dreambooth
title: DreamBooth
title: Image Tasks
- local: seq2seq
title: Seq2Seq
- local: tabular
title: Tabular
title: Data Formats
6 changes: 1 addition & 5 deletions docs/source/dreambooth.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,11 @@

DreamBooth is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views.

![DreamBooth Teaser](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/dreambooth1.jpeg)

## Data Preparation

The data format for DreamBooth training is simple. All you need is images of a concept (e.g. a person) and a concept token.

![DreamBooth Training](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/dreambooth2.png)

To train a dreambooth model, please select an appropriate model from the hub.
When choosing a model from the hub, please make sure you select the correct image size compatible with the model.

The next steps are to add the jobs and click on "Start Training" to start training the model(s).
Your concept token is `prompt` in parameters section.
6 changes: 5 additions & 1 deletion docs/source/getting_started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,8 @@ We are constantly adding new features and tasks to AutoTrain Advanced. Its alway

Please note that "restarting" a space will not update it to the latest version. You need to "Factory reboot" the space to update it to the latest version.

And now we are all set and we can start with our first project!
And now we are all set and we can start with our first project!

# Understanding the UI

![autotrain-space-template](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/ui.png)
6 changes: 0 additions & 6 deletions docs/source/image_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,3 @@ Some points to keep in mind:
- There should not be any other folders inside the zip folder.

When train.zip is decompressed, it creates two folders: cats and dogs. these are the two categories for classification. The images for both categories are in their respective folders. You can have as many categories as you want.

## Training

Once you have your data ready and jobs added, you can start training your model by clicking the "Start Training" button.

![Image Classification](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/image_classification_1.png)
50 changes: 30 additions & 20 deletions docs/source/llm_finetuning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,37 +7,47 @@ AutoTrain supports the following types of LLM finetuning:
- Causal Language Modeling (CLM)
- Masked Language Modeling (MLM) [Coming Soon]

For LLM finetuning, only Hugging Face Hub model choice is available.
User needs to select a model from Hugging Face Hub, that they want to finetune and select the parameters on their own (Manual Parameter Selection),
or use AutoTrain's Auto Parameter Selection to automatically select the best parameters for the task.

## Data Preparation

LLM finetuning accepts data in CSV format.
There are two modes for LLM finetuning: `generic` and `chat`.
An example dataset with both formats in the same dataset can be found here: https://huggingface.co/datasets/tatsu-lab/alpaca

### Generic
### Data Format For SFT / Generic Trainer

For SFT / Generic Trainer, the data should be in the following format:

| text |
| This is the first sentence. |
| This is the second sentence. |

In generic mode, only one column is required: `text`.
The user can take care of how the data is formatted for the task.
A sample instance for this format is presented below:
An example dataset for this format can be found here: https://huggingface.co/datasets/timdettmers/openassistant-guanaco

```
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
For SFT/Generic training, your dataset must have a `text` column

### Instruction: Evaluate this sentence for spelling and grammar mistakes
### Data Format For Reward Trainer

### Input: He finnished his meal and left the resturant
For Reward Trainer, the data should be in the following format:

### Response: He finished his meal and left the restaurant.
```
| text | rejected_text |
|---------------------------------------------------------------|-------------------------------------------------------------------|
| human: hello \n bot: hi nice to meet you | humna: hello \n bot: leave me alone |
| human: how are you \n bot: I am fine | human: how are you \n bot: I am not fine |
| human: What is your name? \n bot: My name is Mary | human: What is your name? \n bot: Whats it to you? |
| human: Which is the best programming language? \n bot: Python | human: Which is the best programming language? \n bot: Javascript |

![Generic LLM Finetuning](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/llm_1.png)
For Reward Trainer, your dataset must have a `text` column (aka chosen text) and a `rejected_text` column.

Please note that above is the format for instruction finetuning. You can also finetune on any other format as you want, for example generic finetuning. The data can be changed according to the requirements.
### Data Format For DPO Trainer

For DPO Trainer, the data should be in the following format:

## Training
| prompt | text | rejected_text |
|-----------------------------------------|---------------------|--------------------|
| hello | hi nice to meet you | leave me alone |
| how are you | I am fine | I am not fine |
| What is your name? | My name is Mary | Whats it to you? |
| What is your name? | My name is Mary | I dont have a name |
| Which is the best programming language? | Python | Javascript |
| Which is the best programming language? | Python | C++ |
| Which is the best programming language? | Java | C++ |

Once you have your data ready and jobs added, you can start training your model by clicking the "Start Training" button.
For DPO Trainer, your dataset must have a `prompt` column, a `text` column (aka chosen text) and a `rejected_text` column.
24 changes: 0 additions & 24 deletions docs/source/model_choice.mdx

This file was deleted.

25 changes: 0 additions & 25 deletions docs/source/param_choice.mdx

This file was deleted.

19 changes: 19 additions & 0 deletions docs/source/seq2seq.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Seq2Seq

Seq2Seq is a task that involves converting a sequence of words into another sequence of words.
It is used in machine translation, text summarization, and question answering.

## Data Format

```csv
text,target
"this movie is great","dieser Film ist großartig"
"this movie is bad","dieser Film ist schlecht"
.
.
.
```

## Columns

Your CSV dataset must have two columns: `text` and `target`.
2 changes: 1 addition & 1 deletion docs/source/support.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ To get help and support for autotrain, there are 3 ways:

- [Ask in the Hugging Face Forum](https://discuss.huggingface.co/c/autotrain/16).

- [Email us](mailto:[email protected]) directly.
- [Email us](mailto:[email protected]) directly (Enterprise users and billing questions only).


Please don't forget to mention your username and project name if you have a specific question about your project.
44 changes: 44 additions & 0 deletions docs/source/tabular.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Tabular Classification / Regression

Using AutoTrain, you can train a model to classify or regress tabular data easily.
All you need to do is select from a list of models and upload your dataset.
Parameter tuning is done automatically.

## Models

The following models are available for tabular classification / regression.

- xgboost
- random_forest
- ridge
- logistic_regression
- svm
- extra_trees
- gradient_boosting
- adaboost
- decision_tree
- knn


## Data Format

```csv
id,category1,category2,feature1,target
1,A,X,0.3373961604172684,1
2,B,Z,0.6481718720511972,0
3,A,Y,0.36824153984054797,1
4,B,Z,0.9571551589530464,1
5,B,Z,0.14035078041264515,1
6,C,X,0.8700872583584364,1
7,A,Y,0.4736080452737105,0
8,C,Y,0.8009107519796442,1
9,A,Y,0.5204774795512048,0
10,A,Y,0.6788795301189603,0
.
.
.
```

## Columns

Your CSV dataset must have two columns: `id` and `target`.
18 changes: 3 additions & 15 deletions docs/source/text_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Let's train a model for classifying the sentiment of a movie review. The data sh
in the following CSV format:

```csv
review,sentiment
text,target
"this movie is great",positive
"this movie is bad",negative
.
Expand Down Expand Up @@ -41,18 +41,6 @@ for chunk in pd.read_csv('example.csv', chunksize=chunk_size):
i += 1
```

Once the data has been uploaded, you have to select the proper column mapping
## Columns

## Column Mapping

![Column Mapping](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/text_classification_1.png)

In our example, the text column is called `review` and the label column is called `sentiment`.
Thus, we have to select `review` for the text column and `sentiment` for the label column.
Please note that, if column mapping is not done correctly, the training will fail.


## Training

Once you have uploaded the data, selected the column mapping, and set the hyperparameters (AutoTrain or Manual mode), you can add the jobs.
To start the training, click on the `Start Training` button.
Your CSV dataset must have two columns: `text` and `target`.
2 changes: 1 addition & 1 deletion src/autotrain/cli/run_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def register_subcommand(parser: ArgumentParser):
},
{
"arg": "--train_split",
"help": "Test dataset split to use",
"help": "Train dataset split to use",
"required": False,
"type": str,
"default": "train",
Expand Down
Binary file added static/ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 9 additions & 1 deletion templates/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,13 @@
</header>

<div class="form-container max-w-lg mx-auto mt-10 p-6 shadow-2xl">
<div class="block text-sm font-normal text-gray-700">AutoTrain Advanced is a no-code solution that
allows you to train machine learning models in just a few clicks. Please note that you must upload data in correct
format for project to be created.
For help regarding proper data format and pricing, click <a href="https://hf.co/docs/autotrain" target="_blank"
class="text-blue-700">here</a>.
</div>
<hr class="h-px my-4 bg-gray-200 border-b-2 dark:bg-gray-700">
<form action="#" method="post" class="space-y-4" enctype="multipart/form-data">
<!-- <div class="columns-2">
<div>
Expand Down Expand Up @@ -133,7 +140,8 @@
<li class="me-2" role="presentation">
<button class="inline-block p-4 border-b-2 rounded-t-lg" id="valid-data-tab"
data-tabs-target="#valid-data" type="button" role="tab" aria-controls="valid-data"
aria-selected="false">Validation Data (optional)</button>
aria-selected="false">Validation Data
(optional)</button>
</li>
</ul>
</div>
Expand Down
Loading