huggingface · abhishekkrthakur · Nov 22, 2023 · Nov 22, 2023
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -8,21 +8,17 @@
   - local: support
     title: Get help and support
   title: Get started
-- sections:
-  - local: model_choice
-    title: Model Selection
-  - local: param_choice
-    title: Parameter Selection
-  title: Selecting Models and Parameters
 - sections: 
   - local: text_classification
     title: Text Classification
   - local: llm_finetuning
     title: LLM Finetuning
-  title: Text Tasks
-- sections:
   - local: image_classification
     title: Image Classification
   - local: dreambooth
     title: DreamBooth
-  title: Image Tasks
+  - local: seq2seq
+    title: Seq2Seq
+  - local: tabular
+    title: Tabular
+  title: Data Formats
diff --git a/docs/source/dreambooth.mdx b/docs/source/dreambooth.mdx
@@ -2,15 +2,11 @@
 
 DreamBooth is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views.
 
-![DreamBooth Teaser](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/dreambooth1.jpeg)
-
 ## Data Preparation
 
 The data format for DreamBooth training is simple. All you need is images of a concept (e.g. a person) and a concept token.
 
-![DreamBooth Training](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/dreambooth2.png)
-
 To train a dreambooth model, please select an appropriate model from the hub.
 When choosing a model from the hub, please make sure you select the correct image size compatible with the model.
 
-The next steps are to add the jobs and click on "Start Training" to start training the model(s).
+Your concept token is `prompt` in parameters section.
diff --git a/docs/source/getting_started.mdx b/docs/source/getting_started.mdx
@@ -18,4 +18,8 @@ We are constantly adding new features and tasks to AutoTrain Advanced. Its alway
 
 Please note that "restarting" a space will not update it to the latest version. You need to "Factory reboot" the space to update it to the latest version.
 
-And now we are all set and we can start with our first project!
+And now we are all set and we can start with our first project!
+
+# Understanding the UI
+
+![autotrain-space-template](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/ui.png)
diff --git a/docs/source/image_classification.mdx b/docs/source/image_classification.mdx
@@ -31,9 +31,3 @@ Some points to keep in mind:
 - There should not be any other folders inside the zip folder.
 
 When train.zip is decompressed, it creates two folders: cats and dogs. these are the two categories for classification. The images for both categories are in their respective folders. You can have as many categories as you want.
-
-## Training
-
-Once you have your data ready and jobs added, you can start training your model by clicking the "Start Training" button.
-
-![Image Classification](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/image_classification_1.png)
diff --git a/docs/source/llm_finetuning.mdx b/docs/source/llm_finetuning.mdx
@@ -7,37 +7,47 @@ AutoTrain supports the following types of LLM finetuning:
 - Causal Language Modeling (CLM)
 - Masked Language Modeling (MLM) [Coming Soon]
 
-For LLM finetuning, only Hugging Face Hub model choice is available. 
-User needs to select a model from Hugging Face Hub, that they want to finetune and select the parameters on their own (Manual Parameter Selection),
-or use AutoTrain's Auto Parameter Selection to automatically select the best parameters for the task.
-
 ## Data Preparation
 
 LLM finetuning accepts data in CSV format.
-There are two modes for LLM finetuning: `generic` and `chat`.
-An example dataset with both formats in the same dataset can be found here: https://huggingface.co/datasets/tatsu-lab/alpaca
 
-### Generic
+### Data Format For SFT / Generic Trainer
+
+For SFT / Generic Trainer, the data should be in the following format:
+
+| text |
+| This is the first sentence. |
+| This is the second sentence. |
 
-In generic mode, only one column is required: `text`.
-The user can take care of how the data is formatted for the task.
-A sample instance for this format is presented below:
+An example dataset for this format can be found here: https://huggingface.co/datasets/timdettmers/openassistant-guanaco
 
-```
-Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. 
+For SFT/Generic training, your dataset must have a `text` column
 
-### Instruction: Evaluate this sentence for spelling and grammar mistakes 
+### Data Format For Reward Trainer
 
-### Input: He finnished his meal and left the resturant 
+For Reward Trainer, the data should be in the following format:
 
-### Response: He finished his meal and left the restaurant.
-```
+| text                                                          | rejected_text                                                     |
+|---------------------------------------------------------------|-------------------------------------------------------------------|
+| human: hello \n bot: hi nice to meet you                      | humna: hello \n bot: leave me alone                               |
+| human: how are you \n bot: I am fine                          | human: how are you \n bot: I am not fine                          |
+| human: What is your name? \n bot: My name is Mary             | human: What is your name? \n bot: Whats it to you?                |
+| human: Which is the best programming language? \n bot: Python | human: Which is the best programming language? \n bot: Javascript |
 
-![Generic LLM Finetuning](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/llm_1.png)
+For Reward Trainer, your dataset must have a `text` column (aka chosen text) and a `rejected_text` column.
 
-Please note that above is the format for instruction finetuning. You can also finetune on any other format as you want, for example generic finetuning. The data can be changed according to the requirements.
+### Data Format For DPO Trainer
 
+For DPO Trainer, the data should be in the following format:
 
-## Training
+| prompt                                  | text                | rejected_text      |
+|-----------------------------------------|---------------------|--------------------|
+| hello                                   | hi nice to meet you | leave me alone     |
+| how are you                             | I am fine           | I am not fine      |
+| What is your name?                      | My name is Mary     | Whats it to you?   |
+| What is your name?                      | My name is Mary     | I dont have a name |
+| Which is the best programming language? | Python              | Javascript         |
+| Which is the best programming language? | Python              | C++                |
+| Which is the best programming language? | Java                | C++                |
 
-Once you have your data ready and jobs added, you can start training your model by clicking the "Start Training" button.
+For DPO Trainer, your dataset must have a `prompt` column, a `text` column (aka chosen text) and a `rejected_text` column.
diff --git a/docs/source/model_choice.mdx b/docs/source/model_choice.mdx
diff --git a/docs/source/param_choice.mdx b/docs/source/param_choice.mdx
diff --git a/docs/source/seq2seq.mdx b/docs/source/seq2seq.mdx
@@ -0,0 +1,19 @@
+# Seq2Seq
+
+Seq2Seq is a task that involves converting a sequence of words into another sequence of words. 
+It is used in machine translation, text summarization, and question answering.
+
+## Data Format
+
+```csv
+text,target
+"this movie is great","dieser Film ist großartig"
+"this movie is bad","dieser Film ist schlecht"
+.
+.
+.
+```
+
+## Columns
+
+Your CSV dataset must have two columns: `text` and `target`.
diff --git a/docs/source/support.mdx b/docs/source/support.mdx
@@ -6,7 +6,7 @@ To get help and support for autotrain, there are 3 ways:
 
 - [Ask in the Hugging Face Forum](https://discuss.huggingface.co/c/autotrain/16).
 
-- [Email us](mailto:[email protected]) directly.
+- [Email us](mailto:[email protected]) directly (Enterprise users and billing questions only).
 
 
 Please don't forget to mention your username and project name if you have a specific question about your project.
diff --git a/docs/source/tabular.mdx b/docs/source/tabular.mdx
@@ -0,0 +1,44 @@
+# Tabular Classification / Regression
+
+Using AutoTrain, you can train a model to classify or regress tabular data easily.
+All you need to do is select from a list of models and upload your dataset.
+Parameter tuning is done automatically.
+
+## Models
+
+The following models are available for tabular classification / regression.
+
+- xgboost
+- random_forest
+- ridge
+- logistic_regression
+- svm
+- extra_trees
+- gradient_boosting
+- adaboost
+- decision_tree
+- knn
+
+
+## Data Format
+
+```csv
+id,category1,category2,feature1,target
+1,A,X,0.3373961604172684,1
+2,B,Z,0.6481718720511972,0
+3,A,Y,0.36824153984054797,1
+4,B,Z,0.9571551589530464,1
+5,B,Z,0.14035078041264515,1
+6,C,X,0.8700872583584364,1
+7,A,Y,0.4736080452737105,0
+8,C,Y,0.8009107519796442,1
+9,A,Y,0.5204774795512048,0
+10,A,Y,0.6788795301189603,0
+.
+.
+.
+```
+
+## Columns
+
+Your CSV dataset must have two columns: `id` and `target`.
diff --git a/docs/source/text_classification.mdx b/docs/source/text_classification.mdx
@@ -10,7 +10,7 @@ Let's train a model for classifying the sentiment of a movie review. The data sh
 in the following CSV format:
 
 ```csv
-review,sentiment
+text,target
 "this movie is great",positive
 "this movie is bad",negative
 .
@@ -41,18 +41,6 @@ for chunk in pd.read_csv('example.csv', chunksize=chunk_size):
     i += 1
 ```
 
-Once the data has been uploaded, you have to select the proper column mapping
+## Columns
 
-## Column Mapping
-
-![Column Mapping](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/text_classification_1.png)
-
-In our example, the text column is called `review` and the label column is called `sentiment`.
-Thus, we have to select `review` for the text column and `sentiment` for the label column.
-Please note that, if column mapping is not done correctly, the training will fail.
-
-
-## Training
-
-Once you have uploaded the data, selected the column mapping, and set the hyperparameters (AutoTrain or Manual mode), you can add the jobs.
-To start the training, click on the `Start Training` button.
+Your CSV dataset must have two columns: `text` and `target`.
diff --git a/src/autotrain/cli/run_llm.py b/src/autotrain/cli/run_llm.py
@@ -45,7 +45,7 @@ def register_subcommand(parser: ArgumentParser):
             },
             {
                 "arg": "--train_split",
-                "help": "Test dataset split to use",
+                "help": "Train dataset split to use",
                 "required": False,
                 "type": str,
                 "default": "train",

diff --git a/static/ui.png b/static/ui.png
diff --git a/templates/index.html b/templates/index.html
@@ -49,6 +49,13 @@
   </header>
 
   <div class="form-container max-w-lg mx-auto mt-10 p-6 shadow-2xl">
+    <div class="block text-sm font-normal text-gray-700">AutoTrain Advanced is a no-code solution that
+      allows you to train machine learning models in just a few clicks. Please note that you must upload data in correct
+      format for project to be created.
+      For help regarding proper data format and pricing, click <a href="https://hf.co/docs/autotrain" target="_blank"
+        class="text-blue-700">here</a>.
+    </div>
+    <hr class="h-px my-4 bg-gray-200 border-b-2 dark:bg-gray-700">
     <form action="#" method="post" class="space-y-4" enctype="multipart/form-data">
       <!-- <div class="columns-2">
         <div>
@@ -133,7 +140,8 @@
             <li class="me-2" role="presentation">
               <button class="inline-block p-4 border-b-2 rounded-t-lg" id="valid-data-tab"
                 data-tabs-target="#valid-data" type="button" role="tab" aria-controls="valid-data"
-                aria-selected="false">Validation Data (optional)</button>
+                aria-selected="false">Validation Data
+                (optional)</button>
             </li>
           </ul>
         </div>