update (#630)

huggingface · May 8, 2024 · 831afc2 · 831afc2
1 parent a1b6fd8
commit 831afc2
Show file tree

Hide file tree

Showing 40 changed files with 1,048 additions and 830 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -1,19 +1,27 @@
 - sections: 
   - local: index
     title: 🤗 AutoTrain
-  - local: getting_started
-    title: Installation
   - local: cost
     title: How much does it cost?
   - local: support
     title: Get help and support
+  - local: faq
+    title: Frequently Asked Questions
   title: Getting Started
 - sections:
-  - local: starting_ui
-    title: Starting the UI
-  - local: starting_cli
-    title: Starting the CLI
-  title: Starting AutoTrain
+  - local: quickstart_spaces
+    title: Quickstart
+  title: AutoTrain on Hugging Face Spaces
+- sections:
+  - local: quickstart
+    title: Quickstart
+  - local: config
+    title: Configurations
+  title: Use AutoTrain Locally
+- sections:
+  - local: col_map
+    title: Understanding Column Mapping
+  title: Miscellaneous
 - sections:
   - local: text_classification
     title: Text Classification
@@ -31,4 +39,20 @@
     title: Token Classification
   - local: tabular
     title: Tabular
-  title: Tasks
+  title: Data Formats
+- sections:
+  - local: text_classification_params
+    title: Text Classification & Regression
+  - local: llm_finetuning_params
+    title: LLM Finetuning
+  - local: image_classification_params
+    title: Image Classification
+  - local: dreambooth_params
+    title: DreamBooth
+  - local: seq2seq_params
+    title: Seq2Seq
+  - local: token_classification_params
+    title: Token Classification
+  - local: tabular_params
+    title: Tabular
+  title: Parameters
diff --git a/docs/source/col_map.mdx b/docs/source/col_map.mdx
@@ -0,0 +1,162 @@
+# Understanding Column Mapping
+
+Column mapping is a critical setup process in AutoTrain that informs the system 
+about the roles of different columns in your dataset. Whether it's a tabular 
+dataset, text classification data, or another type, the need for precise 
+column mapping ensures that AutoTrain processes each dataset element correctly.
+
+## How Column Mapping Works
+
+AutoTrain has no way of knowing what the columns in your dataset represent. 
+AutoTrain requires a clear understanding of each column's function within 
+your dataset to train models effectively. This is managed through a 
+straightforward mapping system in the user interface, represented as a dictionary. 
+Here's a typical example:
+
+```
+{"text": "text", "label": "target"}
+```
+
+In this example, the `text column in your dataset corresponds to the text data 
+AutoTrain uses for processing, and the `target`` column is treated as the 
+label for training.
+
+But let's not get confused! AutoTrain has a way to understand what each column in your dataset represents.
+If your data is already in AutoTrain format, you dont need to change column mappings.
+If not, you can easily map the columns in your dataset to the correct AutoTrain format.
+
+In the UI, you will see column mapping as a dictionary:
+
+```
+{"text": "text", "label": "target"}
+```
+
+Here, the column `text` in your dataset is mapped to the AutoTrain column `text`, 
+and the column `target` in your dataset is mapped to the AutoTrain column `label`.
+
+Let's say you are training a text classification model and your dataset has the following columns:
+
+```
+full_text, target_sentiment
+"this movie is great", positive
+"this movie is bad", negative
+```
+
+You can map these columns to the AutoTrain format as follows:
+
+```
+{"text": "full_text", "label": "target_sentiment"}
+```
+
+If your dataset has the columns: `text` and `label`, you don't need to change the column mapping.
+
+Let's take a look at column mappings for each task:
+
+## LLM
+
+Note: For all LLM tasks, if the text column(s) is not formatted i.e. if contains samples in chat format (dict or json), then you 
+should use `chat_template` parameter. Read more about it in LLM Parameters Section.
+
+
+### SFT / Generic Trainer
+
+```
+{"text": "text"}
+```
+
+`text`: The column in your dataset that contains the text data.
+
+
+### Reward / ORPO Trainer
+
+```
+{"text": "text", "rejected_text": "rejected_text"}
+```
+
+`text`: The column in your dataset that contains the text data.
+
+`rejected_text`: The column in your dataset that contains the rejected text data.
+
+### DPO Trainer
+
+```
+{"prompt": "prompt", "text": "text", "rejected_text": "rejected_text"}
+```
+
+`prompt`: The column in your dataset that contains the prompt data.
+
+`text`: The column in your dataset that contains the text data.
+
+`rejected_text`: The column in your dataset that contains the rejected text data.
+
+
+## Text Classification & Regression, Seq2Seq
+
+For text classification and regression, the column mapping should be as follows:
+
+```
+{"text": "dataset_text_column", "label": "dataset_target_column"}
+```
+
+`text`: The column in your dataset that contains the text data.
+
+`label`: The column in your dataset that contains the target variable.
+
+
+## Token Classification
+
+
+```
+{"text": "tokens", "label": "tags"}
+```
+
+`text`: The column in your dataset that contains the tokens. These tokens must be a list of strings.
+
+`label`: The column in your dataset that contains the tags. These tags must be a list of strings.
+
+For token classification, if you are using a CSV, make sure that the columns are stringified lists.
+
+## Tabular Classification & Regression
+
+```
+{"id": "id", "label": ["target"]}
+```
+
+`id`: The column in your dataset that contains the unique identifier for each row.
+
+`label`: The column in your dataset that contains the target variable. This should be a list of strings.
+
+For a single target column, you can pass a list with a single element.
+
+For multiple target columns, e.g. a multi label classification task, you can pass a list with multiple elements.
+
+
+# DreamBooth LoRA
+
+Dreambooth doesn't require column mapping.
+
+# Image Classification
+
+For image classification, the column mapping should be as follows:
+
+```
+{"image": "image_column", "label": "label_column"}
+```
+
+Image classification requires column mapping only when you are using a dataset from Hugging Face Hub.
+For uploaded datasets, leave column mapping as it is.
+
+## Ensuring Accurate Mapping
+
+To ensure your model trains correctly:
+
+- Verify Column Names: Double-check that the names used in the mapping dictionary accurately reflect those in your dataset.
+
+- Format Appropriately: Especially in token classification, ensure your data format matches expectations (e.g., lists of strings).
+
+- Update Mappings for New Datasets: Each new dataset might require its unique mappings based on its structure and the task at hand.
+
+By following these guidelines and using the provided examples as templates, 
+you can effectively instruct AutoTrain on how to interpret and handle your 
+data for various machine learning tasks. This process is fundamental for 
+achieving optimal results from your model training endeavors.
diff --git a/docs/source/config.mdx b/docs/source/config.mdx
@@ -0,0 +1,65 @@
+# AutoTrain Configs
+
+AutoTrain Configs are the way to use and train models using AutoTrain locally.
+
+Once you have installed AutoTrain Advanced, you can use the following command to train models using AutoTrain config files:
+
+```bash
+$ export HF_USERNAME=your_hugging_face_username
+$ export HF_TOKEN=your_hugging_face_write_token
+
+$ autotrain --config path/to/config.yaml
+```
+
+Example configurations for all tasks can be found in the `configs` directory of 
+the [AutoTrain Advanced GitHub repository](https://github.com/huggingface/autotrain-advanced).
+
+Here is an example of an AutoTrain config file:
+
+```yaml
+task: llm
+base_model: meta-llama/Meta-Llama-3-8B-Instruct
+project_name: autotrain-llama3-8b-orpo
+log: tensorboard
+backend: local
+
+data:
+  path: argilla/distilabel-capybara-dpo-7k-binarized
+  train_split: train
+  valid_split: null
+  chat_template: chatml
+  column_mapping:
+    text_column: chosen
+    rejected_text_column: rejected
+
+params:
+  trainer: orpo
+  block_size: 1024
+  model_max_length: 2048
+  max_prompt_length: 512
+  epochs: 3
+  batch_size: 2
+  lr: 3e-5
+  peft: true
+  quantization: int4
+  target_modules: all-linear
+  padding: right
+  optimizer: adamw_torch
+  scheduler: linear
+  gradient_accumulation: 4
+  mixed_precision: bf16
+
+hub:
+  username: ${HF_USERNAME}
+  token: ${HF_TOKEN}
+  push_to_hub: true
+```
+
+In this config, we are finetuning the `meta-llama/Meta-Llama-3-8B-Instruct` model 
+on the `argilla/distilabel-capybara-dpo-7k-binarized` dataset using the `orpo` 
+trainer for 3 epochs with a batch size of 2 and a learning rate of `3e-5`.
+More information on the available parameters can be found in the *Data Formats and Parameters* section.
+
+In case you dont want to push the model to hub, you can set `push_to_hub` to `false` in the config file.
+If not pushing the model to hub username and token are not required. Note: they may still be needed 
+if you are trying to access gated models or datasets.
diff --git a/docs/source/cost.mdx b/docs/source/cost.mdx
@@ -1,11 +1,40 @@
 # How much does it cost?
 
-AutoTrain provides you with best models which are deployable with just a few clicks.
-Unlike other services, we don't own your models. Once the training is done, you can download them and use them anywhere you want.
+AutoTrain offers an accessible approach to model training, providing deployable models 
+with just a few clicks. Understanding the cost involved is essential to planning and 
+executing your projects efficiently.
 
-You will be charged per minute based on the hardware you choose.
 
-Pricing information is available in the [pricing](https://huggingface.co/pricing#spaces) section.
+## Local Usage
 
-Please note that in order to use AutoTrain, you need to have a valid payment method on file.
-You can add your payment method in the [billing](https://huggingface.co/settings/billing) section.
+When you choose to use AutoTrain locally on your own hardware, there is no cost. 
+This option is ideal for those who prefer to manage their own infrastructure and 
+do not require the scalability that cloud resources offer.
+
+## Using AutoTrain on Hugging Face Spaces
+
+**Pay-As-You-Go**: Costs for using AutoTrain in Hugging Face Spaces are based on the 
+computing resources you consume. This flexible pricing structure ensures you only pay 
+for what you use, making it cost-effective and scalable for projects of any size.
+
+
+**Ownership and Portability**: Unlike some other platforms, AutoTrain does not retain 
+ownership of your models. Once training is complete, you are free to download and 
+deploy your models wherever you choose, providing flexibility and control over your all your assets.
+
+### Pricing Details
+
+**Resource-Based Billing**: Charges are accrued per minute according to the type of hardware 
+utilized during training. This means you can scale your resource usage based on the 
+complexity and needs of your projects.
+
+For a detailed breakdown of the costs associated with using Hugging Face Spaces, 
+please refer to the [pricing](https://huggingface.co/pricing#spaces) section on our website.
+
+To access the paid features of AutoTrain, you must have a valid payment method on file. 
+You can manage your payment options and view your billing information in 
+the [billing section of your Hugging Face account settings.](https://huggingface.co/settings/billing)
+
+By offering both free and flexible paid options, AutoTrain ensures that users can choose 
+the most suitable model training solution for their needs, whether they are experimenting 
+on a local machine or scaling up operations on Hugging Face Spaces.