diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index f00a96415d..20484d0605 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -8,21 +8,17 @@ - local: support title: Get help and support title: Get started -- sections: - - local: model_choice - title: Model Selection - - local: param_choice - title: Parameter Selection - title: Selecting Models and Parameters - sections: - local: text_classification title: Text Classification - local: llm_finetuning title: LLM Finetuning - title: Text Tasks -- sections: - local: image_classification title: Image Classification - local: dreambooth title: DreamBooth - title: Image Tasks + - local: seq2seq + title: Seq2Seq + - local: tabular + title: Tabular + title: Data Formats diff --git a/docs/source/dreambooth.mdx b/docs/source/dreambooth.mdx index 8bbec3d76e..4eb1f82351 100644 --- a/docs/source/dreambooth.mdx +++ b/docs/source/dreambooth.mdx @@ -2,15 +2,11 @@ DreamBooth is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views. -![DreamBooth Teaser](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/dreambooth1.jpeg) - ## Data Preparation The data format for DreamBooth training is simple. All you need is images of a concept (e.g. a person) and a concept token. -![DreamBooth Training](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/dreambooth2.png) - To train a dreambooth model, please select an appropriate model from the hub. When choosing a model from the hub, please make sure you select the correct image size compatible with the model. -The next steps are to add the jobs and click on "Start Training" to start training the model(s). \ No newline at end of file +Your concept token is `prompt` in parameters section. \ No newline at end of file diff --git a/docs/source/getting_started.mdx b/docs/source/getting_started.mdx index f79abc3be0..634e0fa1f5 100644 --- a/docs/source/getting_started.mdx +++ b/docs/source/getting_started.mdx @@ -18,4 +18,8 @@ We are constantly adding new features and tasks to AutoTrain Advanced. Its alway Please note that "restarting" a space will not update it to the latest version. You need to "Factory reboot" the space to update it to the latest version. -And now we are all set and we can start with our first project! \ No newline at end of file +And now we are all set and we can start with our first project! + +# Understanding the UI + +![autotrain-space-template](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/ui.png) diff --git a/docs/source/image_classification.mdx b/docs/source/image_classification.mdx index c27a7f2b5c..582ac7e133 100644 --- a/docs/source/image_classification.mdx +++ b/docs/source/image_classification.mdx @@ -31,9 +31,3 @@ Some points to keep in mind: - There should not be any other folders inside the zip folder. When train.zip is decompressed, it creates two folders: cats and dogs. these are the two categories for classification. The images for both categories are in their respective folders. You can have as many categories as you want. - -## Training - -Once you have your data ready and jobs added, you can start training your model by clicking the "Start Training" button. - -![Image Classification](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/image_classification_1.png) \ No newline at end of file diff --git a/docs/source/llm_finetuning.mdx b/docs/source/llm_finetuning.mdx index c92850a8e5..760c0ae67c 100644 --- a/docs/source/llm_finetuning.mdx +++ b/docs/source/llm_finetuning.mdx @@ -7,37 +7,47 @@ AutoTrain supports the following types of LLM finetuning: - Causal Language Modeling (CLM) - Masked Language Modeling (MLM) [Coming Soon] -For LLM finetuning, only Hugging Face Hub model choice is available. -User needs to select a model from Hugging Face Hub, that they want to finetune and select the parameters on their own (Manual Parameter Selection), -or use AutoTrain's Auto Parameter Selection to automatically select the best parameters for the task. - ## Data Preparation LLM finetuning accepts data in CSV format. -There are two modes for LLM finetuning: `generic` and `chat`. -An example dataset with both formats in the same dataset can be found here: https://huggingface.co/datasets/tatsu-lab/alpaca -### Generic +### Data Format For SFT / Generic Trainer + +For SFT / Generic Trainer, the data should be in the following format: + +| text | +| This is the first sentence. | +| This is the second sentence. | -In generic mode, only one column is required: `text`. -The user can take care of how the data is formatted for the task. -A sample instance for this format is presented below: +An example dataset for this format can be found here: https://huggingface.co/datasets/timdettmers/openassistant-guanaco -``` -Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. +For SFT/Generic training, your dataset must have a `text` column -### Instruction: Evaluate this sentence for spelling and grammar mistakes +### Data Format For Reward Trainer -### Input: He finnished his meal and left the resturant +For Reward Trainer, the data should be in the following format: -### Response: He finished his meal and left the restaurant. -``` +| text | rejected_text | +|---------------------------------------------------------------|-------------------------------------------------------------------| +| human: hello \n bot: hi nice to meet you | humna: hello \n bot: leave me alone | +| human: how are you \n bot: I am fine | human: how are you \n bot: I am not fine | +| human: What is your name? \n bot: My name is Mary | human: What is your name? \n bot: Whats it to you? | +| human: Which is the best programming language? \n bot: Python | human: Which is the best programming language? \n bot: Javascript | -![Generic LLM Finetuning](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/llm_1.png) +For Reward Trainer, your dataset must have a `text` column (aka chosen text) and a `rejected_text` column. -Please note that above is the format for instruction finetuning. You can also finetune on any other format as you want, for example generic finetuning. The data can be changed according to the requirements. +### Data Format For DPO Trainer +For DPO Trainer, the data should be in the following format: -## Training +| prompt | text | rejected_text | +|-----------------------------------------|---------------------|--------------------| +| hello | hi nice to meet you | leave me alone | +| how are you | I am fine | I am not fine | +| What is your name? | My name is Mary | Whats it to you? | +| What is your name? | My name is Mary | I dont have a name | +| Which is the best programming language? | Python | Javascript | +| Which is the best programming language? | Python | C++ | +| Which is the best programming language? | Java | C++ | -Once you have your data ready and jobs added, you can start training your model by clicking the "Start Training" button. \ No newline at end of file +For DPO Trainer, your dataset must have a `prompt` column, a `text` column (aka chosen text) and a `rejected_text` column. diff --git a/docs/source/model_choice.mdx b/docs/source/model_choice.mdx deleted file mode 100644 index e7f3b26fb6..0000000000 --- a/docs/source/model_choice.mdx +++ /dev/null @@ -1,24 +0,0 @@ -# Model Choice - -AutoTrain can automagically select the best models for your task! However, you are also -allowed to choose the models you want to use. You can choose the most appropriate models -from the Hugging Face Hub. - -![autotrain-model-choice](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/model_choice_1.png) - -## AutoTrain Model Choice - -To let AutoTrain choose the best models for your task, you can use the "AutoTrain" -in the "Model Choice" section. Once you choose AutoTrain mode, you no longer need to worry about model and parameter selection. -AutoTrain will automatically select the best models (and parameters) for your task. - -## Manual Model Choice - -To choose the models manually, you can use the "HuggingFace Hub" in the "Model Choice" section. -For example, if you want to use if you are training a text classification task and want to choose Deberta V3 Base for your task -from https://huggingface.co/microsoft/deberta-v3-base, -You can choose "HuggingFace Hub" and then write the model name: `microsoft/deberta-v3-base` in the model name field. - -![hub-model-choice](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/hub_model_choice.png) - -Please note that if you are selecting a hub model, you should make sure that it is compatible with your task, otherwise the training will fail. \ No newline at end of file diff --git a/docs/source/param_choice.mdx b/docs/source/param_choice.mdx deleted file mode 100644 index 14439ab1e1..0000000000 --- a/docs/source/param_choice.mdx +++ /dev/null @@ -1,25 +0,0 @@ -# Parameter Choice - -Just like model choice, you can choose the parameters for your job in two ways: AutoTrain and Manual. - -## AutoTrain Mode - -In the AutoTrain mode, the parameters for your task-model pair will be chosen automagically. -If you choose "AutoTrain" as model choice, you get the AutoTrain mode as the only option. -If you choose "HuggingFace Hub" as model choice, you get the the option to choose between AutoTrain and Manual mode for parameter choice. - -An example of AutoTrain mode for a text classification task is shown below: - -![AutoTrain Parameter Choice](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/param_choice_1.png) - -For most of the tasks in AutoTrain parameter selection mode, you will get "Number of Models" as the only parameter to choose. Some tasks like test-classification might ask you about the language of the dataset. -The more the number of models, the better the final results might be but it might be more expensive too! - -## Manual Mode - -Manual model can be used only when you choose "HuggingFace Hub" as model choice. In this mode, you can choose the parameters for your task-model pair manually. -An example of Manual mode for a text classification task is shown below: - -![Manual Parameter Choice](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/param_choice_2.png) - -In the manual mode, you have to add the jobs on your own. So, carefully select your parameters, click on "Add Job" and 💥. \ No newline at end of file diff --git a/docs/source/seq2seq.mdx b/docs/source/seq2seq.mdx new file mode 100644 index 0000000000..3459cd041e --- /dev/null +++ b/docs/source/seq2seq.mdx @@ -0,0 +1,19 @@ +# Seq2Seq + +Seq2Seq is a task that involves converting a sequence of words into another sequence of words. +It is used in machine translation, text summarization, and question answering. + +## Data Format + +```csv +text,target +"this movie is great","dieser Film ist großartig" +"this movie is bad","dieser Film ist schlecht" +. +. +. +``` + +## Columns + +Your CSV dataset must have two columns: `text` and `target`. diff --git a/docs/source/support.mdx b/docs/source/support.mdx index 0a180c3f8e..ae68375419 100644 --- a/docs/source/support.mdx +++ b/docs/source/support.mdx @@ -6,7 +6,7 @@ To get help and support for autotrain, there are 3 ways: - [Ask in the Hugging Face Forum](https://discuss.huggingface.co/c/autotrain/16). -- [Email us](mailto:autotrain@hf.co) directly. +- [Email us](mailto:autotrain@hf.co) directly (Enterprise users and billing questions only). Please don't forget to mention your username and project name if you have a specific question about your project. diff --git a/docs/source/tabular.mdx b/docs/source/tabular.mdx new file mode 100644 index 0000000000..45374c9bf5 --- /dev/null +++ b/docs/source/tabular.mdx @@ -0,0 +1,44 @@ +# Tabular Classification / Regression + +Using AutoTrain, you can train a model to classify or regress tabular data easily. +All you need to do is select from a list of models and upload your dataset. +Parameter tuning is done automatically. + +## Models + +The following models are available for tabular classification / regression. + +- xgboost +- random_forest +- ridge +- logistic_regression +- svm +- extra_trees +- gradient_boosting +- adaboost +- decision_tree +- knn + + +## Data Format + +```csv +id,category1,category2,feature1,target +1,A,X,0.3373961604172684,1 +2,B,Z,0.6481718720511972,0 +3,A,Y,0.36824153984054797,1 +4,B,Z,0.9571551589530464,1 +5,B,Z,0.14035078041264515,1 +6,C,X,0.8700872583584364,1 +7,A,Y,0.4736080452737105,0 +8,C,Y,0.8009107519796442,1 +9,A,Y,0.5204774795512048,0 +10,A,Y,0.6788795301189603,0 +. +. +. +``` + +## Columns + +Your CSV dataset must have two columns: `id` and `target`. diff --git a/docs/source/text_classification.mdx b/docs/source/text_classification.mdx index d01bd16bfa..364b07ed85 100644 --- a/docs/source/text_classification.mdx +++ b/docs/source/text_classification.mdx @@ -10,7 +10,7 @@ Let's train a model for classifying the sentiment of a movie review. The data sh in the following CSV format: ```csv -review,sentiment +text,target "this movie is great",positive "this movie is bad",negative . @@ -41,18 +41,6 @@ for chunk in pd.read_csv('example.csv', chunksize=chunk_size): i += 1 ``` -Once the data has been uploaded, you have to select the proper column mapping +## Columns -## Column Mapping - -![Column Mapping](https://raw.githubusercontent.com/huggingface/autotrain-advanced/main/static/text_classification_1.png) - -In our example, the text column is called `review` and the label column is called `sentiment`. -Thus, we have to select `review` for the text column and `sentiment` for the label column. -Please note that, if column mapping is not done correctly, the training will fail. - - -## Training - -Once you have uploaded the data, selected the column mapping, and set the hyperparameters (AutoTrain or Manual mode), you can add the jobs. -To start the training, click on the `Start Training` button. +Your CSV dataset must have two columns: `text` and `target`. diff --git a/src/autotrain/cli/run_llm.py b/src/autotrain/cli/run_llm.py index 216e11677c..53bb2cb985 100644 --- a/src/autotrain/cli/run_llm.py +++ b/src/autotrain/cli/run_llm.py @@ -45,7 +45,7 @@ def register_subcommand(parser: ArgumentParser): }, { "arg": "--train_split", - "help": "Test dataset split to use", + "help": "Train dataset split to use", "required": False, "type": str, "default": "train", diff --git a/static/ui.png b/static/ui.png new file mode 100644 index 0000000000..e8fdc1a4d2 Binary files /dev/null and b/static/ui.png differ diff --git a/templates/index.html b/templates/index.html index 976ab36259..09d8e408e1 100644 --- a/templates/index.html +++ b/templates/index.html @@ -49,6 +49,13 @@