Welcome to ModelTrainSet, your one-stop-shop for creating custom datasets and training machine learning models! Whether you're a data scientist, a machine learning engineer, or just someone who likes to play with big data and bigger models, ModelTrainSet has got your back!
ModelTrainSet is like a Swiss Army knife for your data needs. It can:
- 📥 Load data from various sources (JSON, CSV, Excel, XML, SQL, Git/Jira, Twitter)
- 🧹 Clean and process your data
- 🎨 Format your data for different ML tasks
- 🚀 Train models using the latest techniques
It's perfect for when you need to wrangle your data into shape and then teach a model to do tricks with it!
Before you hop on the ModelTrainSet express, make sure you have:
- Python 3.7+ installed (we're not cavemen, after all)
- Git (for version control and looking cool)
- Access to a Jira instance (if you're into that sort of thing)
- Linux for training. (Blame triton)
We've upgraded our luggage handling system! Now you can choose between the classic pip setup or our new first-class Conda/Mamba experience.
-
If you haven't already, install Miniconda or Anaconda. For an even faster setup, install Mamba.
-
Clone our luxury liner:
git clone https://github.com/muddylemon/ModelTrainSet.git cd ModelTrainSet
-
Create and activate your environment:
Using Conda:
conda env create -f environment.yml conda activate modeltrainset
Or, for a faster setup with Mamba:
mamba env create -f environment.yml mamba activate modeltrainset
-
You're all set! Enjoy your first-class ML journey!
If you experience any problems with the automatic setup, you can try the following manual steps:
conda create --name modeltrainset python=3.10
conda activate modeltrainset
conda install pytorch cudatoolkit torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install xformers -c xformers
pip install bitsandbytes
pip install "unsloth[conda] @ git+https://github.com/unslothai/unsloth.git"
pip install transformers datasets accelerate tqdm pyyaml nltk pandas openpyxl sqlalchemy gitpython jira python-dotenv peft trl
Replace conda
with mamba
in the above commands if you're using Mamba for faster installation.
If you prefer the classic experience, follow these steps:
-
Clone this bad boy:
git clone https://github.com/muddylemon/ModelTrainSet.git cd ModelTrainSet
-
Set up your virtual environment (because we're responsible adults):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the necessities:
pip install -r requirements.txt
For detailed instructions on how to use ModelTrainSet, check out our comprehensive tutorial. It covers everything from creating datasets to training your own models!
- Fill In Missing Word style datasets
- TextTriplets style datasets
- Rewriter style datasets
- Text Completion style datasets
Want to add a new stop on the ModelTrainSet line? Here's how:
- Create new loader, processor, or formatter classes in
dataset_creator/
. - Add a new creator class in
dataset_creator/creators/
. - Update
get_creator()
inmain.py
to recognize your new creation.
For more details on contributing to ModelTrainSet, please read our contribution guide.
If you find yourself in a dark tunnel:
- Check your Python version (
python --version
). - Make sure you've installed all the requirements (
pip install -r requirements.txt
). - Double-check your config file. Typos are the bane of every data scientist's existence!
Contributions are welcome! Whether you're fixing bugs, adding features, or just making our jokes funnier, we'd love to have you on board! Check out our contribution guide to get started.
This project is licensed under the MIT License - see the LICENSE file for details. (It's basically "use it however you want, just don't blame us if something goes wrong".)
Remember, in the world of ModelTrainSet, every day is training day! Now go forth and model responsibly! 🚂💨