Repository containing code for training/finetuning HuggingFace transformers with wandb logging.
I created this repository to fine-tune huggingface LMs in a distributed environment with DeepSpeed (as an .ipynb file does not allow you to do such)
If you're coming from the paper: Generation, Distillation and Evaluation of Motivational Interviewing-Style Reflections with a Foundational Language Model (https://aclanthology.org/2024.eacl-long.75.pdf), welcome!
This code was used to train and test the distilled reflection models explained in that paper. Feel free to email me at [email protected] if you have any questions related to this.
- Create a config in the configs/ directory for the type of model you want to train (use a previous config to create yours)
- Configure scripts/run.sh bash script to include your config and some desired flags for trainHFDS.py (I use the run.sh to set up hyperparamter sweeps so I can queue all the jobs in a background environment)
- The script can train using either one GPU or distributed with DeepSpeed
- If you want to use wandb, set an environment variable to WANDB_KEY with your corresponding key
- Run run.sh with the appropriate python environment and get to training!
Yaml files containing the setup for model setup, hyperparameters, training, validation, wandb runs/sweeps, and inferencing.
Inside this directory there are many configs that I have used for my research at UofT. Please use them as an example for your own training.
Directory to store training/validation data. .gitkeep file is in here since most data we fine-tune with is private
Directory to store trained/untrained model weights .gitkeep file is in here since the weights are too large and do not belong in an open source environment.
Python notebook (.ipynb) files which are used by me to inference models during testing. This code could be used for reference but is just in this repository for ease of access.
The main body of this repository. Here is a file-by-file breakdown:
A Language model fine-tuning script. With a complete pipeline This script:
- checks cuda availability
- loads a .yaml config and dataset
- initializes tokenizer
- tokenizes datasets
- initializes a huggingface model (with optimizer, scheduler, and dataloader)
- initializes a huggingface trainer
- initialize wandb logger
- trains and saves the model weights
This is all done via a main driver function and meant to be run by using the scripts/run.sh script
helper functions which are utilized by trainHFDS.py
- config loading
- setting all seeds for training
- formatting time for logging
- cuda and GPU diagnositcs
- wandb initialization
A separate training script created for fine-tuning GPT-2 This code is similar to trainHFDS but does not use a Huggingface trainer class when fine-tuning (the training loop is written manually)
A class declaration written for finetuning with GPT-2 for scripts/gptTrain.py
The main script for training models Model training jobs can be queued in here using either single or multi GPU setups