There are 2 ways to train or fine tune BERT or GPT like models
- on a supervised downstream task
- in an unsupervised way on a corpus
The supervised training / fine-tuning requires a ground truth dataset.
=> We're going to work on the unsupervised approach
We can leverage the scripts given by huggingface:
-
for Masked Language Models :
- run_mlm.py
- BERT, RoBerta, DistilBert and others
-
for Causale Language Models
- run_clm.py
- GPT, GPT-2
-
for Permuted Language Models
- run_plm.py
- XL-NET
Most of the code within the scripts cited above is devoted to
-
handling both pytorch and tensorflow versions
-
and passing arguments via 3 classes:
- ModelArguments: class definition in the script.
Arguments pertaining to which model, config and tokenizer we are going to fine-tune
- DataTrainingArgument: class definition in the script.
Arguments pertaining to what data we are going to input our model for training and eval
- TrainingArguments Class imported from the huggingface lib.
Arguments pertaining to the actual training / finetuning of the model
The core of the script is organized along:
-
loading the data through the dataset module with
load_dataset(data_args.dataset_name, data_args.dataset_config_name)
-
Loading the appropriate config, tokenizer and model
config = AutoConfig.from_pretrained
tokenizer = AutoTokenizer.from_pretrained
model = AutoModelForMaskedLM.from_config(config)
-
tokenizing the data
This returns 3 elements:
- a list of tokens (the vocab)
- the tokens index within the vocab list
- a sequence of token mask [1,1,1,1,1,0,0,0,0]
-
The datacollator handles the random masking of tokens
data_collator = DataCollatorForLanguageModeling
-
and finally the training / finetuning takes place
- the trainer is instanciated
trainer = Trainer()
- the training takes place
trainer.train()
- the trainer is instanciated
-
the model is saved
To fine tune on a our own data, specify the path to the training file and to the validation file:
python run_mlm.py \
--model_name_or_path bert-base-uncased \
--max_seq_length 128 \
--line_by_line \
--train_file "path_to_train_file" \
--validation_file "path_to_validation_file" \
--do_train \
--do_eval \
--max_steps 5000 \
--save_steps 1000 \ # augment to save disk space
--output_dir "results/"
- distilbert is smaller than BERT
- each
save_steps
step, the model is saved in a directory. This can quickly eat up all the space on the disk.