Fine tuning Bert on a Language Model task

There are 2 ways to train or fine tune BERT or GPT like models

on a supervised downstream task
in an unsupervised way on a corpus

The supervised training / fine-tuning requires a ground truth dataset.

=> We're going to work on the unsupervised approach

Model fine tuning

We can leverage the scripts given by huggingface:

for Masked Language Models :
- run_mlm.py
- BERT, RoBerta, DistilBert and others
for Causale Language Models
- run_clm.py
- GPT, GPT-2
for Permuted Language Models
- run_plm.py
- XL-NET

Scripts

Most of the code within the scripts cited above is devoted to

handling both pytorch and tensorflow versions
and passing arguments via 3 classes:
- ModelArguments: class definition in the script.
Arguments pertaining to which model, config and tokenizer we are going to fine-tune
- DataTrainingArgument: class definition in the script.
Arguments pertaining to what data we are going to input our model for training and eval
- TrainingArguments Class imported from the huggingface lib.
Arguments pertaining to the actual training / finetuning of the model

The core of the script is organized along:

loading the data through the dataset module with

load_dataset(data_args.dataset_name, data_args.dataset_config_name)
Loading the appropriate config, tokenizer and model
- config = AutoConfig.from_pretrained
- tokenizer = AutoTokenizer.from_pretrained
- model = AutoModelForMaskedLM.from_config(config)
tokenizing the data

This returns 3 elements:
- a list of tokens (the vocab)
- the tokens index within the vocab list
- a sequence of token mask [1,1,1,1,1,0,0,0,0]
The datacollator handles the random masking of tokens

data_collator = DataCollatorForLanguageModeling
and finally the training / finetuning takes place
- the trainer is instanciated trainer = Trainer()
- the training takes place trainer.train()
the model is saved

Fine tuning

To fine tune on a our own data, specify the path to the training file and to the validation file:

python run_mlm.py \
    --model_name_or_path bert-base-uncased \
    --max_seq_length 128 \
    --line_by_line \
    --train_file "path_to_train_file" \
    --validation_file "path_to_validation_file" \
    --do_train \
    --do_eval \
    --max_steps 5000 \
    --save_steps 1000 \ # augment to save disk space
    --output_dir "results/"

Notes

distilbert is smaller than BERT
each save_steps step, the model is saved in a directory. This can quickly eat up all the space on the disk.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
py		py
.gitignore		.gitignore
README.md		README.md
similarity.py		similarity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine tuning Bert on a Language Model task

Model fine tuning

Scripts

Fine tuning

Notes

About

Releases

Packages

Languages

alexisperrier/BERT_medical

Folders and files

Latest commit

History

Repository files navigation

Fine tuning Bert on a Language Model task

Model fine tuning

Scripts

Fine tuning

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages