Skip to content

Latest commit

 

History

History
102 lines (62 loc) · 3.3 KB

README.md

File metadata and controls

102 lines (62 loc) · 3.3 KB

Fine tuning Bert on a Language Model task

There are 2 ways to train or fine tune BERT or GPT like models

  • on a supervised downstream task
  • in an unsupervised way on a corpus

The supervised training / fine-tuning requires a ground truth dataset.

=> We're going to work on the unsupervised approach

Model fine tuning

We can leverage the scripts given by huggingface:

Scripts

Most of the code within the scripts cited above is devoted to

  • handling both pytorch and tensorflow versions

  • and passing arguments via 3 classes:

    • ModelArguments: class definition in the script.

    Arguments pertaining to which model, config and tokenizer we are going to fine-tune

    • DataTrainingArgument: class definition in the script.

    Arguments pertaining to what data we are going to input our model for training and eval

    Arguments pertaining to the actual training / finetuning of the model

The core of the script is organized along:

  1. loading the data through the dataset module with

    load_dataset(data_args.dataset_name, data_args.dataset_config_name)

  2. Loading the appropriate config, tokenizer and model

    • config = AutoConfig.from_pretrained
    • tokenizer = AutoTokenizer.from_pretrained
    • model = AutoModelForMaskedLM.from_config(config)
  3. tokenizing the data

    This returns 3 elements:

    • a list of tokens (the vocab)
    • the tokens index within the vocab list
    • a sequence of token mask [1,1,1,1,1,0,0,0,0]
  4. The datacollator handles the random masking of tokens

    data_collator = DataCollatorForLanguageModeling

  5. and finally the training / finetuning takes place

    • the trainer is instanciated trainer = Trainer()
    • the training takes place trainer.train()
  6. the model is saved

Fine tuning

To fine tune on a our own data, specify the path to the training file and to the validation file:

python run_mlm.py \
    --model_name_or_path bert-base-uncased \
    --max_seq_length 128 \
    --line_by_line \
    --train_file "path_to_train_file" \
    --validation_file "path_to_validation_file" \
    --do_train \
    --do_eval \
    --max_steps 5000 \
    --save_steps 1000 \ # augment to save disk space
    --output_dir "results/"

Notes

  • distilbert is smaller than BERT
  • each save_steps step, the model is saved in a directory. This can quickly eat up all the space on the disk.