Foundation Model Gym

纸上得来终觉浅，绝知此事要躬行。

Step 0. pytorch warmup

Step 1. Transformer (1 week)

targets

a transformer implementation from scratch

requirements

pytorch Transformer API (fine-grained)
benchmarks (comparing with pytorch's implementation)
the Transformer paper

outputs

a transformer matching reported performances
implementation notes/wiki

Step 2. core models (1 week)

targets

the BERT implementation
optimization skills (1 GPU + accumulated gradient)

requirements

data pre-processing pipeline:
- tokenizers (Huggingface) + raw data --> pre-processed data --> truncated samples -->MLM/NSP training data (json)
Huggingface BERT-base retrained with 10 million samples (reference full BERT base training requires 8*V100*4 days), and their learning curves/MLM accuracies/NSP accuracies
- 1 GPU with small batch sizes or accumulated gradients
- 8 GPUs with the official BERT setting
the BERT paper

outputs

BERT base model matching reported performances on benchmarks
implementation notes/wiki

Step 3. core peripherals (2 weeks)

targets

strategies on building vocabulary (tokenizers (word piece/sentence piece/bpe))
strategies on positional embeddings
masking, sampling, ...

requirements

raw data
Huggingface tokenizer API
data preprocessing papers

outputs

data pre-processing pipelines
implementation notes/wiki

Step 4. fine tune (1 week)

targets

fine-tuning skills

requirements

GLUE evaluation toolkits

outputs

fine tuned BERTs with reported GLUE performances

Step 5. fm with decoders (2 weeks)

targets

fm with decoders (GPT, unilm, BART, T5)

requirements

raw data
Hugging face API

outputs

fm models matching reported performances on benchmarks
implementation notes/wiki

Step 6. useful extensions (2 weeks)

more fm models

XLNet (different pre-training objectives)
tinyBERT, ALBERT (parameter sharing and compressing)
roBERTa (data scale-up)

more pre-training, fine-tuning tricks

learning rates (layer-wise learning rates, warmup)
training with resource constraints (early exit, accumulate gradient (batch size))

more data pre-processing

backdoor injection

Reference