probing_dataset

Repository for the paper "On the data requirements of probing"

Environment

This repo was developed using packages of these versions:

transformers==4.3.2
wandb==0.10.30
torch==1.8.1
torchtext==0.9.1
torchvision==0.9.1
spacy==3.0.6
tensorboard==2.4.1
sentence-transformers==1.1.1

Steps to reproduce the findings

Preprocess the embeddings: python preprocess_data.py

There are currently the preprocessors for SentEval, CATS, and oLMpics. The paper only reports experiments for SentEval (fixed-class problem)
For the corrupted models, use preprocess_corrupted_bert.py

Run the probing experiments (on slurm):

size_per_class=128
python run_senteval.py  \
    --project_path <path_to_github>/probing_dataset \
    --model bert --task bigram_shift --seed 0 \
    --even_distribute --train_size_per_class ${size_per_class} --val_size_per_class ${size_per_class} \
    --lr_list 1e-4 5e-4 1e-3 5e-3 1e-2 \
    --bs_list 8 16 32 64 \
    --use_cuda --probe_metric "others" \
    --wandb_id_file_path "/checkpoint/$USER/$SLURM_JOB_ID/wandb_id.txt" \
    --checkpoint "/checkpoint/$USER/$SLURM_JOB_ID/checkpoint.ckpt" \
    --ray_tune_result_path "${project_path}/results" \
    --resume

Download the probing results on wandb.ai. The logged results include both the performance metrics and the test predictions.
Head to the corresponding ipynb in notebooks directory to further analyze the results:

theory_vs_experiments.ipynb: Experiment 4.2
power_curves.ipynb: Experiment 4.3 - 4.6

Helper files:

learning_theory.py
power_analysis.py: Based on the repo of Card etal (2020)
load_data.py: Load some data.
BayesianLayers.py: Used for variational MDL probing.
engine.py: engine for probing classification.
notebooks/worse_finetuning.ipynb: Notebook for corruption-pretraining Transformer LMs.

Reference

@inproceedings{zhu_etal_data_2022,
    title = {{On the data requirements of probing}},
    author = {Zhu, Zining and Wang, Jixuan and Li, Bai and Rudzicz, Frank},
    year={2022},
    url={https://aclanthology.org/2022.findings-acl.326/},
    booktitle={{Findings of the Association of Computational Linguistics}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

probing_dataset

Environment

Steps to reproduce the findings

Reference

About

Releases

Packages

Languages

License

SPOClab-ca/probing_dataset

Folders and files

Latest commit

History

Repository files navigation

probing_dataset

Environment

Steps to reproduce the findings

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages