This repository contains code for EMNLP 2022 paper titled Sparse Teachers Can Be Dense with Knowledge.
**************************** Updates ****************************
- 19/10/2022: We released our paper and code. Check it out!
Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgeable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgeableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgeable score for each teacher parameter. The knowledgeable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.
- PyTorch
- Numpy
- Transformers
Get GLUE data through the link and put them to the corresponding directories. For example, MRPC dataset should be placed into datasets/mrpc
.
The training is achieved in several scripts. We provide example scripts as follows.
Finetuning
We provide an example of finetuning bert-base-uncased
on RTE in scripts/run_finetuning_rte.sh
. We explain some important arguments in the following:
--model_type
: variant to use, should beft
in the case.--model_path
: pretrained language models to start with, should bebert-base-uncased
in the case and can be others as you like.--task_name
: task to use, should be chosen fromrte
,mrpc
,stsb
,sst2
,qnli
,qqp
,mnli
, andmnlimm
.--data_type
: input format to use, default tocombined
.
We also give the finetuned checkpoints from bert-base-uncased
and bert-large-uncased
as follows:
Model | Checkpoint | Model | Checkpoint |
---|---|---|---|
bert-base-rte | huggingface | bert-large-rte | huggingface |
bert-base-mrpc | huggingface | bert-large-mrpc | huggingface |
bert-base-stsb | huggingface | bert-large-stsb | huggingface |
bert-base-sst2 | huggingface | bert-large-sst2 | huggingface |
bert-base-qnli | huggingface | bert-large-qnli | huggingface |
bert-base-qqp | huggingface | bert-large-qqp | huggingface |
bert-base-mnli | huggingface | bert-large-mnli | huggingface |
bert-base-mnlimm | huggingface | bert-large-mnlimm | huggingface |
Pruning
We provide and example of pruning a finetuned checkpoint on RTE in scripts/run_pruning_rte.sh
. The arguments should be self-contained.
Distillation
We provide an example of distilling a finetuned teacher to a layer-dropped or parameter-pruned student on RTE in scripts/run_distillation_rte.sh
. We explain some important arguments in following:
--model_type
: variant to use, should bekd
in the case.--teacher_model_path
: teacher models to use, should be the path to the finetuned teacher checkpoint.--student_model_path
: student models to initialize, should be the path to the pruned/finetuned teacher checkpoint depending on the way you would like to initialize the student.--student_sparsity
: student sparsity, should be set if you would like to use parameter-pruned student, e.g., 70. Otherwise, this argument should be left blank.--student_layer
: student layer, should be set if you would like to use layer-dropped student, e.g., 4.
Teacher Sparsification
We provide an example of sparsfying the teacher based on the student on RTE in scripts/run_sparsification_rte.sh
. We explain some important arguments in following:
--model_type
: variant to use, should bekd
in the case.--teacher_model_path
: teacher models to use, should be the path to the finetuned teacher checkpoint.--student_model_path
: student models to use, should be the path to the distilled student checkpoint.--student_sparsity
: student sparsity, should be set if you would like to use parameter-pruned student, e.g., 70. Otherwise, this argument should be left blank.--student_layer
: student layer, should be set if you would like to use layer-dropped student, e.g., 4.--lam
: the knowledgeableness tradeoff term to keep a balance between expressiveness and student-friendliness.
Rewinding
We provide an example of rewinding the student on RTE in scripts/run_rewinding_rte.sh
. We explain some important arguments in following:
--model_type
: variant to use, should bekd
in the case.--teacher_model_path
: teacher models to use, should be the path to the sparsified teacher checkpoint.--student_model_path
: student models to initialize, should be the path to the pruned/finetuned teacher checkpoint depending on the way you would like to initialize the student.--student_sparsity
: student sparsity, should be set if you would like to use parameter-pruned student, e.g., 70. Otherwise, this argument should be left blank.--student_layer
: student layer, should be set if you would like to use layer-dropped student, e.g., 4.--lam
: the knowledgeableness tradeoff term to keep a balance between expressiveness and student-friendliness. Here, it is just used for folder names.
If you have any questions related to the code or the paper, feel free to email Chen ([email protected]
). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you use the code in your work:
@inproceedings{yang2022sparse,
title={Sparse Teachers Can Be Dense with Knowledge},
author={Yang, Yi and Zhang, Chen and Song, Dawei},
booktitle={EMNLP},
year={2022}
}