Skip to content

Latest commit

 

History

History
116 lines (103 loc) · 8.48 KB

File metadata and controls

116 lines (103 loc) · 8.48 KB

Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)

This page contains pointers to pre-trained models as well as instructions on how to train new models for our paper

Citation:

@inproceedings{wu2018pay,
  title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
  author = {Felix Wu and Angela Fan and Alexei Baevski and Yann Dauphin and Michael Auli},
  booktitle = {International Conference on Learning Representations},
  year = {2019},
  url = {https://openreview.net/forum?id=SkVhlh09tX},
}

Translation

Pre-trained models

For some datasets we release models without GLUs which are faster at inference.

Description Dataset Model Test set(s)
LightConv (without GLUs) IWSLT14 German-English download (.tar.bz2) IWSLT14 test:
download (.tar.bz2)
DynamicConv (without GLUs) IWSLT14 German-English download (.tar.bz2) IWSLT14 test:
download (.tar.bz2)
LightConv (without GLUs) WMT16 English-German download (.tar.bz2) newstest2014 (shared vocab):
download (.tar.bz2)
DynamicConv (without GLUs) WMT16 English-German download (.tar.bz2) newstest2014 (shared vocab):
download (.tar.bz2)
LightConv WMT16 English-German download (.tar.bz2) newstest2014 (shared vocab):
download (.tar.bz2)
DynamicConv WMT16 English-German download (.tar.bz2) newstest2014 (shared vocab):
download (.tar.bz2)
LightConv WMT14 English-French download (.tar.bz2) newstest2014:
download (.tar.bz2)
DynamicConv WMT14 English-French download (.tar.bz2) newstest2014:
download (.tar.bz2)
LightConv WMT17 Chinese-English download (.tar.bz2) newstest2017:
download (.tar.bz2)
DynamicConv WMT17 Chinese-English download (.tar.bz2) newstest2017:
download (.tar.bz2)

Preprocessing the training datasets

Please follow the instructions in examples/translation/README.md to preprocess the data.

Training and evaluation options:

To use the model without GLU, please set --encoder-glu 0 --decoder-glu 0. For LightConv, please use --encoder-conv-type lightweight --decoder-conv-type lightweight, otherwise the default is DynamicConv. For best BLEU results, lenpen may need to be manually tuned.

IWSLT14 De-En

Training and evaluating DynamicConv (without GLU) on a GPU:

# Training
SAVE="save/dynamic_conv_iwslt"
mkdir -p $SAVE 
CUDA_VISIBLE_DEVICES=0 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en \
    --clip-norm 0 --optimizer adam --lr 0.0005 \
    --source-lang de --target-lang en --max-tokens 4000 --no-progress-bar \
    --log-interval 100 --min-lr '1e-09' --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --lr-scheduler inverse_sqrt \
    --ddp-backend=no_c10d \
    --max-update 50000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --adam-betas '(0.9, 0.98)' --keep-last-epochs 10 \
    -a lightconv_iwslt_de_en --save-dir $SAVE \
    --dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 \
    --encoder-glu 0 --decoder-glu 0
python scripts/average_checkpoints.py --inputs $SAVE \
    --num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"

# Evaluation
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/iwslt14.tokenized.de-en --path "${SAVE}/checkpoint_last10_avg.pt" --batch-size 128 --beam 4 --remove-bpe --lenpen 1 --gen-subset test --quiet 

WMT16 En-De

Training and evaluating DynamicConv (with GLU) on WMT16 En-De using cosine scheduler on one machine with 8 V100 GPUs:

# Training
SAVE="save/dynamic_conv_wmt16en2de"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 $(which fairseq-train) \
    data-bin/wmt16_en_de_bpe32k --fp16  --log-interval 100 --no-progress-bar \
    --max-update 30000 --share-all-embeddings --optimizer adam \
    --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt \
    --clip-norm 0.0 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \
    --ddp-backend=no_c10d --max-tokens 3584 \
    --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
    --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
    --t-mult 1 --lr-period-updates 20000 \
    --arch lightconv_wmt_en_de_big --save-dir $SAVE \
    --dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 \
    --encoder-glu 1 --decoder-glu 1

# Evaluation
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/wmt16.en-de.joined-dict.newstest2014 --path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.5 --gen-subset test > wmt16_gen.txt
bash scripts/compound_split_bleu.sh wmt16_gen.txt

WMT14 En-Fr

Training DynamicConv (with GLU) on WMT14 En-Fr using cosine scheduler on one machine with 8 V100 GPUs:

# Training
SAVE="save/dynamic_conv_wmt14en2fr"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 $(which fairseq-train) \
    data-bin/wmt14_en_fr --fp16  --log-interval 100 --no-progress-bar \
    --max-update 30000 --share-all-embeddings --optimizer adam \
    --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt \
    --clip-norm 0.0 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \
    --ddp-backend=no_c10d --max-tokens 3584 \
    --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
    --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
    --t-mult 1 --lr-period-updates 70000 \
    --arch lightconv_wmt_en_fr_big --save-dir $SAVE \
    --dropout 0.1 --attention-dropout 0.1 --weight-dropout 0.1 \
    --encoder-glu 1 --decoder-glu 1

# Evaluation
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/wmt14.en-fr.joined-dict.newstest2014 --path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.9 --gen-subset test