by Yuki Arase (Osaka University)
This repository provides an implementation of the phrase alignment method based on the constrained tree edit distance.
ESPADA is now available online!
Yuki Arase and Jun'ichi Tsujii. 2020. Compositional Phrase Alignment and Beyond. in Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1611-1623.
Please make sure to install all the dependent libraries in requirements.txt
SPADE and ESPADA datasets are downloadable from LDC (Linguistic Data Consortium)
Please place these corpora (xml files) in the data
directory (the current repository provides just samples for debugging).
The following labels correspond to models described in the paper.
BERTTrainer
: BERT model with simple fine-tuningBERT1F
: The proposed model (BERT+SimMatrix)BERT1E
: The proposed model using [CLS] instead of SimMatrix (BERT+[CLS])
Trained models are distributed at Zenodo. Note that ALIR and ALIP values of these models are slightly different from what was reported in the paper because the paper reports the average performance of 10 models initialized with random seeds.
Pleaes set hyper-parameters as you want.
python ./fine-tune_bert.py --out_dir ../model/ --model_type BERT1F --pooling mean --train_epoch 100 --early_stop 5 --margin 1.0 --lr 3e-05 --ft_bert
python ./main.py --out_dir ../out/ --model_dir ../model/ --model_name BERT1F_TripletMarginLoss_margin-1.0_lr-3e-05_mean_100_ft-bert-base-uncased.pkl --pooling mean --null_thresh 0.8
If you want to output alignments, please flag --decode
option.
python ./baseline_wo_ted.py --out_dir ../out/ --model_dir ../model/ --model_name BERT1F_TripletMarginLoss_margin-1.0_lr-3e-05_mean_100_ft-bert-base-uncased.pkl --pooling mean --null_thresh 0.6
Download a FastText model you like, and specify the path to the model --model_dir
python ./main.py --out_dir ../out/ --model_name FastText --model_dir ../fasttext/crawl-300d-2M-subword.bin --pooling mean --null_thresh 0.8
python ./baseline_wo_ted.py --out_dir ../out/ --model_name FastText --model_dir ../fasttext/crawl-300d-2M-subword.bin --pooling mean --null_thresh 0.75
- Parse your dataset with Enju parser. Make sure to flag
-xml
to obtain outputs in an xml format - Rename your source and target xml files to follow the rule:
s-(\d+).xml
andt-(\d+).xml
((\d+)
is the index of a pair). E.g., s-001.xml and t-001.xml - Place your xml files to
../data/name_of_your_xml_dir/
and change the path indecode()
function inmain.py
- Run
main.py
flagging--decode
- Alignment results will be saved in the output directory:
index.txt
are alignment results without postprocessing andPP_index.txt
are alignments with postprocessing.
For your reference, data/Enju/
provides simple example inputs (Enju xml files) and out/alignment/
provides corresponding alignment outputs.
When you use our codes in your projects, please cite the following paper.
Yuki Arase and Jun'ichi Tsujii. 2020. Compositional Phrase Alignment and Beyond. in Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1611-1623.
@inproceedings{arase-tsujii-2020-compositional,
title = "Compositional Phrase Alignment and Beyond",
author = "Arase, Yuki and
Tsujii, Jun{'}ichi",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.125",
doi = "10.18653/v1/2020.emnlp-main.125",
pages = "1611--1623"
}