This project provides the inference libarary and evaluation scripts for WSPAlign.
Create conda enviroment with
conda create wspalign-infereval python=3.8
conda activate wspalign-infereval
Then install pytorch compatible with your own machine, refer to install pytorch. For example, run
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
Finally, install transformers with
pip install transformers
We use SpaCy to split sentences into words. For now this libaracy supports six langauges. Declare your source and target langauges with --src_lang
and --tgt_lang
. For the language abbreviation, refer to the following table
Language abbreviation | Language |
---|---|
en | English |
ja | Japanese |
zh | Chinese |
fr | French |
de | German |
ro | Romanian |
Install Spacy and langauge packages with the following commands:
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download zh_core_web_sm
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
python -m spacy download de_core_news_sm
python -m spacy download ja_core_news_sm
python -m spacy download ro_core_news_sm
Please refer to https://spacy.io/ for more information. You can easily apply languages other than the above six, but note that for now we do not provide finetuned WSPAligner for other languages. WSPAligner in other languages can only perform in a zero-shot way with our pre-trained model.
Now align words in two sentences with the following example:
python inference.py --model_name_or_path qiyuw/WSPAlign-ft-kftt --src_lang ja --src_text="私は猫が好きです。" --tgt_lang en --tgt_text="I like cats."
Model List | Description |
---|---|
qiyuw/WSPAlign-xlm-base | Pretrained on xlm-roberta |
qiyuw/WSPAlign-mbert-base | Pretrained on mBERT |
qiyuw/WSPAlign-ft-kftt | Finetuned with English-Japanese KFTT dataset |
qiyuw/WSPAlign-ft-deen | Finetuned with German-English dataset |
qiyuw/WSPAlign-ft-enfr | Finetuned with English-French dataset |
qiyuw/WSPAlign-ft-roen | Finetuned with Romanian-English dataset |
Use our model checkpoints with huggingface
Note: For Japanese, Chinese, and other asian languages, we recommend to use mbert-based models like qiyuw/WSPAlign-mbert-base
or qiyuw/WSPAlign-ft-kftt
for better performance as we discussed in the original paper: WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction (ACL 2023).
Dataset list | Description |
---|---|
qiyuw/qiyuw/wspalign_acl2023_eval | Evaluation data used in the paper |
qiyuw/wspalign_test_data | Test dataset for evaluation |
Construction of Evaluation
dataset can be found at word_align.
Go evaluate/
for evaluation. Run download_dataset.sh
to download all the above datasets.
Then download aer.py
from lilt/alignment-scripts by wget https://raw.githubusercontent.com/lilt/alignment-scripts/master/scripts/aer.py
We made minor modification on aer.py
to avoid excutation errors, run patch -p0 aer.py aer.patch
to update the original script.
The project also provides the evaluation script for pretrained and finetuned WSPAlign models, for details of the pre-training and fine-tuning of WSPAlign, please refer to WSPAlign project.
After running zeroshot.sh
with specifying your trained model, you will get the predicted alignment stored in [YOUR OUTPUT DIR]/nbest_predictions_.json
. (e.g., /data/local/qiyuw/WSPAlign/experiments-zeroshot-2023-08-03/zeroshot/deen/nbest_predictions_.json
)
Then go to evaluate/
and run bash post_evaluate.sh [YOUR OUTPUT DIR]/nbest_predictions_.json [LANG] [TOKENIZER]
. The script will take care of the alignment transformation and evaluation. [LANG]
can be chosen from [deen, kftt, roen, enfr]
, and [TOKENIZER]
can be chosen from [BERT, ROBERTA]
.
See evaluate/post_evaluate.sh for details.
If you use our code or model, please cite our paper:
@inproceedings{wu-etal-2023-wspalign,
title = "{WSPA}lign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction",
author = "Wu, Qiyu and Nagata, Masaaki and Tsuruoka, Yoshimasa",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.621",
pages = "11084--11099",
}
This software is released under the NTT License, see LICENSE.txt.