This repository contains training scripts and instructions how to reproduce our systems submitted to the NEWS 2018 Shared Task on Transliteration of Named Entities, and described in R. Grundkiewicz, K. Heafield: Neural Machine Translation Techniques for Named Entity Transliteration, NEWS 2018, ACL 2018
Citation:
@InProceedings{grundkiewicz-heafield:2018:NEWS2018,
author = {Grundkiewicz, Roman and Heafield, Kenneth},
title = {Neural Machine Translation Techniques for Named Entity Transliteration},
booktitle = {Proceedings of the Seventh Named Entities Workshop},
month = {July},
year = {2018},
address = {Melbourne, Australia},
publisher = {Association for Computational Linguistics},
pages = {89--94},
url = {http://www.aclweb.org/anthology/W18-2413}
}
-
Download and compile Marian in
tools/marian-dev
:cd tools git clone https://github.com/marian-nmt/marian-dev mkdir marian-dev/build cd marian-dev/build cmake .. -DCMAKE_BUILD_TYPE=Release make -j8 cd ../../..
If needed, please refer to the official Marian documentation at https://marian-nmt.github.io/docs
-
Download data sets 01-04 from http://workshop.colips.org/news2018/dataset.html and unzip them into
datasets
. -
Prepare training and development data:
cd experiments ./prepare-data.sh
-
Train baseline systems specifying GPU device(s) and one or more language directions, e.g.:
./train.sh '0 1' EnVi EnCh ChEn
Each system will be an ensemble of 4 deep RNN models rescored by 2 right-left models.
The evaluation scores can be collected by running:
./show-results.sh
-
A text file can be translated using the
translate.sh
script, for example:head data/EnVi.dev.src | ./translate.sh EnVi file.tmp 0 > file.out
-
Prepare synthetic data with the back-translation or forward-translation method:
./prepare-synthetic-data.sh
The systems can be re-trained with additional data by replacing original folders and re-running the training script, e.g.:
mv data data.original mv synthetic data mv models models.baseline ./train.sh '0 1' EnVi EnCh ChEn ./show-results.sh
For the EnVi system, this should display results similar to the following:
ACC Fscore MRR MAPref models.baseline/EnVi.1 0.4680 0.8742 0.5582 0.4680 models.baseline/EnVi.2 0.4900 0.8806 0.5693 0.4900 models.baseline/EnVi.3 0.4580 0.8744 0.5521 0.4580 models.baseline/EnVi.4 0.4600 0.8692 0.5543 0.4600 models.baseline/EnVi.ens 0.4740 0.8783 0.5649 0.4740 models.baseline/EnVi.ens.r2l 0.4800 0.8815 0.5767 0.4800 models.baseline/EnVi.ens.r2l.rescore 0.4880 0.8830 0.5777 0.4880 models.baseline/EnVi.r2l.1 0.4520 0.8710 0.5548 0.4520 models.baseline/EnVi.r2l.2 0.4860 0.8759 0.5791 0.4860 models/EnVi.1 0.4980 0.8856 0.5838 0.4980 models/EnVi.2 0.4860 0.8833 0.5771 0.4860 models/EnVi.3 0.4860 0.8836 0.5785 0.4860 models/EnVi.4 0.4980 0.8854 0.5833 0.4980 models/EnVi.ens 0.5000 0.8865 0.5859 0.5000 models/EnVi.ens.r2l 0.4820 0.8858 0.5817 0.4820 models/EnVi.ens.r2l.rescore 0.5020 0.8884 0.5905 0.5020 models/EnVi.r2l.1 0.4800 0.8843 0.5789 0.4800 models/EnVi.r2l.2 0.4920 0.8876 0.5860 0.4920