Encoder-Decoder model with attention

Ideas

Create other (better?) measure than BLEU: compute e.g. Euclidean distance or cosine similarity of context vectors.
See if this new measure is similar to what BLEU finds

TO DO

Find out how model deals with encoder-decoder: there is only one .pt file?
Generalize up 'bleu.py' (make more general, accept arguments, etc.): now 'hardcoded' for dev set.
Calculate BLEU score on train set (due to unclear instructions I did it on dev set +- 1000 entries)
Calculate BLEU score on test set (due to unclear instructions I did it on dev set +- 1000 entries)

DONE

Find out exactly what data has been used for training (description on Blackboard has changed)
Calculate BLEU score on dev set

Open questions

Bleu score: should references be in a list? Plot makes more sense when references are not in a list.
Is the idea good? Would this be sufficient for project?

Project / code description

Model is trained on data/train.tags.en-nl.en and data/train.tags.en-nl.nl (+- 25000 entries).
Dev set can be found here: IWSLT17.TED.dev2010.en-nl.en.xml and IWSLT17.TED.dev2010.en-nl.nl.xml. (+- 1000 entries)
Test set can be found here: IWSLT17.TED.tst2017.mltlng.en-nl.en.xml and IWSLT17.TED.tst2017.mltlng.nl-en.nl.xml. (+- 1250 entries)
When running the code make sure you are located in the root folder.

Command to preprocess the TED data (dev), both English and Dutch:

python xml_preprocess.py IWSLT17.TED.dev2010.en-nl.en.xml en_dev.txt
python xml_preprocess.py IWSLT17.TED.dev2010.en-nl.nl.xml nl_dev.txt

Command to preprocess the TED data (tst), both English and Dutch:

python xml_preprocess.py IWSLT17.TED.tst2017.mltlng.en-nl.en.xml en_tst.txt
python xml_preprocess.py IWSLT17.TED.tst2017.mltlng.nl-en.nl.xml nl_tst.txt

Command to preprocess the TED data (train), both English and Dutch:

python xml_preprocess.py train.tags.en-nl.en en_train.txt
python xml_preprocess.py train.tags.nl-en.nl nl_train.txt

Command to translate input_data.txt using trained model (.pt file), result is stored in write_to.txt

python OpenNMT-py/translate.py -model OpenNMT-py/trained_models/ted_sgd_acc_55.43_ppl_12.39_e11.pt -src en.txt -output preds.txt -replace_unk -verbose

Command to calculate BLEU score and show plot:

python bleuscore.py

(should be extended to accept arguments etc.)

Note that inside folder OpenNMT-py some folders can be ignored (these are for educational purposes): we don't use data, test, and some other files.

In trained_models there are two models: the one starting with ted is the one we need.

I used txtdata to play around, this can be ignored.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
OpenNMT-py		OpenNMT-py
_____old_discarded		_____old_discarded
data		data
data_cleaned		data_cleaned
README.md		README.md
bleu.py		bleu.py
bleu_sentence.py		bleu_sentence.py
en.txt		en.txt
nl.txt		nl.txt
preds.txt		preds.txt
sampled_sents.txt		sampled_sents.txt
sampled_words.txt		sampled_words.txt
xml_preprocess.py		xml_preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Encoder-Decoder model with attention

Ideas

TO DO

DONE

Open questions

Project / code description

About

Releases

Packages

Contributors 3

Languages

davidstap/encoder-decoder

Folders and files

Latest commit

History

Repository files navigation

Encoder-Decoder model with attention

Ideas

TO DO

DONE

Open questions

Project / code description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages