In development, all of this can change.
Parapipeline is a pipeline for POS tagging of texts in multiple languages, sentence alignment, word alignment, and transliteration.
Make sure you have following programs installed
- Python 3.8 or later (not tested on Python 3.7 and earlier)
wget
- All prerequisites for Hunalign
- Polyglot for transliteration requires
python-numpy
libicu-dev
. (apt-get python-numpy libicu-dev
) git-lfs
Run git lfs install
.
Run make
to install necessary packages, compile taggers, aligners, download models, ...
The installation process is not thoroughly tested on various systems (it should work on Ubuntu 18.04), if you encounter an error it's likely caused by a missing prerequisite.
There are scripts tag
, transliterate
, align
, wordalign
and run
All scripts have the same arguments as run
.
All scripts expect line delimited sentences in utf-8 encoded files.
The name of these files is NAME_LANG[_ID][.ext]
, where NAME
is arbitrary text not containing _
, LANG
is iso-639-3 language code,
optional ID
distinguished between more variants of the same text (e.g. different translations), .ext
is also optional.
Inputs have to be case insensitive (if you have file NAME and Name, it will cause errors).
See files in examples
folder for some example input files.
run
script outputs tagged texts in XML files.
And when possible also sentence and word alignment files in XML.
See examples/outputs
for example output of this pipeline.
Aligned sentences are represented by link
tag.
- Attribute
type
denotes the number of sentences from source and target text. - Attribute
xtargets
is the alignment itself:6 7;5
meaning sentences6
,7
from source are aligned with sentence5
from target.
Each link
represents an "aligned block" - aligned sentences.
Attribute xtargets
contains is a space-separated list of alignments... 1:2;3:4
means that word 2
from sentence 1
in source text is aligned with word 4
in sentence 3
in the target text.
usage: run.py [-h] [-o OUTPUT_DIR] N [N ...]
Run pipeline.
positional arguments:
N List of files to be processed. Format: NAME_LANG[_ID][.ext], for example Hobbit_eng.txt
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Directory to which to write the output files.
Done | Language | Done | Language | Done | Language |
---|---|---|---|---|---|
✅ | Afrikaans | ✅ | French | ✅ | Norwegian |
✅ | Albanian | ✅ | Georgian | ✅ | Polish |
✅ | Armenian | ✅ | German | ✅ | Portuguese |
✅ | Belarusian | ✅ | Hebrew | ✅ | Romanian |
❌ | Bosnian | ✅ | Hungarian | ✅ | Russian |
✅ | Bulgarian | ✅ | Italian | ✅ | Serbian |
✅ | Catalan | ✅ | Japanese | ✅ | Slovak |
✅ | Chinese | ❌ | Kashubian | ✅ | Slovenian |
✅ | Croatian | ✅ | Korean | ✅ | Spanish |
✅ | Czech | ✅ | Latvian | ✅ | Swedish |
✅ | Danish | ✅ | Lithuanian | ✅ | Turkish |
✅ | Dutch | ❌ | Lower Sorbian | ✅ | Ukrainian |
✅ | English | ✅ | Macedonian | ✅ | Upper Sorbian |
✅ | Estonian | ✅ | Modern Greek | ❌ | Yiddish |
✅ | Finnish | ❌ | Molise Slavic |
- Edit
.config/config.json
, follow the structure of the other languages in the file to add a new one.- For treetagger,
.par
file has to be in./pipeline/taggers/treetagger/lib/
. - For UDPipe,
.udpipe
file has to be in./pipeline/taggers/udpipe/models/
. Note that this uses UDPipe version 1, UDPipe version 2 models will not work.
- For treetagger,
This section is about upgrading models
In order to update UDPipe models, change ./pipeline/taggers/Makefile
, section models
to download desired models (and extract them ...).
Then change config/config.json
so that each language which uses UDPipe points to correct filename.
Change ./pipeline/taggers/treetagger/Makefile
to download version of treetagger you wish to use.
You can also add scripts to download more models and so on.
- UDPipe: Straka Milan, Hajič Jan, Straková Jana. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 2016
- Treetagger: Helmut Schmid (1994): Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK
- Hunalign: D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005). Parallel corpora for medium density languages In Proceedings of the RANLP 2005, pages 590-596.
- BTagger: https://github.com/agesmundo/BTagger
- Georgian Treetagger model comes from here: http://corpus.leeds.ac.uk/serge/mocky/ka.par
- Eflomal: https://github.com/robertostling/eflomal
CC BY-NC-SA
This work mainly depends on trained UDPipe models which are licesed under CC BY-NC-SA.