UP_Vietnamese-VTB

Data Format

This data is in conllup format, which adds user defined columns to the original 10 columns from the CoNLL-U format (from UD). Our data consists of four columns: the original ID columns, plus three additional columns UP:PRED, UP:ARGHEADS, and UP:ARGSPANS.

ID (column 1) is the token id consistent with corresponding UD sentence.
UP:PRED (column 11) contains predicate sense label for this predicate. This sense provides roleset specific meanings for each of its arguments, as defined in EN propbank.
UP:ARGHEADS (column 12) contains the argument heads for arguments of this predicate. Each argument is in the format label:token_id. The arguments are separated by pipe | charactor.
UP:ARGSPANS (column 13) contains the argument spans for arguments of this predicate. Each argument is in the format label:start_token_id-end_token_id. The arguments are separated by pipe | charactor.

We provide a python script to combine such a UP file with its corresponding UD file to produce the desired 13 column .conllup file. The script is available in tools repository: up2/merge_ud_up.py. Follow the procedure:

Download UD from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-4611
git clone https://github.com/UniversalPropositions/UP_Vietnamese-VTB.git
git clone https://github.com/UniversalPropositions/tools.git
cd tools
- Setup tools as per the instructions in README.md.
python3 up2/merge_ud_up.py --input_ud=<ud_path_to_UD_Vietnamese-VTB>/ --input_up=<up_path_to_UP_Vietnamese-VTB>/ --output=<path_to_UD+UP_Vietnamese-VTB>/

Known data quality issues

Parser quality

Because of the underlying parser mistakes in identifying the correct lemma for certain verbs, and as we name the frame files based on the lemma in the target language, one might expect to see frame filenames that do not make sense in that particular language.

Language peculiarities

For the languages where subject/object can be omitted, one may expect to observe incorrect role label transfer. One potential reason for such issues is incorrect word alignment.
AUX (be, have, do) in EN is likely to be misaligned with other tokens in other languages. In EN, these AUX are used to construct tenses (perfect perfective), polarity etc., but different languages represent tense and polarity differently.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
vi_vtb-up-dev.conllup		vi_vtb-up-dev.conllup
vi_vtb-up-test.conllup		vi_vtb-up-test.conllup
vi_vtb-up-train.conllup		vi_vtb-up-train.conllup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UP_Vietnamese-VTB

Data Format

Known data quality issues

Parser quality

Language peculiarities

About

Releases

Packages

Contributors 2

License

UniversalPropositions/UP_Vietnamese-VTB

Folders and files

Latest commit

History

Repository files navigation

UP_Vietnamese-VTB

Data Format

Known data quality issues

Parser quality

Language peculiarities

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages