-
-
Notifications
You must be signed in to change notification settings - Fork 7
UD parser
Mika Hämäläinen edited this page Jan 26, 2021
·
3 revisions
First you need to parse the CoNLL-U formatted file into a UD collection.
from uralicNLP.ud_tools import UD_collection
ud = UD_collection(codecs.open("file.conllu", encoding="utf-8"))
You can loop sentences and words in a UD collection
for sentence in ud:
for word in sentence:
print(word.pos, word.lemma, word.get_attribute("deprel"))
For an individual sentence, you can parse it as
from uralicNLP.ud_tools import parse_sentence
conl = "# text = Toinen palkinto\n1\tToinen\ttoinen\tADJ\tNum\tCase=Nom\t2\tnummod\t_\t_\n2\tpalkinto\tpalkinto\tNOUN\tN\tCase=Nom\t0\troot\t_\t_"
sentence = parse_sentence(conl)
UD collection and UD sentence can be searched for matching entries.
sentences = ud.find_sentences(query={"lemma": "kissa"}) #finds all sentences with the lemma kissa
for sentence in sentences:
word = sentence.find(query={"lemma": "kissa"})
print(word[0].get_attribute("form")) #prints the form for the first word kissa in all the sentences containing that word
If the find
and find_sentences
are called without arguments, they will return everything (all sentences or all words). The query can contain any of the fields specified in the CoNLL-U format description. The queries can contain string objects or matchable regex patterns.
UralicNLP is an open-source Python library by Mika Hämäläinen