-
-
Notifications
You must be signed in to change notification settings - Fork 7
SemFi, SemUr (legacy stuff)
SemFi is a collection of Finnish words and their syntactic relations. SemFi stores the strength of the syntactic relations between words. SemUr is a collection of automatically translated versions of SemFi for other Uralic languages.
On command line:
python -m uralicNLP.download --languages fin --semfi
Use the following script to download the semantic databases in Python:
from uralicNLP import semfi
semfi.download("fin")
Use semfi.supported_languages()
to list the supported languages.
You can find information stored in SemFi about words with their lemma and pos.
semfi.get_word("kissa","N", "fin")
>> {'word': u'kissa', 'compund': 0, 'pos': u'N', 'frequency': 23214, 'relative_frequency': 0.000172062683057, 'id': u'kissa_N'}
You can also list homonyms without explicitly giving the pos.
semfi.get_words("kuusi", "fin")
>> [{'word': u'kuusi', 'compund': 0, 'pos': u'N', 'frequency': 3823, 'relative_frequency': 2.83361608221e-05, 'id': u'kuusi_N'}, {'word': u'kuusi', 'compund': 0, 'pos': u'Num', 'frequency': 19897, 'relative_frequency': 0.000147477005461, 'id': u'kuusi_Num'}]
word = semfi.get_word("näätä","N", "fin")
semfi.get_all_relations(word, "fin", sort=True) #lists all related words
>> [{'zscore': 6.84208734905, 'frequency': 9, 'relation': u'ROOT', 'word2': {'word': u'olla', 'compund': 0, 'pos': u'V', 'frequency': 5301968, 'relative_frequency': 0.0392983044525, 'id': u'olla_V'}, 'relative_frequency': 0.1125, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}]
semfi.get_by_relation(word, "dobj", "fin", sort=True) #lists words with a given syntactic relation
>> [{'zscore': 0, 'frequency': 1, 'relation': u'dobj', 'word2': {'word': u'tai', 'compund': 0, 'pos': u'C', 'frequency': 783, 'relative_frequency': 5.80361337268e-06, 'id': u'tai_C'}, 'relative_frequency': 1, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]
word2 = semfi.get_word("syödä","V", "fin")
semfi.get_by_word(word, word2, "fin")
>> [{'zscore': 1.48741029327, 'frequency': 3, 'relation': u'ROOT', 'word2': {'word': u'syödä', 'compund': 0, 'pos': u'V', 'frequency': 128242, 'relative_frequency': 0.000950532549347, 'id': u'syödä_V'}, 'relative_frequency': 0.0375, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]
SemFi provides many methods for finding related words. One can get words by all relations, by a given relation or find relations by another word. The results can be sorted by their frequency by sort=True.
If you use SemFi or SemUr, cite the following publication
Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)
UralicNLP is an open-source Python library by Mika Hämäläinen