Skip to content

Latest commit

 

History

History
276 lines (261 loc) · 4.72 KB

README.md

File metadata and controls

276 lines (261 loc) · 4.72 KB

wikidict-wordlist - Wikipedia Monolingual Reference Data (wordlists)

This repository makes available a collection of wordlists derived from article titles in various language Wikipedias. The data has been extracted from Wikidata.

Data

The data directory contains subdirectories arranged in order of ISO language code.

The basic filename pattern is [ISO]-wordlist_wiki.txt, with [ISO] being the target language ISO code. A list of all available languages is below.

Available languages

Language code Language name
af Afrikaans
am Amharic
ang Anglo-Saxon
ar Arabic
arc Aramaic
bg Bulgarian
bi Bislama
bn Bengali
bo Tibetan
br Breton
bs Bosnian
ca Catalan
cdo Min Dong
chr Cherokee
chy Cheyenne
cr Cree
cs Czech
cy Welsh
da Danish
de German
el Greek
en English
eo Esperanto
es Spanish
et Estonian
eu Basque
fa Persian
ff Fula
fi Finnish
fr French
ga Irish
gan Gan
gd Scottish Gaelic
gu Gujarati
gv Manx
ha Hausa
hak Hakka
haw Hawaiian
he Hebrew
hi Hindi
hr Croatian
ht Haitian
hu Hungarian
hy Armenian
id Indonesian
ig Igbo
is Icelandic
it Italian
iu Inuktitut
ja Japanese
jbo Lojban
jv Javanese
ka Georgian
kg Kongo
ki Kikuyu
kl Greenlandic
km Khmer
ko Korean
la Latin
lg Luganda
lo Lao
lt Lithuanian
lv Latvian
mg Malagasy
mi Maori
mn Mongolian
ms Malay
mt Maltese
nah Nahuatl
ne Nepali
nl Dutch
nn Norwegian (Nynorsk)
no Norwegian
nv Navajo
ny Chichewa
oc Occitan
pa Punjabi
pi Pali
pl Polish
ps Pashto
pt Portuguese
qu Quechua
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
sh Serbo-Croatian
sk Slovak
sl Slovenian
sn Shona
so Somali
sq Albanian
sr Serbian
sv Swedish
sw Kiswahili
ta Tamil
te Telugu
th Thai
tl Tagalog
tpi Tok Pisin
tr Turkish
ug Uyghur
uk Ukrainian
ur Urdu
vi Vietnamese
wo Wolof
wuu Wu
xh Xhosa
yi Yiddish
yo Yoruba
za Zhuang
zh Chinese (Mandarin)
zh_classical Classical Chinese
zh_min_nan Min Nan
zh_yue Cantonese
zu Zulu

Statistics

Wordlist size

Language # of entries
af 33599
am 11014
ang 2977
ar 446845
arc 1829
bg 225573
bi 490
bn 59121
bo 2929
br 49865
bs 64229
ca 438072
cdo 2909
chr 492
chy 710
cr 70
cs 327321
cy 52130
da 196279
de 1787961
el 136650
en 4798378
eo 209308
es 1346715
et 124124
eu 203027
fa 744454
ff 464
fi 363265
fr 1862431
ga 35768
gan 14253
gd 15561
gu 27615
gv 4723
ha 518
hak 4123
haw 2009
he 209505
hi 120411
hr 139555
ht 45669
hu 323069
hy 161719
id 338477
ig 1075
is 39429
it 1183116
iu 383
ja 951498
jbo 1179
jv 45722
ka 118968
kg 868
ki 311
kl 1839
km 4713
ko 446200
la 111691
lg 179
lo 1913
lt 173148
lv 58016
mg 77182
mi 2579
mn 18668
ms 245936
mt 2981
nah 10519
ne 24961
nl 1812937
nn 117294
no 403749
nv 3887
ny 170
oc 88788
pa 14042
pi 2759
pl 1088821
ps 5148
pt 866567
qu 18494
ro 264609
ru 1461243
sa 12256
se 7216
sh 284238
sk 269048
sl 132095
sn 1671
so 2760
sq 53553
sr 351888
sv 1954061
sw 26694
ta 80394
te 63860
th 134176
tl 57983
tpi 1336
tr 247607
ug 2596
uk 638342
ur 125182
vi 1241500
wo 1636
wuu 5032
xh 319
yi 12575
yo 35053
za 808
zh 804107
zh_classical 3855
zh_min_nan 14851
zh_yue 32062
zu 689

Top ten wordlists by number of entries

Language # of entries
en 4798378
sv 1954061
fr 1862431
nl 1812937
de 1787961
ru 1461243
es 1346715
vi 1241500
it 1183116
pl 1088821

License

According to the Wikidata website:

All structured data from the main and property namespace is available under the Creative Commons CC0 License

The data in this repository is therefore made available under the same Creative Commons CC0 License as that used by the Wikidata project. All of the data has been derived from the Wikidata JSON format database dumps.