Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Introduction

This repository contains information about Glot500 model, data, and code.

Glot500-m is an extended version of XLM-R-base, covering more than 500 languages compared to XLM-R's 104 languages. Glot500-m is available at huggingface-models.
Glot2000-c comprises corpora for over 2000 languages, while Glot500-c is a subset of Glot2000-c for over 500 languages, including languages with more than 30,000 sentences.
- Glot500-c dataset (the part that we can redistribute) is available at huggingface-dataset. For a more complete version of the data, you need to fill the data request form.

Glot500-m

You can use this model directly with a pipeline for masked language modeling:

>>> ! pip install transformers
>>> ! pip install sentencepiece

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cis-lmu/glot500-base')
>>> unmasker("Hello I'm a <mask> model.")

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('cis-lmu/glot500-base')
model = AutoModelForMaskedLM.from_pretrained("cis-lmu/glot500-base")

# prepare input
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input, output_hidden_states=True)

Glot500-m Evaluation

We provide in-depth evaluation of Glot500-m model and baselines in our paper. Each number is an average over head languages, tail languages and all languages. See the paper for detailed results per task and language. Glot500-m outperforms XLM-R-B (base) in all tasks for head (except for POS) and tail languages and XLM-R-L (large) for tail languages. Best result per task/language set is in bold.

	tail	tail	tail	head	head	head	all	all	all
	XLM-R-B	XLM-R-L	Glot500-m	XLM-R-B	XLM-R-L	Glot500-m	XLM-R-B	XLM-R-L	Glot500-m
Pseudoperplexity	304.2	168.6	12.2	12.5	8.4	11.8	247.8	136.4	11.64
Sentence Retrieval Tatoeba (Top 10 Acc.)	32.6	33.6	59.8	66.2	71.1	75.0	56.6	60.4	70.7
Sentence Retrieval Bible (Top 10 Acc.)	7.4	7.1	43.2	54.2	58.3	59.0	19.3	20.1	47.3
Text Classification (F1)	13.7	13.9	46.6	51.3	60.5	54.7	23.3	25.8	48.7
NER (F1)	47.5	51.8	60.7	61.8	66.0	63.9	55.3	59.5	62.4
POS (F1)	41.7	43.5	62.3	76.4	78.4	76.0	65.8	67.7	71.8
Roundtrip Alignment (Acc.)	2.57	3.13	4.45	3.42	4.06	5.46	2.77	3.34	4.68

Glot500-c

This is an overview of the corpora included Glot500-c presented in our paper. Glot500-c will be sent via email upon filling the data request form. The part that we can redistribute is available at huggingface-dataset. For more information, check out the table below.

Disclaimer Please note that, while the data sources utilized in this study do not explicitly prohibit the reuse of data for research purposes, some sources do have copyright statements indicating that such use is permissible, while others do not. Additionally, certain sources prohibit the redistribution of data. As such, data from these sources is omitted from the published version of Glot500-c.
As regards the ND (NoDerivs) constraint for some datasets, we only change the format of the container while preserving the original contents. The first column of the table indicates the availability of each corpus in the downloadable Glot500-c (yes/no/partially).

We request all the users of Glot500-c to cite the original creators of the datsets and comply to each datasets' license. A BibTex file is available.

If you are a dataset owner and wish to update any part of this overview, or do not want your dataset to be included in Glot500-c, please send us an email at [email protected] .

Glot500-c overview table:

Available	Dataset	Related Papers	Languages	Domain / Notes	Data collection / Verification method	License

Click to Expand Table

Available	Dataset	Related Papers	Languages	Domain / Notes	Data collection / Verification method	License
Partially	1000Langs	-	1500 languages	Religious	Web-crawled	Apache License 2.0
Yes	Add	Link	arz, afb, ajp, apc	Dialects, arabic commentaries	Annotated	Freely available for research purposes
Yes	AfriBERTa	Link	amh, hau, ibo, orm, pcm, som, swa, tir, yor	mostly BBC, some Common Crawl		Apache License 2.0
Yes	AfroMAFT	Link ; Link	afr, amh ,ara, eng, fra, hau, ibo, mlg, nya, orm, pcm, kin, sna, som, sot, swa, xho, yor, zul expand	Language Adaptation Corpus		https://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/
Yes	AI4Bharat	Link	pan, hin, ben, ori, asm, guj, mar, kan, tel, mal, tam expand	News, magazine, blog posts	Automatically curated	CC BY-NC-SA 4.0
Yes	AIFORTHAI-LotusCorpus	-	tha	Large vOcabualry Thai continUous Speech recognition (LOTUS) corpus		CC BY-NC-SA 3.0 TH , 2005 Copyright by National Electronics and Computer Technology Center (NECTEC) For more information, visit http://www.nectec.or.th/rdi/lotus
Yes	Akuapem	-	aka	Parallel sentences	Verified by native speakers	CC-BY 4.0
Yes	Anuvaad	-	hin, ben, tam, mal, tel, kan, mar, pan, guj, asm, urd, ori expand	Various domains (General, Legal, Education, Healthcare, Automobile, News)		CC-BY 4.0
Yes	AraBench	Link	arz, apc, afb, ary	Translations of 'travelling phrases', blogs, tv transcripts, Bible	Available Dialectal Arabic-English resources and with curated evaluation sets	Apache License 2.0
Yes	AUTSHUMATO	-	tsn, tso	South African government domain		Creative Commons Attribution 2.5 South Africa License
Yes	Bianet	Link	kur, eng, tur	Parallel news corpus	Automatically curated	CC-BY-SA 4.0 open license
Yes	BLOOM	Link	aaa, abc, ada, adq, aeu, agq, ags, ahk, aia, ajz, aka, ame, amp, amu, ann, aph, awa, awb, azn, azo, bag, bam, baw, bax, bbk, bcc, bce, bec, bef, bfd, bfm, bfn, bgf, bho, bhs, bis, bjn, bjr, bkc, bkh, bkm, bkx, bob, bod, boz, bqm, bra, brb, bri, brv, bss, bud, buo, bwt, bwx, bxa, bya, bze, bzi, cak, cbr, cgc, chd, chp, cim, clo, cmo, csw, cuh, cuv, dag, ddg, ded, dig, dje, dmg, dnw, dtp, dtr, dty, dug, eee, ekm, enb, enc, ewo, fli, fon, fub, fuh, gal, gbj, gou, gsw, guc, guz, gwc, hao, hbb, hig, hil, hla, hna, hre, hro, idt, ilo, ino, isu, jgo, jmx, jra, kak, kam, kau, kbq, kbx, kby, kek, ken, khb, kik, kin, kjb, kmg, kmr, kms, kmu, kqr, krr, ksw, kvt, kwd, kwu, kwx, kxp, kyq, laj, lan, lbr, lfa, lgg, lgr, lhm, lhu, lkb, llg, lmp, lns, loh, lsi, lts, lug, luy, lwl, mai, mam, mdr, mfh, mfj, mgg, mgm, mgo, mgq, mhx, miy, mkz, mle, mlk, mlw, mmu, mne, mnf, mnw, mot, mqj, mrn, mry, msb, muv, mve, mxu, myk, myx, mzm, nas, nco, new, nge, ngn, nhx, njy, nla, nlv, nod, nsk, nsn, nso, nst, nuj, nwe, nwi, nxa, nxl, nyo, nyu, nza, odk, oji, oki, omw, ozm, pae, pag, pbt, pce, pcg, pdu, pea, pex, pis, pkb, pmf, pnz, psp, pwg, qaa, qub, quc, quf, quz, qve, qvh, qvm, qvo, qxh, rel, rnl, roo, rue, rug, saq, sat, sdk, sea, sgd, shn, sml, snk, snl, sox, sps, ssn, stk, sxb, syw, taj, tbj, tdb, tdg, tdt, teo, tet, the, thk, thl, thy, tio, tkd, tnl, tnn, tnp, tnt, tod, tom, tpi, tpl, tpu, tsb, tsn, tso, tuv, tuz, tvs, udg, unr, ven, vif, war, wbm, wbr, wms, wni, wnk, wtk, xkg, xmd, xmg, xmm, xog, xty, yas, yav, ybb, ybh, ybi, ydd, yea, yet, yin, ymp, zaw, zlm, zuh expand	Web	Crawl from Internet and filtering	CC BY 4.0
Yes	CMU_Haitian_Creole	-	hat, eng	Medical domain phrases and sentences in English translated into Haitian Creole by Eriksen Translations, Inc.	Curated	http://www.speech.cs.cmu.edu/haitian/text/COPYING
Yes	CC100	Link ; Link	asm, ful, grn, lim, lin, lug, nso, orm, que, roh, srd, ssw, tsn, wol expand	Web	Crawl from Internet	Statistical Machine Translation at the University of Edinburgh makes no claims of intellectual property on the work of preparation of the corpus. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
Yes	CCNet	Link	Multiple languages	Multiple domains	Datasets from Common Crawl	MIT License
Yes	Clarin (subset)	-	Multiple languages	Multiple domains	Multiple	CC-BY 4.0
Yes	CORP.NCHLT	-	nde, nso, sot, ssw, tsn, tso, ven, xho, zul expand	Various	Various	Creative Commons Attribution 2.5 South Africa License
Yes	DART	Link	arz, afb, acm, apc, ary	Tweets	Annotators involved also for quality control	Publicly available
Yes	Earthlings	Link	acu, afr, amh, amu, asm, aze, bel, ben, bod, bus, cak, cbc, cbs, cbv, ceb, chv, coe, crn, csb, cym, des, div, dop, epo, eus, fao, gle, glg, guj, gum, gym, hat, hbs, hye, ido, ilo, ipi, isl, jav, kab, kal, kan, kaz, khm, kir, knv, kpr, kur, kyc, kyq, lao, lez, lus, maa, mal, mar, maz, mkd, mlg, mlp, mon, mop, mpx, mri, mya, myy, nep, opm, ori, pan, pck, pir, poh, ptu, pus, que, sab, sah, scn, sin, sja, sme, snd, som, srd, srm, sua, swa, tat, tbc, tbz, tca, tel, tgk, tgl, tpi, tuk, ubu, udm, uig, urd, uzb, wal, wln, wol, yid, yor expand	Subset of CommonCrawl	Crawl from Internet and filtering	GNU-GPL v.3 License
Yes	Flores200	Link	ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, als_Latn, amh_Ethi, apc_Arab, arb_Arab, arb_Latn, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn, bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gaz_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, kaz_Cyrl, kbp_Latn, kea_Latn, khk_Cyrl, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kmr_Latn, knc_Arab, knc_Latn, kon_Latn, kor_Hang, lao_Laoo, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, lvs_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Arab, min_Latn, mkd_Cyrl, mlt_Latn, mni_Beng, mos_Latn, mri_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pbt_Arab, pes_Arab, plt_Latn, pol_Latn, por_Latn, prs_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Olck, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, taq_Latn, taq_Tfng, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zsm_Latn, zul_Latn expand	Misc	Human annotated	CC-BY-SA 4.0
	FrenchEwe	-	ewe, fra	Parallel sentences	Annotated	CC-BY 4.0
Yes	FFR	Link	fon, fra	Parallel sentences	Clean curated corpora	MIT License and Licence Creative Commons Attribution - No Commercial Use - Sharing under the Same Conditions 4.0 International.
Yes	GiossaMedia	Link ; Link	spa, grn	Parallel sentences, news and social media	Automatically curated	also used by NLLB, freely available
Yes	Glosses	Link	256 languages	Disambiguated glosses	Wikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata.	CC BY-NC-SA 3.0
Yes	Habibi	Link	arz, afb, acm, ary, apd, apc	Song lyrics	Collected from the Web	Freely available for research purposes
Yes	Hindialect	Link	anp, awa, ben, bgc, bhb, bhd, bho, bjj, bns, bra, gbm, guj, hin, hne, kfq, kfy, mag, mar, mis, mup, noe, pan, raj, san expand	script all in Devanagari	folksongs	CC BY-NC-SA 4.0
Yes	HornMT	-	aar, amh, eng, orm, som, tir	multi-way parallel corpus		CC-BY 4.0
Yes	IITB	Link	eng, hin	Collected from different sources and corpora	Automatically collected	CC-BY-NC 4.0
Yes	Indiccorp	Link	asm, ben, guj, kan, mal, mar, ory, pan, tel	Web	Web crawled	CC BY-NC-SA 4.0
Yes	isiZulu	-	zul, eng	English sentences, sampled from News Crawl datasets that were translated into isiZulu	Annotated	CC BY 4.0
Yes	JESC	Link	eng, jpn	Movie and tv subtitles	Web-crawled	CC-BY-NC 4.0
Yes	JParaCrawl	Link	eng, jpn	Various domains	Web crawled, automatically aligned	Custom License
No	JW	-		Religious	Web crawled	Private
Yes	KinyaSMT	Link	kin,eng	Bible+other	Automatically translated	GNU General Public License v3.0
Yes	LeipzigData	Link	aar, ace, ach, aka, als, als-al, als-sqi, anw, arg, arz, asm, ast, aym, aze, azj, azj-az, bak, bam, ban, ban-id, bar, bcl, bem, bew, bih, bik, bjn, bjn-id, bod, bos, bpy, bua, bug, cdo, ceb, che, chv, ckb, cos, csb, diq, div, div-mv, dsb, dyu, ekk, emk, eml, ewe, ext, fao, fao-fo, fon, frr, fuc, ful, gan, glk, glv, gom, grn, gsw, gsw-ch, guj, hat, hat-ht, hbs, hbs-rs, hif, hil, hsb, ibb, ibo, ido, ile, ilo, ina, kab, kal, kal-gl, kas, kbd, kde, kea, khk, kik, kin, kng, knn, knn-in, koi, kom, kon, krc, ksh, ksw, lad, lgg, lim, lim-nl, lin, lmo, ltz, ltz-lu, lug, lup, lus, lus-in, lvs, mad, mad-id, mai, mhr, min, min-id, mkw, mlt, mos, mri, mri-nz, mrj, mwl, myv, mzn, nan, nap-tara, nav, nbl, ndo, nds, nds-nl, new, ngl, nno, nno-no, nob, nob-com, nob-no, nso, nso-za, nya, nyn, oci, oci-fr, orm, oss, pag, pam, pap, pcm, pfl, plt, pms, pnb, pnt, pus, roh, roh-ch, rom, rue, rue-ua, run, sah, san, scn, sco, seh, sgs, sin, skr, sme, sme-no, smi, sna, sna-zw, snd, snk, som, sot, sot-za, srd, ssw, ssw-za, suk, sun, sun-id, sus, swa, swh, szl, tat, tel, tem, tgk, tgk-tj, tgk-uz, tgl, tir, tiv, tsn, tsn-bw, tsn-za, tso, tso-za, tuk, tuk-tm, tum, tyv, udm, uig, uzb, uzn-uz, vec, vec-br, vec-hr, ven, ven-za, vls, vol, vro, war, wln, wol, wuu, xmf, ydd, yid, yor, zea, zha, zsm, zul, zul-za expand	Wikipedia, News, WebCrawl corpora of different years	Crawl from Internet	CC BY-NC-SA 3.0
Yes	Lindat	-	Multiple languages	Multiple	Multiple	CC-BY-NC 4.0
Yes	Lingala_Song_Lyrics	-	fra, lin	Scrape the content of the website www.ndombolo.co, the site have almost 30 songs in lingala and their french traduction	Web scraped	also used by NLLB, freely available
	Lyrics	-	aar, abq, adq, ady, agx, aih, ain, aka, akk, ale, ami, ang, arg, arn, arp, asm, ast, aym, bak, bam, bci, bft, bfy, bgc, bhb, bho, bik, bis, bns, bod, bsk, bvd, bya, cab, cbk, cha, che, chg, cho, chr, chv, ckm, cnr, com, cor, cre, crh, csb, ctg, dak, dng, doi, dua, dum, dyu, dzo, enm, evn, ewe, ewo, ext, fao, fij, fon, frm, fro, fur, gag, gbm, gil, gla, glg, glk, gmh, goh, gon, got, gqn, grc, grt, hif, hil, hlb, hne, hop, hsb, ido, ina, inh, ist, izh, jam, jbo, kab, kas, kbd, kca, kdr, kea, kfy, kha, kik, kin, kio, kir, kjh, kmb, kok, kom, kon, krc, krl, kru, ksh, kum, lad, lbj, ldd, lij, lin, lki, lkt, lmo, ltg, lzh, lzz, mag, mah, mai, mbx, mby, min, mjw, mnc, mni, mnk, mns, moh, mos, mrg, mus, mwl, mxi, nan, nap, nav, nds, new, nio, niu, nog, non, nys, oci, odt, ohu, orm, ory, ota, pag, pap, pau, pcd, pcm, pdt, pjt, pli, pnt, pot, que, qya, raj, rar, rhg, roh, rom, rop, rtm, rup, sag, sah, sat, scn, sco, sdc, sel, sgh, sgs, sjn, skr, slr, smn, srn, ssw, sux, syl, szl, tah, tat, tbh, tcy, tet, tir, tlh, tpi, tsn, tuk, twe, twi, tyv, tzo, udm, uig, uki, ulk, unr, vec, ven, vep, vot, wbl, wol, wym, xal, xmf, xno, xxb, yux, zap, zha, zpu, zun, zza expand	Song lyrics	Web-crawled
Yes	MaCoCu	Link	mlt		Crawl from Internet and filtering	CC0 - No Rights Reserved
Yes	Makerere MT Corpus	-	lug, eng	Parallel sentences	Annotated	CC BY 4.0
Yes	Masakhane MT Corpus	-	African languages	Multiple domains	Multiple	MIT License
Yes	Mburisano_Covid	-	afr, eng, nde, sot, ssw, tsn, tso, ven, xho, zul	Corpus with limited domain	Manually translated	CC BY 3.0
Yes	MC4	Link	aze, ceb, cos, fil, guj, hat, haw, hmn, ibo, ltz, mlt, mri, nya, smo, sna, sot, sun, tgk, yor, zul expand	Web	Crawl from Internet	ODC-By
Yes	Menyo20K	Link	yor, eng	Parallel, multidomain	News articles (JW), ted talks, movie transcripts, radio transcripts, science and technology texts, and other short articles curatedfrom the web and professional translators Various sources:	Non-commercial use
Yes	Minangkabau corpora	Link	min_Latn, ind	Parallel sentences	Annotated	MIT License
Yes	MoT	Link	kin, lin, nde, orm, bod, tir	Data collected from Voice of America (VOA) news websites		MIT License
Partially	MTData	Link	Multiple languages	Various sources		Multiple licenses (check spreadsheet)
Yes	Nart/abkhaz	-	abk	multiple sources		Creative Commons Universal Public Domain License
Yes	Ndc without informant codes		dan, fao, isl, ovd, swe	Nordic Dialect Corpus comprises recorded speech data from the Nordic countries, in languages that belong to the North Germanic language family.	Various	CC BY-NC-SA 4.0
Yes	NLLB_seed	Link	ace_Arab, ace_Latn, ary, arz, bam, ban, bho, bja_Arab, bjn_Latn, bug, crh, dik, dzo, fur, fuv, grn, hne, kas_Latn, kas_Deva, knc_Arab, knc_Latn, lij, lim, lmo, ltg, mag, mni, mri, nus, prs, pbt, scn, shn, srd, szl, taq_Tfng, taq_Latn, tzm, vec expand	Collection of topics in different fields of knowledge and human activity	Professionally-translated sentences in the Wikipedia domain	CC-BY-SA 4.0
	OfisPublik	Link ; Link	bre	Texts from the Ofis Publik ar Brezhoneg (Breton Language Board) provided by Francis Tyers
Partially	OPUS	Link		Collection of translated texts from the web	Automatically collected	Multiple licenses (check spreadsheet)
Yes	OSCAR	Link	als, arg, arz, asm, ast, ava, aze, bak, bho, bod, bos, bpy, bxr, ceb, che, chv, ckb, cor, diq, div, dsb, eml, gom, grn, guj, hbs, hsb, ido, ilo, ina, jbo, kom, krc, lez, lim, lmo, ltz, mai, mhr, min, mlt, mrj, mzn, nah, nds, new, nno, oci, oss, pms, pnb, que, sah, scn, sun, tat, tgk, tuk, vol, war, wln, wuu, xal, xmf, yor expand	Web crawled	Crawl from Internet and filtering	CC BY 4.0
Yes	ParaCrawl (subset)	Link	eng, ukr	Various domains	Web-crawled	CC0
Upon direct request	Parallel Bible Corpus	Link		Religious	Automatically collected	You can contact Michael Cysouw, Philipps University of Marburg, to request access to the PBC for academic purposes.
Yes	Parallel Corpora for Ethiopian Languages	Link	amh, orm, tir	Parallel sentences, religious domain	Automatically curated	CC-BY 4.0
Yes	Phontron	-	eng, jpn	Wikipedia	Annotated	CC-BY-SA 3.0
Yes	QADI	Link	afb, abv, arq, arz, acm, apc, ary, acx, ajp, apd, aeb expand	Tweets	Tweets	Apache License 2.0
Yes	Quechua-IIC	Link	que	multiple sources		Apache License 2.0
Yes	Shami	Link	apc, ajp	Several topics from regular conversations such as politics, education, society, health care, house keeping and others	Automatic and manual approaches	Apache License 2.0
Yes	SLI_GalWeb.1.0	Link	glg	Galician political party, newspaper, government official website	Crawling data from many Web data sources	CC BY 4.0
Yes	Stanford NLP: nmt	Link	eng, deu, cze
Partially	StatMT	-	Multiple languages	Various sources	Various sources	Multiple licenses (check spreadsheet)
Yes	Tatoeba	-	abk, acm, ady, afb, afh, afr, aii, ain, ajp, akl, aln, alt, amh, ang, aoz, apc, ara, arg, arq, ary, arz, asm, ast, avk, awa, ayl, aym, aze, bak, bal, bam, ban, bar, bcl, bel, ben, ber, bfz, bho, bis, bjn, bod, bom, bos, bre, brx, bua, bul, bvy, bzt, cat, cay, cbk, ceb, ces, cha, che, chg, chn, cho, chr, chv, cjy, ckb, ckt, cmn, cmo, cor, cos, cpi, crh, crk, crs, csb, cycl, cym, cyo, dan, deu, diq, div, dng, drt, dsb, dtp, dws, egl, ell, emx, eng, enm, epo, est, eus, evn, ewe, ext, fao, fij, fin, fkv, fra, frm, fro, frr, fry, fuc, fur, fuv, gaa, gag, gan, gbm, gcf, gil, gla, gle, glg, glv, gom, gos, got, grc, grn, gsw, guc, guj, hak, hat, hau, haw, hax, hbo, hdn, heb, hif, hil, hin, hnj, hoc, hrv, hrx, hsb, hsn, hun, hye, iba, ibo, ido, igs, iii, ike, ile, ilo, ina, ind, isl, ita, izh, jam, jav, jbo, jdt, jpa, jpn, kaa, kab, kal, kam, kan, kas, kat, kaz, kek, kha, khm, kin, kir, kiu, kjh, klj, kmr, knc, koi, kor, kpv, krc, krl, ksh, kum, kxi, kzj, laa, lad, lao, lat, ldn, lfn, lij, lim, lin, lit, liv, lkt, lld, lmo, lou, ltg, ltz, lug, lut, lvs, lzh, lzz, mad, mah, mai, mal, mar, max, mdf, mfa, mfe, mgm, mhr, mic, mik, min, mkd, mlg, mlt, mnc, mni, mnr, mnw, moh, mon, mri, mrj, mus, mvv, mwl, mww, mya, myv, nah, nan, nau, nav, nch, nds, new, ngt, ngu, niu, nld, nlv, nnb, nno, nob, nog, non, nov, npi, nst, nus, nya, nys, oar, oci, ofs, oji, ood, ori, orv, osp, oss, osx, ota, otk, pag, pal, pam, pan, pap, pau, pcd, pdc, pes, pfl, phn, pli, pms, pnb, pol, por, ppl, prg, pus, quc, que, qxq, qya, rap, rel, rhg, rif, roh, rom, ron, rue, run, rus, ryu, sag, sah, san, sat, scn, sco, sdh, sgs, shi, shs, shy, sin, sjn, skr, slk, slv, sma, sme, smo, sna, snd, som, sot, spa, sqi, srd, srn, srp, ssw, stq, sun, sux, swc, swe, swg, swh, syc, szl, tah, tam, tat, tel, tet, tgk, tgl, tha, thv, tig, tir, tkl, tlh, tly, tmr, tmw, toi, tok, ton, tpi, tpw, tsn, tso, tts, tuk, tur, tvl, tyv, tzl, udm, uig, ukr, umb, urd, urh, uzb, vec, vep, vie, vol, vro, war, wln, wol, wuu, xal, xho, xmf, xqa, yid, yor, yua, yue, zea, zgh, zlm, zsm, zul, zza expand	180922 version	Voluntary contributions of thousands of members	CC-BY 2.0 FR, CC0 1.0 Universal (more info)
Yes	TeDDi	Link	abk, aey, amp, ape, apu, arn, arz, ayz, bmi, bsk, bsn, cha, ckt, crk, dgz, dni, fij, gni, gry, gug, gyd, hae, hau, hix, hnj, imn, jac, kal, kan, kew, kgo, khk, kio, kjq, kut, laj, lue, lvk, mig, mph, mya, myh, myp, mzh, naq, ote, pav, plt, pwn, qvi, ram, rap, rma, sag, spp, swh, tiw, tml, tzm, vma, wba, wic, wyb, xsu, yad, yaq, yor, zoc, zul expand	Collection of different sources (see paper)	Language identification and filtering	CC BY-NC-SA 4.0
Yes	TICO	Link	amh, ara, ben, ckb, din, eng, fas, fra, fuv, hau, hin, ind, khm, knc, kmr, lug, lin, mar, msa, mya, npi, nus, orm, prs, por, pus, rus, kinn, som, spa, swh, tam, tir_et, tir_er, tgl, urd, zho, zul expand	COVID-19 materials for a variety of the world’s languages	Annotated	CC0 1.0 Universal
Yes	TIL	Link	aze, bak, chv, eng, kaz, kir, rus, tuk, tur, tat, uig, uzb expand	Large-scale parallel corpus combinin gmost of the public datasets for 22 Turkic languages	Automatically collected	CC BY-NC-SA 4.0
Yes	Tilde	Link		Various domains	Automatically curated	CC-BY 4.0
Yes	W2C	-	122 languages	Corpus	Automatically collected from wikipedia and the web	CC BY-SA 3.0
Yes	WAT 2020	https://arxiv.org/abs/2008.04550	Asian languages	Multiple domains	Collection of corpora	CC-BY-NC 4.0
Yes	Wikipedia	-	aar, abk, ace, ady, aka, als, ang, arc, arg, arz, asm, ast, atj, ava, aym, aze, bak, bam, bar, bcl, ben, bih, bis, bjn, bod, bos, bpy, bre, bug, bul, bxr, cbk, cdo, ceb, cha, che, cho, chr, chu, chv, chy, ckb, cor, cos, cre, crh, csb, din, diq, div, dsb, dty, dzo, eml, ewe, ext, fao, fij, frp, frr, ful, fur, gag, gan, glg, glk, glv, gom, gor, got, grn, guj, hak, hat, haw, hbs, hif, hmo, hsb, ibo, ido, iii, iku, ile, ilo, ina, inh, ipk, isl, jam, jbo, jpn, kaa, kab, kal, kas, kbd, kbp, kik, kin, koi, kom, kon, krc, ksh, kua, lad, lbe, lez, lfn, lij, lim, lin, lmo, lrc, ltg, ltz, lug, lzh, mah, mai, mdf, mhr, min, mlt, mri, mrj, mus, mwl, myv, mzn, nah, nan, nap, nau, nav, ndo, nds, new, nno, nov, nrm, nso, nya, oci, olo, orm, oss, pag, pam, pan, pap, pcd, pdc, pfl, pih, pli, pms, pnb, pnt, que, rmy, roh, rue, run, rup, rus, sag, sah, sat, scn, sco, sgs, sme, smo, sna, sot, srd, srn, ssw, stq, sun, szl, tah, tat, tcy, tet, tgk, tir, ton, tpi, tsn, tso, tuk, tum, twi, tyv, udm, vec, ven, vep, vls, vol, vro, war, wln, wol, wuu, xal, xmf, yor, yue, zea, zha, zul expand	20221001	Wikipedia	CC BY-NC-SA 3.0
Yes	WikiMatrix	Link	85 languages	Wikipedia	Automatically curated	CC-BY-SA
Yes	Workshop on NER for South and South East Asian Languages	Link	ben, ori, urd		Annotated	Data can be freely used for non-profit research work under the Creative Commons License.
	XhosaNavy	Link	xho, eng	South African Navy parallel corpus
Yes	XLSum	Link	aze, guj, ibo, orm, run, tir, yor	BBC		CC BY-NC-SA 4.0

↑ top

Training and Evalutaion Code

Prerequisites

We use two settings due to package conflict:

Major: Python 3.9, requirements.txt
Evaluation: Python 3.6, evaluation/requirements.txt

Data preparation

For training both tokenizer and model of Glot500-m, we need to prepare a balanced corpus covering all languages.

Go to 'preprocessing/' and run:

bash merge_files.sh

Specify --data_directory with the directory to data for each language and --save_directory with the directory for putting the merged file. For Glot500, we set --scale 1 for training tokenizer, --scale 30 for continued pretraining the model.

Vocabulary Extension

Go to 'tokenization/' and run:

bash train.sh

Specify --input_fname with the merged data file for training the tokenizer and --save_directory with the directory for saving the final tokenizer.

Continued Pretraining

Go to 'modeling/' and run:

bash train_bash.sh

Specify train_file with the merged data file for continued pretraining the model, --tokenizer_name with the trained Huggingface-style tokenizer, --output_dir with the directory for saving logs and checkpoints during training, and --cache_dir with the directory for saving Huggingface cache.

↑ top

Evaluation

Download Datasets

For downloading datasets for NER, POS, and Sentence Retrieval Tatoeba, first go to 'evaluation/download_data' and create a download folder with mkdir -p download. You then need to manually download panx_dataset (for NER) from here (note that it will download as AmazonPhotos.zip) to the download directory. Finally, run the following command under 'evaluation/download_data' to download and process the datasets:

bash download_data.sh

For downloading datasets for Sentence Retrieval Bible, Round-Trip Alignment, you can contact Michael Cysouw, Philipps University of Marburg, to request access to the Parallel Bible Corpus for academic purposes.

Sequence Labeling

For NER evaluation, go to 'evaluation/tagging' and run:

bash evaluate_ner.sh

Specify DATA_DIR with the directory for NER dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

For POS evaluation, go to 'evaluation/tagging' and run:

bash evaluate_pos.sh

Specify DATA_DIR with the directory for POS dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

Sentence Retrieval

For Sentence Retrieval Tatoeba evaluation, go to 'evaluation/retrieval' and run:

bash evaluate_retrieval_tatoeba.sh

Specify DATA_DIR with the directory for Sentence Retrieval Tatoeba dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

For Sentence Retrieval Bible evaluation, go to 'evaluation/retrieval' and run:

bash evaluate_retrieval_bible.sh

Specify DATA_DIR with the directory for Sentence Retrieval Bible dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

Round-Trip Alignment

For Round-Trip Alignment evaluation, go to 'evaluation/round-trip' and run:

python evaluate_roundtrip.py

↑ top

Citation

If you find our model, data or the overview of data useful for your research, please cite:

@inproceedings{imanigooghari-etal-2023-glot500,
	title        = {Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
	author       = {ImaniGooghari, Ayyoob  and Lin, Peiqin  and Kargaran, Amir Hossein  and Severini, Silvia  and Jalili Sabet, Masoud  and Kassner, Nora  and Ma, Chunlan  and Schmid, Helmut  and Martins, Andr{\'e}  and Yvon, Fran{\c{c}}ois  and Sch{\"u}tze, Hinrich},
	year         = 2023,
	month        = jul,
	booktitle    = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	publisher    = {Association for Computational Linguistics},
	address      = {Toronto, Canada},
	pages        = {1082--1117},
	url          = {https://aclanthology.org/2023.acl-long.61}
}

Acknowledgements

This repository is built on top of transformers and xtreme.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evaluation		evaluation
miscellaneous		miscellaneous
modeling		modeling
preprocessing		preprocessing
tokenization		tokenization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_citations.bib		dataset_citations.bib
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Introduction

Glot500-m

Glot500-m Evaluation

Glot500-c

Training and Evalutaion Code

Prerequisites

Data preparation

Vocabulary Extension

Continued Pretraining

Evaluation

Download Datasets

Sequence Labeling

Sentence Retrieval

Round-Trip Alignment

Citation

Acknowledgements

About

Contributors 4

Languages

License

cisnlp/Glot500

Folders and files

Latest commit

History

Repository files navigation

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Introduction

Glot500-m

Glot500-m Evaluation

Glot500-c

Training and Evalutaion Code

Prerequisites

Data preparation

Vocabulary Extension

Continued Pretraining

Evaluation

Download Datasets

Sequence Labeling

Sentence Retrieval

Round-Trip Alignment

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

Languages