MozArt: a multilingual dataset of parallel cloze examples with annotator demographics.
The data is provided in individual JSONL files per language {lang}_data_with_annotations.jsonl, where {lang} denotes the ISO 639-1 code of each language (en, es, de, fr). Each line of the JSONL corresponds to one sentence completed by one annotator with the following information:
s_id: str, sentence indentifier text: str, sentence text true_mask: str, word masked in the sentence tag: str, part-of-speech of the word masked mask: str, word given by the annotator to fill in the gap u_id: str, annotator identifier native: int, binary variable; 1: the annotator is a native speaker of the target language, 0: otherwise nonnative: int, binary variable; 1: the annotator isn't a native speaker of the target language, 0: otherwise male: int, binary variable; 1: the annotator is a male (self-reported), 0: otherwise female: int, binary variable; 1: the annotator is a female (self-reported), 0: otherwise age: int, annotator's age at the time of completion country_of_birth: str, annotator's country of birth current_country_of_residence: str, annotator's current country of residence at the time of completion first_language: str, annotator's first language fluent_languages: List[str], list of languages that the annotator has reported to be fluent in nationality: str, annotator's nationality time_taken: float time taken to complete the task (ms)
Note that the attributes that were not provided voluntarily are filled with 'null'.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
If you use our dataset, please cite our paper (COLING 2022):
@inproceedings{cabello-piqueras-sogaard-2022-pretrained,
title = "Are Pretrained Multilingual Models Equally Fair across Languages?",
author = "Cabello Piqueras, Laura and
S{\o}gaard, Anders",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.318",
pages = "3597--3605",
abstract = "Pretrained multilingual language models can help bridge the digital language divide, enabling high-quality NLP models for lower-resourced languages. Studies of multilingual models have so far focused on performance, consistency, and cross-lingual generalisation. However, with their wide-spread application in the wild and downstream societal impact, it is important to put multilingual models under the same scrutiny as monolingual models. This work investigates the group fairness of multilingual models, asking whether these models are equally fair across languages. To this end, we create a new four-way multilingual dataset of parallel cloze test examples (MozArt), equipped with demographic information (balanced with regard to gender and native tongue) about the test participants. We evaluate three multilingual models on MozArt {--}mBERT, XLM-R, and mT5{--} and show that across the four target languages, the three models exhibit different levels of group disparity, e.g., exhibiting near-equal risk for Spanish, but high levels of disparity for German.",
}
Main resources:
- Philipp Koehn, 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
- Philipp Koehn and Christof Monz. 2006. Manual and automatic evaluation of machine translation between European languages. In Proceedings on the Workshop on Statistical Machine Translation, pages 102–121, New York City. Association for Computational Linguistics.
- Shared task of the NAACL 2006 Workshop on Statistical Machine Translation