This dataset is created by translating SNLI using Amazon Translate. The SNLI dataset is created using English sentences and it is used for natural language inference tasks. The official website for the dataset can be found at NLI-TR.
This dataset consists of 570K machine-translated Turkish sentences pairs. The sentences were originally human-written in English. The pairs are manually labeled for balanced classification with the labels entailment, contradiction, and neutral.
Each pair is annotated with an annotator label, caption id, gold label, and pair id. Annotator labels stand for the collection of annotations decided by different annotators. Caption ID is a unique identifier for each sentence1 from the original Flickr30k example. The gold label is the label chosen by the majority of annotators. Where no majority exists, this is ‘-’. Pair id is a unique id for each sentences pair.
Examples from the original SNLI dataset and SNLI-TR dataset are given below.
An example from SNLI:
{
"annotator_labels":[ "neutral", "contradiction", "contradiction", "neutral", "neutral" ],
"captionID":"2677109430.jpg#1",
"gold_label":"neutral",
"pairID":"2677109430.jpg#1r1n",
"sentence1":"This church choir sings to the masses as they sing joyous songs from the book at a church.",
"sentence1_binary_parse":"( ( This ( church choir ) ) ( ( ( sings ( to ( the masses ) ) ) ( as ( they ( ( sing ( joyous songs ) ) ( from ( ( the book ) ( at ( a church ) ) ) ) ) ) ) ) . ) )",
"sentence1_parse":"(ROOT (S (NP (DT This) (NN church) (NN choir)) (VP (VBZ sings) (PP (TO to) (NP (DT the) (NNS masses))) (SBAR (IN as) (S (NP (PRP they)) (VP (VBP sing) (NP (JJ joyous) (NNS songs)) (PP (IN from) (NP (NP (DT the) (NN book)) (PP (IN at) (NP (DT a) (NN church))))))))) (. .)))",
"sentence2":"The church has cracks in the ceiling.",
"sentence2_binary_parse":"( ( The church ) ( ( has ( cracks ( in ( the ceiling ) ) ) ) . ) )",
"sentence2_parse":"(ROOT (S (NP (DT The) (NN church)) (VP (VBZ has) (NP (NP (NNS cracks)) (PP (IN in) (NP (DT the) (NN ceiling))))) (. .)))"
}
The corresponding Turkish translation in SNLI-TR:
{
"annotator_labels": [ "neutral", "contradiction", "contradiction", "neutral", "neutral" ],
"captionID": "2677109430.jpg#1",
"gold_label": "neutral",
"pairID": "2677109430.jpg#1r1n",
"sentence1": "Bu kilise korosu, kilisedeki kitaptan neşeli şarkılar söylerken kitlelere şarkı söyler.",
"sentence2": "Kilisenin tavanında çatlaklar var."
}
field | dtype |
---|---|
annotator_labels | string list |
captionID | string |
gold_label | string |
pairID | string |
sentence1 | string |
sentence2 | string |
Training pairs | Validation pairs | Test pairs | Total pairs |
---|---|---|---|
550152 | 10000 | 10000 | 570152 |
Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. Since the machine translation systems are robust, they can be used for translating English NLI datasets to Turkish.
The first sentences are collected from the Flickr30k image captioning dataset and the second sentences are human-written in English. Then both sentences are translated to Turkish using Amazon Translate.
SNLI used Amazon Mechanical Turk for data annotation. In each individual task (each HIT), a worker was presented with premise scene descriptions from a pre-existing corpus, and asked to supply hypotheses for each of our three labels—entailment, neutral, and contradiction—forcing the data to be balanced among these classes. About 2,500 workers contributed.
The detailed description of the dataset annotations can be found at the SNLI Website.
There are two steps for measuring the quality of the dataset:
- Quality of SNLI dataset
- Quality of Translation from SNLI to SNLI-TR
Quality of SNLI dataset and details can be found at SNLI Website, and quality of the translation will be discussed here.
The validation of translation process is explained in the paper as follows:
"For expert evaluation, we grouped the translations into example sets of four sentences. We distributed the sets to the experts so that each set (and sentence) was examined by five randomly chosen experts and each expert co-examined approximately the same number of sets with each other expert. Each expert evaluated the translation by (i) grading the translation quality between 1 and 5 (inclusive; 5 the best) and (ii) checking if the translation altered the semantic relation. We distributed an annotation guide to the team to standardize the criteria. In total, 500 example sets (2,000 translated sentences) were examined by five experts, yielding 10,000 annotations. The annotation-level analysis compares each new annotation with the gold label on the original English example. The majority-level analysis compares only the majority label (if any) of the five new annotations with the English gold label"
The results of validation process is given in the following table:
Translation Quality | Annotation-level Label Consistency | Majority-level Label Consistency | |
---|---|---|---|
Train | 4.55 (0.78) | 92.62% | 98.67% |
Dev | 4.46 (0.90) | 90.53% | 95.33% |
Test | 4.45 (0.86) | 87.87% | 94.00% |
Published by Emrah Budur, Rıza Özçelik, Tunga Gung, Christopher Potts
Please cite the following paper (arXiv) if you found this dataset useful:
Emrah Budur, Rıza Özçelik, Tunga Güngör and Christopher Potts. 2020. Data and Representation for Turkish Natural Language Inference. Proceedings of EMNLP.
@inproceedings{budur-etal-2020-data,
title = "Data and {R}epresentation for {T}urkish {N}atural {L}anguage {I}nference",
author = {Budur, Emrah and
{\"O}z{\c{c}}elik, R{\i}za and
Gungor, Tunga and
Potts, Christopher},
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.662",
doi = "10.18653/v1/2020.emnlp-main.662",
pages = "8253--8267"
}
Uploaded and documented by Arda Goktogan: ardagoktogan gmail com