Skip to content

Latest commit

 

History

History
135 lines (92 loc) · 7.02 KB

TDD-C-202202-CC-002.md

File metadata and controls

135 lines (92 loc) · 7.02 KB

SNLI-TR

This dataset is created by translating SNLI using Amazon Translate. The SNLI dataset is created using English sentences and it is used for natural language inference tasks. The official website for the dataset can be found at NLI-TR.

Dataset Details

This dataset consists of 570K machine-translated Turkish sentences pairs. The sentences were originally human-written in English. The pairs are manually labeled for balanced classification with the labels entailment, contradiction, and neutral.

Each pair is annotated with an annotator label, caption id, gold label, and pair id. Annotator labels stand for the collection of annotations decided by different annotators. Caption ID is a unique identifier for each sentence1 from the original Flickr30k example. The gold label is the label chosen by the majority of annotators. Where no majority exists, this is ‘-’. Pair id is a unique id for each sentences pair.

Samples

Examples from the original SNLI dataset and SNLI-TR dataset are given below.

An example from SNLI:

{
"annotator_labels":[ "neutral", "contradiction", "contradiction", "neutral", "neutral" ],
"captionID":"2677109430.jpg#1",
"gold_label":"neutral",
"pairID":"2677109430.jpg#1r1n",
"sentence1":"This church choir sings to the masses as they sing joyous songs from the book at a church.",
"sentence1_binary_parse":"( ( This ( church choir ) ) ( ( ( sings ( to ( the masses ) ) ) ( as ( they ( ( sing ( joyous songs ) ) ( from ( ( the book ) ( at ( a church ) ) ) ) ) ) ) ) . ) )",
"sentence1_parse":"(ROOT (S (NP (DT This) (NN church) (NN choir)) (VP (VBZ sings) (PP (TO to) (NP (DT the) (NNS masses))) (SBAR (IN as) (S (NP (PRP they)) (VP (VBP sing) (NP (JJ joyous) (NNS songs)) (PP (IN from) (NP (NP (DT the) (NN book)) (PP (IN at) (NP (DT a) (NN church))))))))) (. .)))",
"sentence2":"The church has cracks in the ceiling.",
"sentence2_binary_parse":"( ( The church ) ( ( has ( cracks ( in ( the ceiling ) ) ) ) . ) )",
"sentence2_parse":"(ROOT (S (NP (DT The) (NN church)) (VP (VBZ has) (NP (NP (NNS cracks)) (PP (IN in) (NP (DT the) (NN ceiling))))) (. .)))"
}

The corresponding Turkish translation in SNLI-TR:

{
"annotator_labels": [ "neutral", "contradiction", "contradiction", "neutral", "neutral" ],
"captionID": "2677109430.jpg#1",
"gold_label": "neutral",
"pairID": "2677109430.jpg#1r1n",
"sentence1": "Bu kilise korosu, kilisedeki kitaptan neşeli şarkılar söylerken kitlelere şarkı söyler.",
"sentence2": "Kilisenin tavanında çatlaklar var."
}

Fields

field dtype
annotator_labels string list
captionID string
gold_label string
pairID string
sentence1 string
sentence2 string

Splits

Training pairs Validation pairs Test pairs Total pairs
550152 10000 10000 570152

Dataset Creation

Curation Rationale

Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. Since the machine translation systems are robust, they can be used for translating English NLI datasets to Turkish.

Data Source

The first sentences are collected from the Flickr30k image captioning dataset and the second sentences are human-written in English. Then both sentences are translated to Turkish using Amazon Translate.

Annotations

SNLI used Amazon Mechanical Turk for data annotation. In each individual task (each HIT), a worker was presented with premise scene descriptions from a pre-existing corpus, and asked to supply hypotheses for each of our three labels—entailment, neutral, and contradiction—forcing the data to be balanced among these classes. About 2,500 workers contributed.

The detailed description of the dataset annotations can be found at the SNLI Website.

Quality

There are two steps for measuring the quality of the dataset:

  1. Quality of SNLI dataset
  2. Quality of Translation from SNLI to SNLI-TR

Quality of SNLI dataset and details can be found at SNLI Website, and quality of the translation will be discussed here.

The validation of translation process is explained in the paper as follows:

"For expert evaluation, we grouped the translations into example sets of four sentences. We distributed the sets to the experts so that each set (and sentence) was examined by five randomly chosen experts and each expert co-examined approximately the same number of sets with each other expert. Each expert evaluated the translation by (i) grading the translation quality between 1 and 5 (inclusive; 5 the best) and (ii) checking if the translation altered the semantic relation. We distributed an annotation guide to the team to standardize the criteria. In total, 500 example sets (2,000 translated sentences) were examined by five experts, yielding 10,000 annotations. The annotation-level analysis compares each new annotation with the gold label on the original English example. The majority-level analysis compares only the majority label (if any) of the five new annotations with the English gold label"

The results of validation process is given in the following table:

Translation Quality Annotation-level Label Consistency Majority-level Label Consistency
Train 4.55 (0.78) 92.62% 98.67%
Dev 4.46 (0.90) 90.53% 95.33%
Test 4.45 (0.86) 87.87% 94.00%

Additional Information

Dataset Curators

Published by Emrah Budur, Rıza Özçelik, Tunga Gung, Christopher Potts

Citation Information

Please cite the following paper (arXiv) if you found this dataset useful:

Emrah Budur, Rıza Özçelik, Tunga Güngör and Christopher Potts. 2020. Data and Representation for Turkish Natural Language Inference. Proceedings of EMNLP.

@inproceedings{budur-etal-2020-data,
    title = "Data and {R}epresentation for {T}urkish {N}atural {L}anguage {I}nference",
    author = {Budur, Emrah  and
      {\"O}z{\c{c}}elik, R{\i}za  and
      Gungor, Tunga  and
      Potts, Christopher},
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.662",
    doi = "10.18653/v1/2020.emnlp-main.662",
    pages = "8253--8267"
}

Contact

Uploaded and documented by Arda Goktogan: ardagoktogan gmail com