Skip to content

Latest commit

 

History

History
97 lines (56 loc) · 3.9 KB

TDD-C-202206-CC-023.md

File metadata and controls

97 lines (56 loc) · 3.9 KB

Bilkent Information Retrieveal Collection

This dataset is a collection of newspaper documents. It can be used for Information Retrieveal studies. It contains documents, queries, and query relevants. You can reach the paper and GitHub Repositories of this dataset via given links.

Dataset Details

There are 408,305 documents gathered from the news articles of Turkish newspaper Milliyet from years 2001-2005. Average number of words for each document before stop-word elimination is given as 234.

There are 72 ad hoc queries. To determine the query relevants, pooling concept is used. The assessors evaluated the documents at the pool and rest of the documents, ones that are not in the pool, assumed to be irrelevant. Query relevants contains the pool documents for each query, including the result of the relevance assessment by assessors in the last column. (0: irrelevant, 1: relevant) Detailed information about the pooling and TREC approach can be reached by here.

Samples

<DOC>
 <DOCNO> 50000 </DOCNO>
 <SOURCE> Milliyet v.01 </SOURCE>
 <URL> www.milliyet.com.tr/2001/11/01/son/soneko02.html </URL>
 <DATE> 2001/11/01/ </DATE>
 <TIME>  </TIME>
 <AUTHOR>  </AUTHOR>
 <HEADLINE>
Kapalıçarşı'da döviz fiyatları güne kaç liradan başladı.. Tıklayın 
 </HEADLINE>
 <TEXT>
     İstanbul serbest piyasada dolar 1 milyon 595 bin lira, mark ise 736 bin lira satış fiyatıyla güne başladı. 
     Kapalıçarşı'da dolar 1 milyon 585 bin liradan alınıp 1 milyon 595 bin liradan satılıyor. 728 bin liradan alınan markın satış fiyatı ise 736 bin lira olarak belirlendi. 
     Serbest piyasada dünkü kapanışta dolar 1 milyon 602 bin, mark ise 739 bin lira olmuştu.
 </TEXT>
</DOC>

Fields

field dtype
DOCNO integer
SOURCE string
URL string
DATE string
AUTHOR string
HEADLINE string
TEXT string

Dataset Creation

Curation Rationale

The dataset is motivated by the desire to advance information retrieval studies in Turkish language.

Data Source

The authors gathered the documents from columns and news articles of Turkish newspaper Milliyet from the years between 2001-2005.

Annotations

The documents are collected automatically. The query relevance assessments are done by experienced web users. The assessors are not required to expertise on the topic they pick.

Personal and Sensitive Information

All documents are published in the newspaper, thus, are not expected to contain any personal/sensitive information.

Considerations

Social Impact of Dataset

This dataset is part of an effort to encourage information retrieveal research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures.

Discussion of Biases

The data included here are from writers of Turkish newspaper 'Milliyet'. Some percentage of these documents might contain columns with biased point of view.

Additional Information

Dataset Curators

Published by Fazli Can, Seyit Kocberber, Erman Balcik, Cihan Kaynak, H. Cagdas Ocalan, and Onur M. Vursavas.

Citation Information

If you use this dataset, please cite the following paper:

Can, F., Kocberber, S., Balcik, E., Kaynak, C., Ocalan, H. C., & Vursavas, O. M. (2008). Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology, 59(3), 407-421.

Contact

Documented by Burak Kizil mkizil19 ku edu tr and reviewed/uploaded by Arda Goktogan: ardagoktogan gmail com