This dataset is a collection of newspaper documents. It can be used for Information Retrieveal studies. It contains documents, queries, and query relevants. You can reach the paper and GitHub Repositories of this dataset via given links.
There are 408,305 documents gathered from the news articles of Turkish newspaper Milliyet from years 2001-2005. Average number of words for each document before stop-word elimination is given as 234.
There are 72 ad hoc queries. To determine the query relevants, pooling concept is used. The assessors evaluated the documents at the pool and rest of the documents, ones that are not in the pool, assumed to be irrelevant. Query relevants contains the pool documents for each query, including the result of the relevance assessment by assessors in the last column. (0: irrelevant, 1: relevant) Detailed information about the pooling and TREC approach can be reached by here.
<DOC>
<DOCNO> 50000 </DOCNO>
<SOURCE> Milliyet v.01 </SOURCE>
<URL> www.milliyet.com.tr/2001/11/01/son/soneko02.html </URL>
<DATE> 2001/11/01/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
Kapalıçarşı'da döviz fiyatları güne kaç liradan başladı.. Tıklayın
</HEADLINE>
<TEXT>
İstanbul serbest piyasada dolar 1 milyon 595 bin lira, mark ise 736 bin lira satış fiyatıyla güne başladı.
Kapalıçarşı'da dolar 1 milyon 585 bin liradan alınıp 1 milyon 595 bin liradan satılıyor. 728 bin liradan alınan markın satış fiyatı ise 736 bin lira olarak belirlendi.
Serbest piyasada dünkü kapanışta dolar 1 milyon 602 bin, mark ise 739 bin lira olmuştu.
</TEXT>
</DOC>
field | dtype |
---|---|
DOCNO | integer |
SOURCE | string |
URL | string |
DATE | string |
AUTHOR | string |
HEADLINE | string |
TEXT | string |
The dataset is motivated by the desire to advance information retrieval studies in Turkish language.
The authors gathered the documents from columns and news articles of Turkish newspaper Milliyet from the years between 2001-2005.
The documents are collected automatically. The query relevance assessments are done by experienced web users. The assessors are not required to expertise on the topic they pick.
All documents are published in the newspaper, thus, are not expected to contain any personal/sensitive information.
This dataset is part of an effort to encourage information retrieveal research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures.
The data included here are from writers of Turkish newspaper 'Milliyet'. Some percentage of these documents might contain columns with biased point of view.
Published by Fazli Can, Seyit Kocberber, Erman Balcik, Cihan Kaynak, H. Cagdas Ocalan, and Onur M. Vursavas.
If you use this dataset, please cite the following paper:
Can, F., Kocberber, S., Balcik, E., Kaynak, C., Ocalan, H. C., & Vursavas, O. M. (2008). Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology, 59(3), 407-421.
Documented by Burak Kizil mkizil19 ku edu tr
and reviewed/uploaded by Arda Goktogan: ardagoktogan gmail com