This repository contains various corpora and textual data that I often use while experimenting with various statistical models (see Storing data like this in git isn't ideal, but it is convenient.
20 Newsgroups is a well known (perhaps even over used) dataset consisting of 20,000 usenet postings from the early 90's. The documents are equally distributed from 20 different newsgroups on a variety of topics. Note though that many of the newsgroups are similar in nature (e.g. talk.region.misc and soc.religion.christian).
Each subdirectory of the dataset names a specific newsgroup. Each file in the subdirectory is a posting. Note that each file has a header that describes the document metadata in more detail. This header should not be included as part of the posting, and can be skipped by reading the file after the first blank line.
This dataset is also available in a form with duplicate documents removed. The duplicates are usually due to cross posting between newsgroups.
This is a collection of roughly 40,000 product reviews on Amazon. The txt file contains the text of the reviews. Each line consists of a document id followed by the review text. Each line of the review file consists of a document id (corresponding to the txt file), and the number of starts (1-5).
The Enron dataset includes emails which were made public during the Federal Energy Regulatory Commision investigation of the company. Subsequently, the Linguistic Data Consortium labeled a subset of the emails. We include just those emails which have annotations.
This dataset is the King James version of The Holy Bible. Each line consists of a single verse of text. The line format is as follows:
Book.Chapter.Verse VerseText
The dataset also contains cross references within the bible from, along with some slight cleanup and reformatting. Each line in the cross reference file contains a cross reference of the format:
Book.Chapter.Verse Book.Chapter.Verse
The Ambient (AMBIgous ENTries) dataset contains a set of topics, sub-topics and ranked search results with snippets for the purpose of testing web search clustering. It has 44 topics, each with 100 results.
Moresque (MORE Sense-tagged QUEries) is a complement to Ambient. It contains 114 topics, each with 100 results.
This dataset contains all of the text of the state of the union speeches up to 2010. Each file contains a single speech, with the filename giving the president and the speech number.
Stopwords contains various stopword lists. Each list gives one stopword per line. Each stopword has been lowercased and stripped of punctuation.