Data

This repository contains various corpora and textual data that I often use while experimenting with various statistical models (see https://github.com/jlund3/modelt). Storing data like this in git isn't ideal, but it is convenient.

Newsgroups

20 Newsgroups is a well known (perhaps even over used) dataset consisting of 20,000 usenet postings from the early 90's. The documents are equally distributed from 20 different newsgroups on a variety of topics. Note though that many of the newsgroups are similar in nature (e.g. talk.region.misc and soc.religion.christian).

Each subdirectory of the dataset names a specific newsgroup. Each file in the subdirectory is a posting. Note that each file has a header that describes the document metadata in more detail. This header should not be included as part of the posting, and can be skipped by reading the file after the first blank line.

This dataset is also available in a form with duplicate documents removed. The duplicates are usually due to cross posting between newsgroups.

Amazon

This is a collection of roughly 40,000 product reviews on Amazon. The txt file contains the text of the reviews. Each line consists of a document id followed by the review text. Each line of the review file consists of a document id (corresponding to the txt file), and the number of starts (1-5).

Enron

The Enron dataset includes emails which were made public during the Federal Energy Regulatory Commision investigation of the company. Subsequently, the Linguistic Data Consortium labeled a subset of the emails. We include just those emails which have annotations.

Bible

This dataset is the King James version of The Holy Bible. Each line consists of a single verse of text. The line format is as follows:

Book.Chapter.Verse VerseText

The dataset also contains cross references within the bible from http://www.openbible.info/labs/cross-references/, along with some slight cleanup and reformatting. Each line in the cross reference file contains a cross reference of the format:

Book.Chapter.Verse Book.Chapter.Verse

Ambient

The Ambient (AMBIgous ENTries) dataset contains a set of topics, sub-topics and ranked search results with snippets for the purpose of testing web search clustering. It has 44 topics, each with 100 results.

Moresque

Moresque (MORE Sense-tagged QUEries) is a complement to Ambient. It contains 114 topics, each with 100 results.

State of the Union

This dataset contains all of the text of the state of the union speeches up to 2010. Each file contains a single speech, with the filename giving the president and the speech number.

Stopwords

Stopwords contains various stopword lists. Each list gives one stopword per line. Each stopword has been lowercased and stripped of punctuation.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
amazon		amazon
ambiant		ambiant
bible		bible
cade		cade
constraints		constraints
enron		enron
moresque		moresque
newsgroups-dedup		newsgroups-dedup
newsgroups		newsgroups
scripts		scripts
sotu		sotu
stopwords		stopwords
test		test
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data

Newsgroups

Amazon

Enron

Bible

Ambient

Moresque

State of the Union

Stopwords

About

Releases

Packages

Contributors 2

Languages

byu-aml-lab/data

Folders and files

Latest commit

History

Repository files navigation

Data

Newsgroups

Amazon

Enron

Bible

Ambient

Moresque

State of the Union

Stopwords

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages