Skip to content

Latest commit

 

History

History
115 lines (99 loc) · 3.21 KB

Twitter.md

File metadata and controls

115 lines (99 loc) · 3.21 KB

Twitter

Twitter sentiment dataset by Nick Sanders. Downloaded from Sentiment140 site. It is large dataset for the Sentiment Analysis task. Every tweets falls in either three categories positive(4), negative(0) or neutral(2).It contains 1600000 training examples and 498 testing examples.

Structure of dataset: documents/tweets, sentences, words, characters

This whole dataset is divided into four categories which can accessed by giving corresponding keywords:
train_pos : positive polarity sentiment train set examples (default)
train_neg : negative polarity sentiment train set examples
test_pos : positive polarity sentiment test set examples
test_neg : negative polarity sentiment test set examples

To get rid of unwanted levels, flatten_levels function from MultiResolutionIterators.jl can be used.

Example:

#Using "test_pos" keyword for getting positive polarity sentiment examples

julia> dataset_test_pos = load(Twitter("test_pos"))
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)

julia> using Base.Iterators

julia> tweets = collect(take(dataset_test_pos, 2))
2-element Array{Array{Array{String,1},1},1}:
 [["@", "stellargirl", "I", "loooooooovvvvvveee", "my", "Kindle", "2", "."], ["Not", "that", "the", "DX", "is", "cool", ",", "but", "the", "2", "is", "fantastic", "in", "its", "own", "right", "."]]
 [["Reading", "my", "kindle", "2", "..", "."], ["Love", "it..", "."], ["Lee", "childs", "is", "good", "read", "."]]

 julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate
40-element Array{String,1}:
 "@"
 "stellargirl"
 "I"
 "loooooooovvvvvveee"
 "my"
 "Kindle"
 "2"
 "."
 "Not"
 "that"
 "the"
 "DX"
 "is"
 "cool"
 ","
 "but"
 
 "Reading"
 "my"
 "kindle"
 "2"
 ".."
 "."
 "Love"
 "it.."
 "."
 "Lee"
 "childs"
 "is"
 "good"
 "read"
 "."

#Using "train_pos" category to get positive polarity sentiment examples

julia> dataset_train_pos = load(Twitter()) #no need to specify category because "train_pos" is default
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)

julia> using Base.Iterators

julia> tweets = collect(take(dataset_train_pos, 4))
4-element Array{Array{Array{String,1},1},1}:
[["I", "LOVE", "@", "Health", "4", "UandPets", "u", "guys", "r", "the", "best", "!", "!"]]
[["im", "meeting", "up", "with", "one", "of", "my", "besties", "tonight", "!"], ["Cant", "wait", "!", "!"], ["-", "GIRL", "TALK", "!", "!"]]
[["@", "DaRealSunisaKim", "Thanks", "for", "the", "Twitter", "add", ",", "Sunisa", "!"],
["I", "got", "to", "meet", "you", "once", "at", "a", "HIN", "show"    "in", "the", "DC", "area", "and", "you", "were", "a", "sweetheart", "."]]
[["Being", "sick", "can", "be", "really", "cheap", "when", "it", "hurts", "too"    "eat", "real", "food", "Plus", ",", "your", "friends", "make", "you", "soup"]]

julia> flatten_levels(tweets, (!lvls)(Twitter, :words))|>full_consolidate
85-element Array{String,1}: "I" "LOVE" "@" "Health"
 "4"
 "UandPets"
 "u"
 "guys"
 "r"
 "the"
 "best"
 "!"
 "!"
 "im"
 "meeting"
 "up"
 
 "it"
 "hurts"
 "too"
 "much"
 "to"
 "eat"
 "real"
 "food"
 "Plus"
 ","
 "your"
 "friends"
 "make"
 "you"
 "soup"