The politicaltweets
R package provides functions to preprocess and classify tweets data according to whether or not they are political based on a pre-trained ensemble classifier.
remotes::install_github("haukelicht/politicaltweets")
Note that all but one package dependencies are distributed via CRAN.
The one exeption is the laserize
package, which can be installed from GitHub.
To classify a tweet, five steps are required
- Query the tweet data from the Twitter API using
rtweet
'slookup_statuses()
(withparse = TRUE
). - pass the parsed tweets data to argument
x
ofcreate_tweet_features()
to create data frame of tweet features - pass the parsed tweets data to argument
x
ofcreate_tweet_text_representations()
with.compute.pcs = FALSE
to create obtain tweet text embeding representations1 - combine the tweet features and text representation objects in a data frame.
- pass the resulting data frame to argument
x
ofclassify_tweets()
A minimal workin example:
library(dplyr)
library(politicaltweets)
# instead of querying data from the Tweet API (step 1)
# below we use a prototypical tweets data frame
glimpse(tweets.df.prototype)
# step 2
tfeats <- create_tweet_features(tweets.df.prototype, .as.data.table = FALSE)
# step 3
ttreps <- create_tweet_text_representations(tweets.df.prototype, .compute.pcs = FALSE)
# step 4
temp <- as_tibble(tfeats) %>%
left_join(mutate(as_tibble(ttreps$ics), status_id = rownames(ttreps$ics)))
# step 5
preds <- classify_tweets(temp, .debug = TRUE)
# inspect the result
cbind(temp[, c("text", "lang")], preds)
All functions exported by politicaltweets
expect that data passed to
their arguments x
conforms the naming and typing conventions of tweets data frames set by
the rtweet
package.
A prototypical tweets data frame is distributed with the politicaltweets
package,
see ?tweets.df.prototype
.
(Moreover, politicaltweets::required.tweets.df.cols
maps required columns to the accepted classes.)
classify_tweets()
can handle two types of model input:
By default, classify_tweets()
uses a list of four pre-trained models (see ?constituent.modles
for details)
"blends" them into an ensemble classifier using blend.by = "PR-AUC"
(maximize the area under the precision-recall curve).
More generally, classify_tweets()
can handle two types of model inputs:
- Lists of pre-trained base learner models: if the input to argument
model
is a 'caretList' object (i.e., a list of pre-trained base learners). In this case, the base learners are first "blended" into a greedy ensemble classifier, and the resulting ensemble model is then used to classify samples inx
. - Pre-trained ensemble classifiers: If the input to argument
model
is a 'caretEnsemble' object, this ensemble model is directly used to classify samples inx
.
Thus you can train o
Footnotes
-
It obtains tweet text LASER embedding representations using the
laserize
package and projects tweets LASER representations onto a pre-defined independent component space. ↩