First create a virtual environment in the root dir by running:
python3 -m venv venv
then activate the virtual env with
source venv/bin/activate
(to get out of the virtualenv, run deactivate
)
install all the dependencies with
pip install -r requirements.txt
also make sure to download nltk's corpus by running those line in python interpreter:
import nltk
nltk.download()
and spacy model:
python -m spacy download es_core_news_sm
and spacy custom lemmatizer files:
python -m spacy_spanish_lemmatizer download wiki
(for language detection go to this repo)
Rename sample_credentials.json
to credentials.json
, and fill in the four
credentials from your twitter app.
(Not tested in this fork) Run
bokeh serve --show real-time-twitter-trend-discovery.py --args <tw> <top_n_words> <*save_history>
,
where <tw>
and <top_n_words>
are arguments
representing within what time window we treat tweets as a batch, and how many
words with highest idf scores to show, while <*save_history>
is an optional
boolean value indicating whether we want to dump the history. Make sure API
credentials are properly stored in the credentials.json file.
(Not tested in this fork) To train a topic model and visualize the news in 2-D space, run
python topic_20news.py --n_topics <n_topics> --n_iter <n_iter> --top_n <top_n> --threshold <threshold>
,
where <n_topics>
being the number
of topics we select (default 20), <n_iter>
being the number of iterations
for training an LDA model (default 500), <top_n>
being the number of top
keywords we display (default 5), and <threshold>
being the threshold
probability for topic assignment (default 0.0).
(Not tested in this fork) To scrape tweets and save them to disk for later use, run
python scrape_tweets.py
.
If the script is interrupted, just re-run the same command so new tweets collected. The script gets ~1,000 English tweets per min, or 1.5 million/day.
Make sure API credentials are properly stored in the credentials.json file.
First make sure you accumulated some tweets (in this fork, we prefer https://github.com/Jefferson-Henrique/GetOldTweets-python and save it in CSV format), then run
python topic_tweets.py --raw_tweet_dir <raw_tweet_dir> --num_train_tweet <num_train_tweet> --n_topics <n_topics> --n_iter <n_iter> --top_n <top_n> --threshold <threshold> --num_example <num_example> --start_date <start_date> --end_date <end_date> --scope <scope> --lang <lang> --eval_n_topics <eval_n_topics>
where <raw_tweet_dir>
being a folder containing
raw tweet files, <num_train_tweet>
being the number of tweets we use for
training an LDA model, <n_topics>
being the number of topics we select
(default 20), <n_iter>
being the number of iterations for training an LDA
model (default 1500), <top_n>
being the number of top keywords we display
(default 8), <threshold>
being the threshold probability for topic assignment
(default 0.0), and <num_example>
being number of tweets to show on the plot
(default 5000). The same for topic_profiles.py
.
Extra params for topic_tweets.py
: <start_date>
, <end_date>
for filter the data, and <scope>
, for merge with a CSV file with Spain users (default SPA).
Also <lang>
, for filter by language (es [stable], es_gn and gn [pre-alfa]) and <eval_n_topics>
, if you want to evaluate the optimal numbers of topics...
4 .csv files:
- tweets file, with columns: 'tweet_id','tweet','date','user_id'
- lang detected file, with columns: 'tweet_id','lang'
- user file of particular location (Spain for us), with column: 'id_str' (then merge with 'user_id')
- and a extra file to check locations.
For reproducibility, tweet_ids and dates are available here.
Please, cite this paper Discovering topics in Twitter about the COVID-19 outbreak in Spain:
@article{PLN6333,
author = {Marvin M. Agüero-Torales and David Vilares and Antonio G. López-Herrera},
title = {Discovering topics in Twitter about the COVID-19 outbreak in Spain},
journal = {Procesamiento del Lenguaje Natural},
volume = {66},
number = {0},
year = {2021},
keywords = {COVID-19, Twitter, social networks, topic modeling},
abstract = {In this work, we apply topic modeling to study what users have been discussing in Twitter during the beginning of the COVID-19 pandemic. More particularly, we explore the period of time that includes three differentiated phases of the COVID-19 crisis in Spain: the pre-crisis time, the outbreak, and the beginning of the lockdown. To do so, we first collect a large corpus of Spanish tweets and clean them. Then, we cluster the tweets into topics using a Latent Dirichlet Allocation model, and define generative and discriminative routes to later extract the most relevant keywords and sentences for each topic. Finally, we provide an exhaustive qualitative analysis about how such topics correspond to the situation in Spain at different stages of the crisis.},
issn = {1989-7553},
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6333},
pages = {177--190}
}
- https://github.com/lda-project/lda
- https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
- https://datascience.blog.wzb.eu/2017/11/09/topic-modeling-evaluation-in-python-with-tmtoolkit/
- https://github.com/WZBSocialScienceCenter/tmtoolkit
- https://github.com/starry9t/TopicLabel
- https://towardsdatascience.com/%EF%B8%8F-topic-modelling-going-beyond-token-outputs-5b48df212e06