This repo contains R scripts for cleaning and preparing text data for further analysis. I will also provide simple templates of some popular text analysis methods such as Word2Vec, topic modeling (structural topic modeling, or LDA).
In general my text-data-cleaning process is as follows:
- remove emojis
- remove URLs
- remove language(s) that you don't use in the final analysis
- remove spams
Description of text data
- top words
- bigram
- trigram
Topic modeling
- LDA
- STM
Word2Vec