index.Rmd

---
title: "<br><small> Creating [LDAvis](https://cran.r-project.org/web/packages/LDAvis/index.html) - A Tutorial </small>"
subtitle: "<small> <br>[Marcin Kosiński](http://r-addict.com/About.html) </small>"
author: "<small><a href='https://r-addict.com'><i class='fa fa-comment'></i></a>&nbsp;&nbsp;<a href='https://stackoverflow.com/users/3857701'><i class='fa fa-stack-overflow'></i></a>&nbsp;&nbsp;<a href='https://github.com/MarcinKosinski'><i class='fa fa-github'></i></a>&nbsp;&nbsp;<a href='mailto:m.p.kosinski@gmail.com'><i class='fa fa-envelope-o'></i></a></small><br>"
date: October 24, 2016
output:
  revealjs::revealjs_presentation:
    theme: black
    highlight: pygments
    self_contained: false
    center: true
    reveal_options:
      slideNumber: true
---


```{r, include=FALSE}
# edit main page with file.edit(system.file("rmarkdown/templates/revealjs_presentation/resources/default.html",package="revealjs"))
# line 126
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
library(knitr)
opts_chunk$set(
	comment = "",
	fig.width = 12, 
	message = FALSE,
	warning = FALSE,
	tidy.opts = list(
		keep.blank.line = TRUE,	
		width.cutoff = 80
		),
	options(width = 80),
	eval = FALSE
)
```

# Motivation

## Motivation

<small>

- Text mining is a new challenge for machine wandering practitioners. 
- The increased interest in the text mining is caused by an augmentation of internet users and by rapid growth of the internet data which is said that *in 80% is a text data*. 
- Extracting information from articles, news, posts and comments have became a desirable skill but what is even more needful are tools for text mining models diagnostics and visualizations. 
- Such visualizations enable to better understand the insight from a model and provides an easy interface for presenting your research results to greater audience.

</small>

# LDA overview

## LDA overview

From [Wikipedia](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

<small>

> In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

</small>

# Data 

## Show Case on R-Bloggers

<small>

For this presentation I have used articles from [R-Bloggers](http://r-bloggers.com/). 

They can be downloaded from [this repository](https://github.com/MarcinKosinski/r-bloggers-harvesting). 

The data harvesting process is explained in this post: [LDAvis Show Case on R-Bloggers](http://r-addict.com/2016/06/21/LDAvis-RBloggers.html)

```{r}
library(RSQLite)
db.conn <- 
  dbConnect(
    dbDriver("SQLite"),
    "r-bloggers.db"
)
posts <- dbGetQuery(db.conn, 
           "SELECT text from posts")
dbDisconnect(db.conn)
```

</small>

## Data preparation 1

<small>

Normally I would use `LDA()` function from [topicmodels](https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf) package to fit LDA model because the input can be of class `DocumentTermMatrix` which is an object from [tm](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) package. 

`DocumentTermMatrix` object is very convinient for working with text data ([check this Norbert Ryciak's post](http://www.rexamine.com/2014/06/text-mining-in-r-automatic-categorization-of-wikipedia-articles/)) because there exists `tm_map` function which can be applied to all documents for stop words removal, lowering capital letters and removal of words that did not occur in x % of documents. 

I haven't seen `LDAvis` examples for models generated with topicsmodel package so we will use traditional approach to text processing. 

The [stemming](https://en.wikipedia.org/wiki/Lemmatisation) and stopwords removal was performed during the data collection.

```{r}
## the following fragment of code in this section
## is motivated by
## http://cpsievert.github.io/LDAvis/reviews/reviews.html
# tokenize on space and output as a list:
doc.list <- strsplit(posts[, 1], "[[:space:]]+")
# compute the table of terms:
term.table <- table(unlist(doc.list))
term.table <- sort(term.table, decreasing = TRUE)
# remove terms that occur fewer than 5 times:
term.table <- term.table[term.table > 5]
vocab <- names(term.table)
```

</small>

## Data preparation 2

<small>

The `lda.collapsed.gibbs.sampler()` function from `lda` package has uncomfortable input format (regarding to `LDA()` from `topicmodels` package) so I basically used [cpsievert](https://github.com/cpsievert) snippets

```{r}

# now put the documents into the format required by the lda package:
get.terms <- function(x) {
  index <- match(x, vocab)
  index <- index[!is.na(index)]
  rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
documents <- lapply(doc.list, get.terms)

# Compute some statistics related to the data set:
D <- length(documents)  # number of documents (3740)
W <- length(vocab)  # number of terms in the vocab (18,536)
doc.length <- sapply(documents, 
                     function(x) sum(x[2, ]))  
# number of tokens per document [312, 288, 170, 436, 291, ...]
N <- sum(doc.length)  # total number of tokens in the data (546,827)
term.frequency <- as.integer(term.table)  
# frequencies of terms in the corpus [8939, 5544, 2411, 2410, 2143, ...]
```

</small>

# Fitting the Model

## Fitting the Model

<small>

From `lda` package documentation

> ... [this function] takes sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling. 

```{r}
# MCMC and model tuning parameters:
K <- 20
G <- 5000
alpha <- 0.02
eta <- 0.02

# Fit the model:
library(lda)
set.seed(456)
fit <- lda.collapsed.gibbs.sampler(
  documents = documents, K = K, 
  vocab = vocab, num.iterations = G, 
  alpha = alpha, eta = eta, 
  initial = NULL, burnin = 0,
  compute.log.likelihood = TRUE
)
```

</small>

## Getting the Model

<small>

```{r}
library(archivist)
saveToRepo(fit,  repoDir = "../Museum")
```

The computations took very long, so in case you would like to 
get model faster, I have archived my model on GitHub with the help of
[archivist](http://r-bloggers.com/r-hero-saves-backup-city-with-archivist-and-github/) package.
You can easily load this model to R with
```{r}
archivist::aread('MarcinKosinski/Museum/fa93abf0ff93a7f6f3f0c42b7935ad4d') -> fit
```

</small>

# LDAvis use case

## LDAvis use case

<small>

If you google out properly you'll wind out that LDAvis description is

> Tools to create an interactive web-based visualization of a topic model that has been fit to a corpus of text data using Latent Dirichlet Allocation (LDA). Given the estimated parameters of the topic model, it computes various summary statistics as input to an interactive visualization built with D3.js that is accessed via a browser. The goal is to help users interpret the topics in their LDA topic model.

[Detailed vignette about LDAvis input and output can be found here](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf).

</small>

## LDAvis Preparations


<small>

To visualize the result using LDAvis, we'll need estimates of the document-topic distributions, which we denote by the DxK matrix theta, and the set of topic-term distributions, which we denote by the K×W matrix phi. 


```{r}
theta <- t(apply(fit$document_sums + alpha,
                 2,
                 function(x) x/sum(x)))
phi <- t(apply(t(fit$topics) + eta,
               2,
               function(x) x/sum(x)))
```

```{r}
library(LDAvis)
# create the JSON object to feed the visualization:
json <- createJSON(
  phi = phi, 
  theta = theta, 
  doc.length = doc.length, 
  vocab = vocab, 
  term.frequency = term.frequency
)
serVis(json, out.dir = 'vis', 
       open.browser = FALSE)
```

</small>

## Result

<small>

The result is published under this link 

[http://r-addict.com/r-bloggers-harvesting/](http://r-addict.com/r-bloggers-harvesting/) 

where you can check Intertopic Distance Map (via multidimensional scaling) 
and top N relevant terms for a topic.

</small>