-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
117 lines (88 loc) · 5.56 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# topicdoc
<!-- badges: start -->
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#maturing)
[![CRAN status](https://www.r-pkg.org/badges/version/topicdoc)](https://CRAN.R-project.org/package=topicdoc)
[![cran checks](https://cranchecks.info/badges/summary/topicdoc)](https://cran.r-project.org/web/checks/check_results_topicdoc.html)
[![Codecov test coverage](https://codecov.io/gh/doug-friedman/topicdoc/branch/master/graph/badge.svg)](https://app.codecov.io/gh/doug-friedman/topicdoc?branch=master)
[![R-CMD-check](https://github.com/doug-friedman/topicdoc/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/doug-friedman/topicdoc/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
Like a (good) doctor, the goal of topicdoc is to help diagnose issues with your topic models through a variety of
different diagnostics and metrics. There are a lot of great R packages for fitting topic models, but not a lot that
help evaluate their fit. This package seeks to fill that void with functions that easily allow you to run those
diagnostics/metrics efficiently on a variety of topic models.
Currently, only topic models from the `topicmodels` library are supported. However, I'm targeting those from the `stm` library next.
## Installation
You can install the CRAN version of topicdoc with:
```r
install.packages("topicdoc")
```
You can install the development version of topicdoc from [Github](https://github.com/doug-friedman/topicdoc) with:
``` r
remotes::install_github("doug-friedman/topicdoc")
```
## Example
This is a simple use case - using an LDA model with the Associated Press Dataset in `topicmodels` which is made up of AP articles from 1988.
```{r example}
library(topicdoc)
library(topicmodels)
data("AssociatedPress")
lda_ap4 <- LDA(AssociatedPress,
control = list(seed = 33), k = 4)
# See the top 10 terms associated with each of the topics
terms(lda_ap4, 10)
# Calculate all diagnostics for each topic in the topic model
diag_df <- topic_diagnostics(lda_ap4, AssociatedPress)
diag_df
# ...or calculate them individually
topic_size(lda_ap4)
```
It's a lot easier to interpret the output if you put it all together in a nice plot.
```{r example_plot, dpi=300, fig.width=10}
library(ggplot2)
library(dplyr, warn.conflicts = F)
library(tidyr)
library(stringr)
diag_df <- diag_df %>%
mutate(topic_label = terms(lda_ap4, 5) %>%
apply(2, paste, collapse = ", "),
topic_label = paste(topic_num, topic_label, sep = " - "))
diag_df %>%
gather(diagnostic, value, -topic_label, -topic_num) %>%
ggplot(aes(x = topic_num, y = value,
fill = str_wrap(topic_label, 25))) +
geom_bar(stat = "identity") +
facet_wrap(~diagnostic, scales = "free") +
labs(x = "Topic Number", y = "Diagnostic Value",
fill = "Topic Label", title = "All Topic Model Diagnostics")
```
These diagnostics help provide a more rigorous confirmation of our intuition about identifying "good" versus "bad" topics in a topic model.
Almost, immediately you'd deduce that Topic 1 looks odd as it contains a lot of generic words (e.g. "two", "three", "like"). Using the diagnostics provided, you can see that Topic 1 has the largest topic size and highest document prominence indicating that it's incorporating many more tokens and documents than the others. The low TF/DF distance and exclusivity confirm our susipicion about the generic nature of the topic.
Alternatively, you can see that Topic 2 - the one centered around financial news - is the most coherent and exclusive topic.
## Diagnostics/Metrics Included
| Diagnostic/Metric | Function | Description |
|:-----------------------------------------------:|:-------------------:|:-------------------------------------------:|
| topic size | `topic_size` | Total (weighted) number of tokens per topic |
| mean token length | `mean_token_length` | Average number of characters for the top tokens per topic |
| distance from corpus distribution | `dist_from_corpus` | Distance of a topic's token distribution from the overall corpus token distribution |
| distance between token and document frequencies | `tf_df_dist` | Distance between a topic's token and document distributions |
| document prominence | `doc_prominence` | Number of unique documents where a topic appears
| topic coherence | `topic_coherence` | Measure of how often the top tokens in each topic appear together in the same document |
| topic exclusivity | `topic_exclusivity` | Measure of how unique the top tokens in each topic are compared to the other topics |
## Key References
The following documents provide key references for the diagnostics/metrics included in the package.
* [Jordan Boyd-Graber, David Mimno, and David Newman, 2014.
Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements.
CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, Florida.](http://www.people.fas.harvard.edu/~airoldi/pub/books/b02.AiroldiBleiEroshevaFienberg2014HandbookMMM/Ch12_MMM2014.pdf)
* [MALLET Topic Model Diagnostics](https://mallet.cs.umass.edu/diagnostics.php)