Skip to content

Latest commit

 

History

History
1398 lines (1231 loc) · 80.6 KB

wordcloud2.md

File metadata and controls

1398 lines (1231 loc) · 80.6 KB
layout title subtitle
page
Cinema and Clouds
Courtney Stowers | March 9, 2020
<style type="text/css"> .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 1em; } .source { background-color: #f5f5f5; } .left { text-align: left; } .right { text-align: right; } .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

#toc_container { background: #f9f9f9 none repeat scroll 0 0; border: 1px solid #aaa; display: table; font-size: 95%; margin-bottom: 1em; padding: 20px; width: auto; float: left vertical-align: text-top; }

.toc_title { font-weight: 700; text-align: center; }

.toc_list { text-align: center; }

#toc_container li, #toc_container ul, #toc_container ul li{ list-style: outside none none !important; } </style>

<title>Cinema and Clouds</title>

Introduction

Keywords article pic

Although data science projects often employ large amounts of numeric data, some projects examine patterns within text and require a different set of tools. In this code-through tutorial, we are going to explore several packages in R that enable researchers to analyze qualitative data sets and discover cool patterns. We are also going to create two word clouds based on movie plot summaries.

In order to complete this tutorial, you will need access to R or RStudio. If you are not familiar with either of these software or would like a refresher on the basics, check out the first steps of my previous code-through here. I walk you through the entire set-up process beginning with the software.

If you're all set to go with R and RStudio, then proceed below:



Set-Up

Library of Packages

Before we begin analyzing and our data and creating word clouds, first, we need to load a few packages into our 'library'. Specifically we need the following packages:

library(dplyr)  # helps organize our data
library(kableExtra)  # creates elegant tables for output
library(quanteda)  # processes our textual data for anaylsis
library(wordcloud2)  # creates the wordclouds

You can easily install all of these packages with the following code:

`install.packages("nameofpackage")`


Data

Now that we have our packages installed, the next thing that we need is our dataset. For the purposes of this code-through we are going to be using a dataset of over 5000 movies from IMDB that we will access from [data.world](https://data.world/studentoflife/imdb-top-250-lists-and-5000-or-so-data-records){:target="_blank" rel="noopener"}.

![](/charts/pic1.png)

We will click the blue `Explore this dataset` button and then scroll down to select the `IMDBdata_MainData.csv` dataset.

Next, we will click the download button where we will select the download URL for R.

![](/charts/pic2.png)

Data.World provides us with the exact code we need to get started and import our data. Therefore we will copy and paste into a codeblock as follows:

df <- read.csv("https://query.data.world/s/rr46ndg7fyne54q7oonmvzxbaxg3zn", header = TRUE,
    stringsAsFactors = FALSE)

Preview Dataset

A quick peek at the column names of the dataset reveal the various fields available for us to use to explore the 5,000+ movies in the set.

colnames(df)
 [1] "Title"          "Year"           "Rated"          "Released"      
 [5] "Runtime"        "Genre"          "Director"       "Writer"        
 [9] "Actors"         "Plot"           "Language"       "Country"       
[13] "Awards"         "Poster"         "Ratings.Source" "Ratings.Value" 
[17] "Metascore"      "imdbRating"     "imdbVotes"      "imdbID"        
[21] "Type"           "DVD"            "BoxOffice"      "Production"    
[25] "Website"        "Response"       "tomatoURL"     

For ease of reference, we will change all of these titles to be lowercase with the following function:

colnames(df) <- tolower(colnames(df))
colnames(df)
 [1] "title"          "year"           "rated"          "released"      
 [5] "runtime"        "genre"          "director"       "writer"        
 [9] "actors"         "plot"           "language"       "country"       
[13] "awards"         "poster"         "ratings.source" "ratings.value" 
[17] "metascore"      "imdbrating"     "imdbvotes"      "imdbid"        
[21] "type"           "dvd"            "boxoffice"      "production"    
[25] "website"        "response"       "tomatourl"     

For our analysis, we will focus specifically on the movie titles, genres, and plots, so we will create a smaller dataset with only these variables:

dat <- df[c("title", "genre", "plot")]

A preview of this dataset gives us a glimpse into what we're working with:

head(dat) %>% kable() %>% kable_styling()
title genre plot
Code Name: K.O.Z. Crime, Mystery A look at the 17-25 December 2013 corruption scandal in Turkey, from the viewpoint of the Erdogan government.
Saving Christmas Comedy, Family Kirk is enjoying the annual Christmas party extravaganza thrown by his sister until he realizes he needs to help out Christian, his brother-in-law, who has a bad case of the bah-humbugs. ...
Superbabies: Baby Geniuses 2 Comedy, Family, Sci-Fi A group of smart-talking toddlers find themselves at the center of a media mogul's experiment to crack the code to baby talk. The toddlers must race against time for the sake of babies everywhere.
Daniel der Zauberer Comedy, Crime, Fantasy Evil assassins want to kill Daniel Kublbock, the third runner up for the German Idols.
Manos: The Hands of Fate Horror A family gets lost on the road and stumbles upon a hidden, underground, devil-worshiping cult led by the fearsome Master and his servant Torgo.
Pledge This! Comedy At South Beach University, a beautiful sorority president takes in a group of unconventional freshman girls seeking acceptance into her house.

Now we will create an even smaller dataset of only romance movies using the `grep()` function which allows us to search through text for specfic key words. Here, we will search for all movies that have romance listed as either its only genre or at least one of its genres:

grep(pattern = "romance", x = dat$genre, value = TRUE, ignore.case = TRUE) %>% head() %>%
    kable() %>% kable_styling()
x
Horror, Romance, Thriller
Comedy, Romance, Sport
Comedy, Romance
Comedy, Musical, Romance
Drama, Romance, Thriller
Drama, Music, Romance

Using the `grepl()` function we will find all 927 movies that belong to the romance genre:

romance <- grepl("romance", dat$genre, ignore.case = T)
sum(romance)
[1] 927

We will use this criteria to segment out the romance movies:

dat.romance <- dat[romance, c("title", "genre", "plot")]
dat.romance %>% head(10) %>% kable() %>% kable_styling()
title genre plot
9 Birdemic: Shock and Terror Horror, Romance, Thriller A horde of mutated birds descends upon the quiet town of Half Moon Bay, California. With the death toll rising, Two citizens manage to fight back, but will they survive Birdemic?
10 Dream.net Comedy, Romance, Sport Regina, the once popular girl has to make new friends at her new, conservative school. Problems arrive when she becomes enemies with Lívia, the school's queen bee, and falls in love with ...
14 The Hottie & the Nottie Comedy, Romance A woman agrees to go on a date with a man only if he finds a suitor for her unattractive best friend.
16 From Justin to Kelly Comedy, Musical, Romance A waitress from Texas and a college student from Pennsylvania meet during spring break in Fort Lauderdale, Florida and come together through their shared love of singing.
23 Ben & Arthur Drama, Romance, Thriller A pair of recently married gay men are threatened by one of the partners' brother, a religious fanatic who plots to murder them after being ostracized by his church.
32 Glitter Drama, Music, Romance A young singer dates a disc jockey who helps her get into the music business, but their relationship become complicated as she ascends to super stardom.
36 Space Mutiny Action, Adventure, Romance A pilot is the only hope to stop the mutiny of a spacecraft by its security crew, who plot to sell the crew of the ship into slavery.
51 Gigli Comedy, Crime, Romance The violent story about how a criminal lesbian, a tough-guy hit-man with a heart of gold, and a mentally challenged man came to be best friends through a hostage.
80 A Story About Love Romance Two young people stand on a street corner in a run-down part of New York, kissing. Despite the lawlessness of the district they are left unmolested. A short distance away walk Maria and ...
97 The Bat People Horror, Romance After being bitten by a bat in a cave, a doctor undergoes an accelerating transformation into a man-bat, which ruins his vacation and causes considerable distress for his wife.

Now that we have a collection of romance movie titles, we will focus on examining the plots of the movies to see if there are any similarities or common keywords across the summaries:

corp.romance <- corpus(dat.romance, docid_field = "title", text_field = "plot")
corp.romance
Corpus consisting of 927 documents and 1 docvar.

corp.romance[1:5] %>% kable() %>% kable_styling()
x
Birdemic: Shock and Terror A horde of mutated birds descends upon the quiet town of Half Moon Bay, California. With the death toll rising, Two citizens manage to fight back, but will they survive Birdemic?
Dream.net Regina, the once popular girl has to make new friends at her new, conservative school. Problems arrive when she becomes enemies with Lívia, the school's queen bee, and falls in love with ...
The Hottie & the Nottie A woman agrees to go on a date with a man only if he finds a suitor for her unattractive best friend.
From Justin to Kelly A waitress from Texas and a college student from Pennsylvania meet during spring break in Fort Lauderdale, Florida and come together through their shared love of singing.
Ben & Arthur A pair of recently married gay men are threatened by one of the partners' brother, a religious fanatic who plots to murder them after being ostracized by his church.

# summarize corpus
summary(corp.romance)[1:10, ] %>% kable() %>% kable_styling()
Text Types Tokens Sentences genre
Birdemic: Shock and Terror 32 36 2 Horror, Romance, Thriller
Dream.net 31 40 2 Comedy, Romance, Sport
The Hottie & the Nottie 21 23 1 Comedy, Romance
From Justin to Kelly 27 29 1 Comedy, Musical, Romance
Ben & Arthur 30 32 1 Drama, Romance, Thriller
Glitter 28 28 1 Drama, Music, Romance
Space Mutiny 24 30 1 Action, Adventure, Romance
Gigli 27 32 1 Comedy, Crime, Romance
A Story About Love 32 39 3 Romance
The Bat People 27 32 1 Horror, Romance

# Process text

# remove mission statements that are less than 1 sentence long corp.romance <- corpus_trim(corp.romance, what = "sentences", min_ntoken = 1) corp.romance

Corpus consisting of 927 documents and 1 docvar.
# remove punctuation
tokens.romance <- tokens(corp.romance, what = "word", remove_punct = TRUE)
head(tokens.romance)
tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "A"          "horde"      "of"         "mutated"    "birds"     
 [6] "descends"   "upon"       "the"        "quiet"      "town"      
[11] "of"         "Half"       "Moon"       "Bay"        "California"
[16] "With"       "the"        "death"      "toll"       "rising"    
[21] "Two"        "citizens"   "manage"     "to"         "fight"     
[26] "back"       "but"        "will"       "they"       "survive"   
[31] "Birdemic"  

Dream.net : [1] "Regina" "the" "once" "popular" "girl"
[6] "has" "to" "make" "new" "friends"
[11] "at" "her" "new" "conservative" "school"
[16] "Problems" "arrive" "when" "she" "becomes"
[21] "enemies" "with" "Lívia" "the" "school's"
[26] "queen" "bee" "and" "falls" "in"
[31] "love" "with"

The Hottie & the Nottie : [1] "A" "woman" "agrees" "to" "go"
[6] "on" "a" "date" "with" "a"
[11] "man" "only" "if" "he" "finds"
[16] "a" "suitor" "for" "her" "unattractive" [21] "best" "friend"

From Justin to Kelly : [1] "A" "waitress" "from" "Texas" "and"
[6] "a" "college" "student" "from" "Pennsylvania" [11] "meet" "during" "spring" "break" "in"
[16] "Fort" "Lauderdale" "Florida" "and" "come"
[21] "together" "through" "their" "shared" "love"
[26] "of" "singing"

Ben & Arthur : [1] "A" "pair" "of" "recently" "married"
[6] "gay" "men" "are" "threatened" "by"
[11] "one" "of" "the" "partners" "brother"
[16] "a" "religious" "fanatic" "who" "plots"
[21] "to" "murder" "them" "after" "being"
[26] "ostracized" "by" "his" "church"

Glitter : [1] "A" "young" "singer" "dates" "a"
[6] "disc" "jockey" "who" "helps" "her"
[11] "get" "into" "the" "music" "business"
[16] "but" "their" "relationship" "become" "complicated" [21] "as" "she" "ascends" "to" "super"
[26] "stardom"

# convert to lower case
tokens.romance <- tokens_tolower(tokens.romance, keep_acronyms = TRUE)
head(tokens.romance)
tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "a"          "horde"      "of"         "mutated"    "birds"     
 [6] "descends"   "upon"       "the"        "quiet"      "town"      
[11] "of"         "half"       "moon"       "bay"        "california"
[16] "with"       "the"        "death"      "toll"       "rising"    
[21] "two"        "citizens"   "manage"     "to"         "fight"     
[26] "back"       "but"        "will"       "they"       "survive"   
[31] "birdemic"  

Dream.net : [1] "regina" "the" "once" "popular" "girl"
[6] "has" "to" "make" "new" "friends"
[11] "at" "her" "new" "conservative" "school"
[16] "problems" "arrive" "when" "she" "becomes"
[21] "enemies" "with" "lívia" "the" "school's"
[26] "queen" "bee" "and" "falls" "in"
[31] "love" "with"

The Hottie & the Nottie : [1] "a" "woman" "agrees" "to" "go"
[6] "on" "a" "date" "with" "a"
[11] "man" "only" "if" "he" "finds"
[16] "a" "suitor" "for" "her" "unattractive" [21] "best" "friend"

From Justin to Kelly : [1] "a" "waitress" "from" "texas" "and"
[6] "a" "college" "student" "from" "pennsylvania" [11] "meet" "during" "spring" "break" "in"
[16] "fort" "lauderdale" "florida" "and" "come"
[21] "together" "through" "their" "shared" "love"
[26] "of" "singing"

Ben & Arthur : [1] "a" "pair" "of" "recently" "married"
[6] "gay" "men" "are" "threatened" "by"
[11] "one" "of" "the" "partners" "brother"
[16] "a" "religious" "fanatic" "who" "plots"
[21] "to" "murder" "them" "after" "being"
[26] "ostracized" "by" "his" "church"

Glitter : [1] "a" "young" "singer" "dates" "a"
[6] "disc" "jockey" "who" "helps" "her"
[11] "get" "into" "the" "music" "business"
[16] "but" "their" "relationship" "become" "complicated" [21] "as" "she" "ascends" "to" "super"
[26] "stardom"

In order to make the data set more precise, we will remove all of the articles from the data set:

tokens.romance <- tokens_remove(tokens.romance, c(stopwords("english"), "nbsp"),
    padding = F)
head(tokens.romance)
tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "horde"      "mutated"    "birds"      "descends"   "upon"      
 [6] "quiet"      "town"       "half"       "moon"       "bay"       
[11] "california" "death"      "toll"       "rising"     "two"       
[16] "citizens"   "manage"     "fight"      "back"       "survive"   
[21] "birdemic"  

Dream.net : [1] "regina" "popular" "girl" "make" "new"
[6] "friends" "new" "conservative" "school" "problems"
[11] "arrive" "becomes" "enemies" "lívia" "school's"
[16] "queen" "bee" "falls" "love"

The Hottie & the Nottie : [1] "woman" "agrees" "go" "date" "man"
[6] "finds" "suitor" "unattractive" "best" "friend"

From Justin to Kelly : [1] "waitress" "texas" "college" "student" "pennsylvania" [6] "meet" "spring" "break" "fort" "lauderdale"
[11] "florida" "come" "together" "shared" "love"
[16] "singing"

Ben & Arthur : [1] "pair" "recently" "married" "gay" "men"
[6] "threatened" "one" "partners" "brother" "religious" [11] "fanatic" "plots" "murder" "ostracized" "church"

Glitter : [1] "young" "singer" "dates" "disc" "jockey"
[6] "helps" "get" "music" "business" "relationship" [11] "become" "complicated" "ascends" "super" "stardom"

In order to consolidate similar words, we will stem the data set:

# stem the words in the token list:
tokens.romance <- tokens_wordstem(tokens.romance)
head(tokens.romance)
tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "hord"       "mutat"      "bird"       "descend"    "upon"      
 [6] "quiet"      "town"       "half"       "moon"       "bay"       
[11] "california" "death"      "toll"       "rise"       "two"       
[16] "citizen"    "manag"      "fight"      "back"       "surviv"    
[21] "birdem"    

Dream.net : [1] "regina" "popular" "girl" "make" "new" "friend" "new"
[8] "conserv" "school" "problem" "arriv" "becom" "enemi" "lívia"
[15] "school" "queen" "bee" "fall" "love"

The Hottie & the Nottie : [1] "woman" "agre" "go" "date" "man" "find"
[7] "suitor" "unattract" "best" "friend"

From Justin to Kelly : [1] "waitress" "texa" "colleg" "student" "pennsylvania" [6] "meet" "spring" "break" "fort" "lauderdal"
[11] "florida" "come" "togeth" "share" "love"
[16] "sing"

Ben & Arthur : [1] "pair" "recent" "marri" "gay" "men" "threaten" [7] "one" "partner" "brother" "religi" "fanat" "plot"
[13] "murder" "ostrac" "church"

Glitter : [1] "young" "singer" "date" "disc" "jockey"
[6] "help" "get" "music" "busi" "relationship" [11] "becom" "complic" "ascend" "super" "stardom"

We can see which words commonly occur together:

# find frequently co-occuring words (typically compound words)
ngram.romance <- tokens_ngrams(tokens.romance, n = 2) %>% dfm()
ngram.romance %>% textstat_frequency(n = 10) %>% kable() %>% kable_styling()
feature frequency rank docfreq group
fall_love 57 1 57 all
new_york 37 2 36 all
young_man 32 3 32 all
young_woman 28 4 28 all
high_school 28 4 27 all
best_friend 27 6 26 all
world_war 14 7 14 all
york_citi 12 8 12 all
true_love 10 9 10 all
war_ii 9 10 9 all

ngram.romance3 <- tokens_ngrams(tokens.romance, n = 3) %>% dfm()
ngram.romance3 %>% textstat_frequency(n = 10) %>% kable() %>% kable_styling()
feature frequency rank docfreq group
new_york_citi 12 1 12 all
world_war_ii 9 2 9 all
two_best_friend 5 3 5 all
woman_fall_love 4 4 4 all
fall_love_woman 4 4 4 all
meet_fall_love 3 6 3 all
drama_center_around 3 6 3 all
high_school_crush 3 6 3 all
young_woman_find 3 6 3 all
experi_chang_live 3 6 3 all

Finally we can see which words are most common in the romance movie plot summaries:

tokens.romance %>% dfm(stem = T) %>% topfeatures() %>% kable() %>% kable_styling()
x
love 182
young 146
woman 144
man 138
life 112
fall 93
find 93
friend 91
two 90
new 88

Unsurprisingly(!!) love is the number one most frequently used word followed by young, woman, and man.


Word Clouds

Now that we've previewed our data step-by-step, we can create a wordcloud. We will repeat some of our previous steps in a more automated format using the `tm()` package. If you do not have it installed, once again you can install it via `install.packages("tm")`

library(tm)  # package for processing text

Romance Movies

For our first word cloud, we will create a new dataset creating only our plots from romance movies:

# Create a vector containing only the text
romance.plot <- dat.romance$plot

Then we will create a new Corpus consisting solely of the plot:

# Create a corpus
romance.docs <- Corpus(VectorSource(romance.plot))
romance.docs
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 927

Afterwords, we will go through the same steps that we did above to clean our data set:

romance.docs <- romance.docs %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
romance.docs <- tm_map(romance.docs, content_transformer(tolower))
romance.docs <- tm_map(romance.docs, removeWords, stopwords("english"))

And once again find our most frequently used words:

romance.dtm <- TermDocumentMatrix(romance.docs)
romance.matrix <- as.matrix(romance.dtm)
romance.words <- sort(rowSums(romance.matrix), decreasing = TRUE)
romance.df <- data.frame(word = names(romance.words), freq = romance.words)
romance.df %>% head(10) %>% kable() %>% kable_styling()
word freq
love love 174
young young 146
woman woman 133
man man 132
life life 110
two two 90
new new 88
friends friends 62
one one 62
family family 62

Finally, we will create a color pallete and *voila!* we've created a word cloud based on over 900 romance movie plots:

pallete <- mutate(romance.df, color = cut(freq, breaks = c(0, 40, 80, 120, 160, Inf),
    labels = c("#FFCAE5", "#BF1168", "#EA72C4", "#FF167F", "A40000"), include.lowest = TRUE))

romance.word.cloud <- wordcloud2(data = pallete, color = pallete$color) romance.word.cloud

![](/charts/romancewordcloud.png)

If we would like to save our widget, then we can save it as an html, png, and pdf file.

# save it in html
library("htmlwidgets")
library(webshot)

saveWidget(romance.word.cloud, "romancewordcloud.html", selfcontained = F)

# and in png or pdf webshot("romancewordcloud.html", "romancewordcloud.png", delay = 60, vwidth = 960, vheight = 480)


Crime Movies

Finally, for a comparison word cloud, we will create a second data set for crime movies to find out how keywords for this genre differ from romance movies:

grep(pattern = "crime", x = dat$genre, value = TRUE, ignore.case = TRUE) %>% head() %>%
    kable() %>% kable_styling()
x
Crime, Mystery
Comedy, Crime, Fantasy
Action, Crime, Drama
Comedy, Crime, Romance
Crime, Drama, Music
Comedy, Crime, Family

We find that there are 930 movies with crime listed as a genre:

crime <- grepl("crime", dat$genre, ignore.case = T)
sum(crime)
[1] 930

Once again, we will use this criteria to segment out the crime movies:

dat.crime <- dat[crime, c("title", "genre", "plot")]
dat.crime %>% head(10) %>% kable() %>% kable_styling()
title genre plot
1 Code Name: K.O.Z. Crime, Mystery A look at the 17-25 December 2013 corruption scandal in Turkey, from the viewpoint of the Erdogan government.
4 Daniel der Zauberer Comedy, Crime, Fantasy Evil assassins want to kill Daniel Kublbock, the third runner up for the German Idols.
46 Final Justice Action, Crime, Drama Homicidal Sheriff Thomas Jefferson Geronimo is tasked with escorting a mobster to Malta; when the prisoner escapes, Geronimo goes rogue to catch him.
51 Gigli Comedy, Crime, Romance The violent story about how a criminal lesbian, a tough-guy hit-man with a heart of gold, and a mentally challenged man came to be best friends through a hostage.
63 Girl in Gold Boots Crime, Drama, Music A young woman leaves her job as a waitress and travels to Los Angeles, where she strives to become the top star in the glamorous world of go-go dancing.
69 Baby Geniuses Comedy, Crime, Family Scientist hold talking, super-intelligent babies captive, but things take a turn for the worse when a mix-up occurs between a baby genius and its twin.
75 Tees Maar Khan Comedy, Crime Posing as a movie producer, a conman attempts to trick an entire village into helping him rob a treasure-laden train.
86 Mitchell Action, Crime, Drama A sleazy, incompetent detective tries to simultaneously take down heroin dealers and a socialite who murdered a burglar.
96 I Accuse My Parents Crime, Drama Young man goes to work for gangsters to impress his nightclub-singer girlfriend.
99 Ghosts Can't Do It Comedy, Crime, Fantasy Elderly Scott kills himself after a heart attack wrecks his body, but then comes back as a ghost and convinces his loving young hot wife Kate to pick and kill a young man in order for Scott to possess his body and be with her again.

Now we extract the crime plots:

# Create a vector containing only the text
crime.plot <- dat.crime$plot

Create a crime corpus:

# Create a corpus
crime.docs <- Corpus(VectorSource(crime.plot))
crime.docs
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 930

crime.docs <- crime.docs %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
crime.docs <- tm_map(crime.docs, content_transformer(tolower))
crime.docs <- tm_map(crime.docs, removeWords, stopwords("english"))

crime.dtm <- TermDocumentMatrix(crime.docs)
crime.matrix <- as.matrix(crime.dtm)
crime.words <- sort(rowSums(crime.matrix), decreasing = TRUE)
crime.df <- data.frame(word = names(crime.words), freq = crime.words)
crime.df %>% head(10) %>% kable() %>% kable_styling()
word freq
man man 109
young young 89
two two 86
police police 83
one one 77
life life 69
murder murder 60
new new 59
crime crime 57
drug drug 52

For crime moves, we find that our most frequent word is man and young comes in second place again. The third most frequently used word here is 'two', but police is not far behind for fourth place, as would be expected in a crime movie.

Now we create a new color pallete and just like that we have a second word cloud based on over 900 crime movie plots:

crime.pallete <- mutate(crime.df, color = cut(freq, breaks = c(0, 25, 50, 75, 100,
    Inf), labels = c("#C9D1D6", "#616D7A", "#5B83AD", "#1E1E1E", "#13365B"), include.lowest = TRUE))

crime.word.cloud <- wordcloud2(data = crime.pallete, color = crime.pallete$color) crime.word.cloud

![](/charts/crimewordcloud.png)

# save it in html
library("htmlwidgets")
library(webshot)

saveWidget(crime.word.cloud, "crimewordcloud.html", selfcontained = F)

# and in png or pdf webshot("crimewordcloud.html", "crimewordcloud.png", delay = 60, vwidth = 960, vheight = 480)


Comparison and Conclusion:

Now that we have completed this tutorial, we have been able to discover some similarities in individual words between the two word clouds:

However, the collective data from both sets of plots clearly displays how some of the same keywords can carry very different meanings between genres. We have also learned two methods for processing data: the first method using `grep()`, `grepl()` and `quanteda()` is quite a bit longer than using the `tm()` methods, but it has its perks of letting us see our data at every stage. Similarly `tm()` does not provide as much insight into into textual analysis, but for the purposes of quick projects like the wordclouds, it makes the process much easier!


Resources

Now that you have created your first wordcloud, I encourage you to try out the features and give it a try yourself! The following links contain tutorials and resources to guide you each step of the way:

  • Create a word cloud with r{:target="_blank" rel="noopener"} This tutorial from 'towards data science' by Celine Van den Rul served as the inspiration for this code-through and includes great step-by-step details on all of the features of wordcloud2() as well as its more basic predecessor wordcloud(). Both packages have unique benefits!

  • Wordcloud{:target="_blank" rel="noopener"} This article focuses on the first wordcloud() and has lots of great insight.

  • R Graph Gallery: Wordcloud2{:target="_blank" rel="noopener"} The R Graph Gallery has excellent tutorials on various data visualization strategies in R and is a wonderful guide for Wordcloud2.

  • CRAN also has wonderful resources for WordCloud2() available for view here{:target="_blank" rel="noopener"} in PDF form and here{:target="_blank" rel="noopener"} in website form.

  • Webshot{:target="_blank" rel="noopener"} is a neat package that allows us to create static saved images of the ever-changing word clouds that we create.

  • Coolors{:target="_blank" rel="noopener"} is a great site to find colors for your word cloud palettes.

  • Last, but certainly not least, Data.World{:target="_blank" rel="noopener"} is an excellent place to find data on just about any subject and the pre-configured R URL download links are especially helpful!


About The Author

This code-through tutorial was created by Courtney Stowers for CPP 527 Data Science 2 in the Master of Science in Program Evaluation and Data Analytics program at Arizona State University. She can be contacted at: [email protected].