Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added files to create presidential_speech data #40

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

jjn13
Copy link
Collaborator

@jjn13 jjn13 commented Sep 17, 2018

Created raw-data directory and with code for creating presidential_speech dataset.

@michaelweylandt
Copy link
Member

We don't need the pyc files and I don't think we need the html scrape files. (I don't think this needs to be fully bit-for-bit reproducible: just illustrative). Thoughts?

@jjn13
Copy link
Collaborator Author

jjn13 commented Sep 17, 2018

That sounds fine. I'll drop the pyc and raw speechs, and resubmit.

@michaelweylandt
Copy link
Member

We also don't need the rds file (that's not the "raw" data)

wrd.var <- apply(dtm.mat.log,2,var)
top.wrd.var <- names(sort(wrd.var,decreasing = TRUE)[1:75])
dtm.mat.log <- dtm.mat.log[,colnames(dtm.mat.log) %in% top.wrd.var]
saveRDS(dtm.mat.log,"presidential_speech.rds")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing line ending

library(SnowballC)
library(parallel)
library(Matrix)
library(tidyverse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these dependencies used? I only see tm stringr and tidyverse below.

@@ -0,0 +1,11 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file appears to be identical to move_inaug_results.sh. Is that intentional?

class InaugTextSpider(scrapy.Spider):
name = "inaug_text"
allowed_domains = ["http://www.presidency.ucsb.edu/"]
start_urls = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note explaining what these URLs are and how to keep this list up to date (if possible)

name = "sou_text"
allowed_domains = ["http://www.presidency.ucsb.edu"]
start_urls = (
'http://www.presidency.ucsb.edu/ws/index.php?pid=123408',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanatory note needed.

@michaelweylandt
Copy link
Member

@jjn13 Any updates on this PR?

@michaelweylandt michaelweylandt added the Documentation Documentation-related issues label Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Documentation-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants