R package to scrape content like mission statements and social media handles from nonprofit websites
- input website URL
- specify node types
- get tidy dataset with URL + node content
devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )
library( webscraper )
url <- "https://www.mainewelfaredirectors.org"
dat <- get_p_node_data( url )
head( dat ) %>% pander(style="grid")
| tag | text |
| p | General Assistance | Referral |
| | | Advocacy |
| p | Establish and promote |
| | equitable, efficient and |
| | standardized administration of |
| | General Assistance. |
| p | Encourage the professional |
| | development, growth, and |
| | knowledge base of those who |
| | administer General Assistance. |
| p | Advocate for the |
| | municipalities and citizens |
| | that we serve. |
| p | Actively promote and present |
| | our program needs to the |
| | Legislature and citizens by |
| | creating a greater public |
| | awareness of the importance |
| | and the benefits of equitable, |
| | efficient and standardized |
| | General Assistance |
| | administration. |
| p | Click here to download current |
| | MWDA Bylaws |
| URL | page |
| http://mainewelfaredirectors.org/association.htm | association |
| xpath |
| /html/body/table/tr[2]/td/p |
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[1]/p |
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[2]/p |
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[3]/p |
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[4]/p |
| /html/body/table/tr[3]/td[2]/strong/blockquote/p[1] |
To collect multiple at a time:
head( sample.urls )
urls <- sample.urls$ORGURL
results.list <- lapply( urls, get_p_node_data )
d <- dplyr::bind_rows( results.list )
The package collects data following this process:
1. user provides a URL
2. normalize the URL
- save original URL
- create normalized version ("http://some-name.com") >> normalize_url()
- create root url from normalized >> creat_root_url()
3. check URL status >> check_url_status()
- results: exists & active --> load website (4)
- exists & not responding --> try root domain
- does not exist --> try root domain
4. visit page if URL is active
- if redirected, capture redirect
-----> table 1 captures URL status
5. load website
- catalog all internal links on the landing page for snowball sample
- search for contact info (not yet implemented)
- social media sites
- social media handles
- email?
-----> table 2 captures links and handles
7. build node list on current page
- capture data for specified node types
- capture node meta-data (page position, tag attributes) for context
8. drill down and repeat for one level of links
- at completion return nodes table
-----> table 3 contains page content atomized by tags
Functions should return the following tables:
- org name
- org id: unique key
- original_URL: raw string with unvalidated text of web domain
- normalized_URL: cleaned and put in standard http format
- root_URL: normalize first, then parse lowest level
- redirect_URL: capture redirects
- url_version: original, normalized, redirect, or root version that was tested
- domain_status: VALID, EXISTS (but http.status = F), DNE (does not exist)
- active_URL: best domain version to use
table-03 ( get_p_nodes function )
- org name
- org id (unique key)
- date of data capture
- "active" URL root
- subdomain
- xpath
- node type
- attributes (html class, id, etc)
- node text
table-02 (social media) - one-to-many (many accounts for one org)
- org id
- working url
- subdomain
- social media type (twitter, linkedin, facebook)
- ID - handle or account id
A typical workflow might start by processing all raw URLs in a list to generate table-01, which would report active versus dead or misspelled domains.
Table-01 could then serve as the sample frame for subsequent steps.
The user would specify things like site depth to probe (level 0 is the domain landing page, level 1 is all links on the landing page, level 2 is all new links on level 1 pages, etc.), and the types of nodes to collect (typically things like p=paragraphs, ul=lists, th/td=tables).
webscraper utilizes the following dependencies:
library( dplyr ) # data wrangling
library( pander ) # document creation
library( xml2 ) # xml manipulation
library( RCurl ) # for url.exists
library( httr ) # for http_error
library( stringr ) # for str_extract
library( rvest ) # web scraping in R
# update documentation
setwd( "webscraper" )
# update code
setwd( ".." )
devtools::install( "webscraper" )
(1) Install tools:
install.packages(c("devtools", "roxygen2", "testthat", "knitr"))
# if devtools is not working try
# devtools::build_github_devtools()
(2) Create package skeleton:
# has been deprecated - needs update
# setwd()
devtools::create( "pe" )
(3) Document:
Complete roxygen comments in R files for functions, then:
setwd( "pe" )
(4) Install:
setwd( ".." )
devtools::install( "pe" )
devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )