webscraper

R package to scrape content like mission statements and social media handles from nonprofit websites

input website URL
specify node types
get tidy dataset with URL + node content

Usage

Install:

devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )
library( webscraper )

Demo Get Nodes Function

url <- "https://www.mainewelfaredirectors.org"
dat <- get_p_node_data( url )
head( dat ) %>% pander(style="grid")

+-----+--------------------------------+
| tag |              text              |
+=====+================================+
|  p  | General Assistance | Referral  |
|     |           | Advocacy           |
+-----+--------------------------------+
|  p  |     Establish and promote      |
|     |    equitable, efficient and    |
|     | standardized administration of |
|     |      General Assistance.       |
+-----+--------------------------------+
|  p  |   Encourage the professional   |
|     |    development, growth, and    |
|     |  knowledge base of those who   |
|     | administer General Assistance. |
+-----+--------------------------------+
|  p  |        Advocate for the        |
|     |  municipalities and citizens   |
|     |         that we serve.         |
+-----+--------------------------------+
|  p  |  Actively promote and present  |
|     |    our program needs to the    |
|     |  Legislature and citizens by   |
|     |   creating a greater public    |
|     |  awareness of the importance   |
|     | and the benefits of equitable, |
|     |   efficient and standardized   |
|     |       General Assistance       |
|     |        administration.         |
+-----+--------------------------------+
|  p  | Click here to download current |
|     |          MWDA Bylaws           |
+-----+--------------------------------+

+--------------------------------------------------+-------------+
|                       URL                        |    page     |
+==================================================+=============+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------

+-----------------------------------------------------------+
|                           xpath                           |
+===========================================================+
|                /html/body/table/tr[2]/td/p                |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[1]/p |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[2]/p |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[3]/p |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[4]/p |
+-----------------------------------------------------------+
|    /html/body/table/tr[3]/td[2]/strong/blockquote/p[1]    |
+-----------------------------------------------------------+

Sample Org Dataset

To collect multiple at a time:

load_test_urls()
head( sample.urls )
urls <- sample.urls$ORGURL
results.list <- lapply( urls, get_p_node_data )
d <- dplyr::bind_rows( results.list )

Process Overview

The package collects data following this process:

1. user provides a URL
2. normalize the URL   
  - save original URL 
  - create normalized version ("http://some-name.com")  >> normalize_url()
  - create root url from normalized  >> creat_root_url()
3. check URL status >> check_url_status() 
  - results: exists & active  --> load website (4)
  - exists & not responding   --> try root domain
  - does not exist            --> try root domain
4. visit page if URL is active 
  - if redirected, capture redirect 

----->  table 1 captures URL status

5. load website 
  - catalog all internal links on the landing page for snowball sample
  - search for contact info (not yet implemented)
    - social media sites
    - social media handles
    - email?

----->  table 2 captures links and handles

7. build node list on current page
  - capture data for specified node types
  - capture node meta-data (page position, tag attributes) for context
8. drill down and repeat for one level of links
  - at completion return nodes table
  
----->  table 3 contains page content atomized by tags

Functions should return the following tables:

table-01

org name
org id: unique key
original_URL: raw string with unvalidated text of web domain
normalized_URL: cleaned and put in standard http format
root_URL: normalize first, then parse lowest level
redirect_URL: capture redirects
url_version: original, normalized, redirect, or root version that was tested
domain_status: VALID, EXISTS (but http.status = F), DNE (does not exist)
active_URL: best domain version to use

table-03 ( get_p_nodes function )

org name
org id (unique key)
date of data capture
"active" URL root
subdomain
xpath
node type
attributes (html class, id, etc)
node text

table-02 (social media) - one-to-many (many accounts for one org)

org id
working url
subdomain
social media type (twitter, linkedin, facebook)
ID - handle or account id

A typical workflow might start by processing all raw URLs in a list to generate table-01, which would report active versus dead or misspelled domains.

Table-01 could then serve as the sample frame for subsequent steps.

The user would specify things like site depth to probe (level 0 is the domain landing page, level 1 is all links on the landing page, level 2 is all new links on level 1 pages, etc.), and the types of nodes to collect (typically things like p=paragraphs, ul=lists, th/td=tables).

Dependencies

webscraper utilizes the following dependencies:

library( dplyr )     # data wrangling 
library( pander )    # document creation 
library( xml2 )      # xml manipulation 
library( RCurl )     # for url.exists
library( httr )      # for http_error
library( stringr )   # for str_extract
library( rvest )     # web scraping in R

Project Management for Package Dev

Before Syncing New Code to GitHub

# update documentation
setwd( "webscraper" )
devtools::document()
# update code
setwd( ".." )
devtools::install( "webscraper" )

☑️ KanBan Board

Includes Task Lists

Task 01
Task 02
Task 03

https://help.github.com/en/github/managing-your-work-on-github/about-task-lists

Package Build Steps

(1) Install tools:

install.packages(c("devtools", "roxygen2", "testthat", "knitr"))
# if devtools is not working try
# devtools::build_github_devtools()

(2) Create package skeleton:

# has been deprecated - needs update
library(devtools)
has_devel()
# setwd()
devtools::create( "pe" )

(3) Document:

Complete roxygen comments in R files for functions, then:

setwd( "pe" )
document()

(4) Install:

setwd( ".." )
devtools::install( "pe" )

devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
R		R
img		img
man		man
tutorials		tutorials
.Rbuildignore		.Rbuildignore
.Rhistory		.Rhistory
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
webscraper.Rproj		webscraper.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webscraper

Usage

Demo Get Nodes Function

Sample Org Dataset

Process Overview

Dependencies

Project Management for Package Dev

Before Syncing New Code to GitHub

☑️ KanBan Board

Package Build Steps

About

Releases

Packages

Contributors 3

Languages

License

Nonprofit-Open-Data-Collective/webscraper

Folders and files

Latest commit

History

Repository files navigation

webscraper

Usage

Demo Get Nodes Function

Sample Org Dataset

Process Overview

Dependencies

Project Management for Package Dev

Before Syncing New Code to GitHub

☑️ KanBan Board

Package Build Steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages