Skip to content

R package to scrape content like mission statements and social media handles from nonprofit websites

License

Notifications You must be signed in to change notification settings

Nonprofit-Open-Data-Collective/webscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webscraper

R package to scrape content like mission statements and social media handles from nonprofit websites

  • input website URL
  • specify node types
  • get tidy dataset with URL + node content

Usage

Install:

devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )
library( webscraper )

Demo Get Nodes Function

url <- "https://www.mainewelfaredirectors.org"
dat <- get_p_node_data( url )
head( dat ) %>% pander(style="grid")
+-----+--------------------------------+
| tag |              text              |
+=====+================================+
|  p  | General Assistance | Referral  |
|     |           | Advocacy           |
+-----+--------------------------------+
|  p  |     Establish and promote      |
|     |    equitable, efficient and    |
|     | standardized administration of |
|     |      General Assistance.       |
+-----+--------------------------------+
|  p  |   Encourage the professional   |
|     |    development, growth, and    |
|     |  knowledge base of those who   |
|     | administer General Assistance. |
+-----+--------------------------------+
|  p  |        Advocate for the        |
|     |  municipalities and citizens   |
|     |         that we serve.         |
+-----+--------------------------------+
|  p  |  Actively promote and present  |
|     |    our program needs to the    |
|     |  Legislature and citizens by   |
|     |   creating a greater public    |
|     |  awareness of the importance   |
|     | and the benefits of equitable, |
|     |   efficient and standardized   |
|     |       General Assistance       |
|     |        administration.         |
+-----+--------------------------------+
|  p  | Click here to download current |
|     |          MWDA Bylaws           |
+-----+--------------------------------+

+--------------------------------------------------+-------------+
|                       URL                        |    page     |
+==================================================+=============+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------+
| http://mainewelfaredirectors.org/association.htm | association |
+--------------------------------------------------+-------------

+-----------------------------------------------------------+
|                           xpath                           |
+===========================================================+
|                /html/body/table/tr[2]/td/p                |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[1]/p |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[2]/p |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[3]/p |
+-----------------------------------------------------------+
| /html/body/table/tr[3]/td[2]/strong/blockquote/ul/li[4]/p |
+-----------------------------------------------------------+
|    /html/body/table/tr[3]/td[2]/strong/blockquote/p[1]    |
+-----------------------------------------------------------+

Sample Org Dataset

To collect multiple at a time:

load_test_urls()
head( sample.urls )
urls <- sample.urls$ORGURL
results.list <- lapply( urls, get_p_node_data )
d <- dplyr::bind_rows( results.list )

Process Overview

The package collects data following this process:

1. user provides a URL
2. normalize the URL   
  - save original URL 
  - create normalized version ("http://some-name.com")  >> normalize_url()
  - create root url from normalized  >> creat_root_url()
3. check URL status >> check_url_status() 
  - results: exists & active  --> load website (4)
  - exists & not responding   --> try root domain
  - does not exist            --> try root domain
4. visit page if URL is active 
  - if redirected, capture redirect 

----->  table 1 captures URL status

5. load website 
  - catalog all internal links on the landing page for snowball sample
  - search for contact info (not yet implemented)
    - social media sites
    - social media handles
    - email?

----->  table 2 captures links and handles

7. build node list on current page
  - capture data for specified node types
  - capture node meta-data (page position, tag attributes) for context
8. drill down and repeat for one level of links
  - at completion return nodes table
  
----->  table 3 contains page content atomized by tags

Functions should return the following tables:

table-01

  • org name
  • org id: unique key
  • original_URL: raw string with unvalidated text of web domain
  • normalized_URL: cleaned and put in standard http format
  • root_URL: normalize first, then parse lowest level
  • redirect_URL: capture redirects
  • url_version: original, normalized, redirect, or root version that was tested
  • domain_status: VALID, EXISTS (but http.status = F), DNE (does not exist)
  • active_URL: best domain version to use

table-03 ( get_p_nodes function )

  • org name
  • org id (unique key)
  • date of data capture
  • "active" URL root
  • subdomain
  • xpath
  • node type
  • attributes (html class, id, etc)
  • node text

table-02 (social media) - one-to-many (many accounts for one org)

  • org id
  • working url
  • subdomain
  • social media type (twitter, linkedin, facebook)
  • ID - handle or account id

A typical workflow might start by processing all raw URLs in a list to generate table-01, which would report active versus dead or misspelled domains.

Table-01 could then serve as the sample frame for subsequent steps.

The user would specify things like site depth to probe (level 0 is the domain landing page, level 1 is all links on the landing page, level 2 is all new links on level 1 pages, etc.), and the types of nodes to collect (typically things like p=paragraphs, ul=lists, th/td=tables).

Dependencies

webscraper utilizes the following dependencies:

library( dplyr )     # data wrangling 
library( pander )    # document creation 
library( xml2 )      # xml manipulation 
library( RCurl )     # for url.exists
library( httr )      # for http_error
library( stringr )   # for str_extract
library( rvest )     # web scraping in R 





Project Management for Package Dev

Before Syncing New Code to GitHub

# update documentation
setwd( "webscraper" )
devtools::document()
# update code
setwd( ".." )
devtools::install( "webscraper" )

Includes Task Lists

  • Task 01
  • Task 02
  • Task 03

https://help.github.com/en/github/managing-your-work-on-github/about-task-lists

Package Build Steps

(1) Install tools:

install.packages(c("devtools", "roxygen2", "testthat", "knitr"))
# if devtools is not working try
# devtools::build_github_devtools() 

(2) Create package skeleton:

# has been deprecated - needs update
library(devtools)
has_devel()
# setwd()
devtools::create( "pe" )

(3) Document:

Complete roxygen comments in R files for functions, then:

setwd( "pe" )
document()

(4) Install:

setwd( ".." )
devtools::install( "pe" )
devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )

About

R package to scrape content like mission statements and social media handles from nonprofit websites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •