webscraper

R package to scrape content like mission statements and social media handles from nonprofit websites

input website URL
specify node types
get tidy dataset with URL + node content

Usage

The package operates by doing the following:

User provides an org URL

-- ( create_table_01() ) --
clean and parse

save original version ( create_table_01() )
create normalized version ("http://some-name.com") ( normalize_url() )
create root url from normalized ( creat_root_url() )

check URL status ( check_url_status() ) --> error in RCurl::url.exists("WWW.BACASMAINE.ORG")

results: exists & active --> load website (4)
exists & not responding --> try root domain
does not exist --> try root domain

identify active host URL

if redirected, capture redirect

-----> table 1 returned here

load website

catalog all internal links on the landing page for snowball sample
search for contact info (not yet implemented)
- social media sites
- social media handle
- (email?)

-----> table 3 returned here (not yet implemented)

build node list on current page

capture node data
creating node meta-data

drill down and repeat (how many levels?)

at completion return nodes table (table-02)

-----> table 2 returned here

table-01 (create_table_01 function)

org name
org id (unique key)
original_URL - raw web domain reported by the org
normalized_URL
redirected_URL
root_URL - root url of original url - http://www.pets.com [ ROOT VERSION - normalize first, then parse lowest level ]
active_URL change tested_URL to active_URL - domain that works
url_version - version of the URL that works: original, normalized, redirect, or root
domain_status - change URL.Exists, HTTP.Status and valid to "domain_status": VALID, EXISTS (but http.status = F), DNE (does not exist)

Function try URLS in the following order:

original url
normalized url
redirected url
root url

table-02 ( get_p_nodes function )

org name
org id (unique key)
date of data capture
"active" URL root
subdomain
xpath
node type
attributes (html class, etc)
node text

table-03 (social media) - one-to-many (many accounts for one org)

org id
working url
subdomain
social media type (twitter, linkedin, facebook)
ID - handle or account id

Use

Install:

devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )
library( webscraper )

Before Syncing New Code to GitHub

# update documentation
setwd( "webscraper" )
devtools::document()
# update code
setwd( ".." )
devtools::install( "webscraper" )

Demo Get Nodes Function

# example URL
url <- "HTTP://GMFD.ORG/GMFRA/GMFRAINDEX.HTM"
dat <- get_p_node_data( url )
head( as.data.frame( dat ) )

We use the following packages:

library( dplyr )     # data wrangling 
library( pander )    # document creation 
library( xml2 )      # xml manipulation 
library( RCurl )     # for url.exists
library( httr )      # for http_error
library( stringr )   # for str_extract
library( rvest )     # web scraping in R

Sample Org Dataset

For a small sample try:

load_test_urls()
head( sample.urls )

URLs <- sample.urls$ORGURL

create_table_01( URLs[1] )

Project Management

☑️ KanBan Board

Includes Task Lists

Task 01
Task 02
Task 03

https://help.github.com/en/github/managing-your-work-on-github/about-task-lists

Package Build Steps

(1) Install tools:

install.packages(c("devtools", "roxygen2", "testthat", "knitr"))
# if devtools is not working try
# devtools::build_github_devtools()

(2) Create package skeleton:

# has been deprecated - needs update
library(devtools)
has_devel()
# setwd()
devtools::create( "pe" )

(3) Document:

Complete roxygen comments in R files for functions, then:

setwd( "pe" )
document()

(4) Install:

setwd( ".." )
devtools::install( "pe" )

devtools::install_github( "Nonprofit-Open-Data-Collective/webscraper" )

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
R		R
img		img
man		man
tutorials		tutorials
.Rbuildignore		.Rbuildignore
.Rhistory		.Rhistory
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
webscraper.Rproj		webscraper.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webscraper

Usage

Use

Before Syncing New Code to GitHub

Demo Get Nodes Function

Sample Org Dataset

Project Management

☑️ KanBan Board

Package Build Steps

About

Releases

Packages

Languages

License

amotivv/webscraper

Folders and files

Latest commit

History

Repository files navigation

webscraper

Usage

Use

Before Syncing New Code to GitHub

Demo Get Nodes Function

Sample Org Dataset

Project Management

☑️ KanBan Board

Package Build Steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages