Step-02-POBsandGaps.Rmd

---
title: "Step 2: Identifying POBs and data gaps"
date: "09/06/2020"
output:
  html_document:
    theme: united
    df_print: paged
    highlight: tango
    smart: false
    toc: yes
    toc_float: yes
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=F, message=F, fig.width=8)
```

# Introduction

In this script we will work with the **NONPROFIT-2014-2019.rds** and **PEOPLE-2014-2019.rds** to identify addresses that are PO Boxes and identifying the main data gaps in the address data. New datasets will be stored as **NONPROFIT-2014-2019v2.rds** and **PEOPLE-2014-2019v2.rds**. 

**STEPS**

(1) Identifying POBs
(2) Exploring address data gaps

**PACKAGES**
```{r}
library( dplyr )
library(tidyverse)
library( tidyr )
library( pander )
library( httr )
library( stringr )

#update the path with your working directory:
wd <- "/Users/icps86/Dropbox/R Projects/Open_data_ignacio"
setwd(wd) 
dir() 
```


# 1. Identifying PO Boxes

We will be looking for POB (Post Office Boxes). Might also find APOs (Army Post Office), FPOs (Fleet Post Office) and DPOs (Diplomatic Post Office).

## 1.1 Developing a regular expression code to identify PO Boxes

```{r, echo= F, eval= FALSE}
#this code will not be run beacuase it was used to test the regular expressions

test <- c("1 BOX3290", 
          "2 BOX 389",
          "3 PO 384902", 
          "4 BOX 83209", 
          "5 PObox 5432", 
          "6 POB 3903", 
          "7 PO   394",
          "8 PO B 290", 
          "9 PO  BOX    3892", 
          "10 PO 3892",
          "11 BOX   338",
          "12 POB. 293",
          "13 POST OFFICE BOX 329023", 
          "14 P. O. Box 962", 
          "15 P.O. BOX2030", 
          "16 PO BOX. 2829", 
          "17 29 PREBLE ROAD NUMBER  201, BOWDOINHAM, ME, 4008", 
          "18 01 EAST FOREST HOME PO 398", 
          "19 030 POST OAK BLVD UNIT 601",
          "20 testtest", 
          "21 PO this is street")

x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", test,ignore.case=TRUE)
test[x]
```


## 1.2 Identifying POBs in NPO dataset

Loading file and creating a POB variable to flag addresses that are pobs.

```{r}
#loading data
npo <- readRDS("Data/2_InputData/NONPROFITS-2014-2019.rds")       

#creating a variable for pob flags
npo$pob <- NA
```

Identifying the pobs using a regular expression code
```{r}
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
pob <- npo$Address[which(x)]
head(pob, 5)
```

Addressess identified as POBs are sometimes very lengthy... 
```{r}
#including a character length var
pob.len <- nchar(pob)
pobs <- tibble(pob,pob.len)

#arranging it
pobs <- arrange(pobs, desc(pob.len))
head(pobs)

#plotting
x <- table(pobs$pob.len)
barplot(x)
pander(x[3:11])
```
Most NPO POB addresses are clusterd around the `r median(pobs$pob.len, na.rm = TRUE)` character length. 

After taking a closer look, we  notice that lengthy POB addresses have a mix of both POB info and a regular address.

Take a look at POBs with 25 characters
```{r}
head(pobs[pobs$pob.len==25,], 5)
```

A POB flag and address character length var will be added to the actual dataset
```{r}
#char length
npo$add.len <- nchar(npo$Address)

#pob flag
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
npo$pob <- as.numeric(x) #adding a flag for pobs identified
```

How many pobs?
```{r}
x <- as.data.frame(prop.table(table(npo$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)
```

We will merge the ADDRESS, CITY, STATE and ZIP variable to create an INPUT ADDRESS var in the NPO dataset for geocoding.
```{r}
#creating an Input Address variable
npo$input_address <- paste(npo$Address, npo$City, npo$State, npo$Zip, sep = ", ") 
```

Saving the npo file
```{r}
saveRDS(npo, "Data/2_InputData/NONPROFITS-2014-2019v2.rds")       
```


## 1.3 Identifying POBs in BOARD MEMBERS dataset

Loading file and creating a pob variable to flag addresses that are pobs.
```{r}
#loading data
ppl <- readRDS("Data/2_InputData/PEOPLE-2014-2019.rds")   

#creating a variable for pob flags
ppl$pob <- NA
```

Identifying the pobs using a regular expression code
```{r}
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
pob <- ppl$Address[which(x)]
head(pob, 5)
```

Similar to the NPO addresses, some POBs are mixedwith regular address info:
```{r}
pob.len <- nchar(pob)
pobs <- tibble(pob,pob.len)
pobs <- arrange(pobs, desc(pob.len))
head(pobs)

#plotting
x <- table(pobs$pob.len)
barplot(x)
pander(x[4:12])
```
Most ppl POB addresses are also clusterd around the `r median(pobs$pob.len, na.rm = TRUE)` character length. 

Adding POB flag and add.len var
```{r}
ppl$add.len <- nchar(ppl$Address)

x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
ppl$pob <- as.numeric(x) #adding a flag for pobs identified
```

How many pobs?
```{r}
x <- as.data.frame(prop.table(table(ppl$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)
```

We will merge the ADDRESS, CITY, STATE and ZIP variables to create an INPUT ADDRESS var in the ppl dataset for geocoding.
```{r}
#creating an Input Address variable
ppl$input_address <- paste(ppl$Address, ppl$City, ppl$State, ppl$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes
```

Saving rds
```{r}
saveRDS(ppl, "Data/2_InputData/PEOPLE-2014-2019v2.rds")
```


# 2. Exploring Address data gaps

In this section we will look into the Addresses and try to identify any problem with the data. 
We will also create an address_input variable which will be used for geocoding.

## 2.1 NonProfit dataset

The variables that contain address informaiton are:

* "Address"
* "City"
* "State"
* "Zip"               

Subsetting a smaller file to explore addresses, removing pobs
```{r}
nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- npo[,nmz]

#dropping all pobs
x <- add$pob == 0
add <- add[x,]
```

* NPO **ADDRESSES** have `r sum(is.na(add$Address))` NAs and `r sum(add$Address %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA"))` blanks

Using the Address character length variable (add.len) to see the distribution and determine when an address is short enough to assume it is unintelligible.
```{r}
x <- table(add$add.len)
barplot(x)
x[13:21]
```
NPO Addresses are clustered around `r median(add$add.len, na.rm = TRUE)` characters

Addresses with length 1-2 are unintelligible and will probably have to be geocoded using zip codes or city centers
```{r}
x <- which(add$add.len %in% c(1,2))
head(add$Address[x],15)
```

Looking at those with length 3-4, they still look unintelligible
```{r}
x <- which(add$add.len %in% c(3,4))
head(add$Address[x],15)
```

Addresses with length 5-6 also seem to be numbers with not much sense
```{r}
x <- which(add$add.len %in% c(5,6))
head(add$Address[x],15)
```

However, looking at those with length 7-8, we start to recognize proper addressess.
```{r}
x <- which(add$add.len %in% c(7,8))
head(add$Address[x],15)
```

Looking for NAs within the other relevant columns

* **CITY** variable has `r sum(is.na(add$City))` NAs and `r sum(add$City %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA"))` blanks

City var has some unintelligible values 
```{r}
x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]
```

And some values that seem to be too short or too long 
```{r}
x$City <- as.character(x$City)
x$len <- nchar(x$City)
barplot(table(x$len))

x <- arrange(x, len)
head(x$City,20)
tail(x$City,15)
```

* **STATE** variable has `r sum(is.na(add$State))` NAs and `r sum(add$City %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA"))`

State variable seems more standard 
```{r}
sort(unique(add$State))
```

* **ZIP** variable has `r sum(is.na(add$Zip))` NAs and `r sum(add$Zip %in% (add$City %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA")))`


## 2.2 Board Member dataset

Subsetting a smaller file, and removing pobs
```{r}
nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- ppl[,nmz]

#dropping all pobs
x <- add$pob == 0
add <- add[x,]
```

* PPL **ADDRESSES** have `r sum(is.na(add$Address))` NAs and `r sum(add$Address %in% (add$City %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks

Using the Address character length variable (add.len) to see the distribution
```{r}
x <- table(add$add.len)
barplot(x)
x[13:22]
```
PPL Addresses are clustered around `r median(add$add.len, na.rm = TRUE)` characters.

Similar to NPO data, shorter addresses are unintelligeble. 
```{r}
x <- which(add$add.len %in% c(1,2))
head(add$Address[x],15)

x <- which(add$add.len %in% c(3,4))
head(add$Address[x],15)

x <- which(add$add.len %in% c(5,6))
head(add$Address[x],15)
```

And, again, Addresses with length 7-8 are recognizable as proper addressess.
```{r}
x <- which(add$add.len %in% c(7,8))
head(add$Address[x],20)
```

Looking for NAs within the other relevant columns

* **CITY** variable has `r sum(is.na(add$City))` NAs and `r sum(add$City %in% (add$City %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks

City var has some unintelligible values 
```{r}
x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]
```

And some values that seem to be too short or too long 
```{r}
x$City <- as.character(x$City)
x$len <- nchar(x$City)
barplot(table(x$len))

x <- arrange(x, len)
head(x$City,20)
tail(x$City,15)
```

* **STATE** variable has `r sum(is.na(add$State))` NAs and `r sum(add$State %in% (add$City %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks

State variable seems more standard 
```{r}
sort(unique(add$State))
```

* **ZIP** variable has `r sum(is.na(add$Zip))` NAs and `r sum(add$Zip %in% (add$City %in% c("", " ", "  ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks