-
Notifications
You must be signed in to change notification settings - Fork 0
/
Step-02-POBsandGaps.Rmd
376 lines (293 loc) · 10.2 KB
/
Step-02-POBsandGaps.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
title: "Step 2: Identifying POBs and data gaps"
date: "09/06/2020"
output:
html_document:
theme: united
df_print: paged
highlight: tango
smart: false
toc: yes
toc_float: yes
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=F, message=F, fig.width=8)
```
# Introduction
In this script we will work with the **NONPROFIT-2014-2019.rds** and **PEOPLE-2014-2019.rds** to identify addresses that are PO Boxes and identifying the main data gaps in the address data. New datasets will be stored as **NONPROFIT-2014-2019v2.rds** and **PEOPLE-2014-2019v2.rds**.
**STEPS**
(1) Identifying POBs
(2) Exploring address data gaps
**PACKAGES**
```{r}
library( dplyr )
library(tidyverse)
library( tidyr )
library( pander )
library( httr )
library( stringr )
#update the path with your working directory:
wd <- "/Users/icps86/Dropbox/R Projects/Open_data_ignacio"
setwd(wd)
dir()
```
# 1. Identifying PO Boxes
We will be looking for POB (Post Office Boxes). Might also find APOs (Army Post Office), FPOs (Fleet Post Office) and DPOs (Diplomatic Post Office).
## 1.1 Developing a regular expression code to identify PO Boxes
```{r, echo= F, eval= FALSE}
#this code will not be run beacuase it was used to test the regular expressions
test <- c("1 BOX3290",
"2 BOX 389",
"3 PO 384902",
"4 BOX 83209",
"5 PObox 5432",
"6 POB 3903",
"7 PO 394",
"8 PO B 290",
"9 PO BOX 3892",
"10 PO 3892",
"11 BOX 338",
"12 POB. 293",
"13 POST OFFICE BOX 329023",
"14 P. O. Box 962",
"15 P.O. BOX2030",
"16 PO BOX. 2829",
"17 29 PREBLE ROAD NUMBER 201, BOWDOINHAM, ME, 4008",
"18 01 EAST FOREST HOME PO 398",
"19 030 POST OAK BLVD UNIT 601",
"20 testtest",
"21 PO this is street")
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", test,ignore.case=TRUE)
test[x]
```
## 1.2 Identifying POBs in NPO dataset
Loading file and creating a POB variable to flag addresses that are pobs.
```{r}
#loading data
npo <- readRDS("Data/2_InputData/NONPROFITS-2014-2019.rds")
#creating a variable for pob flags
npo$pob <- NA
```
Identifying the pobs using a regular expression code
```{r}
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
pob <- npo$Address[which(x)]
head(pob, 5)
```
Addressess identified as POBs are sometimes very lengthy...
```{r}
#including a character length var
pob.len <- nchar(pob)
pobs <- tibble(pob,pob.len)
#arranging it
pobs <- arrange(pobs, desc(pob.len))
head(pobs)
#plotting
x <- table(pobs$pob.len)
barplot(x)
pander(x[3:11])
```
Most NPO POB addresses are clusterd around the `r median(pobs$pob.len, na.rm = TRUE)` character length.
After taking a closer look, we notice that lengthy POB addresses have a mix of both POB info and a regular address.
Take a look at POBs with 25 characters
```{r}
head(pobs[pobs$pob.len==25,], 5)
```
A POB flag and address character length var will be added to the actual dataset
```{r}
#char length
npo$add.len <- nchar(npo$Address)
#pob flag
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
npo$pob <- as.numeric(x) #adding a flag for pobs identified
```
How many pobs?
```{r}
x <- as.data.frame(prop.table(table(npo$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)
```
We will merge the ADDRESS, CITY, STATE and ZIP variable to create an INPUT ADDRESS var in the NPO dataset for geocoding.
```{r}
#creating an Input Address variable
npo$input_address <- paste(npo$Address, npo$City, npo$State, npo$Zip, sep = ", ")
```
Saving the npo file
```{r}
saveRDS(npo, "Data/2_InputData/NONPROFITS-2014-2019v2.rds")
```
## 1.3 Identifying POBs in BOARD MEMBERS dataset
Loading file and creating a pob variable to flag addresses that are pobs.
```{r}
#loading data
ppl <- readRDS("Data/2_InputData/PEOPLE-2014-2019.rds")
#creating a variable for pob flags
ppl$pob <- NA
```
Identifying the pobs using a regular expression code
```{r}
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
pob <- ppl$Address[which(x)]
head(pob, 5)
```
Similar to the NPO addresses, some POBs are mixedwith regular address info:
```{r}
pob.len <- nchar(pob)
pobs <- tibble(pob,pob.len)
pobs <- arrange(pobs, desc(pob.len))
head(pobs)
#plotting
x <- table(pobs$pob.len)
barplot(x)
pander(x[4:12])
```
Most ppl POB addresses are also clusterd around the `r median(pobs$pob.len, na.rm = TRUE)` character length.
Adding POB flag and add.len var
```{r}
ppl$add.len <- nchar(ppl$Address)
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
ppl$pob <- as.numeric(x) #adding a flag for pobs identified
```
How many pobs?
```{r}
x <- as.data.frame(prop.table(table(ppl$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)
```
We will merge the ADDRESS, CITY, STATE and ZIP variables to create an INPUT ADDRESS var in the ppl dataset for geocoding.
```{r}
#creating an Input Address variable
ppl$input_address <- paste(ppl$Address, ppl$City, ppl$State, ppl$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes
```
Saving rds
```{r}
saveRDS(ppl, "Data/2_InputData/PEOPLE-2014-2019v2.rds")
```
# 2. Exploring Address data gaps
In this section we will look into the Addresses and try to identify any problem with the data.
We will also create an address_input variable which will be used for geocoding.
## 2.1 NonProfit dataset
The variables that contain address informaiton are:
* "Address"
* "City"
* "State"
* "Zip"
Subsetting a smaller file to explore addresses, removing pobs
```{r}
nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- npo[,nmz]
#dropping all pobs
x <- add$pob == 0
add <- add[x,]
```
* NPO **ADDRESSES** have `r sum(is.na(add$Address))` NAs and `r sum(add$Address %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA"))` blanks
Using the Address character length variable (add.len) to see the distribution and determine when an address is short enough to assume it is unintelligible.
```{r}
x <- table(add$add.len)
barplot(x)
x[13:21]
```
NPO Addresses are clustered around `r median(add$add.len, na.rm = TRUE)` characters
Addresses with length 1-2 are unintelligible and will probably have to be geocoded using zip codes or city centers
```{r}
x <- which(add$add.len %in% c(1,2))
head(add$Address[x],15)
```
Looking at those with length 3-4, they still look unintelligible
```{r}
x <- which(add$add.len %in% c(3,4))
head(add$Address[x],15)
```
Addresses with length 5-6 also seem to be numbers with not much sense
```{r}
x <- which(add$add.len %in% c(5,6))
head(add$Address[x],15)
```
However, looking at those with length 7-8, we start to recognize proper addressess.
```{r}
x <- which(add$add.len %in% c(7,8))
head(add$Address[x],15)
```
Looking for NAs within the other relevant columns
* **CITY** variable has `r sum(is.na(add$City))` NAs and `r sum(add$City %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA"))` blanks
City var has some unintelligible values
```{r}
x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]
```
And some values that seem to be too short or too long
```{r}
x$City <- as.character(x$City)
x$len <- nchar(x$City)
barplot(table(x$len))
x <- arrange(x, len)
head(x$City,20)
tail(x$City,15)
```
* **STATE** variable has `r sum(is.na(add$State))` NAs and `r sum(add$City %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA"))`
State variable seems more standard
```{r}
sort(unique(add$State))
```
* **ZIP** variable has `r sum(is.na(add$Zip))` NAs and `r sum(add$Zip %in% (add$City %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA")))`
## 2.2 Board Member dataset
Subsetting a smaller file, and removing pobs
```{r}
nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- ppl[,nmz]
#dropping all pobs
x <- add$pob == 0
add <- add[x,]
```
* PPL **ADDRESSES** have `r sum(is.na(add$Address))` NAs and `r sum(add$Address %in% (add$City %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks
Using the Address character length variable (add.len) to see the distribution
```{r}
x <- table(add$add.len)
barplot(x)
x[13:22]
```
PPL Addresses are clustered around `r median(add$add.len, na.rm = TRUE)` characters.
Similar to NPO data, shorter addresses are unintelligeble.
```{r}
x <- which(add$add.len %in% c(1,2))
head(add$Address[x],15)
x <- which(add$add.len %in% c(3,4))
head(add$Address[x],15)
x <- which(add$add.len %in% c(5,6))
head(add$Address[x],15)
```
And, again, Addresses with length 7-8 are recognizable as proper addressess.
```{r}
x <- which(add$add.len %in% c(7,8))
head(add$Address[x],20)
```
Looking for NAs within the other relevant columns
* **CITY** variable has `r sum(is.na(add$City))` NAs and `r sum(add$City %in% (add$City %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks
City var has some unintelligible values
```{r}
x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]
```
And some values that seem to be too short or too long
```{r}
x$City <- as.character(x$City)
x$len <- nchar(x$City)
barplot(table(x$len))
x <- arrange(x, len)
head(x$City,20)
tail(x$City,15)
```
* **STATE** variable has `r sum(is.na(add$State))` NAs and `r sum(add$State %in% (add$City %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks
State variable seems more standard
```{r}
sort(unique(add$State))
```
* **ZIP** variable has `r sum(is.na(add$Zip))` NAs and `r sum(add$Zip %in% (add$City %in% c("", " ", " ", "-", ".", "-", "--", "------", "-----", "NA")))` blanks