-
Notifications
You must be signed in to change notification settings - Fork 2
/
search.json
443 lines (443 loc) · 176 KB
/
search.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
[
{
"objectID": "resources.html",
"href": "resources.html",
"title": "R@URBAN",
"section": "",
"text": "Free Books\n\nIntro\n\nR for Data Science by Garrett Grolemund and Hadley Wickham\n\n\n\nData Viz\n\nggplot2: Elegant Graphics for Data Analysis by Hadley Wickham\nData Visualization - A practical introduction by Kieran Healy\n\n\n\n*down\n\nR Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund\nblogdown: Creating Websites with R Markdown by Yihui Xie, Amber Thomas, and Alison Presmanes Hill\nbookdown: Authoring Books and Technical Documents with R Markdown by Yihui Xie\n\n\n\nStatistics\n\nLearning Statistics with R by Danielle Navarro\nIntroduction to Econometrics with R by Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer\nAn Introduction to Bayesian Thinking by Merlise Clyde et. al.\nStatistical Inference via Data Science by Chester Ismay and Albert Y. Kim\n\n\n\nMachine Learning\n\nHands-On Machine Learning with R by Bradley Boehmke & Brandon Greenwell\nFeature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson\n\n\n\nMapping and Geospatial Analysis\n\nGeocomputation with R by Robin Lovelace, Jakub Nowosad, Jannes Muenchow\n\n\n\nText Analysis\n\nText Mining with R A Tidy Approach by Julia Silge and David Robinson\n\n\n\nProgramming\n\nAdvanced R by Hadley Wickham\nR Packages by Hadley Wickham\nMaster Spark with R by Javier Luraschi, Kevin Kuo, and Edgar Ruiz\nFunctional programming and unit testing for data munging with R by Bruno Rodrigues\n\n\n\n\nWebsites\n\nRStudio Essentials\nRStudio Education\nR Cheat Sheets\nAndrew Heiss’ free Data Viz Course"
},
{
"objectID": "getting-data.html#introduction",
"href": "getting-data.html#introduction",
"title": "R@URBAN",
"section": "Introduction",
"text": "Introduction\nThis guide outlines some useful workflows for pulling data sets commonly used by the Urban Institute."
},
{
"objectID": "getting-data.html#librarytidycensus",
"href": "getting-data.html#librarytidycensus",
"title": "R@URBAN",
"section": "library(tidycensus)",
"text": "library(tidycensus)\nlibrary(tidycensus) by Kyle Walker (complete intro here) is the best tool for accessing some Census data sets in R from the Census Bureau API. The package returns tidy data frames and can easily pull shapefiles by adding geometry = TRUE.\nYou will need to apply for a Census API key and add it to your R session. Don’t add your API key to your script and don’t add it to a GitHub repository!\nHere is a simple example for one state with shapefiles:\n\nlibrary(tidyverse)\nlibrary(purrr)\nlibrary(tidycensus)\n\n# pull median household income and shapefiles for Census tracts in Alabama\nget_acs(geography = \"tract\", \n variables = \"B19013_001\", \n state = \"01\",\n year = 2015,\n geometry = TRUE,\n progress = FALSE)\n\nSimple feature collection with 1181 features and 5 fields (with 1 geometry empty)\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: -88.47323 ymin: 30.22333 xmax: -84.88908 ymax: 35.00803\nGeodetic CRS: NAD83\nFirst 10 features:\n GEOID NAME variable\n1 01003010500 Census Tract 105, Baldwin County, Alabama B19013_001\n2 01003011501 Census Tract 115.01, Baldwin County, Alabama B19013_001\n3 01009050500 Census Tract 505, Blount County, Alabama B19013_001\n4 01015981901 Census Tract 9819.01, Calhoun County, Alabama B19013_001\n5 01025957700 Census Tract 9577, Clarke County, Alabama B19013_001\n6 01025958002 Census Tract 9580.02, Clarke County, Alabama B19013_001\n7 01031011000 Census Tract 110, Coffee County, Alabama B19013_001\n8 01033020500 Census Tract 205, Colbert County, Alabama B19013_001\n9 01037961200 Census Tract 9612, Coosa County, Alabama B19013_001\n10 01039961700 Census Tract 9617, Covington County, Alabama B19013_001\n estimate moe geometry\n1 41944 8100 MULTIPOLYGON (((-87.80249 3...\n2 41417 14204 MULTIPOLYGON (((-87.71719 3...\n3 40055 8054 MULTIPOLYGON (((-86.75735 3...\n4 NA NA MULTIPOLYGON (((-86.01323 3...\n5 32708 4806 MULTIPOLYGON (((-88.18049 3...\n6 29048 14759 MULTIPOLYGON (((-87.98623 3...\n7 44732 7640 MULTIPOLYGON (((-85.92018 3...\n8 49052 6543 MULTIPOLYGON (((-87.76733 3...\n9 31957 9954 MULTIPOLYGON (((-86.46069 3...\n10 32697 6021 MULTIPOLYGON (((-86.6998 31...\n\n\nSmaller geographies like Census tracts can only be pulled state-by-state. This example demonstrates how to iterate across FIPS codes to pull Census tracts for multiple states. The process is as follows:\n\nPick the variables of interest\nCreate a vector of state FIPS codes for the states of interest\nCreate a custom function that works on a single state FIPS code\nIterate the function along the vector of state FIPS codes with map_df() from library(purrr)\n\nHere is an example that pulls median household income at the Census tract level for multiple states:\n\n# variables of interest\nvars <- c(\n \"B19013_001\" # median household income estimate\n)\n\n# states of interest: alabama, alaska, arizona\nstate_fips <- c(\"01\", \"02\", \"04\")\n \n# create a custom function that works for one state\nget_income <- function(state_fips) {\n \n income_data <- get_acs(geography = \"tract\", \n variables = vars, \n state = state_fips,\n year = 2015)\n \n return(income_data)\n \n}\n\n# iterate the function\nmap_df(.x = state_fips, # iterate along the vector of state fips codes\n .f = get_income) # apply get_income() to each fips_code \n\n# A tibble: 2,874 × 5\n GEOID NAME variable estimate moe\n <chr> <chr> <chr> <dbl> <dbl>\n 1 01001020100 Census Tract 201, Autauga County, Alabama B19013_… 61838 11900\n 2 01001020200 Census Tract 202, Autauga County, Alabama B19013_… 32303 13538\n 3 01001020300 Census Tract 203, Autauga County, Alabama B19013_… 44922 5629\n 4 01001020400 Census Tract 204, Autauga County, Alabama B19013_… 54329 7003\n 5 01001020500 Census Tract 205, Autauga County, Alabama B19013_… 51965 6935\n 6 01001020600 Census Tract 206, Autauga County, Alabama B19013_… 63092 9585\n 7 01001020700 Census Tract 207, Autauga County, Alabama B19013_… 34821 7867\n 8 01001020801 Census Tract 208.01, Autauga County, Ala… B19013_… 73728 2447\n 9 01001020802 Census Tract 208.02, Autauga County, Ala… B19013_… 60063 8602\n10 01001020900 Census Tract 209, Autauga County, Alabama B19013_… 41287 7857\n# … with 2,864 more rows\n\n\nlibrary(tidycensus) works well with library(tidyverse) and enables access to geospatial data, but it is limited to only some Census Bureau data sets. The next package has less functionality but allows for accessing any data available on the Census API."
},
{
"objectID": "getting-data.html#librarycensusapi",
"href": "getting-data.html#librarycensusapi",
"title": "R@URBAN",
"section": "library(censusapi)",
"text": "library(censusapi)\nlibrary(censusapi) by Hannah Recht (complete intro here) can access any published table that is accessible through the Census Bureau API. A full listing is available here.\nYou will need to apply for a Census API key and add it to your R session. Don’t add your API key to your script and don’t add it to a GitHub repository!\nHere is a simple example that pulls median household income and its margin of error for Census tracts in Alabama:\n\nlibrary(tidyverse)\nlibrary(purrr)\nlibrary(censusapi)\nvars <- c(\n \"B19013_001E\", # median household income estimate\n \"B19013_001M\" # median household income margin of error\n)\n\ngetCensus(name = \"acs/acs5\",\n vars = vars, \n region = \"tract:*\",\n regionin = \"state:01\",\n vintage = 2015) %>%\n as_tibble()\n\n# A tibble: 1,181 × 5\n state county tract B19013_001E B19013_001M\n <chr> <chr> <chr> <dbl> <dbl>\n 1 01 103 005109 29644 4098\n 2 01 103 005106 35864 3443\n 3 01 103 005107 66739 5468\n 4 01 103 005108 64632 9804\n 5 01 103 005701 46306 7926\n 6 01 103 005702 47769 12939\n 7 01 105 686800 30662 7299\n 8 01 009 050102 43325 9484\n 9 01 009 050300 37548 9655\n10 01 009 050700 46452 5167\n# … with 1,171 more rows\n\n\nSmaller geographies like Census tracts can only be pulled state-by-state. This example demonstrates how to iterate across FIPS codes to pull Census tracts for multiple states. The process is as follows:\n\nPick the variables of interest\nCreate a vector of state FIPS codes for the states of interest\nCreate a custom function that works on a single state FIPS code\nIterate the function along the vector of state FIPS codes with map_df() from library(purrr)\n\nHere is an example that pulls median household income at the Census tract level for multiple states:\n\n# variables of interest\nvars <- c(\n \"B19013_001E\", # median household income estimate\n \"B19013_001M\" # median household income margin of error\n)\n\n# states of interest: alabama, alaska, arizona\nstate_fips <- c(\"01\", \"02\", \"04\")\n \n# create a custom function that works for one state\nget_income <- function(state_fips) {\n \n income_data <- getCensus(name = \"acs/acs5\", \n vars = vars, \n region = \"tract:*\",\n regionin = paste0(\"state:\", state_fips),\n vintage = 2015)\n \n return(income_data)\n \n}\n\n# iterate the function\nmap_df(.x = state_fips, # iterate along the vector of state fips codes\n .f = get_income) %>% # apply get_income() to each fips_code \n as_tibble() \n\n# A tibble: 2,874 × 5\n state county tract B19013_001E B19013_001M\n <chr> <chr> <chr> <dbl> <dbl>\n 1 01 103 005109 29644 4098\n 2 01 103 005106 35864 3443\n 3 01 103 005107 66739 5468\n 4 01 103 005108 64632 9804\n 5 01 103 005701 46306 7926\n 6 01 103 005702 47769 12939\n 7 01 105 686800 30662 7299\n 8 01 009 050102 43325 9484\n 9 01 009 050300 37548 9655\n10 01 009 050700 46452 5167\n# … with 2,864 more rows"
},
{
"objectID": "optimization.html#learn-lapplypurrrmap",
"href": "optimization.html#learn-lapplypurrrmap",
"title": "R@URBAN",
"section": "Learn lapply/purrr::map",
"text": "Learn lapply/purrr::map\nLearning the lapply (and variants) function from Base R or the map (and variants) function from the purrr package is the first step in learning to run R code in parallel. Once you understand how lapply and map work, running your code in parallel will be simple.\nSay you have a vector of numbers and want to find the square root of each one (ignore for now that sqrt is vectorized, which will be covered later). You could write a for loop and iterate over each element of the vector:\n\nx <- c(1, 4, 9, 16)\n\nout <- vector(\"list\", length(x))\nfor (i in seq_along(x)) {\n out[[i]] <- sqrt(x[[i]])\n}\nunlist(out)\n\n[1] 1 2 3 4\n\n\nThe lapply function essentially handles the overhead of constructing a for loop for you. The syntax is:\n\nlapply(X, FUN, ...)\n\nlapply will then take each element of X and apply the FUNction to it. Our simple example then becomes:\n\nx <- c(1, 4, 9, 16)\nout <- lapply(x, sqrt)\nunlist(out)\n\n[1] 1 2 3 4\n\n\nThose working within the tidyverse may use map from the purrr package equivalently:\n\nlibrary(purrr)\nx <- c(1, 4, 9, 16)\nout <- map(x, sqrt)\nunlist(out)\n\n[1] 1 2 3 4"
},
{
"objectID": "optimization.html#motivating-example",
"href": "optimization.html#motivating-example",
"title": "R@URBAN",
"section": "Motivating Example",
"text": "Motivating Example\nOnce you are comfortable with lapply and/or map, running the same code in parallel takes just an additional line of code.\nFor lapply users, the future.apply package contains an equivalent future_lapply function. Just be sure to call plan(multiprocess) beforehand, which will handle the back-end orchestration needed to run in parallel.\n\n# install.packages(\"future.apply\")\nlibrary(future.apply)\nplan(multiprocess)\nout <- future_lapply(x, sqrt)\nunlist(out)\n\n[1] 1 2 3 4\n\n\nFor purrr users, the furrr (i.e., future purrr) package includes an equivalent future_map function:\n\n# install.packages(\"furrr\")\nlibrary(furrr)\nplan(multiprocess)\ny <- future_map(x, sqrt)\nunlist(y)\n\n[1] 1 2 3 4\n\n\nHow much faster did this simple example run in parallel?\n\nlibrary(future.apply)\nplan(multiprocess)\n\nx <- c(1, 4, 9, 16)\n\nmicrobenchmark::microbenchmark(\n sequential = lapply(x, sqrt),\n parallel = future_lapply(x, sqrt),\n unit = \"s\"\n)\n\nUnit: seconds\n expr min lq mean median uq\n sequential 0.000001626 0.000001876 0.00000314277 0.000002271 0.0000036255\n parallel 0.022703376 0.023005355 0.02763271818 0.023331897 0.0255120635\n max neval\n 0.000031875 100\n 0.338857459 100\n\n\nParallelization was actually slower. In this case, the overhead of setting the code to run in parallel far outweighed any performance gain. In general, parallelization works well on long-running & compute intensive jobs."
},
{
"objectID": "optimization.html#a-somewhat-more-complex-example",
"href": "optimization.html#a-somewhat-more-complex-example",
"title": "R@URBAN",
"section": "A (somewhat) More Complex Example",
"text": "A (somewhat) More Complex Example\nIn this example we’ll use the diamonds dataset from ggplot2 and perform a kmeans cluster. We’ll use lapply to iterate the number of clusters from 2 to 5:\n\ndf <- ggplot2::diamonds\ndf <- dplyr::select(df, -c(cut, color, clarity))\n\ncenters = 2:5\n\nsystem.time(\n lapply(centers, \n function(x) kmeans(df, centers = x, nstart = 500)\n )\n )\n\n user system elapsed \n 27.413 0.291 27.997 \n\n\nA now running the same code in parallel:\n\nlibrary(future.apply)\nplan(multiprocess)\n\nsystem.time(\n future_lapply(centers, \n function(x) kmeans(df, centers = x, nstart = 500)\n )\n )\n\n user system elapsed \n 0.634 0.136 13.164 \n\n\nWhile we didn’t achieve perfect scaling, we still get a nice bump in execution time."
},
{
"objectID": "optimization.html#additional-packages",
"href": "optimization.html#additional-packages",
"title": "R@URBAN",
"section": "Additional Packages",
"text": "Additional Packages\nFor the sake of ease and brevity, this guide focused on the futures framework for parallelization. However, you should be aware that there are a number of other ways to parallelize your code.\n\nThe parallel Package\nThe parallel package is included in your base R installation. It includes analogues of the various apply functions:\n\nparLapply\nmclapply - not available on Windows\n\nThese functions generally require more setup, especially on Windows machines.\n\n\nThe doParallel Package\nThe doParallel package builds off of parallel and is useful for code that uses for loops instead of lapply. Like the parallel package, it generally requires more setup, especially on Windows machines.\n\n\nMachine Learning - caret\nFor those running machine learning models, the caret package can easily leverage doParallel to speed up the execution of multiple models. Lifting the example from the package documentation:\n\nlibrary(doParallel)\ncl <- makePSOCKcluster(5) # number of cores to use\nregisterDoParallel(cl)\n\n## All subsequent models are then run in parallel\nmodel <- train(y ~ ., data = training, method = \"rf\")\n\n## When you are done:\nstopCluster(cl)\n\nBe sure to check out the full documentation for more detail."
},
{
"objectID": "optimization.html#object-size",
"href": "optimization.html#object-size",
"title": "R@URBAN",
"section": "Object Size",
"text": "Object Size\nThe type of your data can have a big impact on the size of your data frame when you are dealing with larger files. There are four main types of atomic vectors in R:\n\nlogical\ninteger\ndouble (also called numeric)\ncharacter\n\nEach of these data types occupies a different amount of space in memory - logical and integer vectors use 4 bytes per element, while a double will occupy 8 bytes. R uses a global string pool, so character vectors are hard to estimate, but will generally take up more space for element.\nConsider the following example:\n\nx <- 1:100\npryr::object_size(x)\n\n680 B\n\npryr::object_size(as.double(x))\n\n680 B\n\npryr::object_size(as.character(x))\n\n1.32 kB\n\n\nAn incorrect data type can easily cost you a lot of space in memory, especially at scale. This often happens when reading data from a text or csv file - data may have a format such as c(1.0, 2.0, 3.0) and will be read in as a numeric column, when integer is more appropriate and compact.\nYou may also be familiar with factor variables within R. Essentially a factor will represent your data as integers, and map them back to their character representation. This can save memory when you have a compact and unique level of factors:\n\nx <- sample(letters, 10000, replace = TRUE)\npryr::object_size(as.character(x))\n\n81.50 kB\n\npryr::object_size(as.factor(x))\n\n42.10 kB\n\n\nHowever if each element is unique, or if there is not a lot of overlap among elements, than the overhead will make a factor larger than its character representation:\n\npryr::object_size(as.factor(letters))\n\n2.22 kB\n\npryr::object_size(as.character(letters))\n\n1.71 kB"
},
{
"objectID": "optimization.html#cloud-computing",
"href": "optimization.html#cloud-computing",
"title": "R@URBAN",
"section": "Cloud Computing",
"text": "Cloud Computing\nSometimes, you will have data that are simply too large to ever fit on your local desktop machine. If that is the case, then the Elastic Cloud Computing Environment from the Office of Technology and Data Science can provide you with easy access to powerful analytic tools for computationally intensive project.\nThe Elastic Cloud Computing Environment allows researchers to quickly spin-up an Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instance. These instances offer increased memory to read in large datasets, along with additional CPUs to provide the ability to process data in parallel at an impressive scale.\n\n\n\nInstance\nCPU\nMemory (GB)\n\n\n\n\nDesktop\n8\n16\n\n\nc5.4xlarge\n16\n32\n\n\nc5.9xlarge\n36\n72\n\n\nc5.18xlarge\n72\n144\n\n\nx1e.8xlarge\n32\n976\n\n\nx1e.16xlarge\n64\n1952\n\n\n\nFeel free to contact Kyle Ueyama (kueyama@urban.org) if this would be useful for your project."
},
{
"objectID": "optimization.html#for-loops-and-vector-allocation",
"href": "optimization.html#for-loops-and-vector-allocation",
"title": "R@URBAN",
"section": "For Loops and Vector Allocation",
"text": "For Loops and Vector Allocation\nA refrain you will often hear is that for loops in R are slow and need to be avoided at all costs. This is not true! Rather, an improperly constructed loop in R can bring the execution of your program to a near standstill.\nA common for loop structure may look something like:\n\nx <- 1:100\nout <- c()\nfor (i in x) {\n out <- c(out, sqrt(x))\n }\n\nThe bottleneck in this loop is with the allocation of the vector out. Every time we iterate over an item in x and append it to out, R makes a copy of all the items already in out. As the size of the loop grows, your code will take longer and longer to run.\nA better practice is to pre-allocate out to be the correct length, and then insert the results as the loop runs.\n\nx <- 1:100\nout <- rep(NA, length(x))\nfor (i in seq_along(x)) {\n out[i] <- sqrt(x[i])\n}\n\nA quick benchmark shows how much more efficient a loop with a pre-allocated results vector is:\n\nbad_loop <- function(x) {\n out <- c()\n for (i in x) {\n out <- c(out, sqrt(x))\n }\n}\n\ngood_loop <- function(x) {\n out <- rep(NA, length(x))\n for (i in seq_along(x)) {\n out[i] <- sqrt(x[i])\n }\n}\n\nx <- 1:100\nmicrobenchmark::microbenchmark(\n bad_loop(x),\n good_loop(x)\n)\n\nUnit: microseconds\n expr min lq mean median uq max neval\n bad_loop(x) 664.751 695.438 1385.65403 748.501 1567.1465 8629.917 100\n good_loop(x) 9.667 10.063 39.98773 10.918 13.8545 2660.168 100\n\n\nAnd note how performance of the “bad” loop degrades as the loop size grows.\n\ny <- 1:250\n\nmicrobenchmark::microbenchmark(\n bad_loop(y),\n good_loop(y)\n)\n\nUnit: microseconds\n expr min lq mean median uq max\n bad_loop(y) 11267.625 11322.79 12504.00940 11361.8965 13006.9385 56963.375\n good_loop(y) 22.793 23.23 30.39238 32.4385 35.7505 57.251\n neval\n 100\n 100"
},
{
"objectID": "optimization.html#vectorized-functions",
"href": "optimization.html#vectorized-functions",
"title": "R@URBAN",
"section": "Vectorized Functions",
"text": "Vectorized Functions\nMany functions in R are vectorized, meaning they can accept an entire vector (and not just a single value) as input. The sqrt function from the prior examples is one:\n\nx <- c(1, 4, 9, 16)\nsqrt(x)\n\n[1] 1 2 3 4\n\n\nThis removes the need to use lapply or a for loop. Vectorized functions in R are generally written in a compiled language like C, C++, or FORTRAN, which makes their implementation faster.\n\nx <- 1:100\nmicrobenchmark::microbenchmark(\n lapply(x, sqrt),\n sqrt(x)\n)\n\nUnit: nanoseconds\n expr min lq mean median uq max neval\n lapply(x, sqrt) 19292 20209 21158.94 20709 21542.0 33750 100\n sqrt(x) 375 417 741.11 501 917.5 4542 100"
},
{
"objectID": "index.html#r-users-group",
"href": "index.html#r-users-group",
"title": "R@URBAN",
"section": "R Users Group",
"text": "R Users Group\nThis website contains resources for using R at the Urban Institute for analysis, visualization, mapping, and more. Click on the links above to get started learning about R!\nThe Urban Institute R Users Group is committed to exposing researchers to the joy and power of R; developing beginner, intermediate, and advanced R skills; encouraging and supporting novel applications of R to public policy research; and building a diverse and mutually supportive community of R Users.\n\n\ngif credits: Allison Horst"
},
{
"objectID": "index.html#sign-up-for-list-serv",
"href": "index.html#sign-up-for-list-serv",
"title": "R@URBAN",
"section": "Sign up for List Serv!",
"text": "Sign up for List Serv!\nPlease fill out the following form to receive email updates about upcoming RUG events and trainings. We promise not to spam your inbox:\n\n\n\n\n\n\n\n\n\n\nFill out this Smartsheet form to unsubscribe from the RUG List Serv."
},
{
"objectID": "index.html#contact-info",
"href": "index.html#contact-info",
"title": "R@URBAN",
"section": "Contact Info",
"text": "Contact Info\nPlease don’t hesitate to contact Aaron Williams (awilliams@urban.org) or Amy Rogin (arogin@urban.org) with any thoughts or questions about R at the Urban Institute."
},
{
"objectID": "index.html#r-lunch-labs",
"href": "index.html#r-lunch-labs",
"title": "R@URBAN",
"section": "R Lunch Labs",
"text": "R Lunch Labs\nThe Urban Institute R Users Group hosts weekly lunch labs. R Lunch Labs are hands-on trainings for R users of all skill levels and soon-to-be R users. Each meeting begins with a 5-10 minute quick tip. Afterwards, attendees break into small groups and work on a range of topics including introduction to R, data management and plotting, mapping, and machine learning. Most users bring laptops, but there are a few extras for users without laptops.\nWe have currently paused R Lunch Labs, but they will be back soon! If you have an idea for a topic you want to present informally at a lunch lab, please let us know!"
},
{
"objectID": "intro-to-r.html#introduction",
"href": "intro-to-r.html#introduction",
"title": "R@URBAN",
"section": "Introduction",
"text": "Introduction\n\nR is one of two premier programming languages for data science and one of the fastest growing programming languages. Created by researchers for researchers (with some help from software engineers), R offers rich, intuitive tools that make it perfect for visualization, public policy analysis, econometrics, geospatial analysis, and statistics.\nR doesn’t come in a box. R was never wrapped in cellophane and it definitely isn’t purchased at a store. R’s pricelessness and open-source development are two of its greatest strengths, but it can often leave new users without the anchor of the box and booklet often provided with proprietary software.\nThis guide is meant to be an on-ramp for soon-to-be R Users and a fill-in-the-gap guide for existing R Users. It starts with the most basic question, “what is R?” and progresses to advanced topics like organizing analyses. Along the way it even demonstrates how to read XKCD comics in R.\nR boasts a strong community in the world and inside the Urban Institute. Please don’t hesitate to contact Aaron Williams (awilliams@urban.org) or Kyle Ueyama (kueyama@urban.org) with thoughts or questions about R."
},
{
"objectID": "intro-to-r.html#what-is-r",
"href": "intro-to-r.html#what-is-r",
"title": "R@URBAN",
"section": "What is R?",
"text": "What is R?\n\n\nSource\nR is a free, open-source software for statistical computing. It is known for intuitive, crisp graphics and an extensive, growing library of statistical and analytic methods. Above all, R boasts an enthusiastic community of developers, instructors, and users.\nThe copyright and documentation for R is held by a not-for-profit organization called The R Foundation.\n\nSource, Fair use\nRStudio is a free, open-source integrated development environment (IDE) that runs on top of R. In practice, R users almost exclusively open RStudio and rarely directly open R.\nRStudio is developed by a for-profit company called RStudio. RStudio, the company, employs some of the R community’s most prolific, open-source developers and creates many open-source tools and trainings.\nWhile R code can be written in any text editor, the RStudio IDE is a powerful tool with a console, syntax-highlighting, and debugging tools. This cheatsheet outlines the power of RStudio."
},
{
"objectID": "intro-to-r.html#installation-and-updates",
"href": "intro-to-r.html#installation-and-updates",
"title": "R@URBAN",
"section": "Installation and Updates",
"text": "Installation and Updates\n\n\nWhen should you update?\nAll Urban computers should come pre-installed with R and Rstudio. However your R version may be out of date and require updating. We recommend having at least R version 3.6.0 or higher. You can check what version of R you have installed by opening Rstudio and submitting the following line of code to the console: R.Version()$version.string.\nIf you’re working on a personal computer, you may not have R or Rstudio installed. So follow this guide to install both on your computer.\n\n\nUpdating/Installing R\n\nVisit https://cran.r-project.org/bin/windows/base/. The latest R version will be the downloadable link at the top. As of 1/1/2020, that R version is 3.6.2. Click on the link at the top and download the R-x.x.x-win.exe file.\nOpen the R-x.x.x-win.exe` file. Click next, accept all the defaults, and install R. After R has been installed, click the Finish button. You should not need admin privileges for this.\nCheck that your version of R has been updated in Rstudio. If Rstudio is already open, first close it. Then open Rstudio and retype in R.Version()$version.string. You should see an updated version number printed out on the console.\nTest that R packages are loading as expected. Packages you already had installed should continue to work with newer versions of R. But in some cases, you may need to re-install the packages to work properly with new versions of R.\n\n\n\nUpdating/Installing Rstudio\n\nOpen Rstudio and go to Help > Check for Updates to see if RStudio is up-to-date\nIf it is out-of-date, download the appropriate update.\nBefore you run the installer, contact IT at helpdesk@urban.org for administrative approval as the program requires admin access.\nRun the installer and accept all defaults.\n\nMoving forward, RStudio will automatically and regularly update on Windows computers at the Urban Institute."
},
{
"objectID": "intro-to-r.html#learning-r",
"href": "intro-to-r.html#learning-r",
"title": "R@URBAN",
"section": "Learning R",
"text": "Learning R\n\n\nWhat to Learn\nThere is often more than one way to accomplish a goal in R because of the language’s flexibility. At first, this flexibility can be overwhelming. That’s why it is useful to pick and master one set of tools in R before branching out and learning everything R.\nFortunately, Hadley Wickham’s tidyverse offers a comprehensive set of tools for data analysis that are good for both beginners and experts. The tidyverse is self-described as “an opinionated collection of R packages designed for data science.” The tidyverse consists of almost two dozen clear and concise tools for every part of an analysis workflow. At first, focus on the function read_csv() for loading data, the package dplyr for manipulating data, and the package ggplot2 for plotting.\nHere’s a quick example that reads a .csv, filters the data, and creates a publishable column plot in just fifteen lines of code:\n\n# load packages and source the Urban Institute ggplot2 theme\nlibrary(tidyverse) # contains read_csv, library(dplyr), and library(ggplot2)\nlibrary(urbnthemes)\n\nset_urbn_defaults(style = \"print\")\n\n# read bankdata.csv\nbank <- read_csv(\"intro-to-r/data/bankdata.csv\") \n\nbank_subset <- bank %>%\n # filter to observations of unmarried mothers less than age 30\n filter(married == \"NO\" & age < 30) %>% \n # drop all variables except children and income\n select(children, income) \n\n# plot!\nbank_subset %>%\n ggplot(mapping = aes(x = children, y = income)) +\n geom_bar(stat = \"summary\", fun.y = \"mean\") +\n scale_y_continuous(expand = c(0, 0), labels = scales::dollar) +\n labs(title = \"Mean income\",\n subtitle = \"Unmarried mothers less than age 30\",\n caption = \"Urban Institute analysis of bank data\",\n x = \"Number of children\",\n y = \"Income\")\n\n\n\n\n\n\nResources for Learning\nR for Data Science by Hadley Wickham and Garrett Grolemund is the best print resource for learning R and the tidyverse. The book is available online for free and begins with visualization which is motivating and practical. R for Data Science contains dozens of worthwhile exercises but no solutions guide. Please check your solutions against the Urban Institute r4ds solutions guide on GitHub and please contribute if the exercise isn’t already in the guide!\nRStudio publishes a number of cheat sheets that cover the tidyverse. The main cheat sheets can be accessed in RStudio at Help > Cheat Sheets. Additional cheat sheets are accessible here on the RStudio website.\nDavid Robinson, a data scientist from Data Camp, has a new video course about the tidyverse. Few people know as much about R and communicate as effectively as David Robinson.\nAdvanced R by Hadley Wickham is a good resource for new R users that have experience with other programming languages and computer science. It is available online for free.\n\n\nLibrary\nIt’s easy to feel overwhelmed by the frenetic development of the extended R universe. Books are an invaluable resource for slowing down and focusing on fully-formed ideas.\nAaron Williams (awilliams@urban.org) has a number of books that can be checked out:\n\nThe Art of R Programming\nggplot2\nEfficient R Programming (Online!)\nText Mining with R (Online!)\nReasoning with Data\nPractical Statistics for Data Scientists\n\n\n\nBuilt-in Data Sets\nR has many built-in data sets that are useful for practice and even more data sets are accessible through R packages.\nSubmitting data() shows a list of all available data sets. cars and iris are two classic sets that are used in many examples.\nlibrary(tidyverse) loads many more “tidy” data sets including diamonds and starwars.\n\nlibrary(tidyverse)\nstarwars %>%\n count(species) %>%\n arrange(desc(n)) %>%\n head()\n\n# A tibble: 6 × 2\n species n\n <chr> <int>\n1 Human 35\n2 Droid 6\n3 <NA> 4\n4 Gungan 3\n5 Kaminoan 2\n6 Mirialan 2\n\n\nlibrary(dslabs) by Rafael Irizarry includes varied data sets that are intentionally imperfect that are useful for practice. Students of econometrics will enjoy library(wooldridge). It loads 105 data sets from Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge. Now you can practice estimating your hedonic pricing models in R!\n\nlibrary(wooldridge)\nlibrary(tidyverse)\nlibrary(urbnthemes)\n\nset_urbn_defaults(style = \"print\")\n\nas_tibble(hprice1) %>%\n ggplot(aes(x = sqrft, y = price)) +\n geom_point() +\n scale_y_continuous(expand = c(0, 0), lim = c(0, 800)) +\n labs(title = '\"hprice1\" data from Wooldridge') \n\n\n\n\n\n\nGetting Help\nEven the best R programmers spend hours each week searching the Internet for answers. Here are some of the best ways to find answers:\nSubmit ? and any function name without parentheses (ex. ?mean) to see the function documentation in RStudio.\nWhen Googling, set the search range to the last year to avoid out-of-date solutions and to focus on up-to-date practices.\nStack Overflow contains numerous solutions. Add [r] to any search to limit results to R. If a problem is particularly perplexing, it is simple to submit questions. Exercise caution when submitting questions because the Stack Overflow community has strict norms about questions and loose norms about respecting novices.\nRStudio Community is a new forum for R Users. It has a smaller back catalog than Stack Overflow but users are friendlier than on Stack Overflow.\nFinally, Aaron Williams (awilliams@urban.org) from IBP and Kyle Ueyama (kueyama@urban.org) from IT are available to solve problems, offer guidance, and share R enthusiasm.\n\n\nCRAN Task Views\nR has sub-communities, frameworks, and tools focused on different subject-matter and and methodological areas. CRAN Task Views is invaluable for understanding these communities and finding the best frameworks and tools for different disciplines in R.\nCRAN Task Views has 35 pages focused on subcategories of R ranging from econometrics to natural language processing. Each page is maintained by a subject-matter expert and contains methods, packages, books, and mailing lists that are useful for researchers.\nThe econometrics page alone contains detailed information on basic linear regression, microeconometrics, instrumental variables, panel data models, further regression models, time series data and models, data sets, CRAN packages, articles, books, and more."
},
{
"objectID": "intro-to-r.html#r-code",
"href": "intro-to-r.html#r-code",
"title": "R@URBAN",
"section": "R Code",
"text": "R Code\n\nIt’s time to start writing R code. Remember, most R users never open R and exclusively use RStudio. Go ahead and open R once to admire its dated text editor. Then, close R and never directly open it again. Now, open RStudio.\n\nSubmitting Code\nRStudio has four main panels: code editor (top left by default), R console (bottom left by default), environment and history (top right by default), and files, plots, packages, help, and viewer pane (bottom right by default).\nThere are two main ways to submit code:\n\nType code to the right of in the R console and hit enter. Note: R won’t create a long-term record of this code.\nClick in the top left to create a new R script in the code editor panel. Type code in the script. Highlight desired code and either click run the in top right of the code editor panel or type Ctrl/command-enter to run code. Scripts can be saved, so they are the best way to write code that will be used again.\n\nFor practice, submit state.name in the R console to create a vector with all fifty state names (sorry statehood advocates, no Washington, D.C.). Next, create a script, paste state.name, highlight the text, and click run at the top right of the code editor. You should get the same output both times.\n\nstate.name\n\n [1] \"Alabama\" \"Alaska\" \"Arizona\" \"Arkansas\" \n [5] \"California\" \"Colorado\" \"Connecticut\" \"Delaware\" \n [9] \"Florida\" \"Georgia\" \"Hawaii\" \"Idaho\" \n[13] \"Illinois\" \"Indiana\" \"Iowa\" \"Kansas\" \n[17] \"Kentucky\" \"Louisiana\" \"Maine\" \"Maryland\" \n[21] \"Massachusetts\" \"Michigan\" \"Minnesota\" \"Mississippi\" \n[25] \"Missouri\" \"Montana\" \"Nebraska\" \"Nevada\" \n[29] \"New Hampshire\" \"New Jersey\" \"New Mexico\" \"New York\" \n[33] \"North Carolina\" \"North Dakota\" \"Ohio\" \"Oklahoma\" \n[37] \"Oregon\" \"Pennsylvania\" \"Rhode Island\" \"South Carolina\"\n[41] \"South Dakota\" \"Tennessee\" \"Texas\" \"Utah\" \n[45] \"Vermont\" \"Virginia\" \"Washington\" \"West Virginia\" \n[49] \"Wisconsin\" \"Wyoming\" \n\n\n\n\nSyntax\nThe are five fundamental pieces of syntax in R.\n\n<- is the assignment operator. An object created on the right side of an assignment operator is assigned to a name on the left side of an assignment operator. Assignment operators are important for saving the consequences of operations and functions. Operations without assignment operators will typically be printed to the console but not saved.\n# begins a comment. Comments are useful for explaining decisions in scripts. As Haldey Wickham notes in the Tidyverse styleguide, ’In code, use comments to explain the “why” not the “what” or “how”.\nc() combines similar vectors into larger vectors. For example, c(1, 2, 3) is a numeric vector of length three made up of three numeric vectors of length one.\n? in front of any function name without parentheses returns function documentation. For example, ?mean.\n%>% from library(magrittr) and library(tidyverse) is the “pipe operator”. It passes the output from one function to another function. This is useful because strings of operations can be “piped” together instead of each individual operation needing to be assigned to an object.\n\n\n\nVectors\nVectors are the fundamental piece of data in R. They are one-dimensional\nMost vectors are one-dimensional collections of logicals, integers, doubles, characters, factors, dates, or date-times. Vectors can’t mix types\nR has six commonl\nScalars don’t exist in R.\n\n1:10 > 5\n\n [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE\n\n\ntodo(aaron): link to Hadley Wickham talk\n\n\nData frames\nData frames are combinations of equally lengthed vectors.\nData analysis in R is built around the data fra\neach observation forms a row each variables forms a column each observational forms a table\n\n\nMissing values\nR stores missing values as NA. A single NA in a calculation can cause the entire result to return as NA.\n\nsum(c(2, 2, NA))\n\n[1] NA\n\n\nThe contagiousness of NA is good, it makes users explicitly acknowledge dropping missing values with na.rm = TRUE.\n\nsum(c(2, 2, NA), na.rm = TRUE)\n\n[1] 4\n\n\n== NA does not test for missing values. Instead, use is.na().\n\nis.na() and math with booleans\ncomplete.cases\n\n\n\nFunctions\nFunctions in R are collections of code that when called cause certain actions. R contains hundreds of functions and thousands of more functions can be accessed through packages.\nMost functions take arguments. For example, the function mean() has arguments x, trim, na.rm, and .... The first argument in most functions, in this case x, is an input object. Arguments can be passed to functions by name or position. mean(c(1, 2, 3)) is equivalent to mean(x = c(1, 2, 3)).\nNotice how the other three arguments were skipped. Most arguments in functions have default values. The best way to see default values is to submit the function name with a question mark, like ?mean. In this case, trim = 0, na.rm = FALSE, and no further arguments were passed through with ....\nIn the previous example, the c() function was nested inside of the mean() function. It is also possible to assign a vector of 1, 2, and 3 to a name and pass the name to the mean function.\n\napples <- c(1, 2, 3)\n\nmean(apples)\n\nR is a functional programming language. In addition to having many pre-made functions like mean(), R has powerful tools for creating and manipulating custom functions. This is useful because:\n\nIt avoids tedious and error-prone copying-and-pasting and makes iterating processes simple;\nIs a powerful way to organize sets of operations;\nIs a standardized way to save code for later and to share operations with others.\n\nThis last bullet is key to the package system in R.\n\n\nPackages\nOpening RStudio automatically loads “base R”, a fundamental collection of code and functions that handles simple operations like math and system management. R can be extended with collections of code and functions developed by the R community called packages. This sounds wild, but most packages are created and maintained by some of the best statisticians and developers in the world.\nMost packages can be installed with install.packages(\"dplyr\"), where the string between the quotation marks is the name of the package. Packages installed with install.packages() come from CRAN and must pass certain checks for performance and documentation. Popular packages on CRAN, like dplyr, have as much, if not more support, standards, and quality than code in proprietary software packages like Stata or SAS.\nIt is possible, but less common, to install packages from places like GitHub. This is less secure and the functionality of the packages is more likely to change over time. install.packages() need only be run once per version of package per machine and should rarely be included in .R scripts.\nPackages are loaded once per R session with the function library(). It is a good idea to include library(package-name) at the top of scripts for each package used in the script. This way it is obvious at the top of the script which packages are installed and loaded.\nNote: install.packages() uses quoted package names and library() uses unquoted package names.\nFor practice, submit the following three lines of code to install RXKCD, load library(RXKCD), and get a random XKCD comic.\n\ninstall.packages(\"RXKCD\")\nlibrary(RXKCD)\ngetXKCD(\"random\")\n\n\n\n\n\n\nPackages are frequently updated, especially around the time R versions change. The easiest way to update packages is Tools > Check for Package Updated in RStudio.\nOccasionally, two loaded packages will have functions with identical names. Any conflicts with be announced when loading packages. See how filter() and lag() from library(tidyverse) and library(stats) conflict:\n In this case, the tidyverse functions are usually favored. If there is ever a conflict or any doubt about which function is used, use the package name and :: to directly call the function. For example, dplyr::select(apples). :: can also be used to call a function without loading the entire package.\n\n\nCRAN\nThe Comprehensive R Archive Network (CRAN) contains almost 12,000 packages contributed over the last two decades by a range of developers. New packages are added to CRAN almost every day.\nCRAN enables R to have all of the benefits of open-source development and the security and predictability of proprietary statistical packages like SAS and Stata. CRAN weds the benefits of broad-based, real-time package development with certain standards for functionality and documentation. Methods and tools make it to R before SAS or Stata, if they ever make it to SAS or Stata, but have standards that generally exceed Python or other open-source languages. (See: Malicious Libraries Found on Python Package Index (PyPI))\nBecause of CRAN’s long history and R’s place in the statistics community, CRAN contains many methods that can’t be accessed, much less duplicated, using proprietary software. In addition to being useful now, this also ensures that R isn’t a temporary fad and will have staying power because of the challenge of replicating or besting CRAN.\nR’s extensible design is important, but most tasks can be accomplished with a handful of packages:\n\nggplot2 data visualization\ndplyr data management\ntidyr data tidying\nreadr data import\npurrr functional programming\ntibble data frames\nhms times\nstringr character strings\nlubridate dates/times\n\nforcats factors\nDBI databases\nhaven SPSS, SAS, and Stata files\nreadxl.xls and .xlsx\nmodelr simple modeling within a pipeline\nbroom turning models into tidy data\ntidyverse loads all of the packages listed up to this point; see Hadley Wichkham’s “tidyverse”"
},
{
"objectID": "intro-to-r.html#organizing-analyses",
"href": "intro-to-r.html#organizing-analyses",
"title": "R@URBAN",
"section": "Organizing Analyses",
"text": "Organizing Analyses\n\nThis section outlines how to organize an analysis to get the most out of R. Newer users may want to skip this section and work through R for Data Science until they understand library(readr), library(dplyr), and library(ggplot2).\n\nProjects\nOrganizing scripts, files, and data is one of the most important steps to creating a clear and reproducible analysis.\nR Projects, proper noun, are the best way to organize an analysis. They have several advantages:\n\nThey make it possible to concurrently run multiple RStudio sessions.\nThey allow for project-specific RStudio settings.\nThey integrate well with Git version control.\nThey are the “node” of relative file paths. (more on this in a second)\n\nBefore setting up an R Project, go to Tools > Global Options and uncheck “Restore most recently opened project at startup”.\n\nEvery new analysis in R should start with an R Project. First, create a directory that holds all data, scripts, and files for the analysis. Storing files and data in a sub-directories is encouraged. For example, data can be stored in a folder called data/.\nNext, click “New Project…” in the top right corner.\n\nWhen prompted, turn your recently created “Existing Directory” into a project.\n\nUpon completion, the name of the R Project should now be displayed in the top right corner of RStudio where it previously displayed “Project: (None)”. Once opened, .RProj files do not need to be saved. Double-clicking .Rproj files in the directory is now the best way to open RStudio. This will allow for the concurrent use of multiple R sessions and ensure the portability of file paths. Once an RStudio project is open, scripts can be opened by double-clicking individual files in the computer directory or clicking files in the “Files” tab in the top right of RStudio.\nR Projects make code highly portable because of the way they handle file paths. Here are a few rules:\n\nFilepaths\nNever use \\ in file paths in R. \\ is a regular expression and will complicate an analysis. Fortunately, RStudio understands / in file paths regardless of operating system.\nNever use setwd() in R. It is unnecessary, it makes code unreproducible across machines, and it is rude to collaborators. R Projects create a better framework for file paths. Simply treat the directory where the R Project lives as the working directory and directories inside of that directory as sub-directories.\nFor example, say there’s a .Rproj called starwars-analysis.Rproj in a directory called starwars-analysis. If there is a .csv in that folder called jedi.csv, the file can be loaded with read_csv(\"jedi.csv\") instead of read_csv(\"H:/ibp/analyses/starwars-analysis/diamonds.csv\"). If that file is in a sub-directory of starwars-analysis called data, it can be loaded with read_csv(\"data/jedi.csv\"). The same concepts hold for writing data and graphics.\nThis simplifies code and makes it portable because all relative filepaths will be identical on all computers. To share an analysis, simply send the entire directory to a collaborator or share it with GitHub.\nHere’s an example directory:\n\n\n\nIt isn’t always possible to avoid absolute file paths because of the many different ways the Urban Institute stores data. Avoid absolute paths when possible and be deliberate about where analyses live in relation to where data live.\nFinally, it’s good practice to include a README in the same directory as the .Rproj. The README should outline the purpose and the directories and can include information about how to contribute, licenses, dependencies, and acknowledgements. This GitHub page is a good README template.\nCheck out R for Data Science by Hadley Wickham and Garrett Grolemund for a more thorough explanation of this workflow. Jenny Bryan also has a good blogpost about avoiding setwd().\n\n\n\nNaming Conventions\nNaming functions, objects, variables, files, and scripts is one of the toughest and least-taught dimensions of computer programming. Better names can add clarity to code, save time and effort, and minimize errors caused by accidentally overwriting existing functions or other objects.\n\nThere are only two hard things in Computer Science: cache invalidation and naming things. ~ Phil Karlton\n\n\nFunctions and Other Objects\nR is case-sensitive.\nObjects in R can be named anything - even unicode characters. But just because something can be named anything doesn’t mean it should.\nMost functions and objects in R are lowerCamelCase, period.separated, or underscore_separated. As an individual or team, it’s important to pick a style and stick with it, but as this article from 2012 shows, there isn’t much consistency across the R community. Hadley Wickham’s tidyverse uses underscores, so expect to see some consolidation into this style.\nIn general, it’s good practice to name functions with verbs and other objects with nouns.\nVariable and object names that start with numbers, have spaces, or use peculiar syntax require back-ticks.\n\nselect(urban, `R Users Group`)\n\n\nurban$`R Users Group`)\n\nFinally, it’s possible to overwrite existing functions and other objects in R with the assignment operator. Don’t give vectors or data frames the same names as exisiting functions and don’t overwrite existing functions with custom functions.\n\n\nFiles\nNaming conventions for scripts and files is probably the most overlooked dimension in programming and analysis. The first three bullets from this section come from this rich slide deck by Jenny Bryan. This may seem pedantic, but picking a file naming convention now can save a bunch of time and headaches in the future.\n1) Machine readable\nCreate file names that are easily machine readable. Use all lower case letters and skip punctuation other than delimiters. Use underscores as characters for splitting the file name. For example, stringr::str_split_fixed(\"2018-01-10_r-introduction_machine-readable-example_01.csv\", \"[_\\\\.]\", 5) splits the file name on underscores and periods and returns date, project, file name, file number, and file type. This information can then be stored and sorted in a data frame.\n2) Human readable\nCreate file names that are human readable. The example from above is informative without any machine interpretation.\n3) Plays well with default ordering\nIt is often useful to include date or sequence numbers in script and file names. For example, include 2018-10-01 for data collected on January 10th, 2018 or include 3 for the third script a sequence of five .R programs. Starting file names with the date or sequence numbers means files will show up in a logical order by default. Be sure to use ISO 8601 standard for dates (YYYY-MM-DD).\n4) Don’t Use File Names for Version Control\nVersion control with file names is unwieldy and usually results in names that are barely human readable and definitely not machine readable.\n\n“2018-01-10_r-introduction_machine-readable-example_01_v2_for-aaron_after-review_before-submission.R”\n\nIterations usually don’t iterate sensibly. For example, what was “v1”, “v2” abandoned for “for-aaron”, “after-review”, “before-submission”. Furthermore, version control with file names is poor for concurrent work and merging.\nThe next section will outline the optimal tool for version control.\n\n\n\nVersion Control\nThe workflow outlined above integrates perfectly with version control like Git and distributed version control repository hosting services like GitHub.\nVersion control is a system for recording changes to files over time. Version control is built around repositories. In this case, the folder containing the .Rproj is the perfect directory to use as a repository. A handful of simple commands are used to track and commit changes to text files (.R, .Rmd, etc.) and data. This record is valuable for testing alternatives, communicating with others and your future self, and documenting progress on projects.\nGitHub is a distributed repository system built on top of Git. GitHub has a number of valuable tools for collaboration and project management. In particular, it makes concurrent collaboration on code simpler with branches and has a slick system for issues. Here are the branches and issues for the Urban Institute R Graphics Guide. It also has free web hosting for websites like the website you are reading right now. GitHub has a quick guide that is a good place to start learning Git.\nThe Urban Institute has a number of legacy models and code bases that span years and have been touched by scores of brilliant researchers. The future value of a record of all code changes and development is borderline unthinkable.\n\n\nCoding Style\n\n“Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read.” ~Hadley Wickham (2014)\n\ngood coding style is like using correct punctuation you can manage without it but it sure makes thing easier to read\nThe details of a coding style are less important than consistently sticking to that style. Be flexible when working with collaborators so the style doesn’t change inside an analysis.\nHere are three good sources for inspiration:\n\nTidyverse Style Guide\nGoogle’s R Style Guide\nHadley Wickham’s R Style Guide"
},
{
"objectID": "intro-to-r.html#putting-it-all-together",
"href": "intro-to-r.html#putting-it-all-together",
"title": "R@URBAN",
"section": "Putting it All Together",
"text": "Putting it All Together\n\nR can augment or replace a traditional proprietary statistical packages like SAS or Stata with a few extra bells and whistles, but hopefully this guide and other resources show a fuller vision for developing reproducible, accurate, and collaborative analyses.1\nThis research pipeline, to use the phrase by Roger Peng, Jeff Leek, and Brian Caffo, combines the best of traditional economic and social policy research, computer science/software development, and statistics.2 Here are the rules:\n\n1) No steps in an analysis are done by hand and all steps are recorded with executable scripts.\nIt is common to use executable scripts to estimate a regression equation or to tabulate weighted summary statistics. But for some reason, other steps like file management, data munging, and visualization are often done “by hand”. Good science demands that every step of an analysis is recorded - and if possible - with executable scripts.\nFortunately, it is possible to script most steps in R from downloading data from the Internet and accessing APIs to visualizations and drafting manuscripts. This may be challenging at first, but it will save time and result in better research in the long run.\n\n\n2) All code is entirely reproducible and portable.\nExecutable scripts are for communicating with other researchers and our future selves. Scripts lose value if they aren’t portable and can’t be reproduced in the future or by others. Recording every step with execuatble scripts is a start, but scripts aren’t valuable if they require expensive proprietary software,or if researchers have to significantly alter scripts to run an analysis.\nOpen source software, like R, promotes accessibility, portability, and reproducibility. Also, be sure to avoid setwd() and use relative filepaths.\n\n\n3) Local and collaborative version control is used and all repositories include all code and a README.\nUse local version control like Git and a distributed version control repository hosting service like GitHub to track changes and share analyses. The version control should include all scripts and meta information about the analysis in a README.\n\n\n4) Raw data and tidy analytic data are stored in a collaborative location with a code book.\nMany raw data are already stored in collaborative locations like BLS.gov and don’t need to be duplicated. Tidy analytic data, like the data used to estimate a regression equation, should be stored in a collaborative location. This is good practice, but is less essential if executable scripts are flawless and reproducible. Researcher-entered data and data from less-stable sources should be stored in raw and analytic forms.\nSmall data sets can be stored on GitHub without issue. Larger data sets should be stored in collaborative locations accessible by scripting languages. This is only possible for public data and best-practices for private data are less established.\nSave codebooks for data sets as text files or PDFs in repositories. Creating codebooks for user-entered data or variables created in executable scripts is often worth the time.\n\n\n5) Code review and issue tracking are used to improve accuracy and computational efficiency.\nGetting stronger programmers and/or methodologists to review code is valuable for limiting programming and analytic mistakes, improving computational efficiency, and learning.\nGitHub issues is a powerful tool for managing, discussing, and collaborating on code.\n\n\n6) Projects rely heavily on literate statistical programming and standard means of distribution for execution, validation, and publishing.\nLiterate statistical programming is the combination of natural language explanations for humans and executable code in one document. The idea was created by Donald Knuth and is embodied by R Markdown.\nR Markdown combines text chunks, code chunks, and output chunks in one script that can be “knitted” using library(knitr) to created PDFs, books, .htmls, and websites like the website where this guide lives.\nThis workflow combines the analytic and narrative process in a tool that is flexible, scalable, reproducible, and less error-prone. R Markdown documents can be used for executing programs, validating models and analyses, and publishing. These documents can be submitted to many academic journals and shared easily with GitHub pages.\n\n\n7) Software versions and dependencies are recorded and all software is cited in publications.\nsessionInfo() reports the R version, locale, packages used, and other important information about an R session. citation() creates a text and BibTex entry of the citation for R. citation(<package-name>) creates a text and BibTex entry for R packages. library(packrat) (outlined here) is a tool for saving R dependencies."
},
{
"objectID": "intro-to-r.html#bibliography-and-references",
"href": "intro-to-r.html#bibliography-and-references",
"title": "R@URBAN",
"section": "Bibliography and References",
"text": "Bibliography and References\n\nHadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse\nHadley Wickham and Garrett Grolemund (2017). R For Data Science http://r4ds.had.co.nz/\nHadley Wickham (2014). Advanced R http://adv-r.had.co.nz/Style.html\nHilary S. Parker (2017. Opinionated Analysis Development https://www.rstudio.com/resources/videos/opinionated-analysis-development/\nJenny Bryan (2017).\nProject-oriented workflow https://www.tidyverse.org/articles/2017/12/workflow-vs-script/\nJenny Bryan (2015). naming things. http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf\nJJ Allaire, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng and Winston Chang (2017). rmarkdown: Dynamic Documents for R. R package version 1.8. https://CRAN.R-project.org/package=rmarkdown\nJustin M. Shea (2017). wooldridge: 105 Data Sets from “Introductory Econometrics: A Modern Approach” by Jeffrey M. Wooldridge. R package version 1.2.0. https://CRAN.R-project.org/package=wooldridge\nRoger Peng Reproducible Research Part 2 https://www.coursera.org/learn/reproducible-research/lecture/abevs/reproducible-research-concepts-and-ideas-part-2\nYihui Xie (2017). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.18.\n\nsessionInfo()\n\nR version 4.1.2 (2021-11-01)\nPlatform: x86_64-apple-darwin17.0 (64-bit)\nRunning under: macOS Big Sur 10.16\n\nMatrix products: default\nBLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib\nLAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib\n\nlocale:\n[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n\nattached base packages:\n[1] stats graphics grDevices utils datasets methods base \n\nother attached packages:\n [1] RXKCD_1.9.2 wooldridge_1.4-2 urbnthemes_0.0.2 forcats_0.5.1 \n [5] stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4 readr_2.1.1 \n [9] tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1 \n\nloaded via a namespace (and not attached):\n [1] httr_1.4.2 bit64_4.0.5 vroom_1.5.7 jsonlite_1.7.2 \n [5] modelr_0.1.8 assertthat_0.2.1 cellranger_1.1.0 yaml_2.2.1 \n [9] ggrepel_0.9.1 Rttf2pt1_1.3.9 pillar_1.7.0 backports_1.4.0 \n[13] glue_1.6.2 extrafontdb_1.0 digest_0.6.29 rvest_1.0.2 \n[17] colorspace_2.0-2 htmltools_0.5.2 plyr_1.8.6 pkgconfig_2.0.3 \n[21] broom_0.7.10 haven_2.4.3 scales_1.1.1 jpeg_0.1-9 \n[25] tzdb_0.2.0 generics_0.1.2 farver_2.1.0 ellipsis_0.3.2 \n[29] withr_2.4.3 cli_3.2.0 RJSONIO_1.3-1.6 magrittr_2.0.3 \n[33] crayon_1.5.1 readxl_1.4.0.9000 evaluate_0.14 fs_1.5.1 \n[37] fansi_1.0.3 xml2_1.3.3 tools_4.1.2 hms_1.1.1 \n[41] lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.1 compiler_4.1.2 \n[45] rlang_1.0.2 grid_4.1.2 rstudioapi_0.13 htmlwidgets_1.5.4\n[49] labeling_0.4.2 rmarkdown_2.11 gtable_0.3.0 DBI_1.1.1 \n[53] R6_2.5.1 gridExtra_2.3 lubridate_1.8.0 knitr_1.36 \n[57] fastmap_1.1.0 bit_4.0.4 extrafont_0.17 utf8_1.2.2 \n[61] stringi_1.7.6 parallel_4.1.2 Rcpp_1.0.8 vctrs_0.4.1 \n[65] png_0.1-7 dbplyr_2.1.1 tidyselect_1.1.2 xfun_0.28"
},
{
"objectID": "graphics-guide.html#urban-institute-r-graphics-guide",
"href": "graphics-guide.html#urban-institute-r-graphics-guide",
"title": "R@URBAN",
"section": "Urban Institute R Graphics Guide",
"text": "Urban Institute R Graphics Guide\n\nR is a powerful, open-source programming language and environment. R excels at data management and munging, traditional statistical analysis, machine learning, and reproducible research, but it is probably best known for its graphics. This guide contains examples and instructions for popular and lesser-known plotting techniques in R. It also includes instructions for using urbnthemes, the Urban Institute’s R package for creating near-publication-ready plots with ggplot2. If you have any questions, please don’t hesitate to contact Aaron Williams (awilliams@urban.org) or Kyle Ueyama (kueyama@urban.org).\n\nBackground\nlibrary(urbnthemes) makes ggplot2 output align more closely with the Urban Institute’s Data Visualization style guide. This package does not produce publication ready graphics. Visual styles must still be edited using your project/paper’s normal editing workflow.\nExporting charts as a pdf will allow them to be more easily edited. See the Saving Plots section for more information.\nThe theme has been tested against ggplot2 version 3.0.0. It will not function properly with older versions of ggplot2\n\n\nUsing library(urbnthemes)\nRun the following code to install or update urbnthemes:\ninstall.packages(\"remotes\")\nremotes::install_github(\"UrbanInstitute/urbnthemes\")\nRun the following code at the top of each script:\nlibrary(tidyverse)\nlibrary(urbnthemes)\n\nset_urbn_defaults(style = \"print\")\n\n\nInstalling Lato\nYour Urban computer may not have the Lato font installed. If it is not installed, please install the free Lato font from Google. Below are step by step instructions:\n\nDownload the Lato font (as a zip file).\nUnzip the file on your computer.\nFor each .ttf file in the unzipped Lato/ folder, double click the file and click Install (on Windows) or Install Font (on Mac).\nImport and register Lato into R by running urbnthemes::lato_import() in the console once. Be patient as this may take a few minutes!\nTo confirm installation, run urbnthemes::lato_test(). If this is successful you’re done and Lato will automatically be used when creating plots with library(urbnthemes). You only need to install Lato once per computer.\n\nWaffle charts with glyphs require fontawesome. fontawesome_test() and fontawesome_install() are the fontawesome versions of the above functions. Be sure to install fontawesome from here first.\n\n\nGrammar of Graphics and Conventions\nHadley Wickham’s ggplot2 is based on Leland Wilkinson’s The Grammar of Graphics and Wickham’s A Layered Grammar of Graphics. The layered grammar of graphics is a structured way of thinking about the components of a plot, which then lend themselves to the simple structure of ggplot2.\n\nData are what are visualizaed in a plot and mappings are directions for how data are mapped in a plot in a way that can be perceived by humans.\n\nGeoms are representations of the actual data like points, lines, and bars.\nStats are statistical transformations that represent summaries of the data like histograms.\nScales map values in the data space to values in the aesthetic space. Scales draw legends and axes.\nCoordinate Systems describe how geoms are mapped to the plane of the graphic.\n\nFacets break the data into meaningful subsets like small multiples.\nThemes control the finer points of a plot such as fonts, font sizes, and background colors.\n\nMore information: ggplot2: Elegant Graphics for Data Analysis\n\n\nTips and Tricks\n\nggplot2 expects data to be in data frames or tibbles. It is preferable for the data frames to be “tidy” with each variable as a column, each obseravtion as a row, and each observational unit as a separate table. dplyr and tidyr contain concise and effective tools for “tidying” data.\nR allows function arguments to be called explicitly by name and implicitly by position. The coding examples in this guide only contain named arguments for clarity.\nGraphics will sometimes render differently on different operating systems. This is because anti-aliasing is activated in R on Mac and Linux but not activated in R on Windows. This won’t be an issue once graphics are saved.\nContinuous x-axes have ticks. Discrete x-axes do not have ticks. Use remove_ticks() to remove ticks."
},
{
"objectID": "graphics-guide.html#bar-plots",
"href": "graphics-guide.html#bar-plots",
"title": "R@URBAN",
"section": "Bar Plots",
"text": "Bar Plots\n\n\nOne Color\n\nmtcars %>%\n count(cyl) %>%\n ggplot(mapping = aes(x = factor(cyl), y = n)) +\n geom_col() +\n geom_text(mapping = aes(label = n), vjust = -1) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Cylinders\",\n y = NULL) +\n remove_ticks() +\n remove_axis() \n\n\n\n\n\n\nOne Color (Rotated)\nThis example introduces coord_flip() and remove_axis(axis = \"x\", flip = TRUE). remove_axis() is from library(urbnthemes) and creates a custom theme for rotated bar plots.\n\nmtcars %>%\n count(cyl) %>%\n ggplot(mapping = aes(x = factor(cyl), y = n)) +\n geom_col() +\n geom_text(mapping = aes(label = n), hjust = -1) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Cylinders\",\n y = NULL) + \n coord_flip() +\n remove_axis(axis = \"x\", flip = TRUE)\n\n\n\n\n\n\nThree Colors\nThis is identical to the previous plot except colors and a legend are added with fill = cyl. Turning x into a factor with factor(cyl) skips 5 and 7 on the x-axis. Adding fill = cyl without factor() would have created a continuous color scheme and legend.\n\nmtcars %>%\n mutate(cyl = factor(cyl)) %>%\n count(cyl) %>%\n ggplot(mapping = aes(x = cyl, y = n, fill = cyl)) +\n geom_col() +\n geom_text(mapping = aes(label = n), vjust = -1) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Cylinders\",\n y = NULL) +\n remove_ticks() +\n remove_axis()\n\n\n\n\n\n\nStacked Bar Plot\nAn additional aesthetic can easily be added to bar plots by adding fill = categorical variable to the mapping. Here, transmission type subsets each bar showing the count of cars with different numbers of cylinders.\n\nmtcars %>%\n mutate(am = factor(am, labels = c(\"Automatic\", \"Manual\")),\n cyl = factor(cyl)) %>% \n group_by(am) %>%\n count(cyl) %>%\n group_by(cyl) %>%\n arrange(desc(am)) %>%\n mutate(label_height = cumsum(n)) %>%\n ggplot() +\n geom_col(mapping = aes(x = cyl, y = n, fill = am)) +\n geom_text(aes(x = cyl, y = label_height - 0.5, label = n, color = am)) +\n scale_color_manual(values = c(\"white\", \"black\")) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Cylinders\",\n y = NULL) + \n remove_ticks() +\n remove_axis() +\n guides(color = FALSE)\n\n\n\n\n\n\nStacked Bar Plot With Position = Fill\nThe previous examples used geom_col(), which takes a y value for bar height. This example uses geom_bar() which sums the values and generates a value for bar heights. In this example, position = \"fill\" in geom_bar() changes the y-axis from count to the proportion of each bar.\n\nmtcars %>%\n mutate(am = factor(am, labels = c(\"Automatic\", \"Manual\")),\n cyl = factor(cyl)) %>% \n ggplot() +\n geom_bar(mapping = aes(x = cyl, fill = am), position = \"fill\") +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1)), labels = scales::percent) +\n labs(x = \"Cylinders\",\n y = NULL) + \n remove_ticks() +\n guides(color = FALSE)\n\n\n\n\n\n\nDodged Bar Plot\nSubsetted bar charts in ggplot2 are stacked by default. position = \"dodge\" in geom_col() expands the bar chart so the bars appear next to each other.\n\nmtcars %>%\n mutate(am = factor(am, labels = c(\"Automatic\", \"Manual\")),\n cyl = factor(cyl)) %>%\n group_by(am) %>%\n count(cyl) %>%\n ggplot(mapping = aes(cyl, y = n, fill = factor(am))) +\n geom_col(position = \"dodge\") +\n geom_text(aes(label = n), position = position_dodge(width = 0.7), vjust = -1) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Cylinders\",\n y = NULL) + \n remove_ticks() +\n remove_axis()\n\n\n\n\n\n\nLollipop plot/Cleveland dot plot\nLollipop plots and Cleveland dot plots are minimalist alternatives to bar plots. The key to both plots is to order the data based on the continuous variable using arrange() and then turn the discrete variable into a factor with the ordered levels of the continuous variable using mutate(). This step “stores” the order of the data.\n\nLollipop plot\n\nmtcars %>%\n rownames_to_column(\"model\") %>%\n arrange(mpg) %>%\n mutate(model = factor(model, levels = .$model)) %>%\n ggplot(aes(mpg, model)) +\n geom_segment(aes(x = 0, xend = mpg, y = model, yend = model)) + \n geom_point() +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0)), limits = c(0, 40)) +\n labs(x = NULL, \n y = \"Miles Per Gallon\")\n\n\n\n\n\n\nCleveland dot plot\n\nmtcars %>%\n rownames_to_column(\"model\") %>%\n arrange(mpg) %>%\n mutate(model = factor(model, levels = .$model)) %>%\n ggplot(aes(mpg, model)) +\n geom_point() +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0)), limits = c(0, 40)) +\n labs(x = NULL, \n y = \"Miles Per Gallon\")\n\n\n\n\n\n\n\nDumbell plot"
},
{
"objectID": "graphics-guide.html#scatter-plots",
"href": "graphics-guide.html#scatter-plots",
"title": "R@URBAN",
"section": "Scatter Plots",
"text": "Scatter Plots\n\n\nOne Color Scatter Plot\nScatter plots are useful for showing relationships between two or more variables. Use scatter_grid() from library(urbnthemes) to easily add vertical grid lines for scatter plots.\n\nmtcars %>%\n ggplot(mapping = aes(x = wt, y = mpg)) +\n geom_point() +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 6),\n breaks = 0:6) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 40),\n breaks = 0:8 * 5) +\n labs(x = \"Weight (thousands of pounds)\",\n y = \"City MPG\") +\n scatter_grid()\n\n\n\n\n\n\nHigh-Density Scatter Plot with Transparency\nLarge numbers of observations can sometimes make scatter plots tough to interpret because points overlap. Adding alpha = with a number between 0 and 1 adds transparency to points and clarity to plots. Now it’s easy to see that jewelry stores are probably rounding up but not rounding down carats!\n\ndiamonds %>%\n ggplot(mapping = aes(x = carat, y = price)) +\n geom_point(alpha = 0.05) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 6),\n breaks = 0:6) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 20000),\n breaks = 0:4 * 5000,\n labels = scales::dollar) +\n labs(x = \"Carat\",\n y = \"Price\") +\n scatter_grid()\n\n\n\n\n\n\nHex Scatter Plot\nSometimes transparency isn’t enough to bring clarity to a scatter plot with many observations. As n increases into the hundreds of thousands and even millions, geom_hex can be one of the best ways to display relationships between two variables.\n\ndiamonds %>%\n ggplot(mapping = aes(x = carat, y = price)) +\n geom_hex(mapping = aes(fill = ..count..)) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 6),\n breaks = 0:6) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 20000),\n breaks = 0:4 * 5000,\n labels = scales::dollar) +\n scale_fill_gradientn(labels = scales::comma) + \n labs(x = \"Carat\",\n y = \"Price\") +\n scatter_grid() +\n theme(legend.position = \"right\",\n legend.direction = \"vertical\")\n\n\n\n\n\n\nScatter Plots With Random Noise\nSometimes scatter plots have many overlapping points but a reasonable number of observations. geom_jitter adds a small amount of random noise so points are less likely to overlap. width and height control the amount of noise that is added. In the following before-and-after, notice how many more points are visible after adding jitter.\n\nBefore\n\nmpg %>%\n ggplot(mapping = aes(x = displ, y = cty)) +\n geom_point() +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 8),\n breaks = 0:8) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 40),\n breaks = 0:4 * 10) +\n labs(x = \"Displacement\",\n y = \"City MPG\") +\n scatter_grid()\n\n\n\n\n\n\nAfter\n\nset.seed(2017)\nmpg %>%\n ggplot(mapping = aes(x = displ, y = cty)) +\n geom_jitter() +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 8),\n breaks = 0:8) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 40),\n breaks = 0:4 * 10) +\n labs(x = \"Displacement\",\n y = \"City MPG\") +\n scatter_grid()\n\n\n\n\n\n\n\nScatter Plots with Varying Point Size\nWeights and populations can be mapped in scatter plots to the size of the points. Here, the number of households in each state is mapped to the size of each point using aes(size = hhpop). Note: ggplot2::geom_point() is used instead of geom_point().\n\nurbnmapr::statedata %>%\n ggplot(mapping = aes(x = medhhincome, y = horate)) +\n ggplot2::geom_point(mapping = aes(size = hhpop), alpha = 0.3) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(30000, 80000),\n breaks = 3:8 * 10000,\n labels = scales::dollar) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 0.8),\n breaks = 0:4 * 0.2) +\n scale_radius(range = c(3, 15),\n breaks = c(2500000, 7500000, 12500000), \n labels = scales::comma) +\n labs(x = \"Household income\",\n y = \"Homeownership rate\") +\n scatter_grid() +\n theme(plot.margin = margin(r = 20))\n\n\n\n\n\n\nScatter Plots with Fill\nA third aesthetic can be added to scatter plots. Here, color signifies the number of cylinders in each car. Before ggplot() is called, Cylinders is created using library(dplyr) and the piping operator %>%.\n\nmtcars %>%\n mutate(cyl = paste(cyl, \"cylinders\")) %>%\n ggplot(aes(x = wt, y = mpg, color = cyl)) +\n geom_point() +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 6),\n breaks = 0:6) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 40),\n breaks = 0:8 * 5) +\n labs(x = \"Weight (thousands of pounds)\",\n y = \"City MPG\") +\n scatter_grid()"
},
{
"objectID": "graphics-guide.html#line-plots",
"href": "graphics-guide.html#line-plots",
"title": "R@URBAN",
"section": "Line Plots",
"text": "Line Plots\n\n\neconomics %>%\n ggplot(mapping = aes(x = date, y = unemploy)) +\n geom_line() +\n scale_x_date(expand = expand_scale(mult = c(0.002, 0)), \n breaks = \"10 years\",\n limits = c(as.Date(\"1961-01-01\"), as.Date(\"2020-01-01\")),\n date_labels = \"%Y\") +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n breaks = 0:4 * 4000,\n limits = c(0, 16000),\n labels = scales::comma) +\n labs(x = \"Year\", \n y = \"Number Unemployed (1,000s)\")\n\n\n\n\n\nLines Plots With Multiple Lines\n\nlibrary(gapminder)\n\ngapminder %>%\n filter(country %in% c(\"Australia\", \"Canada\", \"New Zealand\")) %>%\n mutate(country = factor(country, levels = c(\"Canada\", \"Australia\", \"New Zealand\"))) %>%\n ggplot(aes(year, gdpPercap, color = country)) +\n geom_line() +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n breaks = c(1952 + 0:12 * 5), \n limits = c(1952, 2007)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n breaks = 0:8 * 5000,\n labels = scales::dollar, \n limits = c(0, 40000)) +\n labs(x = \"Year\",\n y = \"Per capita GDP (US dollars)\")\n\n\n\n\nPlotting more than one variable can be useful for seeing the relationship of variables over time, but it takes a small amount of data munging.\nThis is because ggplot2 wants data in a “long” format instead of a “wide” format for line plots with multiple lines. gather() and spread() from the tidyr package make switching back-and-forth between “long” and “wide” painless. Essentially, variable titles go into “key” and variable values go into “value”. Then ggplot2, turns the different levels of the key variable (population, unemployment) into colors.\n\nas_tibble(EuStockMarkets) %>%\n mutate(date = time(EuStockMarkets)) %>%\n gather(key = \"key\", value = \"value\", -date) %>%\n ggplot(mapping = aes(x = date, y = value, color = key)) +\n geom_line() +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(1991, 1999), \n breaks = c(1991, 1993, 1995, 1997, 1999)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n breaks = 0:4 * 2500,\n labels = scales::dollar, \n limits = c(0, 10000)) + \n labs(x = \"Date\",\n y = \"Value\")\n\n\n\n\n\n\nStep plot\ngeom_line() connects coordinates with the shortest possible straight line. Sometimes step plots are necessary because y values don’t change between coordinates. For example, the upper-bound of the Federal Funds Rate is set at regular intervals and remains constant until it is changed.\n\n# downloaded from FRED on 2018-12-06\n\n# https://fred.stlouisfed.org/series/DFEDTARU\n\nfed_fund_rate <- read_csv(\n \"date, fed_funds_rate\n 2014-01-01,0.0025\n 2015-12-16,0.0050\n 2016-12-14,0.0075\n 2017-03-16,0.0100\n 2017-06-15,0.0125\n 2017-12-14,0.0150\n 2018-03-22,0.0175\n 2018-06-14,0.0200\n 2018-09-27,0.0225\n 2018-12-06,0.0225\")\n\nfed_fund_rate %>%\n ggplot(mapping = aes(x = date, y = fed_funds_rate)) + \n geom_step() +\n scale_x_date(expand = expand_scale(mult = c(0.002, 0)), \n breaks = \"1 year\",\n limits = c(as.Date(\"2014-01-01\"), as.Date(\"2019-01-01\")),\n date_labels = \"%Y\") +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n breaks = c(0, 0.01, 0.02, 0.03),\n limits = c(0, 0.03),\n labels = scales::percent) + \n labs(x = \"Date\",\n y = \"Upper-bound of the Federal Funds Rate\")\n\n\n\n\n\n\nPath plot\nThe Beveridge curve is a macroeconomic plot that displays a relationship between the unemployment rate and the vacancy rate. Movements along the curve indicate changes in the business cyle and horizontal shifts of the curve suggest structural changes in the labor market.\nLines in Beveridge curves do not monotonically move from left to right. Therefore, it is necessary to use geom_path().\n\n# seasonally-adjusted, quarterly vacancy rate - JOLTS # seasonally-adjusted, quarterly unemployment rate - CPS\n\n# pulled from FRED on April 11, 2018. \n\nlibrary(ggrepel)\n\nbeveridge <- read_csv(\n \"quarter, vacanacy_rate, unempoyment_rate\n 2006-01-01,0.0310,0.0473\n 2006-04-01,0.0316,0.0463\n 2006-07-01,0.0313,0.0463\n 2006-10-01,0.0310,0.0443\n 2007-01-01,0.0323,0.0450\n 2007-04-01,0.0326,0.0450\n 2007-07-01,0.0316,0.0466\n 2007-10-01,0.0293,0.0480\n 2008-01-01,0.0286,0.0500\n 2008-04-01,0.0280,0.0533\n 2008-07-01,0.0253,0.0600\n 2008-10-01,0.0220,0.0686\n 2009-01-01,0.0196,0.0826\n 2009-04-01,0.0180,0.0930\n 2009-07-01,0.0176,0.0963\n 2009-10-01,0.0180,0.0993\n 2010-01-01,0.0196,0.0983\n 2010-04-01,0.0220,0.0963\n 2010-07-01,0.0216,0.0946\n 2010-10-01,0.0220,0.0950\n 2011-01-01,0.0226,0.0903\n 2011-04-01,0.0236,0.0906\n 2011-07-01,0.0250,0.0900\n 2011-10-01,0.0243,0.0863\n 2012-01-01,0.0270,0.0826\n 2012-04-01,0.0270,0.0820\n 2012-07-01,0.0266,0.0803\n 2012-10-01,0.0260,0.0780\n 2013-01-01,0.0276,0.0773\n 2013-04-01,0.0280,0.0753\n 2013-07-01,0.0280,0.0723\n 2013-10-01,0.0276,0.0693\n 2014-01-01,0.0290,0.0666\n 2014-04-01,0.0323,0.0623\n 2014-07-01,0.0326,0.0610\n 2014-10-01,0.0330,0.0570\n 2015-01-01,0.0350,0.0556\n 2015-04-01,0.0366,0.0540\n 2015-07-01,0.0373,0.0510\n 2015-10-01,0.0360,0.0500\n 2016-01-01,0.0386,0.0493\n 2016-04-01,0.0383,0.0486\n 2016-07-01,0.0383,0.0493\n 2016-10-01,0.0363,0.0473\n 2017-01-01,0.0366,0.0466\n 2017-04-01,0.0390,0.0433\n 2017-07-01,0.0406,0.0430\n 2017-10-01,0.0386,0.0410\")\n\nlabels <- beveridge %>%\n filter(lubridate::month(quarter) == 1)\n\nbeveridge %>%\n ggplot() +\n geom_path(mapping = aes(x = unempoyment_rate, y = vacanacy_rate), alpha = 0.5) +\n geom_point(data = labels, mapping = aes(x = unempoyment_rate, y = vacanacy_rate)) +\n geom_text_repel(data = labels, mapping = aes(x = unempoyment_rate, y = vacanacy_rate, label = lubridate::year(quarter))) + \n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0.04, 0.1),\n labels = scales::percent) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)), \n breaks = c(0, 0.01, 0.02, 0.03, 0.04, 0.05),\n limits = c(0, 0.05),\n labels = scales::percent) + \n labs(x = \"Seasonally-adjusted unemployment rate\",\n y = \"Seasonally-adjusted vacancy rate\") + \n scatter_grid()\n\n\n\n\n\n\nSlope plots\n\n# https://www.bls.gov/lau/\nlibrary(ggrepel)\n\nunemployment <- tibble(\n time = c(\"October 2009\", \"October 2009\", \"October 2009\", \"August 2017\", \"August 2017\", \"August 2017\"),\n rate = c(7.4, 7.1, 10.0, 3.9, 3.8, 6.4),\n state = c(\"Maryland\", \"Virginia\", \"Washington, D.C.\", \"Maryland\", \"Virginia\", \"Washington, D.C.\")\n)\n\nlabel <- tibble(label = c(\"October 2009\", \"August 2017\"))\noctober <- filter(unemployment, time == \"October 2009\")\naugust <- filter(unemployment, time == \"August 2017\")\n\nunemployment %>%\n mutate(time = factor(time, levels = c(\"October 2009\", \"August 2017\")),\n state = factor(state, levels = c(\"Washington, D.C.\", \"Maryland\", \"Virginia\"))) %>%\n ggplot() + \n geom_line(aes(time, rate, group = state, color = state), show.legend = FALSE) +\n geom_point(aes(x = time, y = rate, color = state)) +\n labs(subtitle = \"Unemployment Rate\") +\n theme(axis.ticks.x = element_blank(),\n axis.title.x = element_blank(),\n axis.ticks.y = element_blank(),\n axis.title.y = element_blank(), \n axis.text.y = element_blank(),\n panel.grid.major.y = element_blank(),\n panel.grid.minor.y = element_blank(),\n panel.grid.major.x = element_blank(),\n axis.line = element_blank()) +\n geom_text_repel(data = october, mapping = aes(x = time, y = rate, label = as.character(rate)), nudge_x = -0.06) + \n geom_text_repel(data = august, mapping = aes(x = time, y = rate, label = as.character(rate)), nudge_x = 0.06)"
},
{
"objectID": "graphics-guide.html#univariate",
"href": "graphics-guide.html#univariate",
"title": "R@URBAN",
"section": "Univariate",
"text": "Univariate\n\nThere are a number of ways to explore the distributions of univariate data in R. Some methods, like strip charts, show all data points. Other methods, like the box and whisker plot, show selected data points that communicate key values like the median and 25th percentile. Finally, some methods don’t show any of the underlying data but calculate density estimates. Each method has advantages and disadvantages, so it is worthwhile to understand the different forms. For more information, read 40 years of boxplots by Hadley Wickham and Lisa Stryjewski.\n\nStrip Chart\nStrip charts, the simplest univariate plot, show the distribution of values along one axis. Strip charts work best with variables that have plenty of variation. If not, the points tend to cluster on top of each other. Even if the variable has plenty of variation, it is often important to add transparency to the points with alpha = so overlapping values are visible.\n\nmsleep %>%\n ggplot(aes(x = sleep_total, y = factor(1))) +\n geom_point(alpha = 0.2, size = 5) +\n labs(y = NULL) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 25), \n breaks = 0:5 * 5) +\n scale_y_discrete(labels = NULL) +\n labs(title = \"Total Sleep Time of Different Mammals\",\n x = \"Total sleep time (hours)\",\n y = NULL) +\n theme(axis.ticks.y = element_blank())\n\n\n\n\n\n\nStrip Chart with Highlighting\nBecause strip charts show all values, they are useful for showing where selected points lie in the distribution of a variable. The clearest way to do this is by adding geom_point() twice with filter() in the data argument. This way, the highlighted values show up on top of unhighlighted values.\n\nggplot() +\n geom_point(data = filter(msleep, name != \"Red fox\"), \n aes(x = sleep_total, \n y = factor(1)),\n alpha = 0.2, \n size = 5,\n color = \"grey50\") +\n geom_point(data = filter(msleep, name == \"Red fox\"),\n aes(x = sleep_total, \n y = factor(1), \n color = name),\n alpha = 0.8,\n size = 5) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 25), \n breaks = 0:5 * 5) + \n scale_y_discrete(labels = NULL) +\n labs(title = \"Total Sleep Time of Different Mammals\",\n x = \"Total sleep time (hours)\",\n y = NULL,\n legend) +\n guides(color = guide_legend(title = NULL)) +\n theme(axis.ticks.y = element_blank())\n\n\n\n\n\n\nSubsetted Strip Chart\nAdd a y variable to see the distributions of the continuous variable in subsets of a categorical variable.\n\nlibrary(forcats)\n\nmsleep %>%\n filter(!is.na(vore)) %>%\n mutate(vore = fct_recode(vore, \n \"Insectivore\" = \"insecti\",\n \"Omnivore\" = \"omni\", \n \"Herbivore\" = \"herbi\", \n \"Carnivore\" = \"carni\"\n )) %>%\n ggplot(aes(x = sleep_total, y = vore)) +\n geom_point(alpha = 0.2, size = 5) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 25), \n breaks = 0:5 * 5) + \n labs(title = \"Total Sleep Time of Different Mammals by Diet\",\n x = \"Total sleep time (hours)\",\n y = NULL) +\n theme(axis.ticks.y = element_blank())\n\n\n\n\n\n\nHistograms\nHistograms divide the distribution of a variable into n equal-sized bins and then count and display the number of observations in each bin. Histograms are sensitive to bin width. As ?geom_histogram notes, “You should always override [the default binwidth] value, exploring multiple widths to find the best to illustrate the stories in your data.”\n\nggplot(data = diamonds, mapping = aes(x = depth)) + \n geom_histogram(bins = 100) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 100)) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2)), labels = scales::comma) +\n labs(x = \"Depth\",\n y = \"Count\")\n\n\n\n\n\n\nBoxplots\nBoxplots were invented in the 1970s by John Tukey1. Instead of showing the underlying data or binned counts of the underlying data, they focus on important values like the 25th percentile, median, and 75th percentile.\n\nInsectSprays %>%\n ggplot(mapping = aes(x = spray, y = count)) +\n geom_boxplot() +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2))) +\n labs(x = \"Type of insect spray\",\n y = \"Number of dead insects\") +\n remove_ticks()\n\n\n\n\n\n\nSmoothed Kernel Density Plots\nContinuous variables with smooth distributions are sometimes better represented with smoothed kernel density estimates than histograms or boxplots. geom_density() computes and plots a kernel density estimate. Notice the lumps around integers and halves in the following distribution because of rounding.\n\ndiamonds %>%\n ggplot(mapping = aes(carat)) +\n geom_density(color = NA) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, NA)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2))) +\n labs(x = \"Carat\",\n y = \"Density\")\n\n\n\n\n\ndiamonds %>%\n mutate(cost = ifelse(price > 5500, \"More than $5,500 +\", \"$0 to $5,500\")) %>%\n ggplot(mapping = aes(carat, fill = cost)) +\n geom_density(alpha = 0.25, color = NA) +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, NA)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Carat\",\n y = \"Density\")\n\n\n\n\n\n\nRidgeline Plots\nRidgeline plots are partially overlapping smoothed kernel density plots faceted by a categorical variable that pack a lot of information into one elegant plot.\n\nlibrary(ggridges)\n\nggplot(diamonds, mapping = aes(x = price, y = cut)) +\n geom_density_ridges(fill = \"#1696d2\") +\n labs(x = \"Price\",\n y = \"Cut\")\n\n\n\n\n\n\nViolin Plots\nViolin plots are symmetrical displays of smooth kernel density plots.\n\nInsectSprays %>%\n ggplot(mapping = aes(x = spray, y = count, fill = spray)) +\n geom_violin(color = NA) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2))) +\n labs(x = \"Type of insect spray\",\n y = \"Number of dead insects\") +\n remove_ticks()\n\n\n\n\n\n\nBean Plot\nIndividual outliers and important summary values are not visible in violin plots or smoothed kernel density plots. Bean plots, created by Peter Kampstra in 2008, are violin plots with data shown as small lines in a one-dimensional sstrip plot and larger lines for the mean.\n\nmsleep %>%\n filter(!is.na(vore)) %>%\n mutate(vore = fct_recode(vore, \n \"Insectivore\" = \"insecti\",\n \"Omnivore\" = \"omni\", \n \"Herbivore\" = \"herbi\", \n \"Carnivore\" = \"carni\"\n )) %>%\n ggplot(aes(x = vore, y = sleep_total, fill = vore)) +\n stat_summary(fun.y = \"mean\",\n colour = \"black\", \n size = 30,\n shape = 95,\n geom = \"point\") +\n geom_violin(color = NA) +\n geom_jitter(width = 0,\n height = 0.05,\n alpha = 0.4,\n shape = \"-\",\n size = 10,\n color = \"grey50\") +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2))) + \n labs(x = NULL,\n y = \"Total sleep time (hours)\") +\n theme(legend.position = \"none\") +\n remove_ticks()"
},
{
"objectID": "graphics-guide.html#area-plot",
"href": "graphics-guide.html#area-plot",
"title": "R@URBAN",
"section": "Area Plot",
"text": "Area Plot\n\n\nStacked Area\n\ntxhousing %>%\n filter(city %in% c(\"Austin\",\"Houston\",\"Dallas\",\"San Antonio\",\"Fort Worth\")) %>%\n group_by(city, year) %>%\n summarize(sales = sum(sales)) %>%\n ggplot(aes(x = year, y = sales, fill = city)) +\n geom_area(position = \"stack\") +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0)),\n limits = c(2000, 2015),\n breaks = 2000 + 0:15) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2)), \n labels = scales::comma) +\n labs(x = \"Year\",\n y = \"Home sales\")\n\n\n\n\n\n\nFilled Area\n\ntxhousing %>%\n filter(city %in% c(\"Austin\",\"Houston\",\"Dallas\",\"San Antonio\",\"Fort Worth\")) %>%\n group_by(city, year) %>%\n summarize(sales = sum(sales)) %>%\n ggplot(aes(x = year, y = sales, fill = city)) +\n geom_area(position = \"fill\") +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0)),\n limits = c(2000, 2015),\n breaks = 2000 + 0:15) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.02)),\n breaks = c(0, 0.25, 0.5, 0.75, 1),\n labels = scales::percent) +\n labs(x = \"Year\",\n y = \"Home sales\")"
},
{
"objectID": "graphics-guide.html#heat-map",
"href": "graphics-guide.html#heat-map",
"title": "R@URBAN",
"section": "Heat map",
"text": "Heat map\n\n\nlibrary(fivethirtyeight)\n\nbad_drivers %>%\n filter(state %in% c(\"Maine\", \"New Hampshire\", \"Vermont\", \"Massachusetts\", \"Connecticut\", \"New York\")) %>%\n mutate(`Number of\\nDrivers` = scale(num_drivers),\n `Percent\\nSpeeding` = scale(perc_speeding),\n `Percent\\nAlcohol` = scale(perc_alcohol),\n `Percent Not\\nDistracted` = scale(perc_not_distracted),\n `Percent No\\nPrevious` = scale(perc_no_previous),\n state = factor(state, levels = rev(state))\n ) %>%\n select(-insurance_premiums, -losses, -(num_drivers:losses)) %>%\n gather(`Number of\\nDrivers`:`Percent No\\nPrevious`, key = \"variable\", value = \"SD's from Mean\") %>%\n ggplot(aes(variable, state)) +\n geom_tile(aes(fill = `SD's from Mean`)) +\n labs(x = NULL,\n y = NULL) + \n scale_fill_gradientn() +\n theme(legend.position = \"right\",\n legend.direction = \"vertical\",\n axis.line.x = element_blank(),\n panel.grid.major.y = element_blank()) +\n remove_ticks()\n\n\n\n#https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/"
},
{
"objectID": "graphics-guide.html#faceting-and-small-multiples",
"href": "graphics-guide.html#faceting-and-small-multiples",
"title": "R@URBAN",
"section": "Faceting and Small Multiples",
"text": "Faceting and Small Multiples\n\n\nfacet_wrap()\nR’s faceting system is a powerful way to make “small multiples”.\nSome edits to the theme may be necessary depending upon how many rows and columns are in the plot.\n\ndiamonds %>%\n ggplot(mapping = aes(x = carat, y = price)) +\n geom_point(alpha = 0.05) +\n facet_wrap(~cut, ncol = 5) +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0)),\n limits = c(0, 6)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0)),\n limits = c(0, 20000), \n labels = scales::dollar) +\n labs(x = \"Carat\",\n y = \"Price\") +\n scatter_grid()\n\n\n\n\n\n\nfacet_grid()\n\ndiamonds %>%\n filter(color %in% c(\"D\", \"E\", \"F\", \"G\")) %>%\n ggplot(mapping = aes(x = carat, y = price)) +\n geom_point(alpha = 0.05) +\n facet_grid(color ~ cut) +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0)),\n limits = c(0, 4)) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0)),\n limits = c(0, 20000), \n labels = scales::dollar) +\n labs(x = \"Carat\",\n y = \"Price\") +\n theme(panel.spacing = unit(20L, \"pt\")) +\n scatter_grid()"
},
{
"objectID": "graphics-guide.html#smoothers",
"href": "graphics-guide.html#smoothers",
"title": "R@URBAN",
"section": "Smoothers",
"text": "Smoothers\n\ngeom_smooth() fits and plots models to data with two or more dimensions.\nUnderstanding and manipulating defaults is more important for geom_smooth() than other geoms because it contains a number of assumptions. geom_smooth() automatically uses loess for datasets with fewer than 1,000 observations and a generalized additive model with formula = y ~ s(x, bs = \"cs\") for datasets with greater than 1,000 observations. Both default to a 95% confidence interval with the confidence interval displayed.\nModels are chosen with method = and can be set to lm(), glm(), gam(), loess(), rlm(), and more. Formulas can be specified with formula = and y ~ x syntax. Plotting the standard error is toggled with se = TRUE and se = FALSE, and level is specificed with level =. As always, more information can be seen in RStudio with ?geom_smooth().\ngeom_point() adds a scatterplot to geom_smooth(). The order of the function calls is important. The function called second will be layed on top of the function called first.\n\ndiamonds %>%\n ggplot(mapping = aes(x = carat, y = price)) +\n geom_point(alpha = 0.05) +\n geom_smooth(color = \"#ec008b\") +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 5),\n breaks = 0:5) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 20000), \n labels = scales::dollar) + \n labs(x = \"Carat\",\n y = \"Price\") +\n scatter_grid()\n\n\n\n\ngeom_smooth can be subset by categorical and factor variables. This requires subgroups to have a decent number of observations and and a fair amount of variability across the x-axis. Confidence intervals often widen at the ends so special care is needed for the chart to be meaningful and readable.\nThis example uses Loess with MPG = displacement.\n\nggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = factor(cyl))) +\n geom_point(alpha = 0.2) +\n geom_smooth() +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 7),\n breaks = 0:7) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 60)) + \n labs(x = \"Engine displacement\",\n y = \"Highway MPG\") +\n scatter_grid()\n\n\n\n\nThis example uses linear models with MPG = displacement.\n\nggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = factor(cyl))) +\n geom_point(alpha = 0.2) +\n geom_smooth(method = \"lm\") +\n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)), \n limits = c(0, 7),\n breaks = 0:7) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 60)) + \n labs(x = \"Engine displacement\",\n y = \"Highway MPG\") +\n scatter_grid()"
},
{
"objectID": "graphics-guide.html#highlighting",
"href": "graphics-guide.html#highlighting",
"title": "R@URBAN",
"section": "Highlighting",
"text": "Highlighting\n\nlibrary(gghighlight) enables the intuitive highlighting of ggplot2 plots. gghighlight modifies existing ggplot2 objects, so no other code should change. All of the highlighting is handled by the function gghighlight(), which can handle all types of geoms.\nWarning: R will throw an error if too many colors are highlighted because of the design of urbnthemes. Simply decrease the number of highlighted geoms to solve this issue.\nThere are two main ways to highlight.\n\nThreshold\nThe first way to highlight is with a threshold. Add a logical test to gghighlight() to describe which lines should be highlighted. Here, lines with maximum change in per-capita Gross Domestic Product greater than $35,000 are highlighted by gghighlight(max(pcgpd_change) > 35000, use_direct_label = FALSE).\n\nlibrary(gghighlight)\nlibrary(gapminder)\n\ndata <- gapminder %>%\n filter(continent %in% c(\"Europe\")) %>%\n group_by(country) %>%\n mutate(pcgpd_change = ifelse(year == 1952, 0, gdpPercap - lag(gdpPercap))) %>%\n mutate(pcgpd_change = cumsum(pcgpd_change))\n \ndata %>%\n ggplot(aes(year, pcgpd_change, group = country, color = country)) +\n geom_line() +\n gghighlight(max(pcgpd_change) > 35000, use_direct_label = FALSE) + \n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)),\n breaks = c(seq(1950, 2010, 10)),\n limits = c(1950, 2010)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n breaks = 0:8 * 5000,\n labels = scales::dollar,\n limits = c(0, 40000)) +\n labs(x = \"Year\",\n y = \"Change in per-capita GDP (US dollars)\")\n\n\n\n\n\n\nRank\nThe second way to highlight is by rank. Here, the countries with the first highest values for change in per-capita Gross Domestic Product are highlighted with gghighlight(max(pcgpd_change), max_highlight = 5, use_direct_label = FALSE).\n\ndata %>%\n ggplot(aes(year, pcgpd_change, group = country, color = country)) +\n geom_line() +\n gghighlight(max(pcgpd_change), max_highlight = 5, use_direct_label = FALSE) + \n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)),\n breaks = c(seq(1950, 2010, 10)),\n limits = c(1950, 2010)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n breaks = 0:8 * 5000,\n labels = scales::dollar,\n limits = c(0, 40000)) +\n labs(x = \"Year\",\n y = \"Change in per-capita GDP (US dollars)\")\n\n\n\n\n\n\nFaceting\ngghighlight() works well with ggplot2’s faceting system.\n\ndata %>%\n ggplot(aes(year, pcgpd_change, group = country)) +\n geom_line() +\n gghighlight(max(pcgpd_change), max_highlight = 4, use_direct_label = FALSE) + \n scale_x_continuous(expand = expand_scale(mult = c(0.002, 0)),\n breaks = c(seq(1950, 2010, 10)),\n limits = c(1950, 2010)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n breaks = 0:8 * 5000,\n labels = scales::dollar,\n limits = c(0, 40000)) +\n labs(x = \"Year\",\n y = \"Change in per-capita GDP (US dollars)\") +\n facet_wrap(~ country) +\n theme(panel.spacing = unit(20L, \"pt\"))"
},
{
"objectID": "graphics-guide.html#text-and-annotation",
"href": "graphics-guide.html#text-and-annotation",
"title": "R@URBAN",
"section": "Text and Annotation",
"text": "Text and Annotation\n\nSeveral functions can be used to annotate, label, and highlight different parts of plots. geom_text() and geom_text_repel() both display variables from data frames. annotate(), which has several different uses, displays variables and values included in the function call.\n\ngeom_text()\ngeom_text() turns text variables in data sets into geometric objects. This is useful for labeling data in plots. Both functions need x values and y values to determine placement on the coordinate plane, and a text vector of labels.\nThis can be used to label geom_bar().\n\ndiamonds %>%\n group_by(cut) %>%\n summarize(price = mean(price)) %>%\n ggplot(aes(cut, price)) +\n geom_bar(stat = \"identity\") +\n geom_text(aes(label = scales::dollar(price)), vjust = -1) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2)),\n labels = scales::dollar) +\n labs(title = \"Average Diamond Price by Diamond Cut\",\n x = \"Cut\",\n y = \"Price\") +\n remove_ticks()\n\n\n\n\nIt can also be used to label points in a scatter plot.\nIt’s rarely useful to label every point in a scatter plot. Use filter() to create a second data set that is subsetted and pass it into the labelling function.\n\nlabels <- mtcars %>%\n rownames_to_column(\"model\") %>%\n filter(model %in% c(\"Toyota Corolla\", \"Merc 240D\", \"Datsun 710\"))\n\nmtcars %>%\n ggplot() +\n geom_point(mapping = aes(x = wt, y = mpg)) +\n geom_text(data = labels, mapping = aes(x = wt, y = mpg, label = model), nudge_x = 0.38) +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 6)) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 40)) + \n labs(x = \"Weight (Tons)\",\n y = \"Miles per gallon (MPG)\") +\n scatter_grid()\n\n\n\n\nText too often overlaps with other text or geoms when using geom_text(). library(ggrepel) is a library(ggplot2) add-on that automatically positions text so it doesn’t overlap with geoms or other text. To add this functionality, install and load library(ggrepel) and then use geom_text_repel() with the same syntax as geom_text().\n\n\ngeom_text_repel()\n\nlibrary(ggrepel)\n\nlabels <- mtcars %>%\n rownames_to_column(\"model\") %>%\n top_n(5, mpg)\n\nmtcars %>%\n ggplot(mapping = aes(x = wt, y = mpg)) +\n geom_point() +\n geom_text_repel(data = labels, \n mapping = aes(label = model), \n nudge_x = 0.38) +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 6)) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 40)) + \n labs(x = \"Weight (Tons)\",\n y = \"Miles per gallon (MPG)\") +\n scatter_grid()\n\n\n\n\n\n\nannotate()\nannotate() doesn’t use data frames. Instead, it takes values for x = and y =. It can add text, rectangles, segments, and pointrange.\n\nmsleep %>%\n filter(bodywt <= 1000) %>%\n ggplot(aes(bodywt, sleep_total)) +\n geom_point() +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(-10, 1000),\n labels = scales::comma) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 25)) + \n annotate(\"text\", x = 500, y = 12, label = \"These data suggest that heavy \\n animals sleep less than light animals\") +\n labs(x = \"Body weight (pounds)\",\n y = \"Sleep time (hours)\") +\n scatter_grid() \n\n\n\n\n\nlibrary(AmesHousing)\n\names <- make_ames()\n\names %>%\n mutate(square_footage = Total_Bsmt_SF - Bsmt_Unf_SF + First_Flr_SF + Second_Flr_SF) %>%\n mutate(Sale_Price = Sale_Price / 1000) %>% \n ggplot(aes(square_footage, Sale_Price)) +\n geom_point(alpha = 0.2) +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(-10, 12000),\n labels = scales::comma) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.002)),\n limits = c(0, 800),\n labels = scales::dollar) + \n annotate(\"rect\", xmin = 6800, xmax = 11500, ymin = 145, ymax = 210, alpha = 0.1) +\n annotate(\"text\", x = 8750, y = 230, label = \"Unfinished homes\") +\n labs(x = \"Square footage\", \n y = \"Sale price (thousands)\") +\n scatter_grid()"
},
{
"objectID": "graphics-guide.html#layered-geoms",
"href": "graphics-guide.html#layered-geoms",
"title": "R@URBAN",
"section": "Layered Geoms",
"text": "Layered Geoms\n\nGeoms can be layered in ggplot2. This is useful for design and analysis.\nIt is often useful to add points to line plots with a small number of values across the x-axis. This example from R for Data Science shows how changing the line to grey can be appealing.\n\nDesign\n\nBefore\n\ntable1 %>%\n ggplot(aes(x = year, y = cases)) +\n geom_line(aes(color = country)) +\n geom_point(aes(color = country)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2)), \n labels = scales::comma) +\n scale_x_continuous(breaks = c(1999, 2000)) +\n labs(title = \"Changes in Tuberculosis Cases in Three Countries\")\n\n\n\n\n\n\nAfter\n\ntable1 %>%\n ggplot(aes(year, cases)) +\n geom_line(aes(group = country), color = \"grey50\") +\n geom_point(aes(color = country)) +\n scale_y_continuous(expand = expand_scale(mult = c(0, 0.2)), \n labels = scales::comma) +\n scale_x_continuous(breaks = c(1999, 2000)) +\n labs(title = \"Changes in Tuberculosis Cases in Three Countries\")\n\n\n\n\nLayering geoms is also useful for adding trend lines and centroids to scatter plots.\n\n# Simple line\n# Regression model\n# Centroids\n\n\n\n\nCentroids\n\nmpg_summary <- mpg %>%\n group_by(cyl) %>%\n summarize(displ = mean(displ), cty = mean(cty))\n\nmpg %>%\n ggplot() +\n geom_point(aes(x = displ, y = cty, color = factor(cyl)), alpha = 0.5) +\n geom_point(data = mpg_summary, aes(x = displ, y = cty), size = 5, color = \"#ec008b\") +\n geom_text(data = mpg_summary, aes(x = displ, y = cty, label = cyl)) +\n scale_x_continuous(expand = expand_scale(mult = c(0, 0.002)), \n limits = c(0, 8)) + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0)), \n limits = c(0, 40)) +\n labs(x = \"Displacement\",\n y = \"City MPG\") +\n scatter_grid()"
},
{
"objectID": "graphics-guide.html#saving-plots",
"href": "graphics-guide.html#saving-plots",
"title": "R@URBAN",
"section": "Saving Plots",
"text": "Saving Plots\n\nggsave() exports ggplot2 plots. The function can be used in two ways. If plot = isn’t specified in the function call, then ggsave() automatically saves the plot that was last displayed in the Viewer window. Second, if plot = is specified, then ggsave() saves the specified plot. ggsave() guesses the type of graphics device to use in export (.png, .pdf, .svg, etc.) from the file extension in the filename.\nmtcars %>%\n ggplot(aes(x = wt, y = mpg)) +\n geom_point()\n\nggsave(filename = \"cars.png\")\n\nplot2 <- mtcars %>%\n ggplot(aes(x = wt, y = mpg)) +\n geom_point()\n\nggsave(filename = \"cars.png\", plot = plot2)\nExported plots rarely look identical to the plots that show up in the Viewer window in RStudio because the overall size and aspect ratio of the Viewer is often different than the defaults for ggsave(). Specific sizes, aspect ratios, and resolutions can be controlled with arguments in ggsave(). RStudio has a useful cheatsheet called “How Big is Your Graph?” that should help with choosing the best size, aspect ratio, and resolution.\nFonts are not embedded in PDFs by default. To embed fonts in PDFs, include device = cairo_pdf in ggsave().\nplot <- mtcars %>%\n ggplot(aes(x = wt, y = mpg)) +\n geom_point()\n\nggsave(filename = \"cars.pdf\", plot = plot2, width = 6.5, height = 4, device = cairo_pdf)"
},
{
"objectID": "graphics-guide.html#urbnthemes",
"href": "graphics-guide.html#urbnthemes",
"title": "R@URBAN",
"section": "urbnthemes",
"text": "urbnthemes\n\nOverview\nurbnthemes is a set of tools for creating Urban Institute-themed plots and maps in R. The package extends ggplot2 with print and map themes as well as tools that make plotting easier at the Urban Institute. urbnthemes replaces the urban_R_theme.\nAlways load library(urbnthemes) after library(ggplot2) or library(tidyverse).\n\n\nUsage\nUse set_urbn_defaults(style = \"print\") to set the default styles. scatter_grid(), remove_ticks(), add_axis(), and remove_axis() can all be used to improve graphics.\n\nlibrary(ggplot2)\nlibrary(urbnthemes)\n\nset_urbn_defaults(style = \"print\")\n\nggplot(data = mtcars, mapping = aes(factor(cyl))) +\n geom_bar() + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Number of Cylinders\",\n y = \"Count\") +\n remove_ticks()\n\n\n\n\n\n\nCombining elements\nlibrary(urbnthemes) contains functions for combining plot elements into graphics. urbn_plot() brings all of the elements together.\n\nurbn_logo_text()\nremove_ticks()\nremove_axis()\nscatter_grid()\nadd_axis()\nurbn_geofacet\n\n\nlibrary(ggplot2)\nlibrary(urbnthemes)\n\nset_urbn_defaults(style = \"print\")\n\nplot <- ggplot(data = mtcars, mapping = aes(factor(cyl))) +\n geom_bar() + \n scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +\n labs(x = \"Number of Cylinders\",\n y = \"Count\") +\n remove_ticks()\n\nurbn_plot(plot, urbn_logo_text(), ncol = 1, heights = c(30, 1))\n\n\n\n\nSometimes it’s important to horizontally add the y-axis title above the plot. urbn_y_title() can be sued for this task. The following example goes one step further and adds the title between the legend and the plot.\n\nlibrary(ggplot2)\nlibrary(urbnthemes)\n\nset_urbn_defaults()\n\nplot <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg, color = factor(cyl))) +\n geom_point() + \n scale_x_continuous(expand = c(0, 0),\n limits = c(0, 8)) +\n scale_y_continuous(expand = c(0, 0),\n limits = c(0, 40)) +\n remove_ticks() +\n labs(\"\") +\n scatter_grid()\n\nurbn_plot(get_legend(plot),\n urbn_y_title(\"Miles per gallon\"),\n remove_legend(plot), \n urbn_logo_text(), \n ncol = 1, \n heights = c(3, 1, 30, 1))\n\n\n\n\n\n\nPalettes\nurbnthemes contains many quick-access color palettes from the Urban Institute Data Visualization Style Guide. These palettes can be used to quickly overwrite default color palettes from urbnthemes.\n\npalette_urbn_main is the eight color discrete palette of the Urban Institute with cyan, yellow, black, gray, magenta, green, space gray, and red.\npalette_urbn_diverging is an eight color diverging palette.\npalette_urbn_quintile is a five color blue palette that is good for quintiles.\npalette_urbn_politics is a two color palette with blue for Democrats and red for Republicans.\n\nThere are seven palettes that are continuous palettes of the seven unique colors in the discrete Urban Institute color palette:\n\npalette_urbn_cyan\npalette_urbn_gray\npalette_urbn_yellow\npalette_urbn_magenta\npalette_urbn_green\npalette_urbn_spacegray\npalette_urbn_red\n\nUse view_palette() to see the palette:\n\nview_palette(palette_urbn_magenta)\n\n[1] \"c(#351123, #761548, #af1f6b, #e90989, #e54096, #e46aa7, #eb99c2, #f5cbdf)\"\n\n\n\n\n\nThe vectors can be subset using base R syntax. This allows for the quick selection of specific colors from a palette.\n\npalette_urbn_main[1:4]\n\n cyan yellow black gray \n\"#1696d2\" \"#fdbf11\" \"#000000\" \"#d2d2d2\" \n\n\n\npalette_urbn_spacegray[1:5]\n\n[1] \"#d5d5d4\" \"#adabac\" \"#848081\" \"#5c5859\" \"#332d2f\"\n\n\n\n\nUtility functions\nlibrary(urbnthemes) contains four functions that are helpful with managing font instalations:\n\nlato_test()\nlato_install()\nfontawesome_test()\nfontawesome_install()"
},
{
"objectID": "graphics-guide.html#bibliography-and-session-information",
"href": "graphics-guide.html#bibliography-and-session-information",
"title": "R@URBAN",
"section": "Bibliography and Session Information",
"text": "Bibliography and Session Information\n\nNote: Examples present in this document by Aaron Williams were created during personal time.\nBob Rudis and Dave Gandy (2017). waffle: Create Waffle Chart Visualizations in R. R package version 0.7.0. https://CRAN.R-project.org/package=waffle\nChester Ismay and Jennifer Chunn (2017). fivethirtyeight: Data and Code Behind the Stories and Interactives at ‘FiveThirtyEight’. R package version 0.3.0. https://CRAN.R-project.org/package=fivethirtyeight\nHadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009.\nHadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse\nHadley Wickham (2017). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.2.0. https://CRAN.R-project.org/package=forcats\nJennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder\nKamil Slowikowski (2017). ggrepel: Repulsive Text and Label Geoms for ‘ggplot2’. R package version 0.7.0. https://CRAN.R-project.org/package=ggrepel\nMax Kuhn (2017). AmesHousing: The Ames Iowa Housing Data. R package version 0.0.3. https://CRAN.R-project.org/package=AmesHousing\nPeter Kampstra (2008). Beanplot: A Boxplot Alternative for Visual Comparison of Distributions, Journal of Statistical Software, 2008. https://www.jstatsoft.org/article/view/v028c01\nR Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.\nWinston Chang, (2014). extrafont: Tools for using fonts. R package version 0.17. https://CRAN.R-project.org/package=extrafont\nYihui Xie (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.19.\n\nsessionInfo()\n\nR version 4.1.2 (2021-11-01)\nPlatform: x86_64-apple-darwin17.0 (64-bit)\nRunning under: macOS Big Sur 10.16\n\nMatrix products: default\nBLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib\nLAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib\n\nlocale:\n[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n\nattached base packages:\n[1] stats graphics grDevices utils datasets methods base \n\nother attached packages:\n [1] AmesHousing_0.0.4 gghighlight_0.3.2 fivethirtyeight_0.6.2\n [4] ggridges_0.5.3 ggrepel_0.9.1 gapminder_0.3.0 \n [7] urbnthemes_0.0.2 forcats_0.5.1 stringr_1.4.0 \n[10] dplyr_1.0.8 purrr_0.3.4 readr_2.1.1 \n[13] tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5 \n[16] tidyverse_1.3.1 knitr_1.36 \n\nloaded via a namespace (and not attached):\n [1] httr_1.4.2 splines_4.1.2 bit64_4.0.5 \n [4] vroom_1.5.7 jsonlite_1.7.2 modelr_0.1.8 \n [7] assertthat_0.2.1 cellranger_1.1.0 yaml_2.2.1 \n[10] lattice_0.20-45 urbnmapr_0.0.0.9002 Rttf2pt1_1.3.9 \n[13] pillar_1.7.0 backports_1.4.0 glue_1.6.2 \n[16] extrafontdb_1.0 digest_0.6.29 rvest_1.0.2 \n[19] colorspace_2.0-2 Matrix_1.3-4 htmltools_0.5.2 \n[22] plyr_1.8.6 pkgconfig_2.0.3 broom_0.7.10 \n[25] haven_2.4.3 scales_1.1.1 tzdb_0.2.0 \n[28] mgcv_1.8-38 generics_0.1.2 farver_2.1.0 \n[31] ellipsis_0.3.2 withr_2.4.3 cli_3.2.0 \n[34] magrittr_2.0.3 crayon_1.5.1 readxl_1.4.0.9000 \n[37] evaluate_0.14 fs_1.5.1 fansi_1.0.3 \n[40] nlme_3.1-153 xml2_1.3.3 tools_4.1.2 \n[43] hms_1.1.1 lifecycle_1.0.1 munsell_0.5.0 \n[46] reprex_2.0.1 compiler_4.1.2 rlang_1.0.2 \n[49] grid_4.1.2 rstudioapi_0.13 htmlwidgets_1.5.4 \n[52] labeling_0.4.2 rmarkdown_2.11 gtable_0.3.0 \n[55] DBI_1.1.1 R6_2.5.1 gridExtra_2.3 \n[58] lubridate_1.8.0 fastmap_1.1.0 bit_4.0.4 \n[61] extrafont_0.17 utf8_1.2.2 stringi_1.7.6 \n[64] parallel_4.1.2 Rcpp_1.0.8 vctrs_0.4.1 \n[67] dbplyr_2.1.1 tidyselect_1.1.2 xfun_0.28"
},
{
"objectID": "mapping.html#geospatial-workflow",
"href": "mapping.html#geospatial-workflow",
"title": "R@URBAN",
"section": "Geospatial Workflow",
"text": "Geospatial Workflow\nThis picture below outlines what we think are the main steps in a geospatial workflow. This guide will be split into sections describing each of the steps."
},
{
"objectID": "mapping.html#should-this-be-a-map",
"href": "mapping.html#should-this-be-a-map",
"title": "R@URBAN",
"section": "Should this be a map?",
"text": "Should this be a map?\nThe Urban Institute Data Visualization Style Guide offers some blunt but useful suggestions for maps:\n\nJust because you’ve got geographic data, doesn’t mean that you have to make a map. Many times, there are more efficient storyforms that will get your point across more clearly. If your data shows a very clear geographic trend or if the absolute location of a place or event matters, maps might be the best approach, but sometimes the reflexive impulse to map the data can make you forget that showing the data in another form might answer other—and sometimes more important—questions.\n\nSo we would encourage you to think critically before making a map."
},
{
"objectID": "mapping.html#why-map-with-r",
"href": "mapping.html#why-map-with-r",
"title": "R@URBAN",
"section": "Why map with R?",
"text": "Why map with R?\nR can have a steeper learning curve than point-and-click tools - like QGIS or ArcGIS - for geospatial analysis and mapping. But creating maps in R has many advantages including:\n\nReproducibility: By creating maps with R code, you can easily share the outputs and the code that generated the output with collaborators, allowing them to replicate your work and catch errors easily.\nIteration: With point and click software like ArcGIS, making 50 maps would be 50 times the work/time. But using R, we can easily make make many iterations of the same map with a few changes to the code.\nEasy Updates: Writing code provides a roadmap for others (and future you!) to quickly update parts of the map as needed. Say for example a collaborator wanted to change the legend colors of 50 state maps. With R, this is possible in just a few seconds!\nAn Expansive ecosystem: There are several R packages that make it very easy to get spatial data, create static and interactive maps, and perform spatial analyses. This feature rich package ecosystem which all play nice together is frankly unmatched by other programming languages and even point and click tools like QGIS and ArcGIS. Some of these R packages include:\n\nsf: For managing and analyzing spatial dataframes\ntigris: For downloading in Census geographies\nggplot2: For making publication ready static maps\nurbnmapr: For automatically adding Urban styling to static maps\nmapview: For making expxploratory interactive maps\n\nCost: Most point-and-click tools for geospatial analysis are proprietary and expensive. R is free open-source software. The software and most of its packages can be used for free by anyone for almost any use case."
},
{
"objectID": "mapping.html#helpful-learning-resources",
"href": "mapping.html#helpful-learning-resources",
"title": "R@URBAN",
"section": "Helpful Learning Resources",
"text": "Helpful Learning Resources\nIn addition to this guide, you may want to look at these other helpful resources:\n\nThe Urban Institute mapping training series (with video lectures and notes)\nChapters 5, 6, and 7 from Kyle Walker’s Analyzing US Census Data book.\nAndrew Heiss’ fantastic mapping guide\nAll of the vignettes for the sf package\nGeocomputation with R: A book by Robin Lovelace and others\nUChicago’s R Spatial Workshops: https://spatialanalysis.github.io/tutorials/"
},
{
"objectID": "mapping.html#librarysf",
"href": "mapping.html#librarysf",
"title": "R@URBAN",
"section": "library(sf)",
"text": "library(sf)\n\nThe short version\nlibrary(sf) stores geospatial data, which are points (a single longitude/latitude), lines (a pair of connected points), or polygons (a collection of points which make a polygon) in a geometry column within R dataframes\n\nThis is what sf dataframe looks like in the console:\n\ndc_parks <- st_read(\"mapping/data/dc_parks.geojson\", \n quiet = TRUE)\n\n# Print just the NAME and geometry column\ndc_parks %>%\n select(NAME) %>%\n head(2)\n\nSimple feature collection with 2 features and 1 field\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: -77.01063 ymin: 38.81718 xmax: -76.9625 ymax: 38.89723\nGeodetic CRS: WGS 84\n NAME geometry\n1 Kingman and Heritage Islands MULTIPOLYGON (((-76.96566 3...\n2 Bald Eagle Hill MULTIPOLYGON (((-77.01063 3...\n\n\n\n\nThe long version\nThe sf library is a key tool for reading in, managing, and working with spatial data in R. sf stands for simple features (not San Francisco you Bay Area folks) and denotes a way to describe the spatial attributes of real life objects. The R object you will be working with most frequently for mapping is an sf dataframe. An sf dataframe is essentially a regular R dataframe, with a couple of extra features for use in mapping. These extra features exclusive to sf dataframes include:\n\nsticky geometry columns\nattached coordinate reference systems\nsome other spatial metadata\n\nThe most important of the above list is the sticky geometry column, which is a magical column that contains all of the geographic information for each row of data. Say for example you had a sf dataframe of all DC census tracts. Then the geometry column would contain all of the geographic points used to define DC census tract polygons. The stickiness of this column means that no matter what data munging/filtering you do, you will not be able to drop or delete the geometry column. Below is a graphic to help you understand this:\n\ncredits: @allisonhorst\nThis is what an sf dataframe looks like in the console:\n\n# Read in spatial data about DC parks from DC Open Data Portal\ndc_parks <- st_read(\"https://opendata.arcgis.com/api/v3/datasets/287eaa2ecbff4d699762bbc6795ffdca_9/downloads/data?format=geojson&spatialRefId=4326\",\n quiet = TRUE)\n\n# dc_parks <- st_read(\"mapping/data/dc_parks.geojson\")\n\n# Select just a few columns for readability\ndc_parks <- dc_parks %>%\n select(NAME, geometry)\n\n# Print to the console\ndc_parks\n\nSimple feature collection with 256 features and 1 field\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: -77.11113 ymin: 38.81718 xmax: -76.91108 ymax: 38.98811\nGeodetic CRS: WGS 84\nFirst 10 features:\n NAME geometry\n1 Plymouth Circle MULTIPOLYGON (((-77.04677 3...\n2 Triangle Park RES 0566 MULTIPOLYGON (((-77.04481 3...\n3 Shepherd Field MULTIPOLYGON (((-77.03528 3...\n4 Marvin Caplan Memorial Park MULTIPOLYGON (((-77.03027 3...\n5 Pinehurst Circle MULTIPOLYGON (((-77.06643 3...\n6 Triangle Park 3278 0801 MULTIPOLYGON (((-77.01759 3...\n7 Fort Stevens MULTIPOLYGON (((-77.02988 3...\n8 Takoma Recreation Center MULTIPOLYGON (((-77.01794 3...\n9 Takoma Community Center MULTIPOLYGON (((-77.01716 3...\n10 Triangle Park RES 0648 MULTIPOLYGON (((-77.03362 3...\n\n\nNote that there is some spatial metadata such as the Geometry Type, Bounding Box, and CRS which shows up as a header before the actual contents of the dataframe.\nSince sf dataframes operate similarly to regular dataframes, we can use all our familiar tidyverse functions for data wrangling, including select, filter, rename, mutate, group_by and summarize. The sf package also has many functions that provide easy ways to replicate common tasks done in other GIS software like spatial joins, clipping, and buffering. Almost all of the mapping and geospatial analysis methods described in this guide rely on you having an sf dataframe. So let’s talk about how to get one!"
},
{
"objectID": "mapping.html#importing-spatial-data",
"href": "mapping.html#importing-spatial-data",
"title": "R@URBAN",
"section": "Importing spatial data",
"text": "Importing spatial data\nGetting an sf dataframe is always the first step in the geospatial workflow. Here’s how to import spatial data for…\n\nStates and counties\nWe highly recommend using the library(urbnmapr) package, which was created by folks here at Urban to easily create state and county level maps. The get_urbn_map() function in the package allows you to read in spatial data on states and counties, with options to include territories. Importantly, it will also display AL and HI as insets on the map in accordance with the Urban Institute Data Visualization Style Guide. For information on how to install urbnmapr, see the GitHub repository.\nBelow is an example of how you would use urbnmapr to get an sf dataframe of all the states or counties in the US.\n\nlibrary(urbnmapr)\n\n# Get state data\nstates <- get_urbn_map(\"states\", sf = TRUE)\n\n# Can also get county data\ncounties <- get_urbn_map(\"counties\", sf = TRUE)\n\n\n\nOther Census geographies\nUse the library(tigris) package, which allows you to easily download TIGER and other cartographic boundaries from the US Census Bureau. In order to automatically load in the boundaries as sf objects, run once per R session.\nlibrary(tigris) has all the standard census geographies, including census tracts, counties, CBSAs, ZCTAs, congressional districts, tribal areas, and more. It also includes other elements such as water, roads, and military bases.\nBy default, libraray(tigris) will download large very large and detailed TIGER line boundary files. For thematic mapping, the smaller cartographic boundary files are a better choice, as they are clipped to the shoreline, generalized, and therefore usually smaller in size without losing too much accuracy. To load cartographic boundaries, use the cb = TRUE argument. If you are doing detailed geospatial analysis and need the most detailed shapefiles, then you should use the detailed TIGER line boundary files and set cb = FALSE.\nBelow is an example of how you would use library(tigris) to get a sf dataframe of all Census tracts in DC for 2019.\n\nlibrary(tigris)\n\n# Only need to set once per script\noptions(tigris_class = \"sf\")\n\ndc_tracts <- tracts(\n state = \"DC\",\n cb = TRUE,\n year = 2019\n)\n\nUnlike library(urbnmapr), different functions are used to get geographic data for different geographic levels. For instance, the blocks() function will load census block group data, and the tracts() function will load tract data. Other functions include block_groups(), zctas() , and core_based_statistical_areas(). For the full list of supported geographies and functions, see the package vignette.\nFor folks interested in pulling in Census demographic information along with Census geographies, we recommend checking out the sister package to library(tigris): library(tidycensus). That package allows you to download in Census variables and Census geographic data simultaneously.\n\n\nCountries\nWe recommend using the library(rnaturalearth) package, which is similar to library(tigris) but allows you to download and use boundaries beyond the US. Instead of setting class to sf one time per session as we did with library(tigris), you must set the returnclass = \"sf\" argument each time you use a function from the package. Below is an example of downloading in an sf dataframe of all the countries in the world.\n\nlibrary(rnaturalearth)\n\nworld <- ne_countries(returnclass = \"sf\")\n\nggplot() +\n geom_sf(data = world, mapping = aes())\n\n\n\nYour own files\n\nShapefiles/GeoJSONS\nShapefiles and GeoJSONs are 2 common spatial file formats you will found out in the wild. library(sf) has a function called st_read which allows you to easily read in these files as sf dataframes. The only required argument is dsn or data source name. This is the filepath of the .shp file or the .geojson file on your local computer. For geojsons, dsn can also be a URL.\nBelow is an example of reading in a shapefile of fire stations in DC which is stored in mapping/data/shapefiles/. Note that shapefiles are actually stored as 6+ different files inside a folder. You need to provide the filepath to the file ending in .shp.\n\nlibrary(sf)\n\n# Print out all files in the directory\nlist.files(\"mapping/data/shapefiles\")\n\n[1] \"Fire_Stations.cpg\" \"Fire_Stations.dbf\" \"Fire_Stations.prj\"\n[4] \"Fire_Stations.shp\" \"Fire_Stations.shx\" \"Fire_Stations.xml\"\n\n# Read in .shp file\ndc_firestations <- st_read(\n dsn = \"mapping/data/shapefiles/Fire_Stations.shp\",\n quiet = TRUE\n)\n\nAnd now dc_firestations is an sf dataframe you can use for all your mapping needs! st_read supports reading in a wide variety of other spatial file formats, including geodatabases, KML files, and over 200 others. For an incomplete list, please see the this sf vignette.\n\n\nCSVs or dataframes with lat/lons\nIf you have a CSV with geographic information stored in columns, you will need to read in the CSV as a regular R dataframe and then convert to an sf dataframe. library(sf) contains the st_as_sf() function for converting regular R dataframes into an sf dataframe. The two arguments you must specify for this function are:\n\ncoords: A length 2 vector with the names of the columns corresponding to longitude and latitude (in that order!). For example, c(\"lon\", \"lat\").\ncrs: The CRS (coordinate references system) for your longitude/latitude coordinates. Remember you need to specify both the\nauthority and the SRID code, for example (“EPSG:4326”). For more information on finding and setting CRS codes, please see the CRS section.\n\nBelow is an example of reading in data from a CSV and converting it to an sf dataframe.\n\nlibrary(sf)\n\n# Read in dataset of state capitals which is stored as a csv\nstate_capitals <- read_csv(\"mapping/data/state-capitals.csv\")\n\nstate_capitals <- state_capitals %>%\n # Specify names of the lon/lat columns in the CSV to use to make geometry col\n st_as_sf(\n coords = c(\"longitude\", \"latitude\"),\n crs = 4326\n )\n\nOne common mistake is that before converting to an sf dataframe, you must drop any rows that have NA values for latitude or longitude. If your data contains NA values, then the st_as_sf() function will throw an error."
},
{
"objectID": "mapping.html#appending-spatial-info-to-your-data",
"href": "mapping.html#appending-spatial-info-to-your-data",
"title": "R@URBAN",
"section": "Appending spatial info to your data",
"text": "Appending spatial info to your data\nOftentimes, the data you are working with will just have state or county identifiers - like FIPS codes or state abbreviations - but will not contain any geographic information. In this case, you must do the extra work of downloading in the geographic data as an sf dataframe and then joining your non-spatial data to the spatial data. Generally this involves 3 steps:\n\nReading in your own data as a data frame\nReading in the geographic data as an sf dataframe\nUsing left_join to merge the geographic data with your own non spatial data and create a new expanded sf dataframe\n\nLet’s say we had a dataframe on CHIP enrollment by state with state abbreviations.\n\n# read the state CHIP data\nchip_by_state <- read_csv(\"mapping/data/chip-enrollment.csv\") %>%\n # clean column names so there are no random spaces/uppercase letters\n janitor::clean_names()\n\n# print to the console\nchip_by_state %>% head()\n\n# A tibble: 6 × 3\n state chip_enrollment state_abbreviation\n <chr> <dbl> <chr> \n1 Alabama 150040 AL \n2 Alaska 15662 AK \n3 Arizona 88224 AZ \n4 Arkansas 120863 AR \n5 California 2022213 CA \n6 Colorado 167227 CO \n\n\nIn order to convert this to an sf dataframe, we need to read in the spatial boundaries for each state and append it to our dataframe. Here is how we do that with get_urbn_map() and left_join() .\n\nlibrary(urbnmapr)\n\n# read in state geographic data from urbnmapr\nstates <- get_urbn_map(map = \"states\", sf = TRUE)\n\n# left join state geographies to chip data\nchip_with_geographies <- states %>%\n left_join(\n chip_by_state,\n # Specify join column, which are slightly differently named in states and chip\n # respectively\n by = c(\"state_abbv\" = \"state_abbreviation\")\n )\n\nchip_with_geographies %>%\n select(state_fips, state_abbv, chip_enrollment)\n\nSimple feature collection with 51 features and 3 fields\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: -2600000 ymin: -2363000 xmax: 2516374 ymax: 732352.2\nProjected CRS: NAD27 / US National Atlas Equal Area\nFirst 10 features:\n state_fips state_abbv chip_enrollment geometry\n1 01 AL 150040 MULTIPOLYGON (((1150023 -15...\n2 04 AZ 88224 MULTIPOLYGON (((-1386136 -1...\n3 08 CO 167227 MULTIPOLYGON (((-786661.9 -...\n4 09 CT 25551 MULTIPOLYGON (((2156197 -83...\n5 12 FL 374884 MULTIPOLYGON (((1953691 -20...\n6 13 GA 232050 MULTIPOLYGON (((1308636 -10...\n7 16 ID 35964 MULTIPOLYGON (((-1357097 78...\n8 18 IN 114927 MULTIPOLYGON (((1042064 -71...\n9 20 KS 79319 MULTIPOLYGON (((-174904.2 -...\n10 22 LA 161565 MULTIPOLYGON (((1075669 -15..."
},
{
"objectID": "mapping.html#crs",
"href": "mapping.html#crs",
"title": "R@URBAN",
"section": "Coordinate Reference Systems",
"text": "Coordinate Reference Systems\n\nThe short version\nJust watch this video and know the following:\n\nAll spatial data has a CRS, which specifies how to identify a location on earth.\nIt’s important that all spatial datasets you are working with be in the same CRS. You can find the CRS with st_crs() and change the CRS with st_transform().\nThe Urban Institute Style Guide requires the use of the Atlas Equal Earth Projection (\"ESRI:102003\") for national maps. For state and local maps, use this handy guide to find an appropriate State Plane projection.\n\n\n\nThe long version\nCoordinate reference systems (CRS) specify the 3d shape of the earth and optionally how we project that 3d shape onto a 2d surface. They are an important part of working with spatial data as you need to ensure that all the data you are working with are in the same CRS in order for spatial operations and maps to be accurate.\nCRS can be specified either by name (ie Maryland State Plane) or Spatial Reference System IDentifier (SRID). THe SRID is a numeric identifier that uniquely identifies a coordinate reference system. Generally when referring to an SRID, you need to refer to an authority (ie the data source) and a unique ID. An example is EPSG:26985 which refers to the Maryland State plane projection from the EPSG, or ESRI:102003 which refers to the Atlas Equal Area projection from ESRI. Most CRS codes will be from the EPSG, and some from ESRI and others. A good resource for finding/validating CRS codes is epsg.io.\nSidenote - EPSG stands for the now defunct European Petroleum Survey Group. And while oil companies have generally been terrible for the earth, the one nice thing they did for the earth was to set up common standards for coordinate reference systems.\nYou might be thinking well isn’t the earth just a sphere? Why do we need all this complicated stuff? And the answer is well the earth is kind of a sphere, but it’s really more of a misshapen ellipsoid which is pudgier at the equator than at the poles. To visualize how coordinate reference systems work, imagine that the earth is a (lumpy) orange. Now peel the skin off an orange and try to flatten it. There are many ways to do it, but all will create distortions of some kind. The CRS will give us the formula we’ve used to specify the shape of the orange (usually a sphere or ellipsoid of some kind) and optionally, specify how we flattened the orange into 2d.\nBroadly, there are two kinds of Coordinate Reference Systems:\n\nGeographic coordinate systems\n\n(sometimes called unprojected coordinate systems)\nSpecifies a 3d shape for the earth\nUses a spheroid/ellipsoid to approximate shape of the earth\nUsually use decimal degree units (ie latitude/longitude) to identify locations on earth\n\n\n\n\nProjected coordinate systems\n\nSpecifies a 3d shape for the earth + a 2d mapping\n\nIs a geographic coordinate system + a projection\n\ncredit: xkcd\nprojection: mathematical formula used to convert a 3d coordinate system to a 2d flat coordinate system\nMany different kinds of projections, including Equal Area, Equidistant, Conformal, etc\nAll projections distort the true shape of the earth in some way, either in terms of shape, area, or angle. Required xkcd comic\nUsually use linear units (ie feet, meters) and therefore useful for distance based spatial operations (ie creating buffers)"
},
{
"objectID": "mapping.html#finding-the-crs",
"href": "mapping.html#finding-the-crs",
"title": "R@URBAN",
"section": "Finding the CRS",
"text": "Finding the CRS\nIf you are lucky, your data will have embedded CRS data that will be automatically detected when the file is read in. This is usually the case for GeoJSONS (.geojson) and shapefiles (.shp). When you use st_read() on these files, you should see the CRS displayed in the metadata:\n\nYou can also the st_crs() function to find the CRS. The CRS code is located at the end in ID[authority, SRID].\n\nst_crs(dc_firestations)\n\nCoordinate Reference System:\n User input: WGS 84 \n wkt:\nGEOGCRS[\"WGS 84\",\n DATUM[\"World Geodetic System 1984\",\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"latitude\",north,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"longitude\",east,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n ID[\"EPSG\",4326]]\n\n\nSometimes, the CRS will be blank or NA as the dataset did not specify the CRS. In that case you MUST find and set the CRS for your data before proceeding with analysis. Below are some good rules of thumb for finding out what the CRS for your data is:\n\nFor geojsons, the CRS should always be EPSG:4326 (or WGS 84). The official geojson specification states that this is the only valid CRS for geojsons, but in the wild, this may not be true 100% of the time.\nFor shapefiles, there should be a file that ends in .proj in the same directory as the .shp file. This file contains the projection information for that file and should be used automatically when reading in shapefiles.\nFor CSV’s with latitude/longitude columns, the CRS is usually EPSG:4326 (or WGS 84).\nLook at the metadata and any accompanying documentation to see if the coordinate reference system for the data is specified\n\nIf none of the above rules of thumb apply to you, check out the crsuggest R package.\nOnce you’ve identified the appropriate CRS, you can set the CRS for your data with st_crs():\n\n# If you are certain that your data contains coordinates in the ESRI Atlas Equal Earth projections\nst_crs(some_sf_dataframe) <- st_crs(\"ESRI:102003\")"
},
{
"objectID": "mapping.html#transforming-the-crs",
"href": "mapping.html#transforming-the-crs",
"title": "R@URBAN",
"section": "Transforming the CRS",
"text": "Transforming the CRS\nOften you will need to change the CRS for your sf dataframe so that all datasets you are using have the same CRS, or to use a projected CRS for performing more accurate spatial operations. You can do this with st_transform:\n\n# Transforming CRS from WGS 84 to Urban required Equal Earth Projection\nstate_capitals <- state_capitals %>% st_transform(\"ESRI:102003\")\n\nst_transform() also allows you to just use the CRS of another sf dataframe when transforming.\n\n# transform CRS of chip_with_geographies to be the same as CRS of dc_firestations\nchip_with_geographies <- chip_with_geographies %>%\n st_transform(crs = st_crs(state_capitals))\n\nIf you are working with local data, you should use an appropriate state plane projection instead of the Atlas Equal Earth projection which is meant for national maps. library(crsuggest) can simplify the process of picking an appropriate state plane CRS.\n\nlibrary(crsuggest)\n\nsuggest_crs(dc_firestations) %>%\n # Use the value in the \"crs_code\" column to transform CRS's\n head(4)\n\n# A tibble: 4 × 6\n crs_code crs_name crs_type crs_gcs crs_units crs_proj4\n <chr> <chr> <chr> <dbl> <chr> <chr> \n1 6488 NAD83(2011) / Maryland (ftUS) project… 6318 us-ft +proj=lc…\n2 6487 NAD83(2011) / Maryland project… 6318 m +proj=lc…\n3 3582 NAD83(NSRS2007) / Maryland (ftU… project… 4759 us-ft +proj=lc…\n4 3559 NAD83(NSRS2007) / Maryland project… 4759 m +proj=lc…"
},
{
"objectID": "mapping.html#the-basics",
"href": "mapping.html#the-basics",
"title": "R@URBAN",
"section": "The basics",
"text": "The basics\n\nlibrary(ggplot2)\nMost mapping in R fits the same theoretical framework as plotting in R using library(ggplot2). To learn more about ggplot2, visit the Data Viz page or read the official ggplot book.\nThe key function for mapping is the special geom_sf() function which works with sf dataframes. This function magically detects whether you have point or polygon spatial data and displays the results on a map.\n\n\nA simple map\nTo make a simple map, add geom_sf() to a ggplot() and set data = an_sf_dataframe. Below is code for making a map of all 50 states using library(urbnmapr):\n\nlibrary(urbnmapr)\n\nstates <- get_urbn_map(\"states\", sf = TRUE)\n\nggplot() +\n geom_sf(\n data = states,\n mapping = aes()\n )"
},
{
"objectID": "mapping.html#styling",
"href": "mapping.html#styling",
"title": "R@URBAN",
"section": "Styling",
"text": "Styling\n\nlibrary(urbnthemes)\nlibrary(urbnthemes) automatically styles maps in accordance with the Urban Institute Data Visualization Style Guide. By using library(urbnthemes), you can create publication ready maps you can immediately drop in to Urban research briefs or blog posts.\nTo install urbnthemes, visit the package’s GitHub repository and follow the instructions. There are 2 ways to use the urbnthemes functions:\n\nlibrary(urbnthemes)\n\n# You can either run this once per script to automatically style all maps with\n# the Urban theme\nset_urbn_defaults(style = \"map\")\n\n# Or you can add `+ theme_urbn_map()` to the end of every map you make\nggplot() +\n geom_sf(states, mapping = aes()) +\n theme_urbn_map()\n\n\n\n\n\n\nLayering\nYou can layer multiple points/lines/polygons on top of each other using the + operator from library(ggplot2). The shapes will appear from bottom to top (ie the last mapped object will show up on top). It is important that all layers are in the same CRS (coordinate reference system).\n\nstate_capitals <- state_capitals %>%\n # This will change CRS to ESRI:102003 and shift the AK and HI state capitals\n # point locations to the appropriate locations on the inset maps.\n tigris::shift_geometry() %>%\n # For now filter out AL and HI as their state capitals will be slightly off.\n filter(!state %in% c(\"Alaska\", \"Hawaii\"))\n\nggplot() +\n geom_sf(\n data = states,\n mapping = aes()\n ) +\n # Note we change the data argument\n geom_sf(\n data = state_capitals,\n mapping = aes(),\n # urbnthemes library has urbn color palettes built in.\n color = palette_urbn_main[\"yellow\"],\n size = 2.0\n ) +\n theme_urbn_map()\n\n\n\n\n\n\nFill and Outline Colors\nThe same commands used to change colors, opacity, lines, size, etc. in charts can be used for maps too. To change the colors of the map , just use the fill = and color = parameters in geom_sf(). fill will change the fill color of polygons; color will change the color of polygon outlines, lines, and points.\nGenerally, maps that show the magnitude of a variable use the blue sequential ramp and maps that display positives and negatives use the diverging color ramp.library(urbnthemes) contains inbuilt. helper variables (like palette_urbn_main) for accessing color palettes from the Urban Data Viz Style guide. If for example you want states to be Urban’s magenta color:\n\nggplot() +\n geom_sf(states,\n mapping = aes(),\n # Adjust polygon fill color\n fill = palette_urbn_main[\"magenta\"],\n # Adjust polygon outline color\n color = \"white\"\n ) +\n theme_urbn_map()\n\n\n\n\n\n\nAdding text\nYou can also add text, like state abbreviations, directly to your map using geom_sf_text and the helper function get_urbn_labels().\n\nlibrary(urbnmapr)\n\nggplot() +\n geom_sf(states,\n mapping = aes(),\n color = \"white\"\n ) +\n theme_urbn_map() +\n # Generates dataframe of state abbv and appropriate location to plot them\n geom_sf_text(\n data = get_urbn_labels(\n map = \"states\",\n sf = TRUE\n ),\n aes(label = state_abbv),\n size = 3\n )\n\n\n\n\nThere’s also geom_sf_label() if you want labels with a border."
},
{
"objectID": "mapping.html#choropleth-maps",
"href": "mapping.html#choropleth-maps",
"title": "R@URBAN",
"section": "Choropleth Maps",
"text": "Choropleth Maps\nChoropleth maps display geographic areas with shades, colors, or patterns in proportion to a variable or variables. Choropleth maps can represent massive geographies like the entire world and small geographies like Census Tracts. To make a choropleth map, you need to set geom_sf(aes(fill = some_variable_name)). Below are examples\n\nContinuous color scale\n\n# Map of CHIP enrollment percentage by state\nchip_with_geographies_map <- chip_with_geographies %>%\n ggplot() +\n geom_sf(aes(\n # Color in states by the chip_pct variable\n fill = chip_pct\n ))\n\n\n# Below add-ons to the map are optional, but make the map look prettier.\nchip_with_geographies_map +\n # scale_fill_gradientn adds colors with more interpolation and reverses color scale\n scale_fill_gradientn(\n # Convert legend from decimal to percentages\n labels = scales::percent_format(),\n # Make legend title more readable\n name = \"CHIP Enrollment %\",\n # Manually add 0 to lower limit to include it in legend. NA=use maximum value in data\n limits = c(0, NA),\n # Set number of breaks on legend = 3\n n.breaks = 3\n )\n\n\n\n\n\n\nDiscrete color scale\nThe quick and dirty way is with scale_fill_steps(), which creates discretized bins for continuous variables:\n\nchip_with_geographies %>%\n ggplot() +\n geom_sf(aes(\n # Color in states by the chip_pct variable\n fill = chip_pct\n )) +\n scale_fill_steps(\n # Convert legend from decimal to percentages\n labels = scales::percent_format(),\n # Make legend title more readable\n name = \"CHIP Enrollment %\",\n # Show top and bottom limits on legend\n show.limits = TRUE,\n # Roughly set number of bins. Won't be exact as R uses algorithms under the\n # hood for pretty looking breaks.\n n.breaks = 4\n )\n\n\n\n\nOften you will want to manually generate the bins yourself to give you more fine grained control over the exact legend text. (ie 1% - 1.8%, 1.8 - 2.5%, etc). Below is an example of discretizing the continuous chip_pct variable yourself using cut_interval() and a helper function to get nice looking interval labels:\n\n# Helper function to clean up R generated intervals into nice looking interval labels\nformat_interval <- function(interval_text) {\n text <- interval_text %>%\n # Remove open and close brackets which is R generated math notation\n str_remove_all(\"\\\\(\") %>%\n str_remove_all(\"\\\\)\") %>%\n str_remove_all(\"\\\\[\") %>%\n str_remove_all(\"\\\\]\") %>%\n str_replace_all(\",\", \" — \")\n\n # Convert decimal ranges to percent ranges\n text <- text %>%\n str_split(\" — \") %>%\n map(~ as.numeric(.x) %>%\n scales::percent() %>%\n paste0(collapse = \" — \")) %>%\n unlist() %>%\n # By default character vectors are plotted in alphabetical order. We want\n # factors in reverse alphabetical order to get correct colors in ggplot\n fct_rev()\n\n return(text)\n}\n\nchip_with_geographies <- chip_with_geographies %>%\n # cut_interval into n groups with equal range. Set boundary so 0 is included in the bins\n mutate(chip_pct_interval = cut_interval(chip_pct, n = 5)) %>%\n # Generate nice looking interval labels\n mutate(chip_pct_interval = format_interval(chip_pct_interval))\n\nAnd now we can map the discretized chip_pct_interval variable using geom_sf():\n\nchip_with_geographies %>%\n ggplot() +\n geom_sf(aes(\n # Color in states by the chip_pct variable\n fill = chip_pct_interval\n )) +\n # Default is to use main urban palette, which assumes unrelated groups. We\n # adjust colors manually to be on Urban cyan palette\n scale_fill_manual(\n values = palette_urbn_cyan[c(8, 7, 5, 3, 1)],\n name = \"CHIP Enrollment %\"\n )\n\n\n\n\nIn addition to cut_interval there are similar functions for creating intervals/bins with slightly different rules. When creating bins, be careful as changing the number of bins can drastically change how the map looks."
},
{
"objectID": "mapping.html#bubble-maps",
"href": "mapping.html#bubble-maps",
"title": "R@URBAN",
"section": "Bubble Maps",
"text": "Bubble Maps\nThis is just a layered map with one polygon layer and one point layer, where the points are sized in accordance with a variable in your data.\n\nset_urbn_defaults(style = \"map\")\n\n# Get sf dataframe of DC tracts\nlibrary(tigris)\ndc_tracts <- tracts(\n state = \"DC\",\n year = 2019,\n progress_bar = FALSE\n)\n\n# Add bubbles for firestations\nggplot() +\n geom_sf(data = dc_tracts, fill = palette_urbn_main[\"gray\"]) +\n geom_sf(\n data = dc_firestations,\n # Size bubbles by number of trucks at each station\n aes(size = TRUCK),\n color = palette_urbn_main[\"yellow\"],\n # Adjust transparency for readability\n alpha = 0.8\n )"
},
{
"objectID": "mapping.html#dot-density-maps",
"href": "mapping.html#dot-density-maps",
"title": "R@URBAN",
"section": "Dot-density Maps",
"text": "Dot-density Maps\nThese maps scatter dots within a geographic area. Typically each dot represents a unit (like 100 people, or 1000 houses). To create this kind of map, you need to start with an sf dataframe that is of geometry type POLYGON or MULTIPOLYGON and then sample points within the polygon.\nThe below code generates a dot-density map representing people of different races within Washington DC tracts The code may look a little complicated, but the key workhorse function is st_sample() which samples points within each polygon to use in the dot density map:\n\nlibrary(tidycensus)\n\n# Get counts by race of DC tracts\ndc_pop <- get_acs(\n geography = \"tract\",\n state = \"DC\",\n year = 2019,\n variables = c(\n Hispanic = \"DP05_0071\",\n White = \"DP05_0077\",\n Black = \"DP05_0078\",\n Asian = \"DP05_0080\"\n ),\n geometry = TRUE,\n progress_bar = FALSE\n)\n\n# Get unique groups (ie races)\ngroups <- unique(dc_pop$variable)\n\n# For each unique group (ie race), generate sampled points\ndc_race_dots <- map_dfr(groups, ~ {\n dc_pop %>%\n # .x = the group used in the loop\n filter(variable == .x) %>%\n # Use the projected MD state plane for accuracy\n st_transform(crs = \"EPSG:6487\") %>%\n # Have every dot represent 100 people\n mutate(est100 = as.integer(estimate / 100)) %>%\n st_sample(size = .$est100, exact = TRUE) %>%\n st_sf() %>%\n # Add group (ie race) as a column so we can use it when plotting\n mutate(group = .x)\n})\n\n\nggplot() +\n # Plot tracts, then dots on top of tracts\n geom_sf(\n data = dc_pop,\n # Make interior of tracts transparent and boundaries black\n fill = \"transparent\",\n color = \"black\"\n ) +\n geom_sf(\n data = dc_race_dots,\n # Color in dots by racial group\n aes(color = group),\n # Adjust transparency and size to be more readable\n alpha = 0.5,\n size = 1.1,\n stroke = FALSE\n )"
},
{
"objectID": "mapping.html#geofacets",
"href": "mapping.html#geofacets",
"title": "R@URBAN",
"section": "Geofacets",
"text": "Geofacets\nGeofaceting arranges sub-geography-specific plots into a grid that resembles a larger geography (usually the US). This can be a useful alternative to choropleth maps, which tend to overemphasize low-population density areas with large areas. To make geofacetted charts, use the facet_geo() function from the geofacet library, which can be thought of as equivalent to ggplot2’s facet_wrap(). For this example, we’ll use the built-in state_ranks data.\n\nlibrary(geofacet)\n\nhead(state_ranks %>% as_tibble())\n\n# A tibble: 6 × 4\n state name variable rank\n <chr> <chr> <chr> <dbl>\n1 AK Alaska education 28\n2 AK Alaska employment 50\n3 AK Alaska health 25\n4 AK Alaska wealth 5\n5 AK Alaska sleep 27\n6 AK Alaska insured 50\n\n\n\nset_urbn_defaults(style = \"print\")\n\nstate_ranks %>%\n filter(variable %in% c(\"education\", \"employment\")) %>%\n ggplot(aes(x = rank, y = variable)) +\n geom_col() +\n facet_geo(\n facets = \"state\",\n # Use custom urban geofacet grid which is built into urbnthemes\n # For now we need to rename a few columns as urbnthemes has to be\n # updated\n grid = urbnthemes::urbn_geofacet %>%\n rename(\n code = state_code,\n name = state_name\n )\n )\n\n\n\n\nInteractive geofacets of the United States have been used in Urban Features like A Matter of Time which included geofaceted line charts showing trends in incarceration by state. Static geofacets of the United States were included in Barriers to Accessing Homeownership Down Payment, Credit, and Affordability by the Housing Finance Policy Center."
},
{
"objectID": "mapping.html#cartograms",
"href": "mapping.html#cartograms",
"title": "R@URBAN",
"section": "Cartograms",
"text": "Cartograms\nCartograms are a modified form of a choropleth map with intentionally distorted sizes that map to a variable in your data. Below we create a cartogram with library(cartogram) where the state sizes are proportional to the population.\n\nlibrary(cartogram)\n\nset_urbn_defaults(style = \"map\")\n\nchip_with_geographies_weighted <- chip_with_geographies %>%\n # Note column name needs to be in quotes for this package\n cartogram_cont(weight = \"population\")\n\nggplot() +\n geom_sf(\n data = chip_with_geographies_weighted,\n # Color in states by chip percentages\n aes(fill = chip_pct)\n )"
},
{
"objectID": "mapping.html#interactive-maps",
"href": "mapping.html#interactive-maps",
"title": "R@URBAN",
"section": "Interactive Maps",
"text": "Interactive Maps\nInteractive maps can be a great exploratory tool to explore and understand your data. And luckily there are a lot of new R packages that make it really easy to create them. Interactive maps are powerful but we do not recommend them for official use in Urban publications as getting them in Urban styles and appropriate basemaps can be tricky (reach out to anarayanan@urban.org if you really want to include them).\n\nlibrary(mapview)\nlibrary(mapview) is probably the most user friendly of the interactive mapping R libraries. All you have to do to create an interactive map is:\n\nlibrary(mapview)\n\n\nchip_with_geographies_for_interactive_mapping <- chip_with_geographies %>%\n # Filter out AL and HI bc they would appear in Mexico. If you want AL, HI and\n # in the correct place in interactive maps, make sure to use tigris::states()\n filter(!state_abbv %in% c(\"AK\", \"HI\"))\n\nmapview(chip_with_geographies_for_interactive_mapping)\n\n\n\n\n\n\nWhen you click on an object, you get a popup table of all it’s attributes. And when you hover over an object, you get a popup with an object id.\nEach of the above behaviors can be changed if desired. As you’ll see in the below section, the syntax for library(mapview) is significantly different from library(ggplot2) so be careful!\n\nColoring in points/polygons\nIn order to create a choropleth map where we color in the points/polygons by a variable, we need to feed in a column name in quotes to thezcol argument inside the mapview() function:\n\n# Create interactive state map colored in by chip enrollment\nmapview(chip_with_geographies_for_interactive_mapping, zcol = \"chip_enrollment\")\n\n\n\n\n\n\nIf you want more granular control over the color palette for the legend can also feed in a vector of color hex codes to col.regions along with a column name to zcol. This will create a continuous color range along the provided colors. Be careful though as the color interpolation is not perfect.\n\n# library(RColorBrewer)\nmapview(chip_with_geographies_for_interactive_mapping,\n col.regions = c(\n palette_urbn_green[6],\n \"white\",\n palette_urbn_cyan[6]\n ),\n zcol = \"chip_enrollment\"\n)\n\n\n\n\n\n\nIf you want to color in all points/polygons as the same color, just feed in a single color hex code to the col.regions argument:\n\nmapview(chip_with_geographies_for_interactive_mapping,\n col.regions = palette_urbn_green[5]\n)\n\n\n\n\n\n\n\n\nAdding layers\nYou can add multiple sf objects on the same map by using the + operator. This is very useful when comparing 2 or more spatial datasets.\n\nmapview(chip_with_geographies_for_interactive_mapping, col.regions = palette_urbn_green[5]) +\n mapview(state_capitals, col.regions = palette_urbn_cyan[5])\n\n\n\n\n\n\nYou can even create slider maps by using the | operator!\n\nmapview(chip_with_geographies_for_interactive_mapping, col.regions = palette_urbn_green[5]) |\n mapview(state_capitals, col.regions = palette_urbn_cyan[5])\n\n\n\n\n\n\n\n\n\nMore details\nTo learn more about more advanced options with mapview maps, check out the documentation page and the reference manual.\nThere are also other interactive map making packages in R like leaflet (which mapview is a more user friendly wrapper of), tmap, and mapdeck. To learn about these other packages, this book chapter is a good starting point."
},
{
"objectID": "mapping.html#cropping",
"href": "mapping.html#cropping",
"title": "R@URBAN",
"section": "Cropping",
"text": "Cropping\nCropping (or clipping) is geographically filtering an sf dataframe to just the area we are interested in. Say we wanted to look at the roads around Fire Station 24 in DC.\n\nlibrary(tigris)\nlibrary(units)\n\ndc_firestations <- dc_firestations %>%\n st_transform(\"EPSG:6487\")\n\n\n# Draw 500 meter circle around one fire station\nfire_station_24_buffered <- dc_firestations %>%\n filter(NAME == \"Engine 24 Station\") %>%\n st_buffer(set_units(500, \"meter\"))\n\n# Get listing of all roads in DC\ndc_roads <- roads(\n state = \"DC\",\n county = \"District of Columbia\",\n class = \"sf\",\n progress_bar = FALSE\n) %>%\n st_transform(\"EPSG:6487\")\n\n# View roads on top of fire_station\nggplot() +\n # Order matters! We need to plot fire_stations first, and then roads on top\n # to see overlapping firestations\n geom_sf(\n data = fire_station_24_buffered,\n fill = palette_urbn_cyan[1],\n color = palette_urbn_cyan[7]\n ) +\n geom_sf(\n data = dc_roads,\n color = palette_urbn_gray[7]\n ) +\n theme_urbn_map()\n\n\n\n\nWe can clip the larger roads dataframe to just roads that overlap with the circle around the fire station with st_intersection().\n\n# Use st_intersection() to crop the roads data to just roads within the\n# fire_station radius\ndc_roads_around_fire_station_24_buffered <- fire_station_24_buffered %>%\n st_intersection(dc_roads)\n\nggplot() +\n geom_sf(\n data = fire_station_24_buffered,\n fill = palette_urbn_cyan[1],\n color = palette_urbn_cyan[7]\n ) +\n geom_sf(\n data = dc_roads_around_fire_station_24_buffered,\n color = palette_urbn_gray[7]\n ) +\n theme_urbn_map()\n\n\n\n\nMore Coming Soon!"
},
{
"objectID": "mapping.html#calculating-distance",
"href": "mapping.html#calculating-distance",
"title": "R@URBAN",
"section": "Calculating Distance",
"text": "Calculating Distance"
},
{
"objectID": "mapping.html#spatial-joins",
"href": "mapping.html#spatial-joins",
"title": "R@URBAN",
"section": "Spatial Joins",
"text": "Spatial Joins\n\nPoint to Polygon\n\n\nPolygon to Polygon"
},
{
"objectID": "mapping.html#aggregating",
"href": "mapping.html#aggregating",
"title": "R@URBAN",
"section": "Aggregating",
"text": "Aggregating"
},
{
"objectID": "mapping.html#drivetransit-times",
"href": "mapping.html#drivetransit-times",
"title": "R@URBAN",
"section": "Drive/Transit times",
"text": "Drive/Transit times"
},
{
"objectID": "mapping.html#geocoding",
"href": "mapping.html#geocoding",
"title": "R@URBAN",
"section": "Geocoding",
"text": "Geocoding\nGeocoding is the process of turning text (usually addresses) into geographic coordinates (usually latitudes/longitudes) for use in mapping. For Urban researchers, we highly recommend using the Urban geocoder as it is fast, accurate, designed to work with sensitive/confidential data and most importantly free to use for Urban researchers! To learn about how we set up and chose the geocoder for the Urban Institute, you can read our Data@Urban blog.\n\nCleaning Addresses\nThe single most important factor in getting accurate geocoded data is having cleaned, well structured address data. This can prove difficult as address data out in the wild is often messy and unstandardized. While the rules for cleaning addresses are very data specific, below are some examples of clean addresses you should aim for in your data cleaning process:\n\n\n\n\n\n\n \n \n \n f_address\n Type of address\n \n \n \n 123 Troy Drive, Pillowtown, CO, 92432\nresidnetial address\n 789 Abed Avenue, Apt 666, Blankesburg, CO, 92489\nresidential apartment address\n Shirley Boulevard and Britta Drive, Blanketsburg, CO, 92489\nstreet intersection\n Pillowtown, CO\ncity\n 92489, CO\nZip Code\n \n \n \n\n\n\n\nAll that being said, our geocoder is pretty tolerant of different address formats, typos/spelling errors and missing states, zip codes, etc. So don’t spend too much time cleaning every address in the data. Also note that while our geocoder is able to geocode cities and zip codes, it will return the lat/lon of the center of the city/zip code, which may not be what you want.\n\n\nInstructions\nTo use the Urban geocoder, you will need to:\n\nGenerate a CSV with a column named f_address which contains the addresses in single line format (ie 123 Abed Avenue, Blanketsburg, CO, 94328). This means that if you have the addresses split across multiple columns (ie Address, City, State, Zip columns), you will need to concatenate them into one column. Also see our Address cleaning section above.\nGo to the Urban geocoder and answer the initial questions. This will tell you whether your data is non-confidential or confidential data, and allow you to upload your CSV for geocoding.\nWait for an email telling you your results are ready. If your data is non-confidential, this email will contain a link to your geocoded results. This link expires in 24 hours, so make sure to download your data before then. If you data is confidential, the email will contain a link to the location on the Y Drive where your confidential geocoded data is stored. You can specify this output folder when submitting the CSV in step 1.\n\n\n\nGeocoder outputs\n\nThe geocoded file will be your original data, plus a few more columns (including latitude and longitude). each of the new columns that have been appended to your original data. It’s very important that you take a look at the Addr_type column in the CSV before doing further analysis to check the accuracy of the geocoding process.\n\n\n\n\n\n\n\n\nColumn\nDescription\n\n\n\n\nMatch_addr\nThe actual address that the inputted address was matched to. This is the address that the geocoder used to get Latitudes / Longitudes. If there are potentially many typos or non standard address formats in your data file, you will want to take a close look at this column to confirm that the matched address correctly handled typos and badly formatted addresses.\n\n\nLongitude\nThe WGS 84 datum Longitude (EPSG code 4326)\n\n\nLatitude\nThe WGS 84 datum Latitude (EPSG code 4326)\n\n\nAddr_type\nThe match level for a geocode request. This should be used as an indicator of the precision of geocode results. Generally, Subaddress, PointAddress, StreetAddress, and StreetInt represent accurate matches. The list below contains all possible values for this field. Green values represent High accuracy matches, yellow represents Medium accuracy matches and red represents Low accuracy/inaccurate matches. If you have many yellow and red values in your data, you should manually check the results before proceeding with analysis. All possible values:\n\nSubaddress: A subset of a PointAddress that represents a house or building subaddress location, such as an apartment unit, floor, or individual building within a complex. The UnitName, UnitType, LevelName, LevelType, BldgName, and BldgType field values help to distinguish subaddresses which may be associated with the same PointAddress. Reference data consists of point features with associated house number, street name, and subaddress elements, along with administrative divisions and optional postal code; for example, 3836 Emerald Ave, Suite C, La Verne, CA, 91750.\n\nPointAddress: A street address based on points that represent house and building locations. Typically, this is the most spatially accurate match level. Reference data contains address points with associated house numbers and street names, along with administrative divisions and optional postal code. The X / Y (Longitude/Latitude) and geometry output values for a PointAddress match represent the street entry location for the address; this is the location used for routing operations. The DisplayX and DisplayY values represent the rooftop, or actual, location of the address. Example: 380 New York St, Redlands, CA, 92373.\n\nStreetAddress — A street address that differs from PointAddress because the house number is interpolated from a range of numbers. Reference data contains street center lines with house number ranges, along with administrative divisions and optional postal code information, for example, 647 Haight St, San Francisco, CA, 94117.\n\nStreetInt: A street address consisting of a street intersection along with city and optional state and postal code information. This is derived from StreetAddress reference data, for example, Redlands Blvd & New York St, Redlands, CA, 92373.\n\nStreetName: Similar to a street address but without the house number. Reference data contains street centerlines with associated street names (no numbered address ranges), along with administrative divisions and optional postal code, for example, W Olive Ave, Redlands, CA, 92373.\n\nStreetAddressExt: An interpolated street address match that is returned when parameter matchOutOfRange=true and the input house number exceeds the house number range for the matched street segment.\n\nDistanceMarker: A street address that represents the linear distance along a street, typically in kilometers or miles, from a designated origin location. Example: Carr 682 KM 4, Barceloneta, 00617.\n\nPostalExt: A postal code with an additional extension, such as the United States Postal Service ZIP+4. Reference data is postal code points with extensions, for example, 90210-3841.\n\nPOI: —Points of interest. Reference data consists of administrative division place-names, businesses, landmarks, and geographic features, for example, Golden Gate Bridge.\n\nLocality: A place-name representing a populated place. The Type output field provides more detailed information about the type of populated place. Possible Type values for Locality matches include Block, Sector, Neighborhood, District, City, MetroArea, County, State or Province, Territory, Country, and Zone. Example: Bogotá, COL,\n\nPostalLoc: A combination of postal code and city name. Reference data is typically a union of postal boundaries and administrative (locality) boundaries, for example, 7132 Frauenkirchen.\n\nPostal: Postal code. Reference data is postal code points, for example, 90210 USA.\n\n\nScore\nA number from 1–100 indicating the degree to which the input tokens in a geocoding request match the address components in a candidate record. A score of 100 represents a perfect match, while lower scores represent decreasing match accuracy.\n\n\nStatus\nIndicates whether a batch geocode request results in a match, tie, or unmatched. Possible values include\n\nM - Match. The returned address matches the input address and is the highest scoring candidate.\n\nT - Tied. The returned address matches the input address but has the same score as one or more additional candidates.\n\nU - Unmatched. No addresses match the inputted address.\n\n\ngeometry\nThe WKT (Well-known text) representation of the latitudes and longitudes. This column may be useful if you’re reading the CSV into R, Python, or ArcGIS\n\n\nRegion\nThe state that Match_addr is located in\n\n\nRegionAbbr\nAbbreviated State Name. For example, CA for California\n\n\nSubregion\nThe county that the input address is located in\n\n\nMetroArea\nThe name of the Metropolitan area that Match_addr is located in. This field may be blank if the input address is not located within a metro area.\n\n\nCity\nThe city that Match_addr is located in\n\n\nNbrhd\nThe Neighborhood that Match_addr is located in. Note these are ESRI defined neighborhoods which may or may not align with other sources neighborhood definitions"
}
]