-
Notifications
You must be signed in to change notification settings - Fork 3
/
ch-102-ggplot2.rmd
2252 lines (1521 loc) · 54.4 KB
/
ch-102-ggplot2.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
theme: readable
highlight: tango
self_contained: false
css: textbook.css
---
# The Grammar of Graphics
<br>
<br>
<div class="tip">
## Key Concepts
In this chapter, we'll explore the following key concepts:
* Exploratory Data Analysis (EDA)
* Exploratory Data Visualization (EDV)
* Explanatory vs. Explanatory Plots
* The Grammar of Graphics
* Data & Non-Data Ink
* Aesthetic Mappings
* Attributes
* Overplotting
* Faceting
## New Packages
This chapter uses the following packages:
* ggplot2
* GGally
* scales
* plotly
## Key Takeaways
Too long; didn't read? Here's what you need to know:
* We visualize data to:
- Submit, Publish, Teach ("Explanatory Viz")
- Explore, Learn, Share ("Exploratory Viz")
* Exploratory viz is a key part of any data analysis
* "Grammar of Graphics" framework inspired "ggplot2"
* "Layers" are like visual parts of speech
- There are 7 layers - 3 are essential
- Each layer has a family of functions
* Essential layers include:
1. Data: Intakes data frame/tibble
2. Aesthetic: Maps variables to axes, color, etc.
3. Geometry: Specifies the shape of your data
* Nonessential layers include:
4. Statistics: Modeling and computation
5. Coordinates: Zooming in and modifying scales
6. Facets: Small multiples, a.k.a. trellis plots
7. Themes: Overall style - grid lines, text, etc.
* Variety of ggplot2 extensions (see resources)
* Use package "plotly" to make ggplot2 interactive
<br>
<br>
<br>
</div>
<br>
<br>
```{r echo=F}
# ATTENTION : GLOBAL CHUNK DEFAULTS
knitr::opts_chunk$set(message = FALSE,
warning = FALSE) # Disable warnings, messages
```
```{r include=FALSE}
tutorial::go_interactive( greedy=FALSE ) # Enable interactive exercises
```
## Why Visualize Data?
We visualize data for two principal reasons. We want to:
1. Learn about our data, i.e. **exploratory data visualization**
2. Tell our data's stories to others, i.e. **explanatory data visualization**
<br>
<br>
### Explanatory Visualization
**Explanatory visualization** is polished, publication-quality, and interpretable:
* Meant to be consumed by broad, non-specialist audiences
* Takes significant time and iterations to perfect
* Conveys one or two "big ideas", each
```{r message=F, warning=F, echo=F, cache=T}
library(readr)
url <- paste0("https://raw.githubusercontent",
".com/jamisoncrawford/ddp_app/",
"master/Data/onet_bls_merged.csv")
set.seed(1)
```
<br>
Just note the length of the code and its result. What idea does this plot convey?
```{r message=F, warning=F, cache=T}
library(ggplot2)
ggplot(read_csv(url), aes(x = val, y = myr)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_smooth(method = "lm", se = FALSE,
color = "grey50", lwd = 1, alpha = 0.3) +
facet_wrap(~ elm) +
labs(title = "Annual income vs. Holland personality scores",
x = "Personality Score",
y = "Income (K)",
caption = "Sources: US DOL O*NET & BLS") +
scale_y_continuous(labels = c("0", "50", "100", "150", "$200")) +
theme_minimal()
```
To experiment with Holland codes and predicted income, [check it out in Shiny](https://uruguayguy.shinyapps.io/shiny_app/).
<br>
<br>
### Exploratory Visualization
**Exploratory visualization** is "quick and dirty" and intends to discover.
* Meant to be consumed by yourself, colleagues, or other specialists
* Created quickly, with no polish, refinement, or audience in mind
* Can convey many ideas, or none - that's why you do it!
<br>
This code is manageable! Observe a simple call to `ggpairs()` and the `iris` dataset.
```{r message=F, warning=F, cache=T}
library(ggplot2)
library(GGally)
ggpairs(data = iris,
aes(color = Species))
```
<center>
*With just a little code, pairs plots visualize every variable against the other.*
</center>
<br>
Hopefully, you wouldn't publish this. But we can quickly find patterns in a pairs plot:
* We can see positive correlations between sepal and petals for two species
* We observe that "Setosa" have thinner, longer sepals compared to others
* We also observe that "Versicolor" has the least variation in size
Such visual exploration can help refine hypotheses before analysis begins!
<br>
<br>
<div class="quiz">
<br>
YOUR TURN: EXPLORATORY VIZ WITH PAIRS PLOTS
<br>
Load the necessary packages with `library()`.
Then, call `ggpairs()` on the `economics` dataset from ggplot2.
<br>
</div>
<br>
```{r ex="example-01", type="sample-code", tut=TRUE}
# Load required packages
library(ggplot2)
library(GGally)
# Set first argument to "economics"
ggpairs(data = ...)
```
<br>
<br>
### Do I have to Visualize?
Exploratory viz is a key component in **exploratory data analysis**, or **EDA**.
Failing to visually explore your data can get you in hot water. Let's try it!
<br>
Observe the following data frame containing four data sets.
Variable `x1` corresponds to `y1`, `x2` to `y2`, and so forth:
```{r message=F, warning=F, echo=F, cache=T}
library(datasets)
set1 <- data.frame(x = anscombe[,1], y = anscombe[,5])
set2 <- data.frame(x = anscombe[,2], y = anscombe[,6])
set3 <- data.frame(x = anscombe[,3], y = anscombe[,7])
set4 <- data.frame(x = anscombe[,4], y = anscombe[,8])
```
```{r, cache=T}
library(datasets)
anscombe
```
At a glance, they look like the have some similaries!
<br>
**Check It:** Let's perform a few statistical EDA functions on each subset, 1-4.
What do you notice when we figure out and organize:
* Average of all X and Y values with `mean()`
* Variance of all X and Y values with `var()`
* Correlation between X and Y for all sets with `cor()`
* Linear regression coefficients between X and Y with `lm()`
```{r echo=F, warning=F, message=F, cache=T}
x_mean <- sapply(list(set1$x,
set2$x,
set3$x,
set4$x), mean)
y_mean <- sapply(list(set1$y,
set2$y,
set3$y,
set4$y), mean)
x_var <- sapply(list(set1$x,
set2$x,
set3$x,
set4$x), var)
y_var <- sapply(list(set1$y,
set2$y,
set3$y,
set4$y), var)
correl <- c(cor(set1$x, set1$y),
cor(set2$x, set2$y),
cor(set3$x, set3$y),
cor(set4$x, set4$y))
coeff <- c(lm(y ~ x, set1)$coef[2],
lm(y ~ x, set2)$coef[2],
lm(y ~ x, set3)$coef[2],
lm(y ~ x, set4)$coef[2])
data.frame(x_mean,
y_mean,
x_var,
y_var,
correl,
coeff,
row.names = c("Set 1", "Set 2", "Set 3", "Set 4"))
```
<br>
**Heavens to Murgatroyd!** These are practically the same sets!
* The mean and variance of X is the exact same across sets
* The mean and variance of Y is *almost* exactly the same across sets
* The correlation between X & Y is extremely close across sets
* The coefficient of determination is also extremely close
<br>
Once visualized, all four linear relationships appear to be exactly the same:
<center>
```{r warning=F, echo=F, message=F, cache=T}
library(tidyr)
library(dplyr)
library(ggplot2)
library(stringr)
library(datasets)
ansc <- anscombe %>%
gather(key = set_var, value = value, x1:y4) %>%
mutate(set = str_extract(set_var, pattern = "[1-4]{1}"),
var = str_extract(set_var, pattern = "[x-y]{1}")) %>%
select(-set_var)
xy <- bind_cols(ansc[1:44, 1:3] %>%
rename(x = value) %>%
select(-var),
ansc[45:88, 1:3] %>%
rename(y = value) %>%
select(-var)) %>%
select(set, x, y, -set1)
ggplot(xy, aes(x = x, y = y)) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~set) +
labs(title = "Linear relationship in Anscombe's Datasets",
x = "X Values",
y = "Y Values",
caption = "Source: Francis Anscombe (1973)") +
theme_classic()
```
</center>
<br>
**Well, that settles that.** Pack it up, folks - the data are the same.
<br>
*Hold up.*
<br>
*Wait a minute.*
<br>
*Something ain't right.*
<br>
**Let's try replotting** the actual datasets and not just their linear models.
<br>
<center>
```{r echo=F, warning=F, message=F, cache=T}
ggplot(xy, aes(x = x, y = y)) +
geom_point(size = 2, alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, alpha = 0.3) +
facet_wrap(~set) +
labs(title = "Actual relationships in Anscombe's Datasets",
subtitle = "You've been had!",
x = "X Values",
y = "Y Values",
caption = "Source: Francis Anscombe (1973)") +
theme_classic()
```
</center>
<br>
**Boom!** Not the same datasets *at all*.
<br>
Despite having the same mean, variance, correlations, and coefficients:
* Dataset 1 is a normally distributed linear relationship
* Dataset 2 is a parabolic curve
* Dataset 3 is a perfect linear relationship with a high-leverage outlier
* Dataset 4 shows absolutely no relationship but again has an outlier
<br>
**Conclusion:** Always conduct exploratory visualization as a staple of any analysis.
<br>
<br>
<div class = "note">
**FUN FACT:**
**You just got Anscombe'd.** [Francis John "Frank" Anscombe](https://en.wikipedia.org/wiki/Frank_Anscombe) was an English statistician who helped pioneer the field in the twentieth century. He wrote:
<br>
> "...a computer should make both calculations and graphs..."
<br>
To demonstrate, Anscombe invented these datasets: [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).
</div>
```{r warning=F, message=F, echo=F}
rm(set1, set2, set3, set4, xy, ansc, coeff,
correl, x_mean, x_var, y_mean, y_var)
```
<br>
<br>
## The Grammar of Graphics
We've compared human languaes and programming languages in the past.
Let's take the anology further. **Observe the following sentence:**
```{r, echo=F, fig.align="center", fig.cap = "The quick brown fox jumps over the lazy dog.", out.width="100%", cache=T}
knitr::include_graphics("https://mir-s3-cdn-cf.behance.net/project_modules/disp/7a5a8f10364877.560e3a9483d9b.png")
```
<center> Source: [Chionetti, A. (2013)](https://www.behance.net/gallery/10364877/The-quick-brown-fox-jumps-over-the-lazy-dog) </center>
<br>
Recall that language consists of nouns, verbs, adjectives, articles, prepositions, etc.
If we change any **part of speech**, we change the meaning of our sentence. Observe:
<br>
> The quick brown fox jumps **off** the lazy dog.
<br>
**Now the fox has escalated things, opting to use the poor dog as a springboard.**
<br>
> The quick brown fox **runs** over the lazy dog.
<br>
**Well this is just getting graphic.**
<br>
> The **decrepit** brown fox jumps over the lazy dog.
<br>
**He's a good dog.**
<br>
**Parts of Speech:** Every *part of speech* (like adjectives, e.g. "quick", "lazy") has a function.
Nouns describe things, verbs describe actions, adjectives describe qualities, etc.
<br>
**Parts of Viz:** Like language, there are *parts of visualization*. Each part has a function.
* A chart's *data* could be anything, like quarterly revenues or ELA scores
* A chart's *geometry* could represent the data as bars, points, lines, or shapes
* A chart's *theme* could use different fonts, gridlines, transparencies, etc.
<br>
These are just some *parts of viz* in a larger framework: **The Grammar of Graphics**.
<br>
<br>
### A Brief Overview
In 1999, statistician [Leland Wilkinson](https://en.wikipedia.org/wiki/Leland_Wilkinson) published [**The Grammar of Graphics**](https://books.google.com/books/about/The_Grammar_of_Graphics.html?id=ZiwLCAAAQBAJ).
This framework allows us to dissect and alter plots in the same way we would a sentence.
Let's begin with *parts of viz*, or **layers**.
<br>
<br>
### Layers: Parts of Viz
In the grammar of graphics framework, each visualization is comprised of **layers**.
* Each **layer** performs a unique function in a visualization
* Like *parts of speech*, a **layer** can perform one function in infinite ways
- For example, the **data layer** functions to input your data
- A noun can be "fox" or "dog; a dataset can be "DOL" or "TSA"
<br>
<br>
### Essential Layers: Data, Mappings, & Shapes
Like human *sentences*, every complete visualization has 3 essential layers:
<br>
<br>
<br>
```{r, echo=F, fig.align="center", fig.cap = "The three essential layers for a complete visualization.", out.width="100%", cache=T}
knitr::include_graphics("figures/essential_layers.jpg")
```
```{r echo=F, message=F, warning=F, cache=T}
library(readr)
library(dplyr)
library(ggplot2)
url <- paste0("https://raw.githubusercontent.com/",
"jamisoncrawford/wealth/master/Tidy",
"%20Data/hancock_lakeview_tidy.csv")
lakeview <- read_csv(url) %>%
filter(project == "Lakeview",
!is.na(race))
url <- paste0("https://raw.githubusercontent.com/DS4PS/",
"dp4ss-textbook/master/tables/hancock.csv")
hancock <- read_csv(url) %>%
filter(!is.na(ethnicity))
```
<br>
<br>
<br>
#### The Data Layer
The **data layer** conveys the dataset informing the visualization.
* The data layer inputs a dataset, but doesn't specify the variables you show
* Visualizations can have more than one data layer, e.g. overlay plots
<br>
The **data layer** in everyday conversation:
> "Are you pulling occupations from O*NET or BLS? We only need SOC-level."
<br>
**Intepretation:** Your data layer is comprised of your data source.
<br>
<br>
Observe the same plot - the only difference is the **data layer**. What changes?
<br>
<br>
```{r echo=F, message=F, warning=F, cache=T}
library(gridExtra)
library(ggplot2)
library(scales)
set.seed(1)
p1 <- ggplot(lakeview, aes(x = factor(race,
levels = c("Asian", "Black", "Hispanic", "Indigenous", "White"),
labels = c("Asian", "Black", "Hispanic", "Indigenous", "White")),
y = net)) +
geom_jitter(alpha = 0.15,
color = "dodgerblue2",
width = 0.2) +
labs(title = "Racial disparities in public work",
subtitle = "Lakeview Amphitheater, 2015",
x = NULL,
y = "Weekly Net",
caption = "") +
scale_y_continuous(labels = dollar,
breaks = pretty_breaks(3)) +
coord_flip() +
theme_minimal() +
theme(text = element_text(family = "Quicksand"),
plot.title = element_text(vjust = 3,
face = "bold"),
plot.subtitle = element_text(vjust = 2),
axis.title.x = element_text(vjust = -0.666),
axis.title.y = element_text(vjust = 2),
plot.caption = element_text(vjust = -1.333))
p2 <- ggplot(hancock, aes(x = factor(ethnicity,
levels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White"),
labels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White")),
y = net)) +
geom_jitter(alpha = 0.15,
color = "dodgerblue2",
width = 0.2) +
labs(title = "",
subtitle = "Hancock Airport, 2018",
x = NULL,
y = "Weekly Net",
caption = "Source: Syracuse Regional Airport Authority") +
scale_y_continuous(labels = dollar,
breaks = pretty_breaks(2)) +
coord_flip() +
theme_minimal() +
theme(text = element_text(family = "Quicksand"),
plot.title = element_text(vjust = 3,
face = "bold"),
plot.subtitle = element_text(vjust = 2),
axis.title.x = element_text(vjust = -0.666),
axis.title.y = element_text(vjust = 2),
plot.caption = element_text(vjust = -1.333))
grid.arrange(p1, p2, ncol=2)
```
<br>
<br>
#### The Aesthetics Layer
The **aesthetics layer** conveys which variables to visualize.
* **Aesthetics** refers to the visual ways to represent variables
* Aesthetics include size, shape, color, fill, line type, transparency, etc.
* The most common aesthetic mapping is using x- and y-axes to show quantities
* Aesthetics adjust dynamically with your data - *they are not static*
<br>
The **aesthetics layer** in everyday conversation:
> “Can you color-code the datapoints by gender?”
<br>
**Intepretation:** Aesthetics use visual elements like color to convey more data.
<br>
<br>
Observe the same plot - the only difference is the **aesthetics layer**. What changes?
<br>
<br>
```{r echo=F, message=F, warning=F, cache=T}
library(gridExtra)
library(extrafont)
library(ggplot2)
set.seed(1)
p1 <- ggplot(hancock, aes(x = factor(ethnicity,
levels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White"),
labels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White")),
y = net)) +
geom_jitter(alpha = 0.15,
color = "dodgerblue2",
width = 0.2) +
labs(title = "Racial disparities in public work",
subtitle = "Hancock Airport, 2018",
x = NULL,
y = "Weekly Net",
caption = "") +
scale_y_continuous(labels = dollar,
breaks = pretty_breaks(3)) +
coord_flip() +
theme_minimal() +
theme(text = element_text(family = "Quicksand"),
plot.title = element_text(vjust = 3,
face = "bold"),
plot.subtitle = element_text(vjust = 2),
axis.title.x = element_text(vjust = -0.666),
axis.title.y = element_text(vjust = 2),
plot.caption = element_text(vjust = -1.333))
p2 <- ggplot(hancock, aes(x = gross,
y = net)) +
geom_jitter(alpha = 0.15,
color = "dodgerblue2",
width = 0.2) +
labs(title = "",
subtitle = "",
x = "Weekly Gross",
y = "Weekly Net",
caption = "Source: Syracuse Regional Airport Authority") +
scale_x_continuous(labels = dollar) +
scale_y_continuous(labels = dollar,
breaks = pretty_breaks(2)) +
coord_flip() +
theme_minimal() +
theme(text = element_text(family = "Quicksand"),
plot.title = element_text(vjust = 3,
face = "bold"),
plot.subtitle = element_text(vjust = 2),
axis.title.x = element_text(vjust = -0.666),
axis.title.y = element_text(vjust = 2),
plot.caption = element_text(vjust = -1.333))
grid.arrange(p1, p2, ncol=2)
```
<br>
<br>
#### The Geometry Layer
The **geometry layer** conveys the shape your variables use to visualize data.
* **Geometries** include scatter plots, bar charts, line graphs, etc.
* Some geometries are only compatible with certain variables
* Geometries can also take **attributes**, like size, shape, color, etc.
* Geometry attributes do not adjust dynamically with data - *they are static*
<br>
The **geometry layer** in everyday conversation:
> “I’m trying to emphasize the increase in elevated blood lead levels over time.”
<br>
**Intepretation:** It sounds like they'd want a line graph or bar chart to show change.
<br>
<br>
Observe the same plot - the only difference is the **geometry layer**. What changes?
<br>
<br>
```{r echo=F, message=F, warning=F, cache=T}
library(gridExtra)
library(ggplot2)
library(scales)
set.seed(1)
p1 <- ggplot(hancock, aes(x = factor(ethnicity,
levels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White"),
labels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White")),
y = net)) +
geom_jitter(alpha = 0.15,
color = "dodgerblue2",
width = 0.2) +
labs(title = "Racial disparities in public work",
subtitle = "Hancock Airport, 2018",
x = NULL,
y = "Weekly Net",
caption = "") +
scale_y_continuous(labels = dollar,
breaks = pretty_breaks(3)) +
coord_flip() +
theme_minimal() +
theme(text = element_text(family = "Quicksand"),
plot.title = element_text(vjust = 3,
face = "bold"),
plot.subtitle = element_text(vjust = 2),
axis.title.x = element_text(vjust = -0.666),
axis.title.y = element_text(vjust = 2),
plot.caption = element_text(vjust = -1.333))
p2 <- ggplot(hancock, aes(x = factor(ethnicity,
levels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White"),
labels = c("Hispanic", "Black", "Multiracial", "Indigenous", "White")),
y = net)) +
geom_boxplot(alpha = 0.15,
color = "dodgerblue2") +
labs(title = "",
subtitle = "",
x = NULL,
y = "Weekly Net",
caption = "Source: Syracuse Regional Airport Authority") +
scale_y_continuous(labels = dollar,
breaks = pretty_breaks(3)) +
coord_flip() +
theme_minimal() +
theme(text = element_text(family = "Quicksand"),
plot.title = element_text(vjust = 3,
face = "bold"),
plot.subtitle = element_text(vjust = 2),
axis.title.x = element_text(vjust = -0.666),
axis.title.y = element_text(vjust = 2),
plot.caption = element_text(vjust = -1.333))
grid.arrange(p1, p2, ncol=2)
```
<br>
<br>
<div class = "tip">
PRO TIP
It seems like a small distinction, but the difference is critical:
* Both "Attributes" and "Aesthetics" use color, size, line width, fill, etc.
* "Attributes" are static and unchanging
* "Aesthetics" are dynamic and change with your data
<br>
All "aesthetics" are specified in the "aesthetics" layer.
All "attributes" are specified in the "geometries" layer.
</div>
```{r warning=F, message=F, echo=F}
rm(p1, p2, lakeview)
```
<br>
<br>
### Remaining Layers
The 4 remaining **layers** in the **grammar of graphics** include:
* The **coordinates layer** modifies plot zoom, truncation, and labeling
* The **statistics layer** performs statistical transformations like lines of best fit
* The **facets layer** creates multiple comparison plots, a.k.a. small multiples
* The **themes layer** modifies font, gridlines, and other "non-data ink" polish
<br>
We'll explore some key functions from these **layers** in the following sections.
<br>
<br>
## Download the Practice Data
Our practice data are public consruction project worker records in Syracuse, NY.
They were collected for a [**racial equity impact statement**](https://www.ujtf.org/reis) on hiring disparities.
<br>
**Download:** You can download the practice data [**here**](https://raw.githubusercontent.com/DS4PS/dp4ss-textbook/master/tables/hancock.csv).
<br>
**To Follow Along:** Run this code in your R console:
```{r eval=F}
library(readr)
url <- paste0("https://raw.githubusercontent.com/DS4PS/",
"dp4ss-textbook/master/tables/hancock.csv")
hancock <- read_csv(url)
rm(url)
```
<br>
```{r, echo=F, fig.align="center", fig.cap = 'The impact statement uses ggplot2 for the majority if its viz (p. 95).', out.width="90%", cache=T}
knitr::include_graphics("figures/reis_syracuse.jpg")
```
<br>
<br>
## Package "ggplot2"
Package "ggplot2" is a popular, powerful package for data visualization in R.
* Authored by Hadley Wickham; maintained by RStudio
* An implementation of the "Grammar of Graphics" framework (hence "gg")
* Has a series of function families, each corresponding to a different **layer**
Expressions in "ggplot2" use a particular syntax. Note that `+` connects the **layers**.
<br>
Let's look at a "complete" graph:
```{r message=F, warning=F, cache=T}
library(ggplot2)
ggplot(data = hancock) + # Data layer
aes(x = net) + # Aesthetics layer
geom_histogram() # Geometry layer
```
<br>
Note that the preferred "ggplot2" format is as follows (both do the same thing):
```{r eval=F}
ggplot(data = hancock,
aes(x = net)) + # "aes()" is nested in "ggplot()"
geom_histogram()
```
<br>
**Let's break it down.** Here are the three layers of a "ggplot2" graphic and how they work:
<br>
```{r, echo=F, fig.align="center", fig.cap = 'Breaking down a "complete" graphic.', out.width="90%", cache=T}
knitr::include_graphics("figures/ggplot2_recipe.jpg")
```
<br>
<br>
<div class="quiz">
<br>
YOUR TURN: A BASIC GGPLOT
<br>
1. Load the necessary packages with `library()`
2. Call `ggplot()` on the `economics` dataset from ggplot2
3. In `aes()`, map **x =** to `date` and **y =** to `unemploy`
4. Call `geom_line()`
<br>
A call to `theme_classic()` has been added for panache.
<br>
</div>
<br>
```{r ex="example-02", type="sample-code", tut=TRUE}
# Load required packages
library(ggplot2)
# Specify dataset and variable names