forked from jcoliver/learn-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
004-intro-ggplot.Rmd
377 lines (292 loc) · 13 KB
/
004-intro-ggplot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
---
title: "Introduction to data visualization with ggplot"
author: "Jeff Oliver"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
css: stylesheets/markdown-styles.css
pdf_document:
latex_engine: xelatex
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.height = 4)
```
An introduction to using the `ggplot` package in R to produce publication-quality graphics.
#### Learning objectives
1. Install and use third-party packages for R
2. Use layering to add elements to a plot
3. Format plots with faceting
## Setup
### Workspace organization
First we need to setup our development environment. We need to create two folders: 'data' will store the data we will be analyzing, and 'output' will store the results of our analyses.
```{r, eval = FALSE}
dir.create(path = "data")
dir.create(path = "output")
```
With this workspace organization, we can download the data, either manually or use R to automatically download it. For this lesson we'll do the latter, saving the file in the `data` directory we just created, and name the file `gapminder.csv`:
```{r, eval = FALSE}
download.file(url = "http://tinyurl.com/gapminder-five-year-csv",
destfile = "data/gapminder.csv")
```
From this point on, we want to keep track of what we have done, so we will restrict our use of the console and instead use script files. Start a new script with some brief header information at the very top. We want, at the very least, to include:
1. A short description of what the script does (no more than one line of text)
2. Your name
3. A means of contacting you (e.g. your e-mail address)
4. The date, preferably in ISO format YYYY-MM-DD
```{r}
# Plot gapminder data
# Jeff Oliver
# 2017-02-23
```
### Installing additional packages
There are two steps to using additional packages in R:
1. Install the package through `install.packages()`
2. Load the package into active memory with `library()`
For this exercise, we will install the `ggplot2` package:
```{r eval = FALSE}
install.packages("ggplot2")
library("ggplot2")
```
It is important to note that for each computer you work on, the `install.packages` command need only be issued once, while the call to `library` will need to be issued for each session of R. Because of this, it is standard convention in scripts to comment out the `install.packages` command once it has been run on the machine, so our script now looks like:
```{r}
# Plot gapminder data
# Jeff Oliver
# 2017-02-23
#install.packages("ggplot2")
library("ggplot2")
```
## Now plot!
### Scatterplot
For our first plot, we will create an X-Y scatterplot to investigate a potential relationship between a country's gross domestic product (GDP) and the average life expectancy. We do so by creating a `ggplot` object, then calling `print` on that object:
```{r}
# Load data
gapminder <- read.csv(file = "data/gapminder.csv")
# Create plot object
lifeExp.plot <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp))
# Draw plot
print(lifeExp.plot)
```
_What happened?_ There are no points!? Here is where functionality of ggplot is evident. The way it works is by effectively drawing layer upon layer of graphics. So we have established the plot, but we need to add one more bit of information to tell ggplot what to put in that plot area. For a scatter plot, we use `geom_point()`, literally adding this to the ggplot object with a plus sign (+):
```{r}
# Create plot object
lifeExp.plot <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point()
# Draw plot
print(lifeExp.plot)
```
It is a little difficult to see how the points are distributed, as most are clustered on the left-hand side of the graph. To spread this distribution out, we can change the scale of the x-axis so GDP is displayed on a log scale by adding `scale_x_log10`:
```{r}
# Create plot object
lifeExp.plot <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10()
# Draw plot
print(lifeExp.plot)
```
One thing of interest is to include additional information in the plot, such as which continent each point comes from. We can color points by another value in our data through the `aes` parameter in the initial call to `ggplot`. So in addition to telling R which data to use for the x and y axes, we indicate which data to use for point colors:
```{r}
# Create plot object
lifeExp.plot <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10()
# Draw plot
print(lifeExp.plot)
```
At this point, we want to change the colors of the points in two ways: (1) we want to make them slightly transparent because there is considerable overlap in some part of the graph; we do this by setting the `alpha` parameter in the `geom_point` call; (2) we want to use custom colors for each continent; this is done with the `scale_color_manual` function:
```{r}
# Create plot object
lifeExp.plot <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
scale_x_log10() +
scale_color_manual(values = c("red", "orange", "forestgreen", "darkblue", "violet"))
# Draw plot
print(lifeExp.plot)
```
Finally, we should make those axis labels a little nicer.
```{r}
# Create plot object
lifeExp.plot <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
scale_x_log10() +
scale_color_manual(values = c("red", "orange", "forestgreen", "darkblue", "violet")) +
xlab("GDP per capita") +
ylab("Life Expectancy")
# Draw plot
print(lifeExp.plot)
```
And if we want to save the plot to a file, we call `ggsave`, passing a filename and the plot object we want to save (the latter is optional, if we don't indicate which plot to save, `ggsave` will save the last plot that was displayed):
```{r, eval = FALSE}
# Save plot to png file
ggsave(filename = "output/gdp-lifeExp-plot.png", plot = lifeExp.plot)
```
Also note that `ggsave` will guess the type of file to save from the extension. For example, if we instead wanted to save a TIF instead of a PNG, we change the extension for the `filename` argument:
```{r, eval = FALSE}
# Save plot to tif file
ggsave(filename = "output/gdp-lifeExp-plot.tiff", plot = lifeExp.plot)
```
Our final script for this scatterplot is then:
```{r, eval = FALSE}
# Plot gapminder data
# Jeff Oliver
# 2017-02-23
#install.packages("ggplot2")
library("ggplot2")
# Load data
gapminder <- read.csv(file = "data/gapminder.csv")
# Create plot object
lifeExp.plot <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
scale_x_log10() +
scale_color_manual(values = c("red", "orange", "forestgreen", "darkblue", "violet")) +
xlab("GDP per capita") +
ylab("Life Expectancy")
# Draw plot
print(lifeExp.plot)
# Save plot to png file
ggsave(filename = "output/gdp-lifeExp-plot.png", plot = lifeExp.plot)
```
***
### Violin plot
```{r, eval = FALSE}
# Violin plot of GDP data
# Jeff Oliver
# 2017-02-23
rm(list = ls())
#install.packages("ggplot2")
library("ggplot2")
```
Now with this call to `rm`, we have cleared out all our variables from memory, so we have to load the `gapminder` data again. To create a violin plot, we use the same approach as for a boxplot, but instead of `geom_point`, we use `geom_violin`:
```{r fig.height = 3}
# Load data
gapminder <- read.csv(file = "data/gapminder.csv")
# Create violin plot object
gdp.plot <- ggplot(data = gapminder,
mapping = aes(x = continent, y = gdpPercap)) +
geom_violin()
# Print plot
print(gdp.plot)
```
A good start, but we can see we need to clean some things up. We'll start by changing the Y-axis to a log scale with `scale_y_log10` and adding a nicer looking axis title with `ylab`. Also, we will remove the gray background that comes with the default theme by using `theme_bw`:
```{r fig.height = 3}
# Create violin plot object
gdp.plot <- ggplot(data = gapminder,
mapping = aes(x = continent, y = gdpPercap)) +
geom_violin() +
scale_y_log10() +
ylab("GDP per capita") +
theme_bw()
# Print plot
print(gdp.plot)
```
One of the issues we have in the plot is not obvious just by looking at it. If we take a look at the data using `summary`, note the values in the `year` vector:
```{r}
summary(gapminder)
```
There are actually multiple years of data (1952 - 2007, at five year increments). So rather than plotting GDP for all years together, we should separate out each year of data. The `ggplot2` package makes this _very_ easy with faceting. To break the data apart into separate graphs for each year, we use `facet_wrap`, passing `year` as the column in the `gapminder` data we want to use for each graph:
```{r}
# Create violin plot object
gdp.plot <- ggplot(data = gapminder,
mapping = aes(x = continent, y = gdpPercap)) +
geom_violin() +
scale_y_log10() +
ylab("GDP per capita") +
theme_bw() +
facet_wrap(~ year)
# Print plot
print(gdp.plot)
```
One of the first things you might notice is that the X-axis is way to crowded now. Instead of using text to indicate each continent, we could instead use a separate color to indicate each continent, like we did in the scatterplot. We'll use `scale_fill_manual` for this purpose, and use the `theme` function to indicate that we do not want any x-axis text or an x-axis title.
```{r}
# Create violin plot object
gdp.plot <- ggplot(data = gapminder,
mapping = aes(x = continent, y = gdpPercap)) +
geom_violin() +
scale_y_log10() +
ylab("GDP per capita") +
theme_bw() +
facet_wrap(~ year) +
scale_fill_manual(values = c("red", "orange", "forestgreen", "darkblue", "violet")) +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank())
# Print plot
print(gdp.plot)
```
Okay, the x-axis is cleaned up, but where are the colors? If we look at our code, note that we said what colors we wanted, but not what they corresponded to. We need to explicitly tell ggplot that we want to fill the shapes by continent; this happens in the very first `ggplot` call, by passing an additional argument, `fill` to the `aes` function:
```{r}
# Create violin plot object
gdp.plot <- ggplot(data = gapminder,
mapping = aes(x = continent, y = gdpPercap, fill = continent)) +
geom_violin() +
scale_y_log10() +
ylab("GDP per capita") +
theme_bw() +
facet_wrap(~ year) +
scale_fill_manual(values = c("red", "orange", "forestgreen", "darkblue", "violet")) +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank())
# Print plot
print(gdp.plot)
```
Colors! But now you notice another problem with the plot - where's Oceania? We lost Australia and New Zealand! The problem is that there are now too few data points in the Oceania group in each map to create a violin plot (you need at least three points, and with only those two countries, there are only two points of data for each year). So for this plot, we are going to go ahead and explicitly omit the Oceania data from the plot. We do this by subsetting the `gapminder` data in the `ggplot` call:
```{r}
# Load data
gapminder <- read.csv(file = "data/gapminder.csv")
# Create violin plot object
gdp.plot <- ggplot(data = gapminder[gapminder$continent != "Oceania", ],
mapping = aes(x = continent, y = gdpPercap, fill = continent)) +
geom_violin() +
scale_y_log10() +
ylab("GDP per capita") +
theme_bw() +
facet_wrap(~ year) +
scale_fill_manual(values = c("red", "orange", "forestgreen", "darkblue")) +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank())
# Print plot
print(gdp.plot)
```
Success!
Our final script for this violin plot is then:
```{r eval = FALSE}
# Violin plot of GDP data
# Jeff Oliver
# 2017-02-23
rm(list = ls())
#install.packages("ggplot2")
library("ggplot2")
# Load data
gapminder <- read.csv(file = "data/gapminder.csv")
# Create violin plot object
gdp.plot <- ggplot(data = gapminder[gapminder$continent != "Oceania", ],
mapping = aes(x = continent, y = gdpPercap, fill = continent)) +
geom_violin() +
scale_y_log10() +
ylab("GDP per capita") +
theme_bw() +
facet_wrap(~ year) +
scale_fill_manual(values = c("red", "orange", "forestgreen", "darkblue")) +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank())
# Print plot
print(gdp.plot)
```
***
## Additional resources
+ Official [ggplot documentation](http://docs.ggplot2.org/current/)
+ A handy [cheatsheet for ggplot](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf)
+ A [PDF version](https://jcoliver.github.io/learn-r/004-intro-ggplot.pdf) of this lesson
***
<a href="index.html">Back to learn-r main page</a>
Questions? e-mail me at <a href="mailto:[email protected]">[email protected]</a>.