-
Notifications
You must be signed in to change notification settings - Fork 0
/
reshape.qmd
121 lines (87 loc) · 5.7 KB
/
reshape.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
title: "Reshape"
---
## Module Learning Objectives
By the end of this module, you will be able to:
- <u>Contrast</u> "long" data with "wide" data
- <u>Use</u> `tidyr`'s `pivot_wider` and `pivot_longer` functions to reshape data
```{r libraries-reshape, echo = F, message = F}
library(tidyverse); library(palmerpenguins)
```
## Defining "Shape"
Before talking about *how* to reshape your data between wide and long format, let's talk about *what* "shape" means in reference to data. Fundamentally, "long" data are data with more rows than columns while "wide" data tend to have more columns than rows.
For example, in community ecology a "wide" dataframe could have each row being a site that researchers visited while each column could be a different species where the value in the row is the number of individuals of that species at that site. On the other hand, the `penguins` dataframe we've been working with so far is in "long" format because it has one row per penguin and multiple penguins are stacked up.
Both wide and long format data can be useful in certain contexts and it is sometimes most intuitive to reshape data from one form to the other (and sometimes back again to the original form!).
## Reshaping Data
The `tidyr` package contains the intuitively-named `pivot_wider` and `pivot_longer` for doing exactly this reshaping.
To help demonstrate these two functions, let's begin by summarizing our dataframe to make changing the shape of the dataframe more visible than it would be with the full dataframe. For example, let's calculate the average bill length of each penguin species on each island.
```{r data-prep, message = F}
# Begin by naming our new data and the data they come from
penguins_simp <- penguins %>%
# Now group by species and island
dplyr::group_by(species, island) %>%
# Calculate average bill length
dplyr::summarize(avg_bill_length_mm = mean(bill_length_mm, na.rm = TRUE)) %>%
# And don't forget to ungroup!
dplyr::ungroup()
# And this is what we're left with:
penguins_simp
```
Great! We can use this smaller data object to demonstrate reshaping more clearly. Let's begin with an example for `pivot_wider`.
### `pivot_wider` Example: Reshaping to Wide Format
:::callout-note
## Example
`pivot_wider` takes long format data and reshapes it into wide format.
<p align="center">
<img src="images/reshape-pivot-wide.png" alt="Graphic of a table with 'A' and 'B' columns being pivoted to a table with 'A', 'C', and 'D' columns and a 'B' row" width="50%" />
</p>
Let's say that we want to take that data object and reshape it into wide format so that each island is a column and each species of penguin is a row. The contents of each cell then are going to be the average bill length values that we just calculated.
```{r pivot_wider, message = F}
# Begin by naming the objects
penguins_wide <- penguins_simp %>%
# And now we can pivot wider with `pivot_wider`!
tidyr::pivot_wider(names_from = island,
values_from = avg_bill_length_mm )
# Take a look!
penguins_wide
```
Great! We now have each island as a column, each row is a penguin species, and the bill length measurement we took is included in each cell. Note that in this specific case this makes the number somewhat ambiguous so we might want to use `dplyr`'s `select` or the more specific `rename` to change the island names to be clearer that those values are bill lengths in milimeters.
:::
### `pivot_longer` Example: Reshaping to Long Format
:::callout-note
## Example
Now that we have a small wide format data object, we can feed it to `pivot_longer` and reshape our data into long format! `pivot_longer` has very similar syntax *except* that with `pivot_longer` you need to tell the function which columns should be reshaped.
`pivot_wider` on the other hand knows which columns to move around because you manually specify them in the "names_from" and "values_from" arguments.
<p align="center">
<img src="images/reshape-pivot-long.png" alt="Graphic of a table with 'A' and 'B' columns being pivoted to a table with 'C' and 'D' columns and 'A' and 'B' rows" width="50%" />
</p>
```{r pivot_longer, message = F}
# Begin with our wide data
penguins_wide %>%
# And reshape back into long format
pivot_longer(cols = -species,
names_to = "island_name",
values_to = "mean_bill_length_mm" )
```
Two quick things to note here:
- First, `pivot_longer` included the cells that were NA in the wide version of the data.
- This default behavior is really nice so that you don't lose any cells implicitly (though you can always `filter` them out if you don't want them!).
- Second, you'll note that in the "cols" argument I only told `pivot_longer` to *not* include the "species" column using the same notation you could use for the `select` function in the `dplyr` package.
- This is very handy because it lets us write really concise values in the "cols" argument and the default becomes "everything *except* what was specified".
- Note that we could have also said `cols = Biscoe, Dream, Torgersen` and achieved the same reshaping of the data.
:::
### Challenge: Reshaping
:::callout-important
## Your Turn!
The code below creates a data object that includes the flipper length of all Adelie penguins; what code would you add to reshape the data so that each sex is a column with flipper lengths in the cells?
```{r reshape-challenge_data-prep, message = F}
penguins %>%
# Keep only Adelie penguins of known sex
dplyr::filter(species == "Adelie" & !is.na(sex)) %>%
# Calculate the average flipper length by island and sex
dplyr::group_by(island, sex) %>%
dplyr::summarize(avg_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE)) %>%
# Ungroup (good practice to include this step!)
dplyr::ungroup()
```
:::