-
Notifications
You must be signed in to change notification settings - Fork 0
/
8.groupby_summarize.qmd
177 lines (133 loc) · 7.25 KB
/
8.groupby_summarize.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "8. Group By and Summarize"
author: "Patrick Spauster"
format:
html:
code-copy: true
toc: true
toc-location: right
editor: visual
---
## Video Tutorial
```{=html}
<div style="max-width:608px"><div style="position:relative;padding-bottom:66.118421052632%"><iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/1674401/sp/167440100/embedIframeJs/uiconf_id/23435151/partner_id/1674401?iframeembed=true&playerId=kaltura_player&entry_id=1_0c9gmqct&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[LeadWithHLSOnFlash]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_9o78atzo" width="608" height="402" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="8. group by summarize" style="position:absolute;top:0;left:0;width:100%;height:100%"></iframe></div></div>
```
## Why summarize data?
Having clean data is great, but when working with large datasets we are often looking for summary statistics to let us compare different groups. `group_by` and `summarize`, often used together in a pipe, are a powerful combo for generating statistics at the group level.
Summarizing allows us to compare means, medians, or top values based on different categories. It can also be a helpful data cleaning tool, depending on the level of observations in the data.
## Summarize
Summarize takes a data frame and a `...` list of new variables to generate.
```{r include = F}
library(tidyverse)
library(janitor)
?summarize()
```
Let's grab our code to read in the clean dataframe again.
```{r message = F}
fhv_clean <- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") %>%
clean_names() %>%
rename(hybrid = veh)
```
*Note - `summarize` is different than `summary` - a helpful function that provides the basic stats of all variables* *- a good first step when looking at a new data set*
```{r}
summary(fhv_clean)
```
With a basic application of `summarize`, we take all the rows of the dataframe and turn them into one row of data. Here we use mean() to get the average
```{r}
fhv_clean %>%
summarize(average_year = mean(vehicle_year))
#careful - NAs are sticky so mean() will return NA if there are any missing values. Use na.rm = T to exclude missing in calculating the average
fhv_clean %>%
summarize(average_year = mean(vehicle_year, na.rm = TRUE))
```
You can summarize multiple variables (`...)` at once. Look at summarize documentation for the full list of operations you can summarize with.
```{r}
fhv_clean %>%
summarize(total_cars = n(), #n() just counts the number of rows in the group
average_year = mean(vehicle_year, na.rm = TRUE),
median_year = median(vehicle_year, na.rm = T))
```
When using summarize it defaults to one big group - all the data is summarized. To summarize by group we can add `group_by()`
## Group_by
Where summarize becomes really powerful is pairing it with group_by
```{r}
fhv_clean %>%
group_by(hybrid) %>%
summarize(number_cars = n(),
mean_year = mean(vehicle_year, na.rm = T))
```
Often we are grouping on variables of interest and summarizing across that variable. Take the variable we made last time, for example
```{r}
fhv_type_summary <- fhv_clean %>%
mutate(
ride_type = case_when(
base_name == "UBER USA, LLC" & base_type == "BLACK-CAR" ~ "BLACK CAR RIDESHARE",
base_name != "UBER USA, LLC" & base_type == "BLACK-CAR" ~ "BLACK CAR NON-RIDESHARE",
TRUE ~ base_type #if it doesn't meet either condition, return the base_type
)) %>%
group_by(ride_type) %>% #group by the variable we just created!
summarize(no_cars = n(),
average_year = mean(vehicle_year, na.rm = T))
fhv_type_summary
```
Here we can see some interesting trends start to emerge, like how the livery cars tend to be the oldest and the luxury cars tend to be newer.
We can also group_by multiple variables, or group by logical expressions.
```{r}
fhv_clean %>%
group_by(hybrid, base_type) %>%
summarize(total_cars = n())
fhv_clean %>%
group_by(base_type, vehicle_year >= 2000) %>%
summarize(total_cars = n())
```
## Group_by with mutate
Pair group_by with mutate to create helpful summary level variables without reducing the number of rows in the dataset.
```{r}
fhv_clean %>%
group_by(base_name) %>%
mutate(total_by_name = n()) #variable with the total # of cars for each company name
```
Use ungroup() to return to normal mutation operations
```{r}
fhv_clean %>%
group_by(base_type) %>%
mutate(mean_by_type = mean(vehicle_year, na.rm =T)) %>%
ungroup() %>%
mutate(above_below_mean = if_else(
condition = vehicle_year > mean_by_type,
true = "above mean",
false = "below mean"
)) %>%
count(above_below_mean)
#this creates a variable to show if this car is above or below the average year for the group
```
## Normalizing with Groups
Group by, summarize, and mutate are crucial when comparing geographic areas. You can use group_by to get the proportion within a certain group.
Here's a full example of how to normalize within a group, using the data on new york housing authority apartments by borough. We want to find the proportion of NYCHA apartments that are Section 8 (vouchers) in each borough.
```{r message=F}
nycha <- read_csv('https://data.cityofnewyork.us/resource/evjd-dqpz.csv') %>%
clean_names()
# Filter for developments that contain any Section 8 transition apartments
section8devs <- nycha %>%
filter(number_of_section_8_transition_apartments > 0)
# Get some stats on number of section 8 apts across all developments by borough
section8devs_by_boro <- section8devs %>%
group_by(borough) %>%
summarize(section8apts = sum(number_of_section_8_transition_apartments),
totalapts = sum(total_number_of_apartments),
median_section8apts = median(number_of_section_8_transition_apartments),
avg_section8apts = mean(number_of_section_8_transition_apartments))
# Is this all the info we need? What's missing? (Normalization)
# Section 8 units as a share of all units in mixed finance developments, by borough
section8devs_grouped <- section8devs %>%
group_by(borough) %>%
summarize(total_s8_apts = sum(number_of_section_8_transition_apartments),
total_apts = sum(total_number_of_apartments))%>%
mutate(s8_share = total_s8_apts / total_apts)
# Section 8 units as a share of all NYCHA units, by borough
nycha_grouped <- nycha %>%
group_by(borough) %>%
summarize(total_s8_apts = sum(number_of_section_8_transition_apartments, na.rm=TRUE),
total_apts = sum(total_number_of_apartments, na.rm=TRUE))%>%
mutate(s8_share = total_s8_apts / total_apts)
```