forked from lawanin/police
-
Notifications
You must be signed in to change notification settings - Fork 0
/
complaints.Rmd
71 lines (54 loc) · 2.54 KB
/
complaints.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
title: "Police Complaints"
author: "David Kane"
date: "6/6/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
# It is somewhat sloppy to include a read_csv() command in the setup chunk.
# Normally, we would just load libraries here. However, we have not learned
# about the col_types argument to read_csv() yet, so we can't make the annoying
# message go away unless we stick the call in a code chunk with a code chunk
# option like message = FALSE or include = FALSE.
raw_data <- read_csv("https://raw.githubusercontent.com/Financial-Times/police-misconduct-complaints-analysis/main/output/philly_clean.csv") %>%
select(officer_id)
```
```{r, clean_data}
# This code makes a table that lets us know how many times each officer id appeared
# in the tibble, i.e. how many complaints each officer has.
clean_data <- raw_data %>%
group_by(officer_id) %>%
summarise(total = n()) %>%
# We now want to know in which percentile (out of 10) each officer is, based on
# the amount of complaints he has. We could use mutate and percentile, as we did
# multiple times in Wrangling B. But ntile() accomplishes the same in less code.
mutate(compl_dec = ntile(total, 10)) %>%
# As you build a pipe, you want to look at the result after each step to make
# sure it does what you want. Only after it is working would you then assign
# the result to an object which you can use later.
# We want to know the total number of complaints in each decile of officers.
group_by(compl_dec) %>%
summarize(compl_total = sum(total)) %>%
# The graph needs total complaints as a percentage, which is easy to
# calculate. Then, we only keep around the variables we need for the plot.
mutate(compl_perc = compl_total / sum(compl_total)) %>%
select(compl_dec, compl_perc)
```
```{r, plot_data}
# We could just have one giant pipe which goes directly into ggplot(), like we
# do in the tutorials. There is nothing wrong with that approach, but it is
# often easier to split your work up into separate parts, the better to make
# sure that each part is doing what you want.
clean_data %>%
ggplot(aes(x = compl_dec, y = compl_perc)) +
geom_col() +
labs(title = "Distribution of Police Complaints in Philadelphia",
subtitle = "A tenth of officers get a third of the complaints",
x = "Complaint Decile",
y = NULL,
caption = "Data from Financial Times") +
scale_x_continuous(breaks = 1:10) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))
```