-
Notifications
You must be signed in to change notification settings - Fork 1
/
02_embrace-your-fallibility.qmd
264 lines (203 loc) · 10.7 KB
/
02_embrace-your-fallibility.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
title: "Embrace Your Fallibility"
abstract: "This section will motivate and outline the workflows and tools we will adopt to promote reproducible research."
format:
html:
code-line-numbers: true
fig-align: "center"
execute:
echo: false
editor_options:
chunk_output_type: console
bibliography: references.bib
---
![The Duck-Rabbit Illusion](images/duck-rabbit.png){#fig-duck-rabbit}
```{r}
#| echo: false
exercise_number <- 1
```
```{r}
#| message: false
library(tidyverse)
library(gt)
source(here::here("src", "motivation.R"))
```
## Embrace Your Fallibility
### Why are we here?
The unifying interest of data scientists and statisticians is that **we want to learn about the world using data.**
Working with data has always been tough. It has always been difficult to create analyses that are
1. Accurate
2. Reproducible and auditable
3. Collaborative
Working with data has **gotten tougher with time**. Data sources, methods, and tools have become more sophisticated. This leaves a lot of us stressed out because errors and mistakes feel inevitable and are embarrassing.
### What are we going to do?
Errors and mistakes are **inevitable**. It's time to [embrace our fallibility](https://www.nickeubank.com/wp-content/uploads/2016/06/Eubank_EmbraceYourFallibility.pdf).
In The Field Guide to Understanding Human Error [@dekker2014], the author argues that there are two paradigms:
1. **Old-World View:** errors are the fault of individuals
2. **New-World View:** errors are the fault of flawed systems that fail individuals
::: {.callout-note}
Errors and mistakes are **inevitable**. This is our gestalt moment like the famous duck-rabbit illusion in @fig-duck-rabbit.
Looking at the demands of research and data analysis, we must no longer see grit as sufficient. Instead, we need to see and use systems that don't fail us as statisticians and data scientists.
:::
### How are we going to do this?
Errors and mistakes are **inevitable**. We want to adopt evidence-based best practices that minimize the probability of making an error and maximize the probability of catching an inevitable error.
[@parker] describes a process called opinionated analysis development.
We're going to adopt the approaches outlined in *Opinionated Analysis Development* and then actually implement them using modern data science tools.
Through years of applied data analysis, I've found these tools to be essential for creating analyses that are
1. Accurate
2. Reproducible and auditable
3. Collaborative
## Why is Modern Data Analysis Difficult to do Well?
Working with data has **gotten tougher with time**.
- Data are larger on average. For example, The Billion Prices Project scraped prices from all around the world to build inflation indices [@cavallo2016].
- Complex data collection efforts are more common. For example, [@chetty2019] have gained access to massive administrative data sets and used formal privacy to understand inter-generational mobility.
- Open source packages that provide incredible functionality for free change over time.
- Papers like "The garden of forking paths: Why multiple comparisons can be a problem, even when there is no 'fishing expedition' or 'p-hacking' [@gelman2013garden] and the research hypothesis was posited ahead of time∗" and "Why Most Published Research Findings Are False" [@ioannidis2005] have motivated huge increases in transparency including focuses on pre-registration and computational reproducibility.
> There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many different ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p \< .05 result. The short version is that it’s easy to find a p \< .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with. ~ Gelman and Loken
From this perspective, what we need is Truman Show for every researcher where we can watch their decisionmaking and the nuances of the decisions they make. **That's impractical!** But from the point of good science, pulling back the curtain with computational reproducibility is a way to mitigate these concerns.
Even for simple analysis, we can ask ourselves an entire set of questions at the end of the analysis. @tbl-questions lists a few of these questions.
```{r}
#| label: tbl-questions
#| tbl-cap: "Opinionated Analysis Development"
motivation |>
select(`Question Addressed`) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
tab_source_note(
source_note = "Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1."
)
```
## What are we going to do?
::: {.callout-tip}
## Replication
Replication is the recreation of findings across repeated studies. It is a cornerstone of science.
:::
::: {.callout-tip}
## Reproducibility
Reproducibility is the ability to access data, source code, tools, and documentation and recreate all calculations, visualizations, and artifacts of an analysis.
:::
Computational reproducibility *should* be the minimum standard for computational social sciences and statistical programming.
We are going to center reproducibility in the practices we maintain, the tools we use, and culture we foster. By centering reproducibility, we will be able to create analyses that are
1. Accurate
2. Reproducible and auditable
3. Collaborative
@tbl-features groups these questions in analysis features and suggests opinionated approaches to each question.
```{r}
#| label: tbl-features
#| tbl-cap: "Opinionated Analysis Development"
motivation |>
select(-Tool, -Section) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
cols_hide(`Analysis Feature`) |>
tab_row_group(
label = md("**Collaborative**"),
rows = `Analysis Feature` == "Collaborative"
) |>
tab_row_group(
label = md("**Accurate Code**"),
rows = `Analysis Feature` == "Accurate Code"
) |>
tab_row_group(
label = md("**Reproducible and Auditable**"),
rows = `Analysis Feature` == "Reproducible and Auditable"
) |>
tab_footnote(
footnote = "This was originally 'Code Review'",
locations = cells_body(columns = `Opinionated Approach`, rows = 7)
)|>
tab_source_note(
source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
)
```
## How are we going to do this?
@tbl-tools lists specific tools we can use to adopt each opinionated approach.
```{r}
#| label: tbl-tools
#| tbl-cap: "Opinionated Analysis Development"
motivation |>
select(-Section) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
cols_hide(`Analysis Feature`) |>
tab_row_group(
label = md("**Collaborative**"),
rows = `Analysis Feature` == "Collaborative"
) |>
tab_row_group(
label = md("**Accurate Code**"),
rows = `Analysis Feature` == "Accurate Code"
) |>
tab_row_group(
label = md("**Reproducible and Auditable**"),
rows = `Analysis Feature` == "Reproducible and Auditable"
) |>
tab_footnote(
footnote = "This was originally 'Code Review'",
locations = cells_body(columns = `Opinionated Approach`, rows = 7)
)|>
tab_footnote(
footnote = "We will not spend much time on these topics",
locations = cells_body(
columns = Tool,
rows = Tool %in% c("library(targets)", "library(microbenchmark)")
)
)|>
tab_footnote(
footnote = "Added by Aaron R. Williams",
locations = cells_column_labels(columns = Tool)
) |>
tab_style(
style = list(cell_fill(color = "gray80")),
locations = cells_body(
columns = Tool,
rows = Tool %in% c("library(targets)", "library(microbenchmark)")
)
) |>
tab_source_note(
source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
)
```
### Bonus stuff!
Adopting these opinionated approaches and tools promotes reproducible research. Adopting these opinionated approaches and tools also provides a bunch of great bonuses.
- Reproducible analyses are easy to scale. Using tools we will cover, we created almost [4,000 county- and city-level websites](https://upward-mobility.urban.org/measuring-upward-mobility-counties-and-cities-across-us).
- GitHub offers free web hosting for hosting books and web pages like the notes we're viewing right now.
- Quarto makes it absurdly easy to build beautiful websites and PDFs.
## Roadmap
The day roughly follows the process for setting up a reproducible data analysis.
- **Project organization** will cover how to organize all of the files of a data analysis so they are clear and so they work well with other tools.
- **Literate programming** will cover Quarto, which will allow us to combine narrative text, code, and the output of code into clear artifacts of our data analysis.
- **Version control** will cover Git and GitHub. These will allow us to organize the process of reviewing and merging code.
- In **programming**, we'll discuss best practices for writing code for data analysis like writing modular, well-tested functions and assertive testing of data, assumptions, and results.
- **Environment management** will cover `library(renv)` and the process of managing package dependencies while using open-source code.
- If we have time, we can discuss how a positive climate and ethical practices can improve transparency and strengthen science.
We sort @tbl-tools into @tbl-roadmap-prime, which outlines the structure of the rest of the day.
```{r}
#| label: tbl-roadmap-prime
#| tbl-cap: "Opinionated Analysis Development"
motivation |>
filter(!is.na(Section)) |>
select(-`Analysis Feature`) |>
arrange(Section) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
tab_footnote(
footnote = "This was originally 'Code Review'",
locations = cells_body(columns = `Opinionated Approach`, rows = 7)
)|>
tab_footnote(
footnote = "Added by Aaron R. Williams",
locations = cells_column_labels(columns = c(Tool, Section))
) |>
tab_source_note(
source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
)
```