-
Notifications
You must be signed in to change notification settings - Fork 2
/
bayes_notes.Rmd
435 lines (244 loc) · 17.2 KB
/
bayes_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
# Bayesian approaches {#bayes}
```{block2, type='note'}
I take notes on several different resources/texts below. Ultimately I'll try to integrate these into a single set of notes.
```
## My (David Reinstein's) uses for Bayesian approaches (brainstorm)
### Meta-analysis of previous evidence
- Of prior work, especially on motivators of (effective) charitable giving and responses to effectiveness information
- Of my own series' of experiments (potentially joint with prior work)
### Inference, particularly about 'null effects'
When/what can we say about the 'absence of an effect'
How to integrate into inferences from diagnostic testing (e.g., common-trend assumption)?
### 'Policy' and business implications and recommendations
In particular, in a charitable giving social-media fundraising context, we might consider whether it is worth offering 'seed contributions' to encourage giving on existing pages. If so, 'which pages should we seed and how much?'
\
### Theory-driven inference about optimizing agents, esp. in strategic settings
- Especially in the context od 'predicted contributions to public goods... and 2nd order beliefs'
### Experimental design
- Optimal treatment assignment, with previous observables and a track record
- Sequential designs
- Bayesian Power calculation\
```{r}
#sessionInfo()
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Package loadings from Kurtz:
```{r, echo = TRUE, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
ggplot2::theme_set(ggplot2::theme_grey())
bayesplot::color_scheme_set("blue")
library(tidyverse) #adds in next chapter
```
## 'Statistical thinking' (McElreath) and [AJ Kurtz 'recoded' (bookdown)](https://bookdown.org/ajkurz/Statistical_Rethinking_recoded): highlights and notes
```{block2, type='note'}
McElreath's course and text looks great. I'm taking selective notes here; I'll try to incorporate content from both text and [youtube video lectures](https://www.youtube.com/watch?v=4WVelCswXo4&list=PLDcUM9US4XdNM4Edgs7weiyIguLSToZRI).
[AJ Kurtz has re-written the code](https://bookdown.org/ajkurz/Statistical_Rethinking_recoded) using the `brms` package, which he finds superior. More crucially for me, he redoes the code using ggplot and tidyverse?
I'm planning to through this here, adding my own notes, questions, and considerations and (hopefully) incorporating some of my own work.
```
::: {.marginnote}
I've also forked Kurtz's repo [here](https://github.com/daaronr/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse), which I may play with.
:::
### 1. The Golem of Prague (map ant the territory)
Don't let your model or approach turn into a Golem you can't control. Don't 'believe the model'; continuously validate it. The map is not the territory.
'Statistical decision trees' lend a false sense of security... and almost never fit the actual case we are dealing with. (fig 1.1)\
Statistical models are non-unique maps to 'process models' which are non-unique maps to *hypotheses*. (He offers the example of neutral evolutionary selection' example.)
This makes strict falsification impossible: How can you falsify a hypothesis/theory if it corresponds to a wide set of process models and statistical models, many of which overlap other hypotheses?\
But this warning is at least as relevant for Bayesian analyses, which must be based on specifically defined (term) models of the DGP etc. Thus he recommends caution and continuous (?) interplay between the model and the data. (See next chapter ... 'small worlds and large worlds'.)\
He also suggests we refer not to 'Confidence intervals' or even 'Credible intervals', but to 'Consistent intervals' ... as in 'these intervals are consistent with the model and data'.
\
And...
> [so you should] '...Explicitly compare predictions of more than one model'
#### Rethinking: Is NHST falsificationist? {.unnumbered}
```{r failure_of_falsification.png, fig.cap='From McElreath video lecture 1', out.width='70%', fig.asp=.5, fig.align='center', echo = FALSE}
knitr::include_graphics("images/failure_of_falsification.png")
```
> Null hypothesis significance testing, NHST, is often identified with the falsificationist, or Popperian, philosophy of science. However, usually NHST is used to falsify a null hypothesis, not the actual research hypothesis. So the falsification is being done to something other than the explanatory model. This seems the reverse from Karl Popper's philosophy.
::: {.marginnote}
I.e., scientists have turned things upside down; originally the idea was that you had substitute of hypotheses that you would want to falsify and now we try to falsify silly null hypotheses that "nothing is going on". You should try to really build a hypothesis and test it not just reject that nothing is going on.
:::
#### Book's foci
1. Bayesian data analysis
2. Multilevel modeling
3. Model comparison using information criteria
\
### 2. Small Worlds and Large Worlds
> ... The way that Bayesian models learn from evidence is arguably optimal in the small world. When their assumptions approximate reality, they also perform well in the large world. But large world performance has to be demonstrated rather than logically deduced. (p. 20)
\
We imagine a bag filled with four marbles, each of which is blue or white.
"So, if we're willing to code the marbles as 0 = "white" 1 = "blue", we can arrange the possibility data in a tibble as follows." I.e., we can consider the five possible worlds, in each of which the bag has a different number of white and blue marbles, and represent each of these worlds as a column vector:
```{r, warning = F, message = F}
d <-
tibble(p_1 = 0,
p_2 = rep(1:0, times = c(1, 3)),
p_3 = rep(1:0, times = c(2, 2)),
p_4 = rep(1:0, times = c(3, 1)),
p_5 = 1)
d
```
\
We visualize this in the plot below, where each column is one 'world':
```{r, out.width='50%'}
d %>%
gather() %>% #make it long, with an ket variable for the possibility 'world'
mutate(x = rep(1:4, times = 5), #an index for 'which ball'
possibility = rep(1:5, each = 4)) %>% #distributing the 'which world' index
ggplot(aes(x = x, y = possibility,
fill = value %>% as.character())) +
geom_point(shape = 21, size = 5) +
scale_fill_manual(values = c("white", "navy")) +
scale_x_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(.75, 4.25),
ylim = c(.75, 5.25)) +
theme(legend.position = "none")
```
\
Simple combinatorics (permutations rule) tells us how many 'ways' we can draw 1, 2, and 3 marbles... Here we think about 'which' marble is drawn, and not just 'which color' it is. We can draw marble 1-4, the first time, then 1-4 the second time, and then 1-4 the third time... so `possibilities=marbles ^ draw`.
```{r}
tibble(draw = 1:3,
marbles = 4) %>%
mutate(possibilities = marbles ^ draw) %>%
knitr::kable()
```
```{block2, type='note'}
Next, there is a huge amount of code explaining how to make the 'garden of forking paths' diagrams. I'm basically going to skip all that code, and paste in a few images. You can find all the code [HERE](https://bookdown.org/content/3890/small-worlds-and-large-worlds.html#the-garden-of-forking-data)
```
\
Suppose there is only one blue ball and three white balls, possibility '2' above. For this world, we see the full 'garden of forking paths' --- the number of ways to select 1, 2, and 3 balls (with replacement) --- below.
Every path starting from the center is a possible (sequence of) draws.
[CUT A BUNCH HERE]
## Title: "Introduction to Bayesian analysis in R and Stata - Katz, Qstep"
*Content from notes from this lecture*
### Why and when use Bayesian (MCMC) methods?
#### Pros
1. No need for asymptotics ... good when sample sizes are small
2. Incorporate previous information
You can consider the 'robustness to other priors'
3. Fit complex nonstandard models ... e.g., with difficult functional forms or likelihood settings (more computation, less thinking)
4. Easy to make predictions (e.g., simulate scenarios) after estimation
5. Incorporate evidence, results, expert judgement
('restrictions' with some lee-way?)
(ISn't this the same as number 2?)
6. Cleaner treatment/imputation of missing values ... these are just parameters
#### Cons
1. Must specify prior distributions ... allows subjective judgement
2. Different way of thinking about stats and inference; probability distributions and simulations, not much about p-values, point estimates and standard errors ... path dependence
3. Computational cost
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#### Why more popular today?
- Starting from around 2005 in Political Science and Sociology
Computational revolution comes from Markov chain Monte Carlo (MCMC) methods ... don't need analytical solutions
Software implementations -- many in R, specialised software like EWinBugs, JAGS, STAN; also increasingly in Stata
### Theory
Bayes theorem ... inverting conditional probability thing ... 'inversion' to make inferences about the parameters
- In Bayesian stats the *parameters* (and sometimes missing values) are random variables, we make probability statements about them
$$P(A|B)=P(B|A)P(A)/P(B)$$
Frequentist: Point estimates, unknown fixed parameters, data from a hyol repeataable random sample
Bayesian: Fixed data (from the experiment), parameters are random variables ... results based on probability distributions about rthese
Classical statistics: likelihood of data given parameter: $p(y|\theta)$
Bayes we want, $p(\theta|y) = p(y|\theta)p(\theta)/p(y)$
$p(y)$ is a 'constant' in our estimation ... the data is fixed.
So it's proportional to $p(\theta|y) = p(y|\theta)\times p(\theta)$
$p(y|\theta)$ is what we max when we do ML
\$ p(\theta)\$: prior distribution capturing beliefs about $\theta$
#### So how do we estimate it?
1. Specify a probability model, a distribution for Y (likelihood function) and the priors for $\theta$
2. Solve (find) the posterior distribution $p(\theta|Y)$ and summarise the parameters of interest
In practice, step 2 is usually done via MCMC simulation rather than analytically.
... via simulations, I approach the 'true' value on $\theta$
(Given 'regularity conditions')
#### Linear regression model example
$$Y = x'\beta+\epsilon$$ with n obs
only random term is epsilon ... natural candidate is a normal distribution, so $Y \sim N(x'\beta,\sigma^2_e)$
So we want to find $p(\beta, \sigma^2_\epsilon|Y,X)$. This depends on the choices of $p(\beta)$ and $p(\epsilon)$. Could choose conjugate priors, leading to a particular joint posterior, you can solve it analytically.
Can yield a joint posterior.
Instead, let's assume that the latter (variance) parameter is known, you can show that the posterior for $\beta$ is also normally distributed. (Conjugate)
Similarly, if we assume $\beta$ is known, if the variance term had an inverse gamma distribution (prior), so will the posterior.
In these conjugate priors, the posterior mean will be a weighted average of the priors and the data.
#### Gibbs
Needs closed form conditional posterior for every parameter.
What Gibbs sampler does is break the parameter space into sets of parameters
1. Choose starting values, $\theta^0_1,...\theta^0_k$
2. sample from the first parameter's distribution given the others ... the second one, ... the k'th one .
3. Repeat step 2 ... thousands of times (starting with the parameters from the previous iteration) Eventually 'we obtain samples of $p(\theta|y)$'
*But if we don't have a closed form, we cannot simply sample from known distributions in each step*
E.g., in case of Logit distribution.
#### Metropolis Hastings
1. Choose 'proposal distribution' to sample parameter values (a candidate like normal, uniform)
2. Start w a prelim guess for parameter values $\theta_0$
3. At iteration t sample a proposal $\theta_t$ from $p(\theta_t|\theta_{t-1})$ ?? what does this come from?
4. If $p(\theta_t|y)>p(\theta_{t-1}|y)$ accept it as the new value of $\theta$. ??? how is this computed if we don't have conjugate closed-form posteriors?
5. Otherwise flip a coin with probability r = (ratio of those probabilities)
- if coin tosses heads, accept as new theta, otherwise stay at previous theta
- allows algorithm to avoid getting stuck at local maxima
Commonly used proposal: random walk sample: $\theta_t=\theta_{t-1}+z_t$, $z_t \sim f$
?? I do this because there is no analytical way to derive this, unlike in the conjugate case, where we might use the Gibbs
- can combine Gibbs with Metropolis steps; relevant to some problems
#### Assessing convergence
- previous ... 'eyeballing'
- formal:
- single-chain tests (Geweke/Heidel) ... is the last part of the chain stable (stationary)... compare simulation at middle and end, is there much variation?
- multiple-chain test... (starting from different values), do they end similar ... Gelman-Rubin diagnosting $\hat{R}$
- typically either a very long chain and use GH convergence, or multiple shorter chains and use $\hat{R}$
Gabriel: Gelman-Rubin is probably preferred; more conservative
?? What am I iterating towards? Converging on what?
#### Assesing 'fit' in Bayesian
- No r-squared
- Typical measure is 'posterior predictive comparisons'
$p(y_{replicated}|y_{observed}= ...$
1. Simulate data from estimated parameters
2. Compare to observed data
3. Use an overall fit measure to assess model fit
E.g., percent correct predictions (binary), whether the true data is within the 95% CI of the replicates, deviance
For each replicate Choose statistic D, compare the replicated
$D(y^s_{replicated})$ against $D(y^s_{observed})$
Quantify the discrepancy ... percent of correct predictions, proportion of times replicated y is below true y ... compute 'bayesian p-value's'
Systematic differences between replicate and actual data indicate model limitations
(?? what are reasonable values here??)
### Comparing models ... Equivalent of 'likelihood'
'Deviance Information Criterion' (most used); specific for MCMC simulations: compares expected LL of the model (of the data given the estimated parameters; average here across much of the later points in the chain) against the llhd at the posterior parameter mean. Always select model with lowest DIC.
Bayes Factor (less used): Ratio of llhd of the models; higher BF means model is more supported; BF\>10 seen to provide strong evidence for model w higher value
### On choosing priors
Most social scientists use non-informative or vague priors; i.e., large variance... e.g., $\beta \sim N(0,1000)$
But its often useful to incorporate information into your priors
Small pilot to test, $\rightarrow$ data $Y_1$, another study gives data $Y_2$; repeated application of Bayes theorem gives the posterior.
Same result whether you obtained these together, or whether you did one and then updated (e.g., via an MCMC, starting with the first one as a prior)
Conjugate priors (mentioned before)
- Jeffrey's priors (??)
### Implementation
If you don't need to do fancy things, and don't want to (?) generate the full posterior distribution (or something)
Some Stata/R commands that make Bayesian look frequentist.
In Jags and Winbugs, we only have to specify the prior... rest is done for us
Jags is great ... you only need to do self-coding with lots of data and super complicated models as it can freeze up
We went through it the fancy way in Probit.R
Then the easy way with 'script probit Jags.R'
### Generate predictions from a WinBUGS model
You can just generate these outcomes ...
Prediction: generate a new observation \#note, he is doing one per iteration, but since these are convergent it would be basically the same if you just chose a random iteration and did all the draws from that one
### Missing data case
One solution -- multiple imputation
- choose imputation model to predict missings,
- generate many copies of orig data set, imputing missibg value for each
- 2 more steps here
Need a model for X\|alpha, because missing variables are random variables
### Stata
Has some rather simple implementations; e.g., just using commands like `bayes: regress y x`
### R mcmc pac
Also simple code; great for standard use
Speedup with parallelization; see "script for parallel probit.R" and "parallelprobit.R"
More advanced: C++; can integrate it with Rcpp, or even use Exeter's ISCA cluster
```{r cars}
summary(cars)
```
## Other resources and notes to integrate
<blockquote class="twitter-tweet">
<p lang="en" dir="ltr">
Hey stats twitter: got a very sharp psych UG student wanting to dive into Bayes. Many resources are too technical (i.e., not good teaching texts for UG level, but useful references). Where should I point her?
</p>
--- Tom Carpenter (\@tcarpenter216) <a href="https://twitter.com/tcarpenter216/status/1223406990325534720?ref_src=twsrc%5Etfw">February 1, 2020</a>
</blockquote>
```{=html}
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
```