-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-models.qmd
181 lines (138 loc) · 21.1 KB
/
02-models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# What is a Model?
Before we dive into the analysis of the beer tasting experiment, we need to define some key components. First of all, the concept of a *statistical model*. A statistical model is a combination of a general statistical model (e.g., the binomial model) and a statement about a parameter value that describe a certain phenomenon. For instance, we can model the flipping of a fair coin with the binomial model, where the probability parameter $\theta$ ("theta") is set to $0.5$. Or, we can model the height of Dutch men (in cm) with a normal model, where the location parameter $\mu = 183$ and the dispersion parameter $\sigma = 5$. A statistical model can therefore also be seen as a hypothesis: a specific statement about the value of the model parameters.
## Models Make Predictions {#sec-models-make-predictions}
An essential property of a statistical model is that it can make predictions about the real world. We can use the accuracy of these predictions to gauge the quality/plausibility of a model, relative to another model. For instance, Sarah thinks that the probability of heads in a coin flip is 50% (i.e., $H_S: \theta = 0.5$), while Paul claims that the coin has been tampered with, and that the probability of heads is 80% (i.e., $H_P: \theta = 0.8$). Here, Sarah and Paul postulate different models/hypotheses. They are both [binomial models](https://en.wikipedia.org/wiki/Binomial_distribution), which is the general statistical model for describing a series of chance-based events with a binary outcome (e.g., coin flip, red/black in roulette, whether a random person from the population has a certain disease or not, or someone identifying the alcholic beer). Where Sarah and Paul differ, however, is their claim about the specific value of the $\theta$ parameter. In the remainder of this text, we will be referring to model to mean such a combination of general statistical model, and claim about the value of the model parameter (i.e., hypothesis).
```{r two-models-binomial, fig.cap='Two models for a coin toss. The arrows indicate what each of the models postulate: both postulate a single value for theta.', fig.align='center', out.width='90%', echo = FALSE, cache=FALSE}
#| label: fig-two-models-binomial
par(mfrow = c(1, 2))
cols <- viridis::viridis(6)
plot(1, 1, type ="n", xlim = c(0,1), ylim = c(0,4), bty = "n", main = "Sarah's Model",
las = 1, xlab = expression(theta), ylab = "Density")
arrows(x0 = 0.5, x1 = 0.5, y0 = 0, y1 = 4.0, lwd = 4, col = cols[1])
plot(1, 1, type ="n", xlim = c(0,1), ylim = c(0,4), bty = "n", main = "Paul's Model",
las = 1, xlab = expression(theta), ylab = "Density")
arrows(x0 = 0.8, x1 = 0.8, y0 = 0, y1 = 4.0, lwd = 4, col = cols[3])
```
The two models make a different claim about $\theta$, and therefore also make different *predictions* about the outcome of a series of 10 coin flips. Specifically, we can use the binomial model to calculate how likely each possible outcome is under each of the models. For instance, we can calculate how likely it is to observe 8 heads out of 10 flips. The binomial formula is as follows: \begin{align}
\label{binomFormula}
P(\text{data} \mid \theta) = \frac{n!}{k! (n-k)!} \theta^k\left(1-\theta\right)^{n-k},
\end{align} which, if we fill in the outcome for which we want to know the likelihood (i.e., $k=8$ heads out of $n=10$ flips), becomes:
\begin{equation}
P(\text{8 heads out of 10} \mid \theta) = \frac{10!}{8! (10-8)!} \theta^8\left(1-\theta\right){10-8}.
\end{equation}
The last element to fill in is $\theta$. If we do so for Sarah, who postulates $\theta = 0.5$, we get `r round(dbinom(8, 10, 0.5), 4)`. For Paul, who postulates $\theta = 0.8$, we get `r round(dbinom(8, 10, 0.8), 4)`. If we do this for every possible outcome, and create a bar graph of each likelihood, we get the following two figures that illustrate what each model deems likely (the yellow bar indicates each models' likelihood of the example of 8 heads out of 10 flips):
```{r two-models-likelihoods-binomial, fig.cap="The likelihoods of all possible outcomes of 10 coin flips, under Sarah's model and under Paul's model. The yellow bar indicates the likelihood of the observed data (8 heads).", fig.align='center', out.width='90%', echo = FALSE}
#| label: fig-two-models-likelihoods-binomial
par(mfrow = c(1, 2), cex.main = 0.95)
cols <- viridis::viridis(6)
barplot(dbinom(0:10, 10, 0.5), names.arg = 0:10, xlab = "Number of heads", ylab = "Likelihood",
main = "Likely Outcomes under Sarah's Model", col = cols[1], ylim = c(0,0.32))
barplot(c(rep(0, 8), dbinom(8, 10, 0.5)), add = TRUE, col = cols[6])
barplot(dbinom(0:10, 10, 0.8), names.arg = 0:10, xlab = "Number of heads", ylab = "Likelihood",
main = "Likely Outcomes under Paul's Model", col = cols[3], ylim = c(0,0.32))
barplot(c(rep(0, 8), dbinom(8, 10, 0.8)), add = TRUE, col = cols[6])
```
<!-- These two figures also reflect how Sarah and Paul would spread their money, if they would be handed 100€ and asked to bet on the outcome of 10 coin tosses. While they would bet most of their money on the value exactly equal to $\theta \times n$ (i.e., 5 for Sarah and 8 for Paul), there will always be some random noise in the sample, which causes the observed data to manifest (slightly) differently from the true population value. -->
These two figures reflect likely outcomes of the experiment of flipping a coin 10 times. If Sarah is correct, and the probability of heads is in fact $0.5$, likely outcomes are 4, 5, and 6. However, if Paul is correct, and $\theta = 0.8$, it is more likely to see 7, 8 or 9 heads.
## Model Comparison {#sec-model-comparison}
```{r, echo = FALSE}
s.ml <- round(dbinom(8, 10, 0.5), 4)
p.ml <- round(dbinom(8, 10, 0.8), 4)
bf_sp <- round(s.ml / p.ml, 2)
bf_ps <- round(p.ml / s.ml, 2)
```
In the previous section we have made concrete what each of the two models predict. The models differ in their statement about $\theta$ (@fig-two-models-binomial), and therefore differ in what they deem likely outcomes (@fig-two-models-likelihoods-binomial). Now imagine that we actually gather some data by flipping a coin 10 times, and we observe 8 heads and 2 tails. @fig-two-models-likelihoods-binomial tells us that the probability of that happening under Sarah's model is `r round(dbinom(8, 10, 0.5), 4)`, while under Paul's model this is `r round(dbinom(8, 10, 0.8), 4)`. These two numbers tell us something about how well each model predicted the data, relative to each other. Specifically, the ratio of these two numbers is known as the **Bayes factor**. Here, the Bayes factor is equal to `r round(dbinom(8, 10, 0.5), 4)` / `r round(dbinom(8, 10, 0.8), 4)` = `r round(round(dbinom(8, 10, 0.5), 4) / round(dbinom(8, 10, 0.8), 4), 2)`, which means that the observed data are about `r bf_sp` times more likely under Sarah's model than under Paul's model. The Bayes factor has a subscript that indicates what model is being compared to what: $\text{BF}_{SP}$ gives how much more likely Sarah's model is than Paul's, while $\text{BF}_{PS}$ gives how much more likely Paul's is than Sarah's. To go from $\text{BF}_{SP}$ to $\text{BF}_{PS}$, you simply take 1 divided by the other: $\text{BF}_{PS}$ = $\frac{1}{\text{BF}_{SP}}$ = `r bf_ps`. So saying that the data are `r bf_ps` times more likely under Paul's model than Sarah's model is exactly the same. Generally, it is a bit easier to communicate the Bayes factor that is $>1$, using the appropriate subscript.
Lastly, it can be the case that two models predicted the data equally well. In this case the Bayes factor will be equal to 1. Generally, we want the Bayes factor to be as far away from 1 as possible, since this indicates more and more evidence in favor of one model over another. Different categorizations have been made to translate a Bayes factor into human words, to facilitate communication about degrees of evidence for/against one model respective to another. One such representation is given below in @fig-bayes-factor-classification.
```{r bayes-factor-classification, echo = FALSE, fig.cap = 'A graphical representation of a Bayes factor classification table. As the Bayes factor deviates from 1, which indicates equal support for $H_0$ and $H_1$, more support is gained for either $H_0$ or $H_1$. The probability wheels illustrate the continuous scale of evidence that Bayes factors represent. These classifications are heuristic and should not be misused as an absolute rule for binary all-or-nothing conclusions.', fig.align='center', out.width= '90%'}
#| label: fig-bayes-factor-classification
knitr::include_graphics("Figures/BF_TableInterpretation.png", dpi=120)
```
## More Models {#sec-more-models}
So far, we have considered two models ($H_S$ and $H_P$), both of which postulate a single value for the model parameter $\theta$. However, it is possible for a model to be more uncertain in its assertions. For instance, we can have a model that postulates that the probability of heads is greater than $0.5$ (i.e., $0.5 \leq \theta \leq 1$ [^02-models-1]). This corresponds to the belief that the coin is tampered with, but without making a strong statement about the degree of tampering. Furthermore, next to the model postulating the range for $\theta$, it also needs to specify how likely it deems every value in this range. Let's add two more people and their models to the mix to illustrate. Betty believes the coin has been tampered with, but is unsure about the degree of tampering: she believes that every value of $\theta$ between 0.5 and 1 is equally likely. Next is David, who is a bit more extreme in his beliefs: he believes that the coin is tampered with heavily, so assumes that values of $\theta$ close to 1 are more likely than values of $\theta$ closer to $0.5$. If we were to plot the models and corresponding hypotheses of Betty and David, they would look as follows (the difference in density reflecting their different beliefs):
[^02-models-1]: Because we are working with a continuous range of values for $\theta$, the difference between saying $0.5 < \theta$ and $0.5 \leq \theta$ is infinitesimally small and the two versions may be used interchangeably.
```{r two-models-binomial-onesided, fig.cap='Two more models for a coin toss. The colored regions indicate what each model believes. Even though both Betty and David belive the probabilty of heads to be greater than 0.5, they differ in how plausible they deem specific values in that range.', fig.align='center', out.width='90%', echo = FALSE}
#| label: fig-two-models-binomial-onesided
par(mfrow = c(1, 2))
tBetFun <- function(x, shape1 = 1, shape2 = 1, side = "pos") {
ds <- dbeta(x, shape1, shape2)
if (side == "pos") {
ds[x<0.5] <- 0
ds <- ds / integrate(function(x) dbeta(x, shape1, shape2), lower = 0.5, upper = 1)[[1]]
} else if (side == "neg") {
ds[x>0.5] <- 0
ds <- ds / integrate(function(x) dbeta(x, shape1, shape2), lower = 0, upper = 0.5)[[1]]
}
return(ds)
}
cols <- viridis::viridis(6)
plot(1, 1, type ="n", xlim = c(0,1), ylim = c(0,4), bty = "n", main = "Betty's Model",
las = 1, xlab = expression(theta), ylab = "Density")
# curve(tBetFun(x, 1, 1), add = TRUE, col = cols[4], lwd = 4)
mySeq <- seq(0.5, 1, length.out = 1e3)
polygon(x = c(mySeq, rev(mySeq)), y = c(rep(0, 1e3), tBetFun(mySeq, 1, 1)), col = cols[2])
plot(1, 1, type ="n", xlim = c(0,1), ylim = c(0,4), bty = "n", main = "David's Model",
las = 1, xlab = expression(theta), ylab = "Density")
# curve(tBetFun(x, 3, 1), add = TRUE, col = cols[5], lwd = 4)
polygon(x = c(mySeq, rev(mySeq)), y = c(rep(0, 1e3), rev(tBetFun(mySeq, 3, 1))), col = cols[5])
# plot(1, 1, type ="n", xlim = c(0,1), ylim = c(0,4), bty = "n", main = "David's Model",
# las = 1, xlab = expression(theta), ylab = "Density")
# # curve(tBetFun(x, 3, 1), add = TRUE, col = cols[5], lwd = 4)
# mySeq <- seq(0, 1, length.out = 1e3)
# polygon(x = c(mySeq, rev(mySeq)), y = c(rep(0, 1e3), rev(tBetFun(mySeq, 1, 1, "bla"))), col = cols[6])
```
Compared to the models in @fig-two-models-binomial, which only "bet" on a single value, the models above spread their bets more. David and Betty thus make safer bets since they make wider predictions. Although both Betty and David only consider positive values, they differ in how plausible they deem specific positive values. As before, these models also make predictions about how likely various outcomes of a series of 10 coin flips would be. Again, the binomial formula can be used. However, this time the models do not predict a single value, but a whole range of values. In order to compute how likely Betty's model deems an outcome of 8 heads out of 10 flips, we have to consider every postulated value of $\theta$ between 0.5 and 1, compute the likelihood of the data for each value, and average across all of these likelihoods, weighted by the density at each point. The technical term for such weighted averaging is called marginalizing, and we refer to this averaged likelihood as the **marginal likelihood**. In the next section we will revisit this topic.
In @fig-two-models-binomial-onesided-predictions below, you can see the marginal likelihoods for all outcomes, for each of the two additional models. Note that even though neither Betty nor David postulate values of $\theta$ below 0.5 (i.e., the *parameter*), they assign some plausibility to observed proportions below 0.5 (i.e., the *statistic*, or observed data). This reflects the random nature of a coin flip: even though the true probability of heads is $0.6$, you might still observe 3 heads out of 10 flips.
```{r two-models-binomial-onesided-predictions, fig.cap='The marginal likelihoods of all possible outcomes of 10 coin flips, under the two additional models. The yellow bar indicates the marginal likelihood of the observed data (8 heads).', fig.align='center', out.width='90%', echo = FALSE}
#| label: fig-two-models-binomial-onesided-predictions
par(mfrow = c(1, 2), cex.main = 0.95)
cols <- viridis::viridis(6)
sampsU <- c(0:5, rbinom(1e5, 10, runif(1e5, 0.5, 1)))
barplot(table(sampsU)/1e5, names.arg = 0:10, xlab = "Number of heads", ylab = "Likelihood",
main = "Likely Outcomes under Betty's Model", col = cols[2], ylim = c(0,0.32))
barplot(c(rep(0, 8), (table(sampsU)/1e5)[9]), add = TRUE, col = cols[6])
samps <- rbeta(1e5, 3, 1)
samps[samps<0.5] <- 1- samps[samps<0.5]
samps <- c(0:5, rbinom(1e5, 10, samps))
barplot(table(samps)/1e5, names.arg = 0:10, xlab = "Number of heads", ylab = "Likelihood",
main = "Likely Outcomes under David's Model", col = cols[5], ylim = c(0,0.32))
barplot(c(rep(0, 8), (table(samps)/1e5)[9]), add = TRUE, col = cols[6])
# (table(samps)/1e5)[9] /(table(sampsU)/1e5)[9]
# samps <- rbeta(1e5, 3, 1)
# samps[samps<0.5] <- 1- samps[samps<0.5]
# samps <- c(0:5, rbinom(1e5, 10, samps))
# barplot(table(samps)/1e5, names.arg = 0:10, xlab = "Number of heads", ylab = "Likelihood",
# main = "Likely Outcomes under David's Model", col = cols[5], ylim = c(0,0.32))
# barplot(c(rep(0, 4), (table(samps)/1e5)[5]), add = TRUE, col = cols[6])
```
### The Open-Minded Model {#sec-open-minded-model}
Lastly, but perhaps most importantly, we can also consider a model that tries to spread its bets as much as possible. Let's say that Alex wants to keep as much of an open mind about values of $\theta$ as possible. They consider each possible value of $\theta$ to be equally plausible. In Bayesian inference, we also refer to this type of model as the *uninformed* model. The figure below illustrates what the uninformed model posits, and which outcomes it deems likely. We again have a model that postulates multiple values, so the figure on the right depicts marginal likelihoods. For instance, for the yellow bar, we look at how likely 8 heads out of 10 flips are, averaged over all values postulated by the model, weighted by the density in the left graph.
```{r uninformed-model-binomial-prediction, fig.cap='The so-called "uninformed model". Alex wants to keep an open mind about the values of theta and considers each value equally plausible. Left: the colored region indicate what Alex believes. Right: what this specific model considers likely outcomes. The yellow bar indicates the marginal likelihood of the observed data (8 heads).', fig.align='center', out.width='90%', echo = FALSE}
#| label: fig-uninformed-model-binomial-prediction
par(mfrow = c(1, 2), cex.main = 0.95)
cols <- viridis::viridis(6)
plot(1, 1, type ="n", xlim = c(0,1), ylim = c(0,4), bty = "n", main = "Alex's Model",
las = 1, xlab = expression(theta), ylab = "Density")
# curve(tBetFun(x, 1, 1), add = TRUE, col = cols[4], lwd = 4)
mySeq <- seq(0, 1, length.out = 1e3)
polygon(x = c(mySeq, rev(mySeq)), y = c(rep(0, 1e3), tBetFun(mySeq, 1, 1, side = "neutral")), col = cols[4])
barplot(rep(1/11, 11), names.arg = 0:10, xlab = "Number of heads", ylab = "Likelihood",
main = "Likely Outcomes under Alex's Model", col = cols[4], ylim = c(0,0.32))
barplot(c(rep(0, 8), 1/11), add = TRUE, col = cols[6])
# barplot(c(rep(0, 8), 1/11), add = TRUE, col = cols[6])
```
## More Model Comparisons {#sec-more-model-comparison}
We can apply the same principles from @sec-model-comparison to compare how well each of the additional models has predicted the observed data of 8 heads out of 10 coin flips. To do so, we can simply take the ratio of each of the yellow bars in the figures that depict how likely each model considers the various possible outcomes of 10 coin flips. For instance, Alex's model has a marginal likelihood of `r round((1/11), 4)` for 8 heads, whereas Betty's model has a marginal likelihood of `r round( (table(sampsU)/1e5)[9], 4)` for 8 heads. If we want to compare the predictions of Betty and Alex, we can look at the ratio of these values to obtain $\text{BF}_{AB} =$ `r round( round(1/11, 4)/ round((table(sampsU)/1e5)[9], 4), 1)`, which is equivalent to $\text{BF}_{BA} =$ `r round( round((table(sampsU)/1e5)[9], 4) / round(1/11, 4) , 1)`. This means that the data are about twice as likely under Betty's model than under Alex's model, which can be considered weak evidence in favor of Betty's model over Alex's model.
If we were to use the betting analogy again, we could say that while both Alex and Betty had bet some money on the outcome of 8 heads, Betty had bet more money on this particular outcome than Alex, and is therefore rewarded more. Because Betty has a more specific belief (namely that the coin is biased towards heads), she had more money at her disposal for betting on the considered values (i.e., values between 0.5 and 1). In contrast, Alex played it very safely: they win some money for any outcome because they spread their betting money across all values. However, because of this, their reward is lower for having correctly predicted the observed data compared to someone who made a more specific bet on the observed data. The phenomenon of more specific models being rewarded more (when predicting well) than their non-specific competitor is known as **parsimony**, and will be discussed in more depth in Chapter 4.
A last model comparison we can make is to compare Alex's model to Sarah's model. In a typical (two-sided) statistical test about a proportion, this is the most often-used comparison: Sarah's model is considered to be the null model, and Alex's model is considered the two-sided alternative model. As we saw, Alex's marginal likelihood is `r round((1/11), 4)`, while Sarah's marginal likelihood is `r round(dbinom(8, 10, 0.5), 4)`, so the Bayes factor comparing these two models, $\text{BF}_{AS}$, equals `r round(round((1/11), 4) /round(dbinom(8, 10, 0.5), 4), 2)`. This means the data are about twice as likely under Alex's model compared to Sarah's model.
As a bonus, when we know $\text{BF}_{BA}$ and $\text{BF}_{AS}$, we automatically know $\text{BF}_{BS}$. Since we know how much more likely the data are under Betty's model than under Alex's model (about 2 times), and how much more likely the data are under Alex's model than under Sarah's model (about 2 times), we also now know that Betty's model is about $2 \times 2 = 4$ times more likely than Sarah's model! This property it known as **transitivity**.
## Concluding Thoughts
In this chapter, we have seen several key ideas:
- Models make concrete statements about parameters of a model. In this case, about the $\theta$ parameter in the binomial model
- These statements can be characterized by a probability distribution, where the probability mass reflects the specific statement
- The model could hypothesize a single value (e.g., [the models of Sarah and Paul](#fig-two-models-binomial))
- The model could hypothesize a range of values (e.g., [the models of Betty, David](#fig-two-models-binomial-onesided) and [Alex](#fig-uninformed-model-binomial-prediction))
- After we have observed some data, we can use the Bayes factor to compare the quality of the predictions made by each model
- The Bayes factor is a relative metric, comparing 2 models at a time
- The subscript of the Bayes factor indicates which model is compared to which
- More specific predictions, when accurate, are rewarded (parsimony)
Instead of comparing models, however, we can also look at one model, and use it to **estimate** the value of $\theta$. We will see that each of the models presented above will yield different estimates, because they had different *a priori* (i.e., before seeing the data) beliefs about plausible values of $\theta$.