-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
293 lines (254 loc) · 12 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
title: A Naive 2022 Senate Forecast
author: Sentient Potato
output:
md_document:
variant: gfm
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = FALSE)
## Load required packages
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(usmap)
## Override verbosity defaults and default ggplot theme
options(readr.show_col_types = FALSE)
options(dplyr.summarise.inform = FALSE)
theme_set(theme_bw())
```
# A Naive 2022 Senate Forecast
I do not study elections.
I am not a forecaster.
So take what follows accordingly.
There are a number of people who have built sophisticated and probably quite fine election forecasting models.
I wanted to see something a little different.
How close can I get with a *very naive* forecasting model?
So first, what am I trying to forecast?
I'm gonna go for the Democratic two party voteshare, or
$$
\delta \triangleq \dfrac{\text{Number of Democratic Votes}}{\text{Number of Democratic Votes} + \text{Number of Republican Votes}}.
$$
## The Model
Now like I said, for kicks I'm going to try to forecast $\delta_i$ for each state $i$ where there's a Senate race in a very unsophisticated way:
Specifically, I'm only going to use information voters have given me about whether they prefer to vote for the Democrat or the Republican.
This information comes in two flavors: the information I had beforehand and what I've learned through the campaign.
This leads very naturally to a very simple Bayesian model, where we place a Beta prior on $\delta_i$, whose parameters are shaped by election returns from past Senate races in state $i$, and update our belief about $\delta_i$ using polls of the voters in state $i$ in the current election cycle.
(Specifics about how the prior parameters are calculated from past election returns are given for the interested reader after the forecasting results are discussed).
## Model "Validation"
Since we have historical polls and election returns for the past couple cycles of Senate elections, we can go ahead and see how this naive model would have fared in 2018 and 2020 before seeing what its predictions for tomorrow will be.
(I omitted some cases to avoid dealing with wrinkles for this simplified naive model; this is discussed at the end).
```{r historical-model}
## Read in historical Senate polls from FiveThirtyEight
## (projects.fivethirtyeight.com/polls-page/data/senate_polls_historical.csv)
historical_polls = read_csv("senate_polls_historical.csv")
## Read in historical Senate results from MIT's Election Data Lab
## (doi.org/10.7910/DVN/PEJ5QU)
historical_results = read_csv("1976-2020-senate.csv")
## Eliminate problematic cases
historical_polls = historical_polls %>%
filter(!(state %in% c("California", "Louisiana"))) %>%
filter(!(race_id %in% c(7780, 7781))) %>% ## Warnock
filter(!(race_id %in% c(6271, 130))) ## pre-runoff Ossoff & Espy
idx_to_fix = which(with(historical_results,
year == 2020 & state == "WYOMING" & candidate == "CYNTHIA M. LUMMIS"
))
historical_results$party_simplified[idx_to_fix] = "REPUBLICAN"
idx_to_fix = which(with(historical_results,
year == 2020 & state == "WYOMING" & candidate == "MERAV BEN DAVID"
))
historical_results$party_simplified[idx_to_fix] = "DEMOCRAT"
historical_results = historical_results %>%
filter(!writein) %>%
filter(candidate != "ERNEST J. PAGELS, JR.") %>%
select(year, state, stage, special, candidatevotes, party_simplified) %>%
filter(year >= 2010) %>%
filter(party_simplified %in% c("DEMOCRAT", "REPUBLICAN")) %>%
filter(!(state %in% c("CALIFORNIA", "LOUISIANA"))) %>%
filter(!(special & year == 2020 & state == "GEORGIA"))
## Calculate Democratic two party voteshare for each race
historical_results = historical_results %>%
pivot_wider(names_from = party_simplified, values_from = candidatevotes) %>%
mutate(twoparty = DEMOCRAT / (DEMOCRAT + REPUBLICAN))
## Calculate Democratic and Republican "votes" from each poll
aggpolls = function(x) round(mean(x, na.rm = TRUE))
historical_polls = historical_polls %>%
filter(party %in% c("DEM", "REP")) %>%
filter(!is.na(sample_size)) %>%
mutate(special = race_id %in% c(6268, 6209, 129)) %>%
group_by(poll_id, party) %>%
summarise(
year = first(as.numeric(paste0("20", gsub(".*/", "", election_date)))),
state = first(state),
special = first(special),
votes = (pct / 100) * sample_size
) %>%
pivot_wider(names_from = party, values_from = votes, values_fn = aggpolls)
dat = historical_polls %>%
na.omit() %>%
group_by(year, state, special) %>%
summarise(DEMOCRAT = sum(DEM), REPUBLICAN = sum(REP)) %>%
mutate(state = toupper(state))
dat$a = 0; dat$b = 0; dat$actual = 0
for ( i in 1:nrow(dat) ) {
s = dat$state[i]
y = dat$year[i]
p = historical_results %>% filter(state == s & year < y) %>% pull(twoparty)
mu = mean(p)
s2 = var(p)
dat$a[i] = ( ((1 - mu) / s2) - (1 / mu) ) * mu^2
dat$b[i] = dat$a[i] * ( (1 / mu) - 1 )
r = historical_results %>%
filter(state == s & year == y & special == dat$special[i]) %>%
pull(twoparty)
dat$actual[i] = r
}
dat = dat %>%
mutate(estimate = (a + DEMOCRAT) / (a + b + DEMOCRAT + REPUBLICAN))
```
Surprisingly, the results are not *that* bad??
The dashed line here represents if the estimated two party voteshare was *exactly* what we actually observed, so one thing to notice is the model does overestimate Democrats slightly since most of the points are just a bit on the upper-left side of that line, but the points do track the line pretty well actually.
```{r historical-comparison}
ggplot(data = dat, mapping = aes(x = actual, y = estimate)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
geom_point(size = 2, alpha = 1/2, na.rm = TRUE) +
xlab("Actual Two Party Voteshare") +
ylab("Estimated Two Party Voteshare")
```
Here are how the races would have been classified vs the actual results:
```{r historical-comparison-table}
tab = dat %>%
mutate(
Result = ifelse(actual > 0.5, "Dem won", "Dem lost"),
Prediction = ifelse(estimate > 0.5, "Dem won", "Dem lost")
)
table(Result = tab$Result, Prediction = tab$Prediction)
```
This super simple model predicted `r sprintf("%0.1f%%", 100 * mean(tab$Result == tab$Prediction, na.rm = TRUE))` of 2018 and 2020 US Senate elections correctly LMAO.
**BUT**, notice that all the errors were overestimating Democrats, so that's something to be wary of.
For reference, here's how far off the predictions were in those classification misses:
```{r check-out-historical-misses}
tab = tab %>%
filter(Result != Prediction) %>%
select(year, state, actual, estimate)
names(tab) = c("Year", "State", "Actual voteshare", "Estimated voteshare")
knitr::kable(tab)
```
## The 2022 Senate Forecast
Okay, so if we were to apply this method to the 2022 races, what would we get?
```{r current-model}
## Read in current Senate polls from FiveThirtyEight
## (projects.fivethirtyeight.com/polls-page/data/senate_polls.csv)
current_polls = read_csv("senate_polls.csv")
## Eliminate problematic cases
current_polls = current_polls %>%
filter(!(state %in% c("California", "Louisiana"))) %>%
filter(cycle == 2022)
## Calculate Democratic and Republican "votes" from each poll
aggpolls = function(x) round(mean(x, na.rm = TRUE))
current_polls = current_polls %>%
filter(stage != "jungle primary") %>%
filter(party %in% c("DEM", "REP")) %>%
filter(!is.na(sample_size)) %>%
mutate(special = race_id %in% c(9480, 9482)) %>%
group_by(poll_id, party) %>%
summarise(
year = first(as.numeric(paste0("20", gsub(".*/", "", election_date)))),
state = first(state),
special = first(special),
votes = (pct / 100) * sample_size
) %>%
pivot_wider(names_from = party, values_from = votes, values_fn = aggpolls)
dat = current_polls %>%
na.omit() %>%
group_by(year, state, special) %>%
summarise(DEMOCRAT = sum(DEM), REPUBLICAN = sum(REP)) %>%
mutate(state = toupper(state))
dat$a = 0; dat$b = 0
for ( i in 1:nrow(dat) ) {
s = dat$state[i]
y = dat$year[i]
p = historical_results %>% filter(state == s & year < y) %>% pull(twoparty)
mu = mean(p, na.rm = TRUE)
s2 = var(p, na.rm = TRUE)
dat$a[i] = ( ((1 - mu) / s2) - (1 / mu) ) * mu^2
dat$b[i] = dat$a[i] * ( (1 / mu) - 1 )
}
dat = dat %>%
mutate(estimate = (a + DEMOCRAT) / (a + b + DEMOCRAT + REPUBLICAN))
safe_d = 36 + 2 ## seats not up + CA + HI
safe_r = 29 + 3 ## seats not up + LA + ND + ID
```
```{r map}
plotdat = dat %>% ungroup() %>% select(state, estimate)
addon = data.frame(
state = c("CALIFORNIA", "LOUISIANA", "IDAHO", "HAWAII", "NORTH DAKOTA"),
estimate = c(1, 0, 0, 1, 0)
)
plotdat = rbind(plotdat, addon)
plot_usmap(data = plotdat, values = "estimate") +
scale_fill_gradient2(
low = "firebrick3",
mid = "white",
high = "dodgerblue4",
midpoint = 0.5
)
```
(The completely solid red or blue states were states that could not be estimated by this model due to data issues such as lack of polling, but there is also a very strong consensus on the outcome in those states so I treated them as a particular party winning them with probability 1).
We'd end up with `r sum(dat$estimate > 0.5) + safe_d` Democrats in the Senate and `r sum(dat$estimate <= 0.5) + safe_r` Republicans in the Senate... at least just looking at the point estimates alone and using 0.5 as the prediction cutoff.
However, just like with all the other models of this election, this naive model gives lots of close races; here are the 10 closest according to this model:
```{r tbl}
dat %>%
ungroup() %>%
mutate(state = tools::toTitleCase(tolower(state))) %>%
mutate(closeness = abs(0.5 - estimate)) %>%
arrange(closeness) %>%
select(state, estimate) %>%
rename(State = state) %>%
rename(`Estimated Dem Voteshare` = estimate) %>%
head(10) %>%
knitr::kable(digits = 2)
```
So, we can do the usual thing that these modellers do and simulate a whole bunch of election outcomes from the model and then summarize them for you to get a sense of the uncertainty and the likely range of outcomes according to the model:
```{r simulation}
dat = dat %>% mutate(A = a + DEMOCRAT, B = b + REPUBLICAN)
nsamples = 40000 ## I'll do 40,000 like everyone else I guess lol
seats = numeric(nsamples)
set.seed(138)
for ( s in 1:nsamples ) {
shares = sapply(1:nrow(dat), function(i) rbeta(1, dat$A[i], dat$B[i]))
seats[s] = sum(shares > 0.5) + safe_d
}
histdat = data.frame(seats = seats)
ggplot(data = histdat, mapping = aes(x = seats)) +
geom_bar(fill = "dodgerblue4") +
xlab("Number of Senate seats held by Democrats") +
ylab("Number of simulations (out of 40,000)")
```
```{r simulation-tab}
histdat %>%
rename(`Dem-held seats` = seats) %>%
group_by(`Dem-held seats`) %>%
summarise(`N Simulations` = n()) %>%
knitr::kable()
```
Unsurprisingly, there's less uncertainty in this model than some others given its almost total lack of sophistication.
The modal outcome of this model, by far, is the Dems keeping the Senate, and even gaining a seat, up to 51-49.
But Dems ending up with more than that is quite a rare outcome of this model.
...And the model overestimated Dems by just a little in the last two cycles... so it's gonna be a nail-biter 😬️
## Appendices
### Setting the Prior Parameters
A Beta distribution is defined by two parameters, $a$ and $b$.
The mean and variance of a Beta distribution are given by
$$
\mu = \dfrac{a}{a + b}
$$
and
$$
\sigma^2 = \dfrac{ab}{(a + b)^2 (a + b + 1)}
$$
respectively.
So we can use *the method of moments* to set $a$ and $b$ for our Beta prior over $\delta_i$ by taking the election returns from the last $N$ (say 3) Senate elections in state $i$ and setting $\mu$ equal to their mean and $\sigma^2$ equal to their variance, then solving for $a$ and $b$ using the equations above.
### Omitted cases
To avoid dealing with wrinkles in just what Democratic two party voteshare means in elections where there's multiple Democrats, I omit races in California, Louisiana, and the 2020 Georgia special.