-
Notifications
You must be signed in to change notification settings - Fork 9
/
bus-model.Rmd
executable file
·165 lines (130 loc) · 6.04 KB
/
bus-model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "Example bus model"
output:
github_document:
toc: true
---
## Introduction
This document describes a simple probabilistic prediction model of bus arrival time used to generate realistic-looking predictions for the survey. While the model fits reasonably well to one bus from King County Metro, and served the purpose of giving predictive distributions for our survey, we do not recommend using this model unmodified in deployment. In particular, it is probably overfit to the couple of busses we tested it on.
Please cite:
Matthew Kay, Tara Kola, Jessica Hullman, Sean Munson. _When (ish) is My Bus?
User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems_.
CHI 2016. DOI: [10.1145/2858036.2858558](http://dx.doi.org/10.1145/2858036.2858558).
## Setup
### Required libraries
If you are missing any of the packages below, use `install.packages("packagename")` to install them.
The `import::` syntax requires the `import` package to be installed, and provides a simple way to
import specific functions from a package without polluting your entire namespace (unlike `library()`)
```{r setup, results="hide", message=FALSE}
library(ggplot2)
import::from(gamlss, gamlss, predictAll)
import::from(gamlss.dist, BCTo, dBCTo, pBCTo, qBCTo)
import::from(magrittr, `%>%`, `%<>%`, `%$%`)
import::from(dplyr,
transmute, group_by, mutate, filter, select, data_frame,
left_join, summarise, one_of, arrange, do, ungroup)
```
### Ggplot theme
```{r setup_ggplot}
theme_set(theme_light() + theme(
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.line=element_line(color="black"),
text=element_text(size=14),
axis.text=element_text(size=rel(15/16)),
axis.ticks.length=unit(8, "points"),
line=element_line(size=.75)
))
```
## Data
Let's load historical data from one bus:
```{r load_data}
df = read.csv("data/bus-history-44.csv")
```
We're particularly interested in just a few columns from this table:
```{r data_str}
df %>%
select(
timeToArrivalMin, timeToPredictedArrivalMin,
timeSinceLastUpdateMin, timeToScheduledArrivalMin,
predictedArrivalDelayMin
) %>%
head(10)
```
These give the actual time to arrival from when the reading was taken, the time to arrival predicted by OneBusAway at that moment, the time since the last update, the time until the bus is scheduled to arrive, and the predicted delay (predicted minus scheduled arrival time).
## Model
We want to predict the time to arrival based on the other four columns. Below is a Box-Cox t (BCT) regression that does so (strictly speaking, a _shifted_ Box-Cox t; the various `+ 15`s and `- 15`s used throughout this document shift the BCT distribution to be defined on `(-15, Inf)` instead of its usual `(0, Inf)`):
```{r fit_model, results="hide"}
m = gamlss(
I(timeToArrivalMin + 15) ~ log(timeToPredictedArrivalMin + 15) + timeSinceLastUpdateMin,
sigma.formula = ~ log(timeToPredictedArrivalMin + 15),
nu.formula = ~ I(log(predictedArrivalDelayMin + 15) - log(timeToPredictedArrivalMin + 15)),
tau.formula = ~ log(timeToPredictedArrivalMin + 15),
data=df,
n.cyc=50,
family=BCTo)
```
Here are some model diagnostics for this model:
```{r model_diagnostics}
plot(m)
```
It doesn't look *too* bad, at least as a description of this particular bus (though it is probably overfit and could do with regularization, but for the purposes of our survey this is not a concern).
The fitted model against the data looks like this:
```{r predict_versus_actual_plot, warning=FALSE}
predictions =
data_frame(
timeToPredictedArrivalMin = seq(-10, 35, length.out = 100),
timeSinceLastUpdateMin = mean(df$timeSinceLastUpdateMin),
predictedArrivalDelayMin = mean(df$predictedArrivalDelayMin),
timeToScheduledArrivalMin = timeToPredictedArrivalMin + predictedArrivalDelayMin
) %>%
cbind(predictAll(m, newdata = ., data = df)) %>%
mutate(
upper95 = qBCTo(.975, mu, sigma, nu, tau) - 15,
lower95 = qBCTo(.025, mu, sigma, nu, tau) - 15,
upper80 = qBCTo(.9, mu, sigma, nu, tau) - 15,
lower80 = qBCTo(.1, mu, sigma, nu, tau) - 15,
upper50 = qBCTo(.75, mu, sigma, nu, tau) - 15,
lower50 = qBCTo(.25, mu, sigma, nu, tau) - 15,
upper10 = qBCTo(.55, mu, sigma, nu, tau) - 15,
lower10 = qBCTo(.45, mu, sigma, nu, tau) - 15
)
predictions %>%
ggplot(aes(x = timeToPredictedArrivalMin)) +
geom_ribbon(aes(ymin = lower95, ymax = upper95), fill="gray50") +
geom_ribbon(aes(ymin = lower50, ymax = upper80), fill="gray40") +
geom_ribbon(aes(ymin = lower50, ymax = upper50), fill="gray30") +
geom_ribbon(aes(ymin = lower10, ymax = upper10), fill="gray20") +
geom_point(aes(y = timeToArrivalMin), data = df) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0)
```
Not completely terrible.
## Generating a predictive distribution for one bus
Given the model and the observations of the current point prediction from OneBusAway for a single bus, we can generate a predictive distribution based on that point:
```{r one_bus_prediction, warning=FALSE}
one_bus = data_frame(
#observations presumed to come from OneBusAway:
timeToPredictedArrivalMin = 20,
timeSinceLastUpdateMin = 1,
predictedArrivalDelayMin = 5,
timeToScheduledArrivalMin = timeToPredictedArrivalMin + predictedArrivalDelayMin
) %>%
cbind(predictAll(m, newdata = ., data = df))
```
The prediction yields parameters of a Box-Cox t distribution, `mu`, `sigma`, `nu`, and `tau`:
```{r one_bus_params}
t(one_bus)
```
These parameters define a probability distribution that we can plot, for example as a density plot:
```{r density_plot, fig.width=7, fig.height=3}
one_bus %$% curve(dBCTo(x + 15, mu, sigma, nu, tau), n = 1001, xlim = c(-10, 40), ylab="d")
```
Or as a quantile dotplot:
```{r quantile_dotplot, fig.width=7, fig.height=2}
one_bus %$%
data_frame(x = qBCTo(ppoints(20), mu, sigma, nu, tau) - 15) %>%
ggplot(aes(x = x)) +
geom_dotplot(binwidth = 1.5) +
xlim(-10, 40)
```