-
Notifications
You must be signed in to change notification settings - Fork 48
/
appendix.tex
484 lines (401 loc) · 22.6 KB
/
appendix.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
\input{common/header.tex}
\inbpdocument
\chapter{Gaussian Conditionals}
\label{ch:appendix-gaussians}
%\section{Formula for Gaussian Conditionals}
A standard result shows how to condition on a subset of dimensions $\vy_B$ of a vector $\vy$ having a multivariate Gaussian distribution.
If
%
\begin{align}
\vy = \colvec{\vy_A}{\vy_B} \sim \Nt{\colvec{\vmu_A}{\vmu_B}}{\left[ \begin{array}{cc}\vSigma_{AA} & \vSigma_{AB} \\ \vSigma_{BA} & \vSigma_{BB} \end{array} \right]}
%\left[ \begin{array} \vx_A \sim \Nt{\vmu_A}{
%| \vx_B \sim \Nt{\vmu_A + \vSigma_{AB} \vSigma_{BB}\inv \left( \vx_B - \vmu_B \right) }
%{\vSigma_{AA} - \vSigma_{AB} \vSigma_{BB}\inv \vSigma_{BA} }
%\label{eq:gauss_conditional}
\end{align}
%
then
%
\begin{align}
\vy_A | \vy_B \sim \mathcal{N} \big(\vmu_A + \vSigma_{AB} \vSigma_{BB}\inv \left( \vx_B - \vmu_B \right),
\vSigma_{AA} - \vSigma_{AB} \vSigma_{BB}\inv \vSigma_{BA} \big).
\label{eq:gauss_conditional}
\end{align}
This result can be used in the context of Gaussian process regression, where $\vy_B = [f(\vx_1), f(\vx_2), \dots, f(\vx_N)]$ represents a set of function values observed at some subset of locations $[\vx_1, \vx_2, \dots, \vx_N]$, while $\vy_A = [f(\vx_1\star), f(\vx_2\star), \ldots, f(\vx_N\star)]$ represents test points whose predictive distribution we'd like to know.
In this case, the necessary covariance matrices are given by:
%
\begin{align}
\vSigma_{AA} & = k(\vX^\star, \vX^\star) \\
\vSigma_{AB} & = k(\vX^\star, \vX) \\
\vSigma_{BA} & = k(\vX, \vX^\star) \\
\vSigma_{BB} & = k(\vX, \vX)
\end{align}
and similarly for the mean vectors.
\chapter{Kernel Definitions}
\label{sec:kernel-definitions}
%\newcommand{\scalefactor}{\sigma_f^2}
\newcommand{\scalefactor}{}
This appendix gives the formulas for all one-dimensional base kernels used in the thesis.
Each of these formulas is multiplied by a scale factor $\sigma_f^2$, which we omit for clarity.
%
\begin{align}
\kC(\inputVar, \inputVar') & = \scalefactor 1 \\
\kSE(\inputVar, \inputVar') & = \scalefactor \exp\left(-\frac{(\inputVar - \inputVar')^2}{2\ell^2}\right) \label{eq:appendix-se}\\
\kPer(x, x') & = \exp\left(-\frac{2}{\ell^2} \sin^2 \left( \pi \frac{\inputVar - \inputVar'}{p} \right)\right) \label{eq:appendix-periodic}\\
\kLin(\inputVar, \inputVar') & = \scalefactor (\inputVar - c)(\inputVar' - c) \label{eq:appendix-lin} \\
\kRQ(x, x') & = \scalefactor \left( 1 + \frac{(\inputVar - \inputVar')^2}{2\alpha\ell^2}\right)^{-\alpha} \label{eq:appendix-rq}\\
%\kPer(\inputVar, \inputVar') & = \sigma_f^2 \frac{\exp\left(\frac{1}{\ell^2}\cos 2 \pi \frac{(\inputVar - \inputVar')}{p}\right) - I_0\left(\frac{1}{\ell^2}\right)}{\exp\left(\frac{1}{\ell^2}\right) - I_0\left(\frac{1}{\ell^2}\right)} \label{eq:generalized-periodic} \\
%\kPer(\inputVar, \inputVar') & = \scalefactor \exp\left(\frac{1}{\ell^2}\cos 2 \pi \frac{(\inputVar - \inputVar')}{p}\right) - I_0\left(\frac{1}{\ell^2}\right) \\
\cos(x, x') & = \scalefactor \cos\left(\frac{2 \pi (x - x')}{p}\right) \label{eq:appendix-cos} \\
\kWN(\inputVar, \inputVar') & = \scalefactor \delta(\inputVar - \inputVar') \label{eq:appendix-wn}\\
\kCP(\kernel_1, \kernel_2)(x, x') & = \scalefactor \sigma(x) k_1(x,x')\sigma(x') + (1-\sigma(x)) k_2(x,x')(1-\sigma(x')) \\
\boldsymbol\sigma(\inputVar, \inputVar') & = \scalefactor \sigma(x)\sigma(x') \\
\boldsymbol{\bar\sigma}(\inputVar, \inputVar') & = \scalefactor (1-\sigma(x))(1-\sigma(x'))
\end{align}
%
where $\delta_{\inputVar, \inputVar'}$ is the Kronecker delta function, $\{c,\ell,p,\alpha\}$ represent kernel parameters, and ${\sigma(x) = \nicefrac{1}{1 + \exp(-x)}}$.
\Cref{eq:appendix-se,eq:appendix-periodic,eq:appendix-lin} are plotted in \cref{fig:basic_kernels}, and \cref{eq:appendix-wn,eq:appendix-rq,eq:appendix-cos} are plotted in \cref{fig:basic_kernels_two}.
Draws from \gp{} priors with changepoint kernels are shown in \cref{fig:changepoint_examples}.
\subsubsection{The zero-mean periodic kernel}
%\citet{lloyd-periodic}
James Lloyd (personal communication) showed that the standard periodic kernel due to \citet{mackay1998introduction} can be decomposed into a sum of a periodic and a constant component.
He derived the equivalent periodic kernel without any constant component:
%, shown in \cref{eq:generalized-periodic}.
%
\begin{align}
\kPerGen(\inputVar, \inputVar') & = \sigma_f^2 \frac{\exp\left(\frac{1}{\ell^2}\cos 2 \pi \frac{(\inputVar - \inputVar')}{p}\right) - I_0\left(\frac{1}{\ell^2}\right)}{\exp\left(\frac{1}{\ell^2}\right) - I_0\left(\frac{1}{\ell^2}\right)} \label{eq:generalized-periodic}
\end{align}
%
where $I_0$ is the modified Bessel function of the first kind of order zero.
He further showed that its limit as the lengthscale grows is the cosine kernel:
%
\begin{equation}
\lim_{\ell \to \infty} \kPerGen(x, x') = \cos\left(\frac{2 \pi (x - x')}{p}\right).
\end{equation}
Separating out the constant component allows us to express negative prior covariance, as well as increasing the interpretability of the resulting models.
This covariance function is included in the \GPML{} software package~\citep{GPML}, and its source can be viewed at \url{gaussianprocess.org/gpml/code/matlab/cov/covPeriodicNoDC.m}.
\chapter{Search Operators}
\label{ch:appendix-search}
\label{sec:search-operators}
%\subsection{Overview}
The model construction phase of \procedurename{} starts with the noise kernel, $\kWN$.
New kernel expressions are generated by applying search operators to the current kernel, which replace some part of the existing kernel expression with a new kernel expression.
%When new base kernels are proposed by the search operators, their parameters are randomly initialised with several restarts.
%Parameters are then optimized by conjugate gradients to maximise the likelihood of the data conditioned on the kernel parameters.
%The kernels are then scored by the Bayesian information criterion and the top scoring kernel is selected as the new kernel.
%The search then proceeds by applying the search operators to the new kernel \ie this is a greedy search algorithm.
%In all experiments, 10 random restarts were used for parameter initialisation and the search was run to a depth of 10.
%\subsection{Search operators}
The search used in the multidimensional regression experiments in \cref{sec:synthetic,sec:additive-experiments} used only the following search operators:
%
\begin{eqnarray}
\mathcal{S} &\to& \mathcal{S} + \mathcal{B} \\
\mathcal{S} &\to& \mathcal{S} \times \mathcal{B} \label{eq:search-multiply}\\
\mathcal{B} &\to& \mathcal{B'}
\end{eqnarray}
%
where $\mathcal{S}$ represents any kernel sub-expression and $\mathcal{B}$ is any base kernel within a kernel expression.
These search operators represent addition, multiplication and replacement.
When the multiplication operator is applied to a sub-expression which includes a sum of sub-expressions, parentheses () are introduced.
For instance, if rule \eqref{eq:search-multiply} is applied to the sub-expression $k_1 + k_2$, the resulting expression is $(k_1 \kernplus k_2) \kerntimes \mathcal{B}$.
Afterwards, we added several more search operators in order to speed up the search.
This expanded set of operators was used in the experiments in \cref{sec:time_series,sec:Predictive accuracy on time series,ch:description}.
These new operators do not change the set of possible models.
To accommodate changepoints and changewindows, we introduced the following additional operators to our search:
%
\begin{eqnarray}
\mathcal{S} &\to& \kCP(\mathcal{S},\mathcal{S}) \\
\mathcal{S} &\to& \kCW(\mathcal{S},\mathcal{S}) \\
\mathcal{S} &\to& \kCW(\mathcal{S},\kC) \\
\mathcal{S} &\to& \kCW(\kC,\mathcal{S})
\end{eqnarray}
%
where $\kC$ is the constant kernel.
The last two operators result in a kernel only applying outside, or within, a certain region.
To allow the search to simplify existing expressions, we introduced the following operators:
%
\begin{eqnarray}
\mathcal{S} &\to& \mathcal{B}\\
\mathcal{S} + \mathcal{S'} &\to& \mathcal{S}\\
\mathcal{S} \times \mathcal{S'} &\to& \mathcal{S}
\end{eqnarray}
%
where $\mathcal{S'}$ represents any other kernel expression.
%Their introduction is currently not rigorously justified.
We also introduced the operator
%
\begin{eqnarray}
\mathcal{S} &\to& \mathcal{S} \times (\mathcal{B} + \kC)
\end{eqnarray}
%
Which allows a new base kernel to be added along with the constant kernel, for cases when multiplying by a base kernel by itself would be overly restrictive.
\chapter{Example Automatically Generated Report}
\label{ch:example-solar}
The following pages of this appendix contain an automatically-generated report, run on a dataset measuring annual solar irradiation data from 1610 to 2011.
This dataset was previously analyzed by \citet{lean1995reconstruction}.
The structure search was run using the \procedurename-interpretable variant, with base kernels $\kSE$, $\kLin$, $\kC$, $\kPer$, $\vsigma$, and $\kWN$.
Other example reports can be found at \url{http://www.mlg.eng.cam.ac.uk/Lloyd/abcdoutput/}, including analyses of wheat prices, temperature records, call centre volumes, radio interference, gas production, unemployment, number of births, and wages over time.
\newcommand{\solarreportpage}[1]{\includegraphics[width=\columnwidth]{figures/solarpages/02-solar-seperate-pages-#1}}
\clearpage
%\solarreportpage{1}
\solarreportpage{2}
\\ \vspace{1cm} \\
\solarreportpage{3}
\\ \vspace{1cm} \\
\solarreportpage{4}
\\ \vspace{1cm} \\
\solarreportpage{5}
\\ \vspace{1cm} \\
\solarreportpage{6}
\\ \vspace{1cm} \\
\solarreportpage{7}
\\ \vspace{1cm} \\
\solarreportpage{8}
\\ \vspace{1cm} \\
\solarreportpage{9}
\\ \vspace{1cm} \\
\solarreportpage{10}
\\ \vspace{1cm} \\
\solarreportpage{11}
\\ \vspace{1cm} \\
\solarreportpage{12}
\\ \vspace{1cm} \\
\solarreportpage{13}
%\solarreportpage{14}
%\solarreportpage{15}
%\solarreportpage{16}
%\solarreportpage{17}
%\solarreportpage{18}
%\solarreportpage{19}
%\solarreportpage{20}
%\solarreportpage{21}
%\solarreportpage{22}
\chapter{Inference in the Warped Mixture Model}
\label{ch:warped-appendix}
\subsubsection{Detailed definition of model}
The \iwmm{} assumes that the latent density is an infinite mixture of Gaussians:
%
\begin{align}
p(\vx ) = \sum_{c=1}^{\infty} \lambda_c \, {\cal N}(\vx|\bm{\mu}_{c},\vR_{c}\inv)
\end{align}
%
where $\lambda_{c}$, $\bm{\mu}_{c}$ and $\vR_{c}$ is the mixture weight, mean, and precision matrix of the $c$\textsuperscript{th} mixture component.
We place a conjugate Gaussian-Wishart priors on the Gaussian parameters $\{\bm{\mu}_{c},\vR_{c}\}$:
%
\begin{align}
p(\bm{\mu}_{c},\vR_{c})
= {\cal N}(\bm{\mu}_{c}|\vu,(r\vR_{c})\inv)
{\cal W}(\vR_{c}|\vS\inv,\nu),
\label{eq:normal-wishar-prior}
\end{align}
%
where $\vu$ is the mean of $\vmu_c$, $r$ is the relative precision of $\vmu_c$, $\vS\inv$ is the scale matrix for $\vR_c$, and $\nu$ is the number of degrees of freedom for $\vR_c$.
The Wishart distribution is defined as:
%
\begin{align}
{\cal W}(\vR|\vS\inv,\nu)=\frac{1}{G}|\vR|^{\frac{\nu-Q-1}{2}}\exp\left(-\frac{1}{2}{\rm tr}(\vS\vR)\right),
\end{align}
%
where $G$ is the normalizing constant.
Because we use conjugate Gaussian-Wishart priors for the parameters of the Gaussian mixture components, we can analytically integrate out those parameters given the assignments of points to components.
Let $z_n$ be the assignment of the $n$\textsuperscript{th} point.
The prior probability of latent coordinates $\vX$ given latent cluster assignments $\CLAS=(\CLASi_1, \CLASi_2, \ldots, \CLASi_N)$ factorizes over clusters, and can be obtained in closed-form by integrating out the Gaussian parameters $\{\vmu_c, \vR_c\}$ to give:
%
\begin{align}
p(\vX | \CLAS, \vS, \nu, r) = \prod_{c=1}^{\infty}
\pi^{-\frac{N_c Q}{2}}\frac{r^{Q/2} \left|\vS \right|^{\nu/2}}{r_c^{Q/2} \left| \vS_c \right|^{\nu_c/2}}
\times \prod_{q=1}^Q \frac{\Gamma \left( \frac{\nu_c + 1 - q}{2} \right)}{\Gamma \left( \frac{\nu+1-q}{2} \right)},
\label{eq:px_z}
\end{align}
%
where $N_c$ is the number of data points assigned to the $c$\textsuperscript{th} component, $\Gamma(\cdot)$ is the Gamma function, and
%
\begin{align}
r_{c}=r+N_{c},
\hspace{2em}
\nu_{c}=\nu+N_{c},
\hspace{2em}
%\nonumber
%\end{align}
%\begin{align}
\vu_c=\frac{\displaystyle r\vu + \sum_{n:z_n = c} \vx_n}{r+N_c}, \\[1em]
%\end{align}
%
%\begin{align}
\textnormal{and} \hspace{2em} \vS_{c}=\vS+\sum_{n:z_{n}=c}\vx_{n}\vx_{n}\tra + r\vu\vu\tra
- r_{c}\vu_{c}\vu_{c}\tra,
\end{align}
%
are the posterior Gaussian-Wishart parameters of the $c$\textsuperscript{th} component~\citep{murphy2007conjugate}.
To model the cluster assignments, we use a Dirichlet process~\citep{maceachern1998estimating} with concentration parameter $\eta$.
% for infinite mixture modeling in the latent space.
%The probability under a Dirichlet-multinomial prior of observing assignment $\CLAS$ is:
Under a Dirichlet process prior, the probability of observing a particular cluster assignment $\CLAS$ depends only on the partition induced, and is given by the Chinese restaurant process:
%
\begin{align}
p(\CLAS|\eta) =
\frac{\Gamma(\eta) \eta^C}{\Gamma(\eta + N)} \prod_{c=1}^C \Gamma( N_c )
%= \frac{\eta^{C}\prod_{c=1}^C (N_c-1)!}{\eta(\eta+1)\ldots(\eta+N-1)},
\label{eq:pz}
\end{align}
%
where $C$ is the number of components for which $N_c > 0$, and $N$ is the total number of datapoints.
The joint distribution of observed coordinates, latent coordinates, and cluster assignments is given by
%
\begin{align}
p(\vY,\vX,\CLAS|\bm{\theta},\bm{S},\nu,\vu,r,\eta)
= p(\vY|\vX,\bm{\theta})
p(\vX|\CLAS,\bm{S},\nu,\vu,r)p(\CLAS|\eta),
\label{eq:joint}
\end{align}
%
where the factors in the right hand side can be calculated by equations \eqref{eq:py_x}, \eqref{eq:px_z} and \eqref{eq:pz}, respectively.
\iffalse
\subsubsection{Generative description of model}
The infinite warped mixture model generates observations $\vY$ according to the following generative process:
%
\begin{enumerate}
\item Draw mixture weights $\bm{\lambda} \sim \GEM(\eta)$
\item For each component $c=1, 2, \dots, \infty$
\begin{enumerate}
\item Draw precision $\vR_c \sim {\mathcal W}(\vS\inv, \nu)$
\item Draw mean $\vmu_c \sim {\mathcal N}(\vu,(r\vR_c)\inv)$
\end{enumerate}
\item For each observed dimension $d=1, 2, \dots, D$
\begin{enumerate}
\item Draw function $f_{d}(\vx) \sim {\rm GP}(m(\vx),k(\vx,\vx'))$
\end{enumerate}
\item For each observation $n=1, 2, \dots,N$
\begin{enumerate}
\item Draw latent assignment $z_n \sim {\rm Mult}(\bm{\lambda})$
\item Draw latent coordinates $\vx_n \sim {\mathcal N}(\vmu_{z_n},\vR_{z_n}\inv)$
\item For each observed dimension $d=1, 2, \dots, D$
\begin{enumerate}
\item Draw feature $y_{nd} \sim {\cal N}(f_d(\vx_n), \sigma^2_n)$
\end{enumerate}
\end{enumerate}
\end{enumerate}
%
Here, $\GEM(\eta)$ is the stick-breaking process \citep{sethuraman94} that generates mixture weights for a Dirichlet process with parameter $\eta$, %${\rm Mult}(\cdot)$ represents a multinomial distribution,
${\rm Mult}(\bm{\lambda})$ represents a multinomial distribution with parameter $\bm{\lambda}$,
$m(\vx)$ is the mean function of the Gaussian process, $\vx,\vx'\in{\mathbb R}^{Q}$ and $\sigma^2_n$ is the noise variance of the \gp{} kernel.
\fi
\subsubsection{Details of inference}
\label{sec:iwmm-inference-details}
%By placing conjugate Gaussian-Wishart priors on the parameters of the Gaussian mixture components, we analytically integrate out those parameters given the assignments of points to clusters.
After analytically integrating out the parameters of the Gaussian mixture components, the only remaining variables to infer are the latent points $\vX$, the cluster assignments $\CLAS$, and the kernel parameters $\vtheta$.
We'll estimate the posterior over these parameters using Markov chain Monte Carlo.
In particular, we'll alternate between collapsed Gibbs sampling of each row of $\CLAS$, and Hamiltonian Monte Carlo sampling of $\vX$ and $\vtheta$.
First, we explain collapsed Gibbs sampling for the cluster assignments $\CLAS$.
Given a sample of $\vX$, $p(\CLAS | \vX, \vS, \nu, \vu, r, \eta)$ does not depend on $\vY$.
This lets us resample cluster assignments, integrating out the \iGMM{} likelihood in closed form.
Given the current state of all but one latent component $z_n$, a new value for $z_n$ is sampled with the following probability:
%
\begin{align}
p(z_{n}=c|\vX,\CLAS_{\setminus n},\bm{S},\nu,\vu,r,\eta)
& \propto\!
\left\{
\begin{array}{ll}
\!\!N_{c\setminus n}\cdot p(\vx_{n}|\vX_{c\setminus n},\bm{S},\nu,\vu,r) & \text{{\small existing components}}\\
\!\!\eta\cdot p(\vx_{n}|\bm{S},\nu,\vu,r) & \text{{\small a new component}}
\end{array}
\right.
\label{eq:gibbs}
\end{align}
%
where
$\vX_{c}=\{\vx_{n}|z_{n}=c\}$
is the set of latent coordinates assigned to the $c$\textsuperscript{th} component,
and $\setminus n$ represents the value or set when excluding the $n$\textsuperscript{th} data point.
We can analytically calculate $p(\vx_{n}|\vX_{c\setminus n},\bm{S},\nu,\vu,r)$
as follows:
%
\begin{align}
p(\vx_{n}|\vX_{c\setminus n},\bm{S},\nu,\vu,r)
=\pi^{-\frac{N_{c\setminus n}Q}{2}}
\frac{r_{c\setminus n} ^{Q/2} \left| \vS_{c\setminus n} \right|^{\nu_{c\setminus n}/2}}
{r_{c\setminus n}'^{Q/2} \left| \vS_{c\setminus n}' \right|^{\nu_{c\setminus n}'/2}}
\times \prod_{d=1}^{Q}
\frac{\Gamma \left( \frac{\nu_{c\setminus n}' + 1 - d}{2} \right)}
{\Gamma \left( \frac{\nu_{c\setminus n} + 1 - d}{2} \right)},
\label{eq:iwmm-cluster-gibbs}
\end{align}
%
where $r_{c}'$, $\nu_{c}'$, $\vu_{c}'$ and $\vS_{c}'$ represent the posterior on Gaussian-Wishart parameters of the $c$\textsuperscript{th} component when the $n$\textsuperscript{th} data point has been assigned to it.
%the $c$\textsuperscript{th} component.
We can efficiently calculate the determinant $\left| \vS_{c\setminus n}' \right|$ using the rank-one Cholesky update.
A special case of \cref{eq:iwmm-cluster-gibbs} gives the likelihood for a new component, $p(\vx_{n}|\bm{S},\nu,\vu,r)$.
\subsubsection{Gradients for Hamiltonian Monte Carlo}
Hamiltonian Monte Carlo (\HMC{}) sampling of $\vX$ from posterior ${p(\vX|\CLAS,\vY,\bm{\theta},\bm{S},\nu,\vu,r)}$, requires computing the gradient of the log-unnormalized-posterior with respect to $\vX$:
%
\begin{align}
\pderiv{}{\vX} \big[ \log p(\vY|\vX,\bm{\theta}) + \log p(\vX|\CLAS,\bm{S},\nu,\vu,r) \big]
\label{eq:warped-hmc1}
\end{align}
%
The first term of gradient \eqref{eq:warped-hmc1} can be calculated by
%
\begin{align}
\pderiv{\log p(\vY | \vX,\bm{\theta})}{\vX} = \pderiv{\log p(\vY | \vX,\bm{\theta})}{\vK} \pderiv{\vK}{\vX} = \left[ -\frac{1}{2}D\vK\inv+\frac{1}{2}\vK\inv \vY \vY^{T} \vK\inv \right]\left[\pderiv{\vK}{\vX}\right],
\label{eq:warped-hmc2}
\end{align}
%
where for an $\kSE + \kWN$ kernel with the same lengthscale $\ell$ on all dimensions,
%
\begin{align}
\pderiv{k(\vx_{n},\vx_{m})}{\vx_n}
& = -\frac{\sigma^2_f}{\ell^2} \exp \left( - \frac{1}{2 \ell^2} (\vx_n - \vx_m)\tra (\vx_n - \vx_m) \right) (\vx_n - \vx_m).
\label{eq:warped-hmc3}
\end{align}
%
%using the chain rule.
The second term of \eqref{eq:warped-hmc1} is given by
\begin{align}
\frac{\partial \log p(\vX|\CLAS,\bm{S},\nu,\vu,r)}{\partial \vx_{n}}
= -\nu_{z_{n}}\bm{S}_{z_{n}}\inv(\vx_{n}-\vu_{z_{n}}).
\label{eq:warped-hmc4}
\end{align}
We also infer kernel parameters $\vtheta$ via \HMC{}, using the gradient of the log unnormalized posterior with respect to the kernel parameters, using an improper uniform prior.
\subsubsection{Posterior predictive density}
\label{sec:iwmm-predictive-density}
In the \gplvm{}, the predictive density of at test point $\vy_\star$ is usually computed by finding the point $\vx_\star$ which has the highest probability of being mapped to $\vy_\star$, then using the density of $p(\vx_\star)$ and the Jacobian of the warping at that point to approximate the density at $\vy_\star$.
When inference is done this way, approximating the predictive density only requires solving a single optimization for each $\vy_\star$.
For our model, we use approximate integration to estimate $p(\vy_\star)$.
This is done for two reasons:
First, multiple latent points (possibly from different clusters) can map to the same observed point, meaning the standard method can underestimate $p(\vy_\star)$.
Second, because we do not optimize the latent coordinates of training points, but instead sample them, we would need to optimize each $p(\vx_\star)$ separately for each sample in the Markov chain.
One advantage of our method is that it gives estimates for all $p(\vy_\star)$ at once.
However, it may not be as accurate in very high observed dimensions, when the volume to sample over is relatively large.
The posterior density in the observed space given the training data is
%
\begin{align}
p(\vy_\star | \vY)
& = \int \!\!\! \int p(\vy_\star,\vx_\star, \vX | \vY)d\vx_\star d\vX \nonumber\\
& = \int\!\!\! \int p(\vy_\star | \vx_\star, \vX, \vY)p(\vx_\star|\vX,\vY)p(\vX|\vY)d\vx_\star d\vX.
\label{eq:density}
\end{align}
%
We first approximate $p(\vX | \vY)$ using samples from the Gibbs and Hamiltonian Monte Carlo chain.
We then approximate $p(\vx_\star | \vX, \vY)$ by sampling points from the posterior density in the latent space and warping them, using the following procedure:
%
\begin{enumerate}
\item Draw a latent cluster assignment $z_\star \sim {\rm Mult} \left( \frac{N_{1}}{N+\eta}, \frac{N_{2}}{N+\eta}, \cdots,\frac{N_{C}}{N+\eta},\frac{\eta}{N+\eta} \right)$
\item Draw a latent cluster precision matrix $\vR_\star \sim {\cal W}(\vS\inv_{z_\star},\nu_{z_\star})$
\item Draw a latent cluster mean $\vmu_\star \sim {\cal N}(\vu_{z_\star},(r_{z_\star}\vR_\star)\inv)$
\item Draw latent coordinates $\vx_\star \sim {\cal N}(\vmu_\star,\vR_\star\inv)$
\item For each observed dimension $d = 1, 2, \ldots, D$, \\ draw observed coordinates
$y^\star_d \sim {\cal N}(\vk_\star\tra \vK\inv \vY_{:,d}, k(\vx_\star,\vx_\star) - \vk_\star\tra \vK\inv \vk_\star)$
\end{enumerate}
%
If $z_\star$ is assigned to a new component in step 1, the prior Gaussian-Wishart distribution \eqref{eq:normal-wishar-prior} is used for sampling in steps 2 and 3.
The density drawn from in step 5 is the predictive distribution of a \gp{}, where
%
$\vk_\star=[ k(\vx_\star, \vx_1), k(\vx_\star, \vx_2), \cdots, k(\vx_\star, \vx_N)]\tra$ and $\vY_{:,d}$ represents the $d$th column of $\vY$.
Each step of this sampling procedure draws from the exact conditional distribution, so the Monte Carlo estimate of the conditional predictive density $p(\vy_\star | \vX, \vY)$ will converge to the true marginal distribution as the number of samples increases.
Since the observations $\vy_\star$ are conditionally normally distributed, each one adds a smooth contribution to the empirical Monte Carlo estimate of the posterior density, as opposed to a collection of point masses.
\subsubsection{Source code}
A reference implementation of the above algorithms is available at\\ \url{http://www.github.com/duvenaud/warped-mixtures}.
\outbpdocument{
\bibliographystyle{plainnat}
\bibliography{references.bib}
}