modeling.tex

\section{Probabilistic modeling}
Let us begin with a useful taxonomy of probabilistic models to provide a context for how NNs are useful for inference and vice versa. Define by $\mathbf{y}$ the variables of interest to be predicted, and by $\mathbf{x}$ data that are (hopefully) informative about the outputs. For instance, $\mathbf{y}$ may be a vector of attributes relating to a real estate property such as postcode, number of bedrooms, number of bathrooms, etc., and $\mathbf{x}$ may be the property market value. Or, $\mathbf{x}$ may be the pixel values of an image containing a single object, and $\mathbf{y}$ a label describing the identity of the object in the image. Define by $\mathbf{z}$ a variable with similar meaning to $\mathbf{y}$, with the convention that $\mathbf{z}$ is typically a latent (unobserved) variable, whereas $\mathbf{y}$ typically denotes one that is observed. Probabilistic models can be roughly divided into two branches depending on whether the goal is to match a joint or a conditional distribution.

\subsection{Discriminative models}
In the discriminative modeling paradigm, the statistician assumes a conditional distribution, $p_\phi(\mathbf{y}\mid\mathbf{x})$, known as the \emph{discriminative model}, relating the likelihood of observing the output given the value of the data and the parameters, $\phi$. In this paradigm, both $\mathbf{y}$ and $\mathbf{x}$ are observed, and the model typically fit by the method of maximum likelihood: given a dataset $\mathcal{D}=\{\mathbf{y}_n,\mathbf{x}_n\}_{n=1}^N$, find
\begin{align}\label{eq:max-likelihood}
	\phi^* &= \text{argmax}_{\phi\in\Phi}\mathbb{E}_{(\mathbf{y},\mathbf{x})\sim p_\mathcal{D}}\left[\ln\left(p_\phi\left(\mathbf{y}\mid\mathbf{x}\right)\right)\right].
\end{align}
where $p_\mathcal{D}$ denotes the empirical data distribution. This is equivalent to minimizing the KL-divergence between the empirical distribution and model,
\begin{align*}
	\phi^* &= \text{argmin}_{\phi\in\Phi}\infdiv{p_\mathcal{D}(\mathbf{y}, \mathbf{x})}{p_\phi(\mathbf{y}\mid\mathbf{x})p_\mathcal{D}(\mathbf{x})}.
\end{align*}
This type of learning is known as \emph{supervised learning}, since both the inputs and outputs are observed. Discriminative models are typically understood through a frequentist statistical lens, viewing the parameters as fixed and the data as random.

Traditional deep learning models \citep{GoodfellowEtAl2016} assume that the parameters $\phi$ of the discriminative distribution are the output of a neural network (NN) inputting the features, $\phi=f_\theta(\mathbf{x})$, with its own parameters $\theta$, learnt by stochastic gradient ascent on \eqref{eq:max-likelihood}.

\subsection{Generative models}
In the generative modeling paradigm, on the other hand, the statistician assumes a joint distribution, $p_\phi(\mathbf{x})$ or $p_\phi(\mathbf{x},\mathbf{z})$, known as the \emph{generative model}, that directly models the observed data, $\mathbf{x}$, and, optionally, features $\mathbf{z}$.

Generative models of the form $p_\phi(\mathbf{x})$ are often referred to as autoregressive generative models, since the distribution is commonly represented by the chain rule as,
\begin{align*}
	p_\phi(\mathbf{x}) &= \prod^N_{n=1}p_\phi(x_n\mid\mathbf{x}_{\prec n}),
\end{align*}
where $\mathbf{x}_{\prec n}\triangleq\{x_1,\ldots,x_{n-1}\}$, although we note that any factorization suffices. For this reason, we will refer to them as \emph{fully-observed generative models}. A representative example is PixelRNN \citep{VanDenOordEtAl2016a}, which represents a joint distribution over the pixels of an image parametrizing the $\phi$ of each factor using the output of a cleverly constructed RNN. 

For fully-observed generative models, the objective is typically to learn $p_\phi(\mathbf{x})$ from the data by the method of maximum likelihood, which we again can interpret as minimizing a KL-divergence. This is a form of supervised learning.

Generative models of the form $p_\phi(\mathbf{x},\mathbf{z})$ are referred to as \emph{latent variable generative models}, and it is assumed that the latent variables, $\mathbf{z}$, in some way capture the dynamics of the process giving rise to $\mathbf{x}$. Latent variable generative models include both traditional statistical models like latent dirichlet allocation \citep{blei2003latent} and discrete Bayesian networks \citep{KollerFriedman2009}, as well as modern ``deep'' generative models like the variational autoencoder \citep{KingmaWelling2013} and attend-infer-repeat \citep{eslami2016attend}.

Latent variable generative models are most conveniently understood through a Bayesian statistical lens, where the data $\mathbf{x}$ is fixed, and the statistical parameters, $\mathbf{z}$, are random, and must be inferred through the process of inference. The objective is typically to approximate the posterior, $p(\mathbf{z}\mid\mathbf{x})$, either through approximate samples, or an approximation, $q_\psi(\mathbf{z})$ or $q_\psi(\mathbf{z}\mid\mathbf{x})$. The model can also be learnt by maximizing (indirectly) the marginal log-likelihood, $p_\phi(\mathbf{x})$, when amortized VI is used. Typically $\mathbf{x}$ is observed and $\mathbf{z}$ latent, or unobserved, leading to \emph{unsupervised learning}, although it is also possible to make $\mathbf{z}$ partially-observed for \emph{semi-supervised learning} \citep{KingmaEtAl2014, SiddharthEtAl2017}, or fully-observed for \emph{supervised learning}.

Latent variable models that condition on an observed context, $\mathbf{y}$, with conditional distribution $p(\mathbf{x},\mathbf{z}\mid\mathbf{y})$ are also considered to be generative models, provided that part of $\mathbf{z}$ is latent, and the distribution is defined in a way that $\mathbf{z}$ gives rise to $\mathbf{x}$. We refer to these models as \emph{conditional generative models}.

\subsection{Implicit models}\label{sec:gans}
We further divide generative models into those that are ``explicit" (or ``prescribed'') versus those that are ``implicit'' \citep{mohamed2016learning}. The models of the previous section are said to be \emph{explicit}, since we directly model an exact functional form for the likelihood, or scoring process, $p_\theta(\mathbf{x})$. In an \emph{implicit} generative model, on the other hand, we directly model the sampling process for $\mathbf{x}$, and this only indirectly defines $p_\theta(\mathbf{x})$ (which is typically intractable).

One well known implicit generative model is the generative adversarial network (GAN) \citep{GoodfellowEtAl2014}. In its simplest incarnation, it models the sampling process of an image datum, $\mathbf{x}$, as follows: first sample $\mathbf{z}\sim N(\mathbf{0},I_{D\times D})$ for some $D\ll N$ from a standard multivariate normal, then calculate $\mathbf{x}=f_\theta(\mathbf{z})$, where $f_\theta$ is a shallow densely-connected feedforward NN.

The learning objective of implicit generative modeling is to match the assumed sampling process of the model to that of the empirical data distribution. In the simplest case, this can be interpreted as minimizing the Jensen--Shannon divergence between the data distribution and the model (see \citep[Theorem 1]{GoodfellowEtAl2014}),
\begin{align}\label{eq:gan-objective}
	\theta^* &= \text{argmin}_{\theta\in\Theta}\text{JSD}\infdivx{p_\mathcal{D}(\mathbf{x})}{p_\theta(\mathbf{x})}
\end{align}
where
\begin{align*}
	\text{JSD}\infdivx{p}{q} &\triangleq \frac{1}{2}\text{KL}\infdivx[\bigg]{p}{\frac{p+q}{2}} + \frac{1}{2}\text{KL}\infdivx[\bigg]{q}{\frac{p+q}{2}}.
\end{align*}
It is not possible to directly optimize \eqref{eq:gan-objective}, as it depends on the implicit and intractable density, $p_\theta(\mathbf{x})$. Instead, a discriminator function is introduced whose purpose is to estimate the intractable likelihood ratio terms required by \eqref{eq:gan-objective}.

Implicit models may be necessary for situations where it is only feasible to simulate the sampling process of the data, such as in stochastic simulators for scientific modeling \citep{van2018introduction}. Nonetheless, there is theoretical and empirical evidence that implicit models such as GANs learn distributions with low-dimensional support \citep{arora2017generalization, arora2017gans}, and thus presumably are limited in their capability to generalize beyond the observed data. We do not consider implicit models further in this thesis.

Let us make clear the distinction between model, objective, algorithm, and framework, as we define them, which is also relevant to our later discussion of variational autoencoders (VAEs). We define the GAN model as, e.g., the sampling process described above. Alternatives exist, such as using a convolutional architecture \citep{radford2015unsupervised} or an invertible neural network \citep{grover2018flow}. By the objective, we mean the divergence metric or other measure used to match the sampling process to the data distribution, such as the Jenson--Shannon divergence described above. Again, there are other possibilities that can be used independently of the model, such as the Wasserstein distance \citep{ArjovskyEtAl2017}. By algorithm, we mean the optimization problem and method of optimization to optimize the objective. For instance, for GANs this typically involves SGD with adaptive stepsizes \citep{KingmaBa2014} on two separate optimization problems involving a discriminator function that determines how well the sampling process is able to match the data distribution. By framework, we mean the combination of these three elements. When we simply refer to GAN or VAE, we will take that to mean the GAN or VAE model, and not the whole framework. It is common in the literature to conflate the term GAN with the complete framework. However, we believe it is important to at least distinguish the model from the learning or inferential procedure---many models can be used with many separate learning procedures, both of which are developed in parallel in the literature.

\subsection{Discussion}
Discriminative models have lower asymptotic error than generative models for regression and classification tasks, presumably because they do not require any assumption on the structure of $p(\mathbf{x})$. Indeed, deep NN discriminative models, parametrized by flexible NN function approximators and learnt from big data have produced state-of-the-art results in classification tasks \citep{KrizhevskyEtAl2012, HintonEtAl2012}. Generative models, however, often approach their higher asymptotic error faster \citep{ng2002discriminative} and are to be preferred in low-data regimes.

Moreover, generative models can solve a number of AI tasks requiring a deeper level of ``intelligence'' than the regression and classification afforded by discriminative models. Most obviously, generative models can sample new data points. This is more than a curiosity and has been used for practical application such as supersampling of images \citep{ledig2017photo}. Latent variable generative models are particularly useful and can be applied to tasks beyond simply associating inputs to outputs \citep{mohamed2017tutorial} such as,
\begin{itemize}
	\item recognizing the identity and location of objects \citep{eslami2016attend}
	\item predicting future states of the world \citep{kosiorek2018sequential}
	\item disentangling the factors of variation giving rise to our observations \citep{SiddharthEtAl2017}
	\item forming concepts in an unsupervised manner that are useful for reasoning and decision making \citep{lake2015human}
	%\item recognition of anomalies and outliers \hl{?}
	\item generating plans for the future \citep{igl2018deep}
	\item causal modeling and discovery \citep{louizos2017causal}
\end{itemize}
%These applications are enabled by the inclusion of latent variables.

When the conditional distributions comprising a generative model are parametrized by NNs, the models are termed \emph{deep generative models}. This poses a challenge for model learning, as we cannot analytically marginalize out the latent variables to produce the marginal, $p_\phi(\mathbf{x})$. In the following section we will see how model learning can be performed using the amortized VI paradigm. Amortized VI also makes use of NNs for inference.

%The presence of latent variables necessitates inference. To ...

%Deep generative models have already found commercial application in text-to-speech synthesis \hl{cite!}, and scientific applications in predicting chemical reactions \hl{cite!} and modeling physics \hl{cite!}.

Discriminative models do not provide any mechanism to express our uncertainty over the predictions. However, we can do so by reinterpreting the model weights as random variables. Suppose we have a discriminative classifier, $p(\mathbf{y}\mid\mathbf{x};\phi)$. By placing a prior on the parameters, $p(\phi)$, we can transform the discriminative model into one that is conditionally generative, $p(\mathbf{y},\phi\mid\mathbf{x})=p(\mathbf{y}\mid\mathbf{x},\phi)p(\phi)$. Instead of learning the parameters by, e.g., stochastic gradient descent, the problem is transformed to one of inference. As to be explained, one can draw approximate samples, $\{\phi_m\}_{m=1}^M$, from the posterior, $p(\phi\mid\mathbf{x},\mathbf{y})$, by an MCMC method, and use those samples to produce $M$ different distributions, $p(\mathbf{y}\mid\mathbf{x},\phi_m)$ characterizing our uncertainty over the prediction distribution. Alternatively, by classical VI or EP, one can learn an approximation, $q_\psi(\phi)$, to the posterior, and use this to similarly characterize the uncertainty over prediction distributions. When the model is parametrized by a NN, this bridge from discriminative to generative models is known as a \emph{Bayesian NN}, and will be used in Ch 4 for distributed learning of NNs.

In order to exploit the low asymptotic error of discriminative models, we may wish to train a model on more data than fits on a single machine. In these situations, we require learning systems that can distribute and coordinate the work of learning across several machines. In one such setup, the model is replicated across $W$ worker nodes, each of which receives a (possibly overlapping) portion of the data, and a master node. In the asynchronous SGD (A-SGD) \citep{DeanEtAl2012} algorithm, each worker node requests the most recent parameters from the master, and after receiving this performs a gradient update on a minibatch from its portion of the data, sending back the resulting update to the parameters to the master node. While a worker is computing the gradient with respect to the current minibatch, other workers have potentially interacted with the master to update the current parameters, and thus the worker in question is working with out-of-date information. On average, each worker's update will be $W-1$ steps behind. This is known as the stale-gradient problem. Again, this is a problem that can be addressed by Bayesian inference. We develop in Ch 4, a distributed Bayesian learning framework that lessens the severity of the stale-gradient problem. From a high-level, it does this by exchanging distributions over the parameters from the workers to the master, rather than simple gradient updates.

Thus, NNs are important components of both discriminative and generative models. They are powerful function approximators that are able to learn hierarchical, distributed, and often sparse, representations of their inputs \citep{Bengio2009}. Deep NN discriminative models are the state-of-the-art in many classification and regression tasks. NNs are useful not just for modeling, but also for inference, as we will describe in the next two sections. Latent variable generative models require inference, and a scalable, generic inference framework known as amortized VI makes use of NNs for this purpose. We will also describe how inference can be useful for analyzing discriminative NN models.