Deep Learning

Lecture 2: Neural networks

Prof. Gilles Louppe
g.louppe@uliege.be

???

R: regenerate all svg files from draw.io

R: backprop -> check https://mathematical-tours.github.io/book-sources/optim-ml/OptimML.pdf

Today

Explain and motivate the basic constructs of neural networks.

From linear discriminant analysis to logistic regression
Stochastic gradient descent
From logistic regression to the multi-layer perceptron
Vanishing gradients and rectified networks
Universal approximation theorem (teaser)

Neural networks

Threshold Logic Unit

The Threshold Logic Unit (McCulloch and Pitts, 1943) was the first mathematical model for a neuron.

Assuming Boolean inputs and outputs, it is defined as

$$f(\mathbf{x}) = 1_{\{\sum_i w_i x_i + b \geq 0\}}$$

This unit can implement:

$\text{or}(a,b) = 1_{\{a+b - 0.5 \geq 0\}}$
$\text{and}(a,b) = 1_{\{a+b - 1.5 \geq 0\}}$
$\text{not}(a) = 1_{\{-a + 0.5 \geq 0\}}$

Therefore, any Boolean function can be built with such units.

.center.width-40[]

.footnote[Credits: McCulloch and Pitts, A logical calculus of ideas immanent in nervous activity, 1943.]

Perceptron

The perceptron (Rosenblatt, 1957) is very similar, except that the inputs are real:

$$f(\mathbf{x}) = \begin{cases} 1 &\text{if } \sum_i w_i x_i + b \geq 0 \\\ 0 &\text{otherwise} \end{cases}$$

This model was originally motivated by biology, with $w_i$ being synaptic weights and $x_i$ and $f$ firing rates.

.center.width-100[]

???

A perceptron is a signal transmission network consisting of sensory units (S units), association units (A units), and output or response units (R units). The ‘retina’ of the perceptron is an array of sensory elements (photocells). An S-unit produces a binary output depending on whether or not it is excited. A randomly selected set of retinal cells is connected to the next level of the network, the A units. As originally proposed there were extensive connections among the A units, the R units, and feedback between the R units and the A units.

In essence an association unit is also an MCP neuron which is 1 if a single specific pattern of inputs is received, and it is 0 for all other possible patterns of inputs. Each association unit will have a certain number of inputs which are selected from all the inputs to the perceptron. So the number of inputs to a particular association unit does not have to be the same as the total number of inputs to the perceptron, but clearly the number of inputs to an association unit must be less than or equal to the total number of inputs to the perceptron. Each association unit's output then becomes the input to a single MCP neuron, and the output from this single MCP neuron is the output of the perceptron. So a perceptron consists of a "layer" of MCP neurons, and all of these neurons send their output to a single MCP neuron.

The Mark I Percetron (Frank Rosenblatt).

The Perceptron

Let us define the (non-linear) activation function:

$$\text{sign}(x) = \begin{cases} 1 &\text{if } x \geq 0 \\ 0 &\text{otherwise} \end{cases}$$ .center[]

The perceptron classification rule can be rewritten as $$f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b).$$

Computational graphs

.grid[ .kol-3-5[.width-90[]] .kol-2-5[ The computation of $$f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b)$$ can be represented as a computational graph where

white nodes correspond to inputs and outputs;
red nodes correspond to model parameters;
blue nodes correspond to intermediate operations. ] ]

???

Draw the NN diagram.

In terms of tensor operations, $f$ can be rewritten as $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T \mathbf{x} + b),$$ for which the corresponding computational graph of $f$ is:

.center.width-70[]

Linear discriminant analysis

Consider training data $(\mathbf{x}, y) \sim P(X,Y)$, with

$\mathbf{x} \in \mathbb{R}^p$,
$y \in \{0,1\}$.

Assume class populations are Gaussian, with same covariance matrix $\Sigma$ (homoscedasticity):

$$P(\mathbf{x}|y) = \frac{1}{\sqrt{(2\pi)^p |\Sigma|}} \exp \left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_y)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_y) \right)$$

Using the Bayes' rule, we have:

$$\begin{aligned} P(Y=1|\mathbf{x}) &= \frac{P(\mathbf{x}|Y=1) P(Y=1)}{P(\mathbf{x})} \\\ &= \frac{P(\mathbf{x}|Y=1) P(Y=1)}{P(\mathbf{x}|Y=0)P(Y=0) + P(\mathbf{x}|Y=1)P(Y=1)} \\\ &= \frac{1}{1 + \frac{P(\mathbf{x}|Y=0)P(Y=0)}{P(\mathbf{x}|Y=1)P(Y=1)}}. \end{aligned}$$

--

It follows that with

$$\sigma(x) = \frac{1}{1 + \exp(-x)},$$

we get

$$P(Y=1|\mathbf{x}) = \sigma\left(\log \frac{P(\mathbf{x}|Y=1)}{P(\mathbf{x}|Y=0)} + \log \frac{P(Y=1)}{P(Y=0)}\right).$$

Therefore,

$$\begin{aligned} &P(Y=1|\mathbf{x}) \\\ &= \sigma\left(\log \frac{P(\mathbf{x}|Y=1)}{P(\mathbf{x}|Y=0)} + \underbrace{\log \frac{P(Y=1)}{P(Y=0)}}_{a}\right) \\\ &= \sigma\left(\log P(\mathbf{x}|Y=1) - \log P(\mathbf{x}|Y=0) + a\right) \\\ &= \sigma\left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_1)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_1) + \frac{1}{2}(\mathbf{x} - \mathbf{\mu}_0)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_0) + a\right) \\\ &= \sigma\left(\underbrace{(\mu_1-\mu_0)^T \Sigma^{-1}}_{\mathbf{w}^T}\mathbf{x} + \underbrace{\frac{1}{2}(\mu_0^T \Sigma^{-1} \mu_0 - \mu_1^T \Sigma^{-1} \mu_1) + a}_{b} \right) \\\ &= \sigma\left(\mathbf{w}^T \mathbf{x} + b\right) \end{aligned}$$

Note that the sigmoid function $$\sigma(x) = \frac{1}{1 + \exp(-x)}$$ looks like a soft heavyside:

Therefore, the overall model $f(\mathbf{x};\mathbf{w},b) = \sigma(\mathbf{w}^T \mathbf{x} + b)$ is very similar to the perceptron.

.center.width-70[]

This unit is the main primitive of all neural networks!

Logistic regression

Same model $$P(Y=1|\mathbf{x}) = \sigma\left(\mathbf{w}^T \mathbf{x} + b\right)$$ as for linear discriminant analysis.

But,

ignore model assumptions (Gaussian class populations, homoscedasticity);
instead, find $\mathbf{w}, b$ that maximizes the likelihood of the data.

We have,

$$\begin{aligned} &\arg \max_{\mathbf{w},b} P(\mathbf{d}|\mathbf{w},b) \\\ &= \arg \max_{\mathbf{w},b} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} P(Y=y_i|\mathbf{x}_i, \mathbf{w},b) \\\ &= \arg \max_{\mathbf{w},b} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \sigma(\mathbf{w}^T \mathbf{x}_i + b)^{y_i} (1-\sigma(\mathbf{w}^T \mathbf{x}_i + b))^{1-y_i} \\\ &= \arg \min_{\mathbf{w},b} \underbrace{\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} -{y_i} \log\sigma(\mathbf{w}^T \mathbf{x}_i + b) - {(1-y_i)} \log (1-\sigma(\mathbf{w}^T \mathbf{x}_i + b))}_{\mathcal{L}(\mathbf{w}, b) = \sum_i \ell(y_i, \hat{y}(\mathbf{x}_i; \mathbf{w}, b))} \end{aligned}$$

This loss is an instance of the cross-entropy $$H(p,q) = \mathbb{E}_p[-\log q]$$ for $p=Y|\mathbf{x}_i$ and $q=\hat{Y}|\mathbf{x}_i$.

When $Y$ takes values in $\{-1,1\}$, a similar derivation yields the logistic loss $$\mathcal{L}(\mathbf{w}, b) = -\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \log \sigma\left(y_i (\mathbf{w}^T \mathbf{x}_i + b))\right).$$

In general, the cross-entropy and the logistic losses do not admit a minimizer that can be expressed analytically in closed form.
However, a minimizer can be found numerically, using a general minimization technique such as gradient descent.

Gradient descent

Let $\mathcal{L}(\theta)$ denote a loss function defined over model parameters $\theta$ (e.g., $\mathbf{w}$ and $b$).

To minimize $\mathcal{L}(\theta)$, gradient descent uses local linear information to iteratively move towards a (local) minimum.

For $\theta_0 \in \mathbb{R}^d$, a first-order approximation around $\theta_0$ can be defined as $$\hat{\mathcal{L}}(\epsilon; \theta_0) = \mathcal{L}(\theta_0) + \epsilon^T\nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{2\gamma}||\epsilon||^2.$$

.center.width-60[]

A minimizer of the approximation $\hat{\mathcal{L}}(\epsilon; \theta_0)$ is given for $$\begin{aligned} \nabla_\epsilon \hat{\mathcal{L}}(\epsilon; \theta_0) &= 0 \\ &= \nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{\gamma} \epsilon, \end{aligned}$$ which results in the best improvement for the step $\epsilon = -\gamma \nabla_\theta \mathcal{L}(\theta_0)$.

Therefore, model parameters can be updated iteratively using the update rule $$\theta_{t+1} = \theta_t -\gamma \nabla_\theta \mathcal{L}(\theta_t),$$ where

$\theta_0$ are the initial parameters of the model;
$\gamma$ is the learning rate;
both are critical for the convergence of the update rule.