class: middle, center, title-slide
Lecture 2: Neural networks
Prof. Gilles Louppe
[email protected]
???
R: regenerate all svg files from draw.io
R: backprop -> check https://mathematical-tours.github.io/book-sources/optim-ml/OptimML.pdf
Explain and motivate the basic constructs of neural networks.
- From linear discriminant analysis to logistic regression
- Stochastic gradient descent
- From logistic regression to the multi-layer perceptron
- Vanishing gradients and rectified networks
- Universal approximation theorem (teaser)
class: middle
The Threshold Logic Unit (McCulloch and Pitts, 1943) was the first mathematical model for a neuron.
Assuming Boolean inputs and outputs, it is defined as
This unit can implement:
$\text{or}(a,b) = 1_{\{a+b - 0.5 \geq 0\}}$ $\text{and}(a,b) = 1_{\{a+b - 1.5 \geq 0\}}$ $\text{not}(a) = 1_{\{-a + 0.5 \geq 0\}}$
Therefore, any Boolean function can be built with such units.
class: middle
.footnote[Credits: McCulloch and Pitts, A logical calculus of ideas immanent in nervous activity, 1943.]
The perceptron (Rosenblatt, 1957) is very similar, except that the inputs are real:
This model was originally motivated by biology, with
class: middle
.footnote[Credits: Frank Rosenblatt, Mark I Perceptron operators' manual, 1960.]
???
A perceptron is a signal transmission network consisting of sensory units (S units), association units (A units), and output or response units (R units). The ‘retina’ of the perceptron is an array of sensory elements (photocells). An S-unit produces a binary output depending on whether or not it is excited. A randomly selected set of retinal cells is connected to the next level of the network, the A units. As originally proposed there were extensive connections among the A units, the R units, and feedback between the R units and the A units.
In essence an association unit is also an MCP neuron which is 1 if a single specific pattern of inputs is received, and it is 0 for all other possible patterns of inputs. Each association unit will have a certain number of inputs which are selected from all the inputs to the perceptron. So the number of inputs to a particular association unit does not have to be the same as the total number of inputs to the perceptron, but clearly the number of inputs to an association unit must be less than or equal to the total number of inputs to the perceptron. Each association unit's output then becomes the input to a single MCP neuron, and the output from this single MCP neuron is the output of the perceptron. So a perceptron consists of a "layer" of MCP neurons, and all of these neurons send their output to a single MCP neuron.
class: middle, center, black-slide
.grid[
.kol-1-2[.width-100[]]
.kol-1-2[
.width-100[]]
]
The Mark I Percetron (Frank Rosenblatt).
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/cNxadbrN_aI" frameborder="0" allowfullscreen></iframe>The Perceptron
class: middle
Let us define the (non-linear) activation function:
$$\text{sign}(x) = \begin{cases} 1 &\text{if } x \geq 0 \\ 0 &\text{otherwise} \end{cases}$$ .center[]
The perceptron classification rule can be rewritten as
class: middle
.grid[
.kol-3-5[.width-90[]]
.kol-2-5[
The computation of
- white nodes correspond to inputs and outputs;
- red nodes correspond to model parameters;
- blue nodes correspond to intermediate operations. ] ]
???
Draw the NN diagram.
class: middle
In terms of tensor operations,
Consider training data
-
$\mathbf{x} \in \mathbb{R}^p$ , -
$y \in \{0,1\}$ .
Assume class populations are Gaussian, with same covariance matrix
Using the Bayes' rule, we have:
--
count: false
It follows that with
we get
class: middle
Therefore,
class: middle, center
count: false class: middle, center
count: false class: middle, center
class: middle
Note that the sigmoid function
Therefore, the overall model
class: middle, center
This unit is the main primitive of all neural networks!
Same model
But,
- ignore model assumptions (Gaussian class populations, homoscedasticity);
- instead, find
$\mathbf{w}, b$ that maximizes the likelihood of the data.
class: middle
We have,
This loss is an instance of the cross-entropy
class: middle
When
class: middle
- In general, the cross-entropy and the logistic losses do not admit a minimizer that can be expressed analytically in closed form.
- However, a minimizer can be found numerically, using a general minimization technique such as gradient descent.
Let
To minimize
For
class: middle
A minimizer of the approximation
Therefore, model parameters can be updated iteratively using the update rule
-
$\theta_0$ are the initial parameters of the model; -
$\gamma$ is the learning rate; - both are critical for the convergence of the update rule.
class: center, middle
Example 1: Convergence to a local minima
count: false class: center, middle
Example 1: Convergence to a local minima
count: false class: center, middle
Example 1: Convergence to a local minima
count: false class: center, middle
Example 1: Convergence to a local minima
count: false class: center, middle
Example 1: Convergence to a local minima
count: false class: center, middle
Example 1: Convergence to a local minima
count: false class: center, middle
Example 1: Convergence to a local minima
count: false class: center, middle
Example 1: Convergence to a local minima
class: center, middle
Example 2: Convergence to the global minima
count: false class: center, middle
Example 2: Convergence to the global minima
count: false class: center, middle
Example 2: Convergence to the global minima
count: false class: center, middle
Example 2: Convergence to the global minima
count: false class: center, middle
Example 2: Convergence to the global minima
count: false class: center, middle
Example 2: Convergence to the global minima
count: false class: center, middle
Example 2: Convergence to the global minima
count: false class: center, middle
Example 2: Convergence to the global minima
class: center, middle
Example 3: Divergence due to a too large learning rate
count: false class: center, middle
Example 3: Divergence due to a too large learning rate
count: false class: center, middle
Example 3: Divergence due to a too large learning rate
count: false class: center, middle
Example 3: Divergence due to a too large learning rate
count: false class: center, middle
Example 3: Divergence due to a too large learning rate
count: false class: center, middle
Example 3: Divergence due to a too large learning rate
In the empirical risk minimization setup,
class: middle
Since the empirical risk is already an approximation of the expected risk, it should not be necessary to carry out the minimization with great accuracy.
Instead, stochastic gradient descent uses as update rule:
- Iteration complexity is independent of
$N$ . - The stochastic process
$\{ \theta_t | t=1, ... \}$ depends on the examples$i(t)$ picked randomly at each iteration.
--
.grid.center.italic[ .kol-1-2[.width-100[]
Batch gradient descent] .kol-1-2[.width-100[]
Stochastic gradient descent ] ]
class: middle
Why is stochastic gradient descent still a good idea?
- Informally, averaging the update
$$\theta_{t+1} = \theta_t - \gamma \nabla \ell(y_{i(t+1)}, f(\mathbf{x}_{i(t+1)}; \theta_t)) $$ over all choices$i(t+1)$ restores batch gradient descent. - Formally, if the gradient estimate is unbiased, e.g., if $$\begin{aligned} \mathbb{E}_{i(t+1)}[\nabla \ell(y_{i(t+1)}, f(\mathbf{x}_{i(t+1)}; \theta_t))] &= \frac{1}{N} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \nabla \ell(y_i, f(\mathbf{x}_i; \theta_t)) \\ &= \nabla \mathcal{L}(\theta_t) \end{aligned}$$ then the formal convergence of SGD can be proved, under appropriate assumptions (see references).
- If training is limited to single pass over the data, then SGD directly minimizes the expected risk.
class: middle
The excess error characterizes the expected risk discrepancy between the Bayes model and the approximate empirical risk minimizer. It can be decomposed as $$\begin{aligned} &\mathbb{E}\left[ R(\tilde{f}_*^\mathbf{d}) - R(f_B) \right] \\ &= \mathbb{E}\left[ R(f_*) - R(f_B) \right] + \mathbb{E}\left[ R(f_*^\mathbf{d}) - R(f_*) \right] + \mathbb{E}\left[ R(\tilde{f}_*^\mathbf{d}) - R(f_*^\mathbf{d}) \right] \\ &= \mathcal{E}_\text{app} + \mathcal{E}_\text{est} + \mathcal{E}_\text{opt} \end{aligned}$$ where
-
$\mathcal{E}_\text{app}$ is the approximation error due to the choice of an hypothesis space, -
$\mathcal{E}_\text{est}$ is the estimation error due to the empirical risk minimization principle, -
$\mathcal{E}_\text{opt}$ is the optimization error due to the approximate optimization algorithm.
class: middle
A fundamental result due to Bottou and Bousquet (2011) states that stochastic optimization algorithms (e.g., SGD) yield the best generalization performance (in terms of excess error) despite being the worst optimization algorithms for minimizing the empirical risk.
So far we considered the logistic unit
These units can be composed in parallel to form a layer with
.center.width-70[![](figures/lec2/graphs/layer.svg)]
???
Draw the NN diagram.
class: middle
Similarly, layers can be composed in series, such that:
$$\begin{aligned}
\mathbf{h}_0 &= \mathbf{x} \\
\mathbf{h}_1 &= \sigma(\mathbf{W}_1^T \mathbf{h}_0 + \mathbf{b}_1) \\
... \\
\mathbf{h}_L &= \sigma(\mathbf{W}_L^T \mathbf{h}_{L-1} + \mathbf{b}_L) \\
f(\mathbf{x}; \theta) = \hat{y} &= \mathbf{h}_L
\end{aligned}$$
where
This model is the multi-layer perceptron, also known as the fully connected feedforward network.
???
Draw the NN diagram.
class: middle, center
class: middle
.footnote[Credits: PyTorch Deep Learning Minicourse, Alfredo Canziani, 2020.]
class: middle
- For binary classification, the width
$q$ of the last layer$L$ is set to$1$ , which results in a single output$h_L \in [0,1]$ that models the probability$P(Y=1|\mathbf{x})$ . - For multi-class classification, the sigmoid action
$\sigma$ in the last layer can be generalized to produce a vector$\mathbf{h}_L \in \bigtriangleup^C$ of probability estimates$P(Y=i|\mathbf{x})$ .
This activation is the$\text{Softmax}$ function, where its$i$ -th output is defined as$$\text{Softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^C \exp(z_j)},$$ for$i=1, ..., C$ .
For regression problems, one usually starts with the assumption that
class: middle
We have,
$$\begin{aligned}
&\arg \max_{\theta} P(\mathbf{d}|\theta) \\
&= \arg \max_{\theta} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} P(Y=y_i|\mathbf{x}_i, \theta) \\
&= \arg \min_{\theta} -\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \log P(Y=y_i|\mathbf{x}_i, \theta) \\
&= \arg \min_{\theta} -\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \log\left( \frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(y_i - f(\mathbf{x};\theta))^2) \right)\\
&= \arg \min_{\theta} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} (y_i - f(\mathbf{x};\theta))^2,
\end{aligned}$$
which recovers the common squared error loss
To minimize
Therefore, we require the evaluation of the (total) derivatives
These derivatives can be evaluated automatically from the computational graph of
class: middle
Let us consider a 1-dimensional output composition
class: middle
The chain rule states that
For the total derivative, the chain rule generalizes to $$ \begin{aligned} \frac{\text{d} y}{\text{d} x} &= \sum_{k=1}^m \frac{\partial y}{\partial u_k} \underbrace{\frac{\text{d} u_k}{\text{d} x}}_{\text{recursive case}} \end{aligned}$$
class: middle
- Since a neural network is a composition of differentiable functions, the total derivatives of the loss can be evaluated backward, by applying the chain rule recursively over its computational graph.
- The implementation of this procedure is called reverse automatic differentiation.
class: middle
Let us consider a simplified 2-layer MLP and the following loss function:
$$\begin{aligned}
f(\mathbf{x}; \mathbf{W}_1, \mathbf{W}_2) &= \sigma\left( \mathbf{W}_2^T \sigma\left( \mathbf{W}_1^T \mathbf{x} \right)\right) \\
\mathcal{\ell}(y, \hat{y}; \mathbf{W}_1, \mathbf{W}_2) &= \text{cross\_ent}(y, \hat{y}) + \lambda \left( ||\mathbf{W}_1||_2 + ||\mathbf{W}_2||_2 \right)
\end{aligned}$$
for
class: middle
In the forward pass, intermediate values are all computed from inputs to outputs, which results in the annotated computational graph below:
class: middle
The total derivative can be computed through a backward pass, by walking through all paths from outputs to parameters in the computational graph and accumulating the terms. For example, for
class: middle
Let us zoom in on the computation of the network output
-
Forward pass: values
$u_1$ ,$u_2$ ,$u_3$ and$\hat{y}$ are computed by traversing the graph from inputs to outputs given$\mathbf{x}$ ,$\mathbf{W}_1$ and$\mathbf{W}_2$ . - Backward pass: by the chain rule we have $$\begin{aligned} \frac{\text{d} \hat{y}}{\text{d} \mathbf{W}_1} &= \frac{\partial \hat{y}}{\partial u_3} \frac{\partial u_3}{\partial u_2} \frac{\partial u_2}{\partial u_1} \frac{\partial u_1}{\partial \mathbf{W}_1} \\ &= \frac{\partial \sigma(u_3)}{\partial u_3} \frac{\partial \mathbf{W}_2^T u_2}{\partial u_2} \frac{\partial \sigma(u_1)}{\partial u_1} \frac{\partial \mathbf{W}_1^T \mathbf{x}}{\partial \mathbf{W}_1} \end{aligned}$$ Note how evaluating the partial derivatives requires the intermediate values computed forward.
class: middle
- This algorithm is also known as backpropagation.
- An equivalent procedure can be defined to evaluate the derivatives in forward mode, from inputs to outputs.
- Since differentiation is a linear operator, automatic differentiation can be implemented efficiently in terms of tensor operations.
Training deep MLPs with many layers has for long (pre-2011) been very difficult due to the vanishing gradient problem.
- Small gradients slow down, and eventually block, stochastic gradient descent.
- This results in a limited capacity of learning.
.width-100[]
.caption[Backpropagated gradients normalized histograms (Glorot and Bengio, 2010).
Gradients for layers far from the output vanish to zero. ]
class: middle
Let us consider a simplified 3-layer MLP, with
Under the hood, this would be evaluated as
$$\begin{aligned}
u_1 &= w_1 x \\
u_2 &= \sigma(u_1) \\
u_3 &= w_2 u_2 \\
u_4 &= \sigma(u_3) \\
u_5 &= w_3 u_4 \\
\hat{y} &= \sigma(u_5)
\end{aligned}$$
and its derivative
class: middle
The derivative of the sigmoid activation function
Notice that
class: middle
Assume that weights
Then,
This implies that the gradient
Hence the vanishing gradient problem.
- In general, bounded activation functions (sigmoid, tanh, etc) are prone to the vanishing gradient problem.
- Note the importance of a proper initialization scheme.
Instead of the sigmoid activation function, modern neural networks are for most based on rectified linear units (ReLU) (Glorot et al, 2011):
class: middle
Note that the derivative of the ReLU function is
$$\frac{\text{d}}{\text{d}x} \text{ReLU}(x) = \begin{cases} 0 &\text{if } x \leq 0 \\ 1 &\text{otherwise} \end{cases}$$ .center[]
For
class: middle
Therefore,
This solves the vanishing gradient problem, even for deep networks! (provided proper initialization)
Note that:
- The ReLU unit dies when its input is negative, which might block gradient descent.
- This is actually a useful property to induce sparsity.
- This issue can also be solved using leaky ReLUs, defined as
$$\text{LeakyReLU}(x) = \max(\alpha x, x)$$ for a small$\alpha \in \mathbb{R}^+$ (e.g.,$\alpha=0.1$ ).
Let us consider the 1-layer MLP
This model can approximate any smooth 1D function, provided enough hidden units.
class: middle
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle, center
(demo)
class: middle
.italic[ People are now building a new kind of software by .bold[assembling networks of parameterized functional blocks] and by .bold[training them from examples using some form of gradient-based optimization]. ]
.pull-right[Yann LeCun, 2018.]
class: middle
class: middle
.center[