Deep Learning

Lecture 5: Training neural networks

Prof. Gilles Louppe
g.louppe@uliege.be

Today

How to optimize parameters efficiently?

Optimizers
Initialization
Normalization

A practical recommendation

Training a massive deep neural network is long, complex and sometimes confusing.

A first step towards understanding, debugging and optimizing neural networks is to make use of visualization tools such as TensorBoard for

plotting losses and metrics,
visualizing computational graphs,
or show additional data as the network is being trained.

.center.width-45[]

Optimizers

Gradient descent

To minimize a loss $\mathcal{L}(\theta)$ of the form $$\mathcal{L}(\theta) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, f(\mathbf{x}_n; \theta)),$$ standard batch gradient descent (GD) consists in applying the update rule $$\begin{aligned} g_t &= \frac{1}{N} \sum_{n=1}^N \nabla_\theta \ell(y_n, f(\mathbf{x}_n; \theta_t)) \\ \theta_{t+1} &= \theta_t - \gamma g_t, \end{aligned}$$ where $\gamma$ is the learning rate.

]

While it makes sense to compute the gradient exactly,

it takes time to compute and becomes inefficient for large $N$,
it is an empirical estimation of an hidden quantity (the expected risk), and any partial sum is also an unbiased estimate, although of greater variance.

To illustrate how partial sums are good estimates, consider an ideal case where the training set is the same set of $M \ll N$ samples replicated $K$ times. Then, $$\begin{aligned} \mathcal{L}(\theta) &= \frac{1}{N} \sum_{i=n}^N \ell(y_n, f(\mathbf{x}_n; \theta)) \\ &=\frac{1}{N} \sum_{k=1}^K \sum_{m=1}^M \ell(y_m, f(\mathbf{x}_m; \theta)) \\ &=\frac{1}{N} K \sum_{m=1}^M \ell(y_m, f(\mathbf{x}_m; \theta)). \end{aligned}$$ Then, instead of summing over all the samples and moving by $\gamma$, we can visit only $M=N/K$ samples and move by $K\gamma$, which would cut the computation by $K$.

Although this is an ideal case, there is redundancy in practice that results in similar behaviors.

Stochastic gradient descent

To reduce the computational complexity, stochastic gradient descent (SGD) consists in updating the parameters after every sample $$\begin{aligned} g_t &= \nabla_\theta \ell(y_{n(t)}, f(\mathbf{x}_{n(t)}; \theta_t)) \\ \theta_{t+1} &= \theta_t - \gamma g_t. \end{aligned}$$

]

The stochastic behavior of SGD helps evade local minima.

While being computationally faster than batch gradient descent,

gradient estimates used by SGD can be very noisy,
SGD does not benefit from the speed-up of batch-processing.

Mini-batching

Instead, mini-batch SGD consists of visiting the samples in mini-batches and updating the parameters each time $$ \begin{aligned} g_t &= \frac{1}{B} \sum_{b=1}^B \nabla_\theta \ell(y_{n(t,b)}, f(\mathbf{x}_{n(t,b)}; \theta_t)) \\ \theta_{t+1} &= \theta_t - \gamma g_t, \end{aligned} $$ where the order ${n(t,b)}$ to visit the samples can be either sequential or random.

Increasing the batch size $B$ reduces the variance of the gradient estimates and enables the speed-up of batch processing.
The interplay between $B$ and $\gamma$ is still unclear.

Limitations

The gradient descent method makes strong assumptions about

the magnitude of the local curvature to set the step size,
the isotropy of the curvature, so that the same step size $\gamma$ makes sense in all directions.

$\gamma=0.01$ ]

$\gamma=0.01$ ]

$\gamma=0.1$ ]

$\gamma=0.4$ ]

Wolfe conditions

Let us consider a function $f$ to minimize along $x$, following a direction of descent $p$.

For $0 < c_1 < c_2 < 1$, the Wolfe conditions on the step size $\gamma$ are as follows:

Sufficient decrease condition: $$f(x + \gamma p) \leq f(x) + c_1 \gamma p^T \nabla f(x)$$
Curvature condition: $$c_2 p^T \nabla f(x) \leq p^T \nabla f(x + \gamma p)$$

Typical values are $c_1 = 10^{-4}$ and $c_2 = 0.9$.

???

The sufficient decrease condition ensures that the function decreases sufficiently, as predicted by the slope of f in the direction p.
The curvature condition ensures that steps are not too short by ensuring that the slope has decreased (in magnitude) by some relative amount.

.center[ .width-100[] The sufficient decrease condition ensures that $f$ decreases sufficiently.
($\alpha$ is the step size.) ]

.center[ .width-100[] The curvature condition ensures that the slope has been reduced sufficiently. ]

The Wolfe conditions can be used to design line search algorithms to automatically determine a step size $\gamma_t$, hence ensuring convergence towards a local minima.

However, in deep learning,

these algorithms are impractical because of the size of the parameter space and the overhead it would induce,
they might lead to overfitting when the empirical risk is minimized too well.

The tradeoffs large-scale learning

A fundamental result due to Bottou and Bousquet (2011) states that stochastic optimization algorithms (e.g., SGD) yield the best generalization performance (in terms of excess error) despite being the worst optimization algorithms for minimizing the empirical risk.

That is, for a fixed computational budget, stochastic optimization algorithms reach a lower test error than more sophisticated algorithms (2nd order methods, line search algorithms, etc) that would fit the training error too well or would consume too large a part of the computational budget at every step.

.center.width-80[]

Momentum

.center.width-80[]

In the situation of small but consistent gradients, as through valley floors, gradient descent moves very slowly.

An improvement to gradient descent is to use momentum to add inertia in the choice of the step direction, that is

$$\begin{aligned} u_t &= \alpha u_{t-1} - \gamma g_t \\\ \theta_{t+1} &= \theta_t + u_t. \end{aligned}$$

The new variable $u_t$ is the velocity. It corresponds to the direction and speed by which the parameters move as the learning dynamics progresses, modeled as an exponentially decaying moving average of negative gradients.
Gradient descent with momentum has three nice properties:
- it can go through local barriers,
- it accelerates if the gradient does not change much,
- it dampens oscillations in narrow valleys. ] .kol-1-3[ .center.width-100[] ] ]

The hyper-parameter $\alpha$ controls how recent gradients affect the current update.

Usually, $\alpha=0.9$, with $\alpha > \gamma$.
If at each update we observed $g$, the step would (eventually) be $$u = -\frac{\gamma}{1-\alpha} g.$$
Therefore, for $\alpha=0.9$, it is like multiplying the maximum speed by $10$ relative to the current direction.

]

Nesterov momentum

An alternative consists in simulating a step in the direction of the velocity, then calculate the gradient and make a correction.

$$ \begin{aligned} g_t &= \frac{1}{N} \sum_{n=1}^N \nabla_\theta \ell(y_n, f(\mathbf{x}_n; \theta_t + \alpha u_{t-1}))\\\ u_t &= \alpha u_{t-1} - \gamma g_t \\\ \theta_{t+1} &= \theta_t + u_t \end{aligned}$$

.center.width-30[]

]

Adaptive learning rate

Vanilla gradient descent assumes the isotropy of the curvature, so that the same step size $\gamma$ applies to all parameters.

Isotropic vs. Anistropic ]

AdaGrad

Per-parameter downscale by square-root of sum of squares of all its historical values.

$$\begin{aligned} r_t &= r_{t-1} + g_t \odot g_t \\\ \theta_{t+1} &= \theta_t - \frac{\gamma}{\delta + \sqrt{r_t}} \odot g_t. \end{aligned}$$

AdaGrad eliminates the need to manually tune the learning rate. Most implementation use $\gamma=0.01$ as default.
It is good when the objective is convex.
$r_t$ grows unboundedly during training, which may cause the step size to shrink and eventually become infinitesimally small.

RMSProp

Same as AdaGrad but accumulate an exponentially decaying average of the gradient.

$$\begin{aligned} r_t &= \rho r_{t-1} + (1-\rho) g_t \odot g_t \\\ \theta_{t+1} &= \theta_t - \frac{\gamma}{\delta + \sqrt{r_t}} \odot g_t. \end{aligned}$$

Perform better in non-convex settings.
Does not grow unboundedly.

Adam

Similar to RMSProp with momentum, but with bias correction terms for the first and second moments.

$$\begin{aligned} s_t &= \rho_1 s_{t-1} + (1-\rho_1) g_t \\\ \hat{s}_t &= \frac{s_t}{1-\rho_1^t} \\\ r_t &= \rho_2 r_{t-1} + (1-\rho_2) g_t \odot g_t \\\ \hat{r}_t &= \frac{r_t}{1-\rho_2^t} \\\ \theta_{t+1} &= \theta_t - \gamma \frac{\hat{s}_t}{\delta+\sqrt{\hat{r}_t}} \end{aligned}$$

Good defaults are $\rho_1=0.9$ and $\rho_2=0.999$.
Adam is one of the default optimizers in deep learning, along with SGD with momentum.

]

.center.width-60[]

Scheduling

Despite per-parameter adaptive learning rate methods, it is usually helpful to anneal the learning rate $\gamma$ over time.

Step decay: reduce the learning rate by some factor every few epochs (e.g, by half every 10 epochs).
Exponential decay: $\gamma_t = \gamma_0 \exp(-kt)$ where $\gamma_0$ and $k$ are hyper-parameters.
$1/t$ decay: $\gamma_t = \gamma_0 / (1+kt)$ where $\gamma_0$ and $k$ are hyper-parameters.

Initialization

In convex problems, provided a good learning rate $\gamma$, convergence is guaranteed regardless of the initial parameter values.
In the non-convex regime, initialization is much more important!
Little is known on the mathematics of initialization strategies of neural networks.
- What is known: initialization should break symmetry.
- What is known: the scale of weights is important.

See demo.

Controlling for the variance in the forward pass

A first strategy is to initialize the network parameters such that activations preserve the same variance across layers.

Intuitively, this ensures that the information keeps flowing during the forward pass, without reducing or magnifying the magnitude of input signals exponentially.

Let us assume that

we are in a linear regime at initialization (e.g., the positive part of a ReLU or the middle of a sigmoid),
weights $w_{ij}^l$ are initialized i.i.d,
biases $b_l$ are initialized to be $0$,
input features are i.i.d, which we denote as $\mathbb{V}[x]$.

Then, the variance of the activation $h_i^l$ of unit $i$ in layer $l$ is $$ \begin{aligned} \mathbb{V}\left[h_i^l\right] &= \mathbb{V}\left[ \sum_{j=0}^{q_{l-1}-1} w_{ij}^l h_j^{l-1} \right] \\ &= \sum_{j=0}^{q_{l-1}-1} \mathbb{V}\left[ w_{ij}^l \right] \mathbb{V}\left[ h_j^{l-1} \right] \end{aligned} $$ where $q_l$ is the width of layer $l$ and $h^0_j = x_j$ for all $j=0,..., p-1$.

???

Use

V(AB) = V(A)V(B)+V(A)E(B)+V(B)E(A)
V(A+B) = V(A)+V(B)+Cov(A,B)

Since the weights $w_{ij}^l$ at layer $l$ share the same variance $\mathbb{V}\left[ w^l \right]$ and the variance of the activations in the previous layer are the same, we can drop the indices and write $$\mathbb{V}\left[h^l\right] = q_{l-1} \mathbb{V}\left[ w^l \right] \mathbb{V}\left[ h^{l-1} \right].$$

Therefore, the variance of the activations is preserved across layers when $$\mathbb{V}\left[ w^l \right] = \frac{1}{q_{l-1}} \quad \forall l.$$

This condition is enforced in LeCun's uniform initialization, which is defined as $$w_{ij}^l \sim \mathcal{U}\left[-\sqrt{\frac{3}{q_{l-1}}}, \sqrt{\frac{3}{q_{l-1}}}\right].$$

???

Var[ U(a,b) ] = 1/12 (b-a)^2

Controlling for the variance in the backward pass

A similar idea can be applied to ensure that the gradients flow in the backward pass (without vanishing nor exploding), by maintaining the variance of the gradient with respect to the activations fixed across layers.

Under the same assumptions as before, $$\begin{aligned} \mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h_i^l} \right] &= \mathbb{V}\left[ \sum_{j=0}^{q_{l+1}-1} \frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}} \frac{\partial h_j^{l+1}}{\partial h_i^l} \right] \\ &= \mathbb{V}\left[ \sum_{j=0}^{q_{l+1}-1} \frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}} w_{j,i}^{l+1} \right] \\ &= \sum_{j=0}^{q_{l+1}-1} \mathbb{V}\left[\frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}}\right] \mathbb{V}\left[ w_{ji}^{l+1} \right] \end{aligned}$$

If we further assume that

the gradients of the activations at layer $l$ share the same variance
the weights at layer $l+1$ share the same variance $\mathbb{V}\left[ w^{l+1} \right]$,

then we can drop the indices and write $$ \mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h^l} \right] = q_{l+1} \mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h^{l+1}} \right] \mathbb{V}\left[ w^{l+1} \right]. $$

Therefore, the variance of the gradients with respect to the activations is preserved across layers when $$\mathbb{V}\left[ w^{l} \right] = \frac{1}{q_{l}} \quad \forall l.$$

Xavier initialization

We have derived two different conditions on the variance of $w^l$,

$\mathbb{V}\left[w^l\right] = \frac{1}{q_{l-1}}$
$\mathbb{V}\left[w^l\right] = \frac{1}{q_{l}}$.

A compromise is the Xavier initialization, which initializes $w^l$ randomly from a distribution with variance $$\mathbb{V}\left[w^l\right] = \frac{1}{\frac{q_{l-1}+q_l}{2}} = \frac{2}{q_{l-1}+q_l}.$$

For example, normalized initialization is defined as $$w_{ij}^l \sim \mathcal{U}\left[-\sqrt{\frac{6}{q_{l-1}+q_l}}, \sqrt{\frac{6}{q_{l-1}+q_l}}\right].$$

.center.width-70[]

.footnote[Credits: Glorot and Bengio, Understanding the difficulty of training deep feedforward neural networks, 2010.]

.center.width-70[]

.footnote[Credits: Glorot and Bengio, Understanding the difficulty of training deep feedforward neural networks, 2010.]

Normalization

Data normalization

Previous weight initialization strategies rely on preserving the activation variance constant across layers, under the initial assumption that the input feature variances are the same.

That is, $$\mathbb{V}\left[x_i\right] = \mathbb{V}\left[x_j\right] \triangleq \mathbb{V}\left[x\right]$$ for all pairs of features $i,j$.

In general, this constraint is not satisfied but can be enforced by standardizing the input data feature-wise, $$\mathbf{x}' = (\mathbf{x} - \hat{\mu}) \odot \frac{1}{\hat{\sigma}},$$ where $$ \begin{aligned} \hat{\mu} = \frac{1}{N} \sum_{\mathbf{x} \in \mathbf{d}} \mathbf{x} \quad\quad\quad \hat{\sigma}^2 = \frac{1}{N} \sum_{\mathbf{x} \in \mathbf{d}} (\mathbf{x} - \hat{\mu})^2. \end{aligned} $$

.center.width-100[]

Batch normalization

Maintaining proper statistics of the activations and derivatives is critical for training neural networks.

This constraint can be enforced explicitly during the forward pass by re-normalizing them. Batch normalization was the first method introducing this idea.

.center.width-80[![](figures/lec5/bn.png)]

.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL; Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.]

During training, batch normalization shifts and rescales according to the mean and variance estimated on the batch.

During test, it shifts and rescales according to the empirical moments estimated during training.

.center.width-50[]

Let us consider a minibatch of samples at training, for which $\mathbf{u}_b \in \mathbb{R}^q$, $b=1, ..., B$, are intermediate values computed at some location in the computational graph.

In batch normalization following the node $\mathbf{u}$, the per-component mean and variance are first computed on the batch $$ \hat{\mu}_\text{batch} = \frac{1}{B} \sum_{b=1}^B \mathbf{u}_b \quad\quad\quad \hat{\sigma}^2_\text{batch} = \frac{1}{B} \sum_{b=1}^B (\mathbf{u}_b - \hat{\mu}_\text{batch})^2, $$ from which the standardized $\mathbf{u}'_b \in \mathbb{R}^q$ are computed such that $$ \begin{aligned} \mathbf{u}'_b &= \gamma\odot (\mathbf{u}_b - \hat{\mu}_\text{batch}) \odot \frac{1}{\hat{\sigma}_\text{batch} + \epsilon} + \beta \end{aligned} $$ where $\gamma, \beta \in \mathbb{R}^q$ are parameters to optimize.

During inference, batch normalization shifts and rescales each component according to the empirical moments estimated during training: $$\mathbf{u}' = \gamma \odot (\mathbf{u} - \hat{\mu}) \odot \frac{1}{\hat{\sigma}} + \beta.$$

.center.width-100[]

.footnote[Credits: Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.]

The position of batch normalization relative to the non-linearity is not clear.

.center.width-50[]

.footnote[Credits: Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.]

Layer normalization

Given a single input sample $\mathbf{x}$, a similar approach can be applied to standardize the activations $\mathbf{u}$ across a layer instead of doing it over the batch.

The end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture5.md

lecture5.md

Deep Learning

Today

A practical recommendation

Optimizers

Gradient descent

Stochastic gradient descent

Mini-batching

Limitations

Wolfe conditions

The tradeoffs large-scale learning

Momentum

Nesterov momentum

Adaptive learning rate

AdaGrad

RMSProp

Adam

Scheduling

Initialization

Controlling for the variance in the forward pass

Controlling for the variance in the backward pass

Xavier initialization

Normalization

Data normalization

Batch normalization

Layer normalization

Files

lecture5.md

Latest commit

History

lecture5.md

File metadata and controls

Deep Learning

Today

A practical recommendation

Optimizers

Gradient descent

Stochastic gradient descent

Mini-batching

Limitations

Wolfe conditions

The tradeoffs large-scale learning

Momentum

Nesterov momentum

Adaptive learning rate

AdaGrad

RMSProp

Adam

Scheduling

Initialization

Controlling for the variance in the forward pass

Controlling for the variance in the backward pass

Xavier initialization

Normalization

Data normalization

Batch normalization

Layer normalization