Vanilla gradient descent assumes the isotropy of the curvature, so that the same step size $\gamma$ applies to all parameters.
Isotropic vs. Anistropic
]
Per-parameter downscale by square-root of sum of squares of all its historical values.
Same as AdaGrad but accumulate an exponentially decaying average of the gradient.
Similar to RMSProp with momentum, but with bias correction terms for the first and second moments.
]
class: middle
.center.width-60[ ]
.footnote[Credits: Kingma and Ba, Adam: A Method for Stochastic Optimization , 2014.]
Despite per-parameter adaptive learning rate methods, it is usually helpful to anneal the learning rate $\gamma$ over time.
Step decay: reduce the learning rate by some factor every few epochs
(e.g, by half every 10 epochs).
Exponential decay: $\gamma_t = \gamma_0 \exp(-kt)$ where $\gamma_0$ and $k$ are hyper-parameters.
$1/t$ decay: $\gamma_t = \gamma_0 / (1+kt)$ where $\gamma_0$ and $k$ are hyper-parameters.
class: middle
.center[
.width-70[ ]
.caption[Step decay scheduling for training ResNets.]
]
class: middle
class: middle
In convex problems, provided a good learning rate $\gamma$ , convergence is guaranteed regardless of the initial parameter values .
In the non-convex regime, initialization is much more important !
Little is known on the mathematics of initialization strategies of neural networks.
What is known: initialization should break symmetry.
What is known: the scale of weights is important.
class: middle, center
See demo .
class: middle
Controlling for the variance in the forward pass
A first strategy is to initialize the network parameters such that activations preserve the same variance across layers .
Intuitively, this ensures that the information keeps flowing during the forward pass , without reducing or magnifying the magnitude of input signals exponentially.
class: middle
Let us assume that
we are in a linear regime at initialization (e.g., the positive part of a ReLU or the middle of a sigmoid),
weights $w_{ij}^l$ are initialized i.i.d,
biases $b_l$ are initialized to be $0$ ,
input features are i.i.d, which we denote as $\mathbb{V}[x]$ .
Then, the variance of the activation $h_i^l$ of unit $i$ in layer $l$ is
$$
\begin{aligned}
\mathbb{V}\left[h_i^l\right] &= \mathbb{V}\left[ \sum_{j=0}^{q_{l-1}-1} w_{ij}^l h_j^{l-1} \right] \\
&= \sum_{j=0}^{q_{l-1}-1} \mathbb{V}\left[ w_{ij}^l \right] \mathbb{V}\left[ h_j^{l-1} \right]
\end{aligned}
$$
where $q_l$ is the width of layer $l$ and $h^0_j = x_j$ for all $j=0,..., p-1$ .
???
Use
V(AB) = V(A)V(B)+V(A)E(B)+V(B)E(A)
V(A+B) = V(A)+V(B)+Cov(A,B)
class: middle
Since the weights $w_{ij}^l$ at layer $l$ share the same variance $\mathbb{V}\left[ w^l \right]$ and the variance of the activations in the previous layer are the same, we can drop the indices and write
$$\mathbb{V}\left[h^l\right] = q_{l-1} \mathbb{V}\left[ w^l \right] \mathbb{V}\left[ h^{l-1} \right].$$
Therefore, the variance of the activations is preserved across layers when
$$\mathbb{V}\left[ w^l \right] = \frac{1}{q_{l-1}} \quad \forall l.$$
This condition is enforced in LeCun's uniform initialization , which is defined as
$$w_{ij}^l \sim \mathcal{U}\left[-\sqrt{\frac{3}{q_{l-1}}}, \sqrt{\frac{3}{q_{l-1}}}\right].$$
???
Var[ U(a,b) ] = 1/12 (b-a)^2
class: middle
Controlling for the variance in the backward pass
A similar idea can be applied to ensure that the gradients flow in the backward pass (without vanishing nor exploding), by maintaining the variance of the gradient with respect to the activations fixed across layers.
Under the same assumptions as before,
$$\begin{aligned}
\mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h_i^l} \right] &= \mathbb{V}\left[ \sum_{j=0}^{q_{l+1}-1} \frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}} \frac{\partial h_j^{l+1}}{\partial h_i^l} \right] \\
&= \mathbb{V}\left[ \sum_{j=0}^{q_{l+1}-1} \frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}} w_{j,i}^{l+1} \right] \\
&= \sum_{j=0}^{q_{l+1}-1} \mathbb{V}\left[\frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}}\right] \mathbb{V}\left[ w_{ji}^{l+1} \right]
\end{aligned}$$
class: middle
If we further assume that
the gradients of the activations at layer $l$ share the same variance
the weights at layer $l+1$ share the same variance $\mathbb{V}\left[ w^{l+1} \right]$ ,
then we can drop the indices and write
$$
\mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h^l} \right] = q_{l+1} \mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h^{l+1}} \right] \mathbb{V}\left[ w^{l+1} \right].
$$
Therefore, the variance of the gradients with respect to the activations is preserved across layers when
$$\mathbb{V}\left[ w^{l} \right] = \frac{1}{q_{l}} \quad \forall l.$$
class: middle
We have derived two different conditions on the variance of $w^l$ ,
$\mathbb{V}\left[w^l\right] = \frac{1}{q_{l-1}}$
$\mathbb{V}\left[w^l\right] = \frac{1}{q_{l}}$ .
A compromise is the Xavier initialization , which initializes $w^l$ randomly from a distribution with variance
$$\mathbb{V}\left[w^l\right] = \frac{1}{\frac{q_{l-1}+q_l}{2}} = \frac{2}{q_{l-1}+q_l}.$$
For example, normalized initialization is defined as
$$w_{ij}^l \sim \mathcal{U}\left[-\sqrt{\frac{6}{q_{l-1}+q_l}}, \sqrt{\frac{6}{q_{l-1}+q_l}}\right].$$
class: middle
.center.width-70[ ]
.footnote[Credits: Glorot and Bengio, Understanding the difficulty of training deep feedforward neural networks , 2010.]
class: middle
.center.width-70[ ]
.footnote[Credits: Glorot and Bengio, Understanding the difficulty of training deep feedforward neural networks , 2010.]
class: middle
Previous weight initialization strategies rely on preserving the activation variance constant across layers, under the initial assumption that the input feature variances are the same .
That is,
$$\mathbb{V}\left[x_i\right] = \mathbb{V}\left[x_j\right] \triangleq \mathbb{V}\left[x\right]$$
for all pairs of features $i,j$ .
.footnote[Credits: Francois Fleuret, EE559 Deep Learning , EPFL.]
class: middle
In general, this constraint is not satisfied but can be enforced by standardizing the input data feature-wise,
$$\mathbf{x}' = (\mathbf{x} - \hat{\mu}) \odot \frac{1}{\hat{\sigma}},$$
where
$$
\begin{aligned}
\hat{\mu} = \frac{1}{N} \sum_{\mathbf{x} \in \mathbf{d}} \mathbf{x} \quad\quad\quad \hat{\sigma}^2 = \frac{1}{N} \sum_{\mathbf{x} \in \mathbf{d}} (\mathbf{x} - \hat{\mu})^2.
\end{aligned}
$$
.center.width-100[ ]
.footnote[Credits: Scikit-Learn, Compare the effect of different scalers on data with outliers .]
Maintaining proper statistics of the activations and derivatives is critical for training neural networks.
This constraint can be enforced explicitly during the forward pass by re-normalizing them.
Batch normalization was the first method introducing this idea.
.center.width-80[![](figures/lec5/bn.png)]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning , EPFL; Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015.]
class: middle
During training, batch normalization shifts and rescales according to the mean and variance estimated on the batch.
During test, it shifts and rescales according to the empirical moments estimated during training.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning , EPFL.]
class: middle
.center.width-50[ ]
Let us consider a minibatch of samples at training, for which $\mathbf{u}_b \in \mathbb{R}^q$ , $b=1, ..., B$ , are intermediate values computed at some location in the computational graph.
In batch normalization following the node $\mathbf{u}$ , the per-component mean and variance are first computed on the batch
$$
\hat{\mu}_\text{batch} = \frac{1}{B} \sum_{b=1}^B \mathbf{u}_b \quad\quad\quad \hat{\sigma}^2_\text{batch} = \frac{1}{B} \sum_{b=1}^B (\mathbf{u}_b - \hat{\mu}_\text{batch})^2,
$$
from which the standardized $\mathbf{u}'_b \in \mathbb{R}^q$ are computed such that
$$
\begin{aligned}
\mathbf{u}'_b &= \gamma\odot (\mathbf{u}_b - \hat{\mu}_\text{batch}) \odot \frac{1}{\hat{\sigma}_\text{batch} + \epsilon} + \beta
\end{aligned}
$$
where $\gamma, \beta \in \mathbb{R}^q$ are parameters to optimize.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning , EPFL.]
class: middle
.center[Exercise: How does batch normalization combine with backpropagation?]
class: middle
During inference, batch normalization shifts and rescales each component according to the empirical moments estimated during training:
$$\mathbf{u}' = \gamma \odot (\mathbf{u} - \hat{\mu}) \odot \frac{1}{\hat{\sigma}} + \beta.$$
.footnote[Credits: Francois Fleuret, EE559 Deep Learning , EPFL.]
class: middle
.center.width-100[ ]
.footnote[Credits: Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015.]
class: middle
The position of batch normalization relative to the non-linearity is not clear.
.center.width-50[ ]
.footnote[Credits: Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015.]
class: middle
Given a single input sample $\mathbf{x}$ , a similar approach can be applied
to standardize the activations $\mathbf{u}$ across a layer instead of doing it over the batch.
class: end-slide, center
count: false
The end.