f(w) = \tfrac{1}{2}w^TAw - b^Tw, \qquad w \in \mathbf{R}^n.
- Assume A is symmetric and invertible, then the optimal solution w^{\star} occurs at
+ Assume A is symmetric and invertible, then (assuming one exists) the optimal solution w^{\star} occurs at
w^{\star} = A^{-1}b.
@@ -333,13 +333,13 @@
First Steps: Gradient Descent
- Every symmetric matrix A has an eigenvalue decomposition
+ Every real symmetric matrix A has an eigenvalue decomposition
A=Q\ \text{diag}(\lambda_{1},\ldots,\lambda_{n})\ Q^{T},\qquad Q = [q_1,\ldots,q_n],
- and, as per convention, we will assume that the \lambda_i's are sorted, from smallest \lambda_1 to biggest \lambda_n. If we perform a change of basis, x^{k} = Q^T(w^{k} - w^\star), the iterations break apart, becoming:
+ and, as per convention, we will assume that the \lambda_i's are sorted, from smallest \lambda_1 to biggest \lambda_n. If we perform a change of basis, x^{k} = Q^T(w^{k} - w^\star)Q, the iterations break apart, becoming:
\begin{aligned}
@@ -351,7 +351,7 @@
First Steps: Gradient Descent
Moving back to our original space w, we can see that
- w^k - w^\star = Qx^k=\sum_i^n x^0_i(1-\alpha\lambda_i)^k q_i
+ w^k - w^\star = Qx^kQ^T=\sum_i^n x^0_i(1-\alpha\lambda_i)^k q_i
and there we have it -- gradient descent in closed form.
@@ -364,12 +364,12 @@
Decomposing the Error
For most step-sizes, the eigenvectors with largest eigenvalues converge the fastest. This triggers an explosion of progress in the first few iterations, before things slow down as the smaller eigenvectors' struggles are revealed. By writing the contributions of each eigenspace's error to the loss
- f(w^{k})-f(w^{\star})=\sum(1-\alpha\lambda_{i})^{2k}\lambda_{i}[x_{i}^{0}]^2
+ f(w^{k})-f(w^{\star})=\sum(1-\alpha\lambda_{i})^{2k}\lambda_{i}(x_{i}^{0})^2
we can visualize the contributions of each error component to the loss.
- The path of convergence, as we know, is elucidated when we view the iterates in the space of Q (the eigenvectors of Z^T Z). So let's recast our regression problem in the basis of Q. First, we do a change of basis, by rotating w into Qw, and counter-rotating our feature maps p into eigenspace, \bar{p}. We can now conceptualize the same regression as one over a different polynomial basis, with the model
+ The path of convergence, as we know, is elucidated when we view the iterates in the space of Q (the eigenvectors of Z^T Z). So let's recast our regression problem in the basis of Q. First, we do a change of basis, by rotating w into QwQ^T, and counter-rotating our feature maps p into eigenspace, \bar{p}. We can now conceptualize the same regression as one over a different polynomial basis, with the model
\text{model}(\xi)~=~x_{1}\bar{p}_{1}(\xi)~+~\cdots~+~x_{n}\bar{p}_{n}(\xi)\qquad \bar{p}_{i}=\sum q_{ij}p_j.
@@ -690,7 +691,7 @@
Example: Polynomial Regression
- The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a model which is, in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to the observations changes the entire model dramatically. And the eigenfeatures, the principal components of the data, give us exactly the decomposition we need to sort the features by its sensitivity to perturbations in d_i's. The most robust components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the smallest eigenvalues).
+ The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a model which is, in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to the observations changes the entire model dramatically. The eigenfeatures, the principal components of the data, give us exactly the decomposition we need to order the features by the model's sensitivity to perturbations in d_i's. The most robust components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the smallest eigenvalues).
@@ -789,11 +790,11 @@
Example: Polynomial Regression
- This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of regularization, such as Tikhonov Regression. Both methods try to suppress the components of the smallest eigenvalues directly, though they employ different methods of spectral decay.In Tikhonov Regression we add a quadratic penalty to the regression, minimizing
+ This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of regularization, such as Tikhonov Regularization. Both methods try to suppress the components of the smallest eigenvalues directly, though they employ different methods of spectral decay.In Tikhonov Regularization we add a quadratic penalty to the loss function, minimizing
\text{minimize}\qquad\tfrac{1}{2}\|Zw-d\|^{2}+\frac{\eta}{2}\|w\|^{2}=\tfrac{1}{2}w^{T}(Z^{T}Z+\eta I)w-(Zd)^{T}w
-Recall that Z^{T}Z=Q\ \text{diag}(\Lambda_{1},\ldots,\Lambda_{n})\ Q^T. The solution to Tikhonov Regression is therefore
+Recall that Z^{T}Z=Q\ \text{diag}(\Lambda_{1},\ldots,\Lambda_{n})\ Q^T. The solution to Tikhonov Regularization is therefore
(Z^{T}Z+\eta I)^{-1}(Zd)=Q\ \text{diag}\left(\frac{1}{\lambda_{1}+\eta},\cdots,\frac{1}{\lambda_{n}+\eta}\right)Q^T(Zd)