From 6a770bdcf3020eba2600d5faaf4bd7b726a71291 Mon Sep 17 00:00:00 2001 From: George Fu Date: Sun, 11 Aug 2019 13:47:14 -0500 Subject: [PATCH 1/7] State condition (might have a saddle) --- public/index.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/public/index.html b/public/index.html index e047563..96db438 100644 --- a/public/index.html +++ b/public/index.html @@ -263,7 +263,7 @@

First Steps: Gradient Descent

f(w) = \tfrac{1}{2}w^TAw - b^Tw, \qquad w \in \mathbf{R}^n. - Assume A is symmetric and invertible, then the optimal solution w^{\star} occurs at + Assume A is symmetric and invertible, then (assuming one exists) the optimal solution w^{\star} occurs at w^{\star} = A^{-1}b. From f5c0f554abdbcf69c6ae086e0ce06f41a69f4497 Mon Sep 17 00:00:00 2001 From: George Fu Date: Sun, 11 Aug 2019 13:47:15 -0500 Subject: [PATCH 2/7] Fix change of basis --- public/index.html | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/public/index.html b/public/index.html index 96db438..79940c7 100644 --- a/public/index.html +++ b/public/index.html @@ -333,13 +333,13 @@

First Steps: Gradient Descent

- Every symmetric matrix A has an eigenvalue decomposition + Every real symmetric matrix A has an eigenvalue decomposition A=Q\ \text{diag}(\lambda_{1},\ldots,\lambda_{n})\ Q^{T},\qquad Q = [q_1,\ldots,q_n], - and, as per convention, we will assume that the \lambda_i's are sorted, from smallest \lambda_1 to biggest \lambda_n. If we perform a change of basis, x^{k} = Q^T(w^{k} - w^\star), the iterations break apart, becoming: + and, as per convention, we will assume that the \lambda_i's are sorted, from smallest \lambda_1 to biggest \lambda_n. If we perform a change of basis, x^{k} = Q^T(w^{k} - w^\star)Q, the iterations break apart, becoming: \begin{aligned} @@ -351,7 +351,7 @@

First Steps: Gradient Descent

Moving back to our original space w, we can see that - w^k - w^\star = Qx^k=\sum_i^n x^0_i(1-\alpha\lambda_i)^k q_i + w^k - w^\star = Qx^kQ^T=\sum_i^n x^0_i(1-\alpha\lambda_i)^k q_i and there we have it -- gradient descent in closed form. @@ -599,7 +599,7 @@

Example: Polynomial Regression

- The path of convergence, as we know, is elucidated when we view the iterates in the space of Q (the eigenvectors of Z^T Z). So let's recast our regression problem in the basis of Q. First, we do a change of basis, by rotating w into Qw, and counter-rotating our feature maps p into eigenspace, \bar{p}. We can now conceptualize the same regression as one over a different polynomial basis, with the model + The path of convergence, as we know, is elucidated when we view the iterates in the space of Q (the eigenvectors of Z^T Z). So let's recast our regression problem in the basis of Q. First, we do a change of basis, by rotating w into QwQ^T, and counter-rotating our feature maps p into eigenspace, \bar{p}. We can now conceptualize the same regression as one over a different polynomial basis, with the model \text{model}(\xi)~=~x_{1}\bar{p}_{1}(\xi)~+~\cdots~+~x_{n}\bar{p}_{n}(\xi)\qquad \bar{p}_{i}=\sum q_{ij}p_j. From 6a0e609e0ce142b4bae0bd65cadb05b154fd2bb7 Mon Sep 17 00:00:00 2001 From: George Fu Date: Sun, 11 Aug 2019 13:47:15 -0500 Subject: [PATCH 3/7] Use parenthesis not square brackets --- public/index.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/public/index.html b/public/index.html index 79940c7..9574bbf 100644 --- a/public/index.html +++ b/public/index.html @@ -364,7 +364,7 @@

Decomposing the Error

For most step-sizes, the eigenvectors with largest eigenvalues converge the fastest. This triggers an explosion of progress in the first few iterations, before things slow down as the smaller eigenvectors' struggles are revealed. By writing the contributions of each eigenspace's error to the loss - f(w^{k})-f(w^{\star})=\sum(1-\alpha\lambda_{i})^{2k}\lambda_{i}[x_{i}^{0}]^2 + f(w^{k})-f(w^{\star})=\sum(1-\alpha\lambda_{i})^{2k}\lambda_{i}(x_{i}^{0})^2 we can visualize the contributions of each error component to the loss.

From 605353dbdc12af1e365dedbffffbd1e4ae09d786 Mon Sep 17 00:00:00 2001 From: George Fu Date: Sun, 11 Aug 2019 13:47:16 -0500 Subject: [PATCH 4/7] Rephrase --- public/index.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/public/index.html b/public/index.html index 9574bbf..fafa97e 100644 --- a/public/index.html +++ b/public/index.html @@ -369,7 +369,7 @@

Decomposing the Error

we can visualize the contributions of each error component to the loss.

-
Optimization can be seen as combination of several component problems, shown here as 1 2 3 with eigenvalues \lambda_1=0.01, \lambda_2=0.1, and \lambda_3=1 respectively.
+
The loss can be seen as combination of several component losses, shown here as 1 2 3 with eigenvalues \lambda_1=0.01, \lambda_2=0.1, and \lambda_3=1 respectively.
From f56187e4785ff8df59f9689a1a19ea43ae5f1523 Mon Sep 17 00:00:00 2001 From: George Fu Date: Sun, 11 Aug 2019 13:47:16 -0500 Subject: [PATCH 5/7] Init slider on load and use more precision --- public/index.html | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/public/index.html b/public/index.html index fafa97e..4ad7d0a 100644 --- a/public/index.html +++ b/public/index.html @@ -417,21 +417,22 @@

Decomposing the Error

var alphaHTML = MathCache("alpha-equals"); + var optimal_pt = 1.98005; var slidera = sliderGen([250, 80]) - .ticks([0,1,200/(101),2]) + .ticks([0, 1, optimal_pt, 2]) .change( function (i) { - var html = alphaHTML + '' + i.toPrecision(4) + ""; + var html = alphaHTML + '' + i.toPrecision(5) + ""; d3.select("#stepSizeMilestones") .html("Stepsize " + html ) updateSliderGD(i,0.000) } ) .ticktitles(function(d,i) { return [0,1,"",2][i] }) - .startxval(200/(101)) + .startxval(optimal_pt) .cRadius(7) .shifty(-12) .shifty(10) .margins(20,20)(d3.select("#sliderStep")) - + slidera.init() // renderDraggable(svg, [133.5, 23], [114.5, 90], 2, " ").attr("opacity", 0.1) // renderDraggable(svg, [133.5, 88], [115.5, 95], 2, " ").attr("opacity", 0.1) From 17c2700c033ce86d74622c7115a2c032c84aa8fc Mon Sep 17 00:00:00 2001 From: George Fu Date: Sun, 11 Aug 2019 13:47:16 -0500 Subject: [PATCH 6/7] Rephrase --- public/index.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/public/index.html b/public/index.html index 4ad7d0a..47d3739 100644 --- a/public/index.html +++ b/public/index.html @@ -474,7 +474,7 @@

Decomposing the Error

.attr("dx", -295) .attr("text-anchor", "start") .attr("fill", "gray") - .text("At the optimum, the rates of convergence of the largest and smallest eigenvalues equalize.") + .text("At the optimal step size, the rates of convergence of the largest and smallest eigenvalues equalize.") callback(null); }); From f036fcf5b1c946dd1f762065146d970ebeb07b91 Mon Sep 17 00:00:00 2001 From: George Fu Date: Sun, 11 Aug 2019 13:50:25 -0500 Subject: [PATCH 7/7] Tikhonov Regularization --- public/index.html | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/public/index.html b/public/index.html index 47d3739..7a2d7d9 100644 --- a/public/index.html +++ b/public/index.html @@ -691,7 +691,7 @@

Example: Polynomial Regression

- The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a model which is, in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to the observations changes the entire model dramatically. And the eigenfeatures, the principal components of the data, give us exactly the decomposition we need to sort the features by its sensitivity to perturbations in d_i's. The most robust components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the smallest eigenvalues). + The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a model which is, in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to the observations changes the entire model dramatically. The eigenfeatures, the principal components of the data, give us exactly the decomposition we need to order the features by the model's sensitivity to perturbations in d_i's. The most robust components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the smallest eigenvalues).

@@ -790,11 +790,11 @@

Example: Polynomial Regression

- This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of regularization, such as Tikhonov Regression. Both methods try to suppress the components of the smallest eigenvalues directly, though they employ different methods of spectral decay.In Tikhonov Regression we add a quadratic penalty to the regression, minimizing + This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of regularization, such as Tikhonov Regularization. Both methods try to suppress the components of the smallest eigenvalues directly, though they employ different methods of spectral decay.In Tikhonov Regularization we add a quadratic penalty to the loss function, minimizing \text{minimize}\qquad\tfrac{1}{2}\|Zw-d\|^{2}+\frac{\eta}{2}\|w\|^{2}=\tfrac{1}{2}w^{T}(Z^{T}Z+\eta I)w-(Zd)^{T}w -Recall that Z^{T}Z=Q\ \text{diag}(\Lambda_{1},\ldots,\Lambda_{n})\ Q^T. The solution to Tikhonov Regression is therefore +Recall that Z^{T}Z=Q\ \text{diag}(\Lambda_{1},\ldots,\Lambda_{n})\ Q^T. The solution to Tikhonov Regularization is therefore (Z^{T}Z+\eta I)^{-1}(Zd)=Q\ \text{diag}\left(\frac{1}{\lambda_{1}+\eta},\cdots,\frac{1}{\lambda_{n}+\eta}\right)Q^T(Zd)