From 6a770bdcf3020eba2600d5faaf4bd7b726a71291 Mon Sep 17 00:00:00 2001
From: George Fu <GeorgeFu891@users.noreply.github.com>
Date: Sun, 11 Aug 2019 13:47:14 -0500
Subject: [PATCH 1/7] State condition (might have a saddle)

---
 public/index.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/public/index.html b/public/index.html
index e047563..96db438 100644
--- a/public/index.html
+++ b/public/index.html
@@ -263,7 +263,7 @@ <h2>First Steps: Gradient Descent</h2>
     f(w) = \tfrac{1}{2}w^TAw - b^Tw, \qquad w \in \mathbf{R}^n.
   </dt-math>
 
-  Assume <dt-math>A</dt-math> is symmetric and invertible, then the optimal solution <dt-math>w^{\star}</dt-math> occurs at
+  Assume <dt-math>A</dt-math> is symmetric and invertible, then (assuming one exists) the optimal solution <dt-math>w^{\star}</dt-math> occurs at
 
   <dt-math block> w^{\star} = A^{-1}b.</dt-math>
 

From f5c0f554abdbcf69c6ae086e0ce06f41a69f4497 Mon Sep 17 00:00:00 2001
From: George Fu <GeorgeFu891@users.noreply.github.com>
Date: Sun, 11 Aug 2019 13:47:15 -0500
Subject: [PATCH 2/7] Fix change of basis

---
 public/index.html | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/public/index.html b/public/index.html
index 96db438..79940c7 100644
--- a/public/index.html
+++ b/public/index.html
@@ -333,13 +333,13 @@ <h2>First Steps: Gradient Descent</h2>
 
 </script>
 <p>
-  Every symmetric matrix <dt-math>A</dt-math> has an eigenvalue decomposition
+  Every real symmetric matrix <dt-math>A</dt-math> has an eigenvalue decomposition
 
   <dt-math block>
   A=Q\ \text{diag}(\lambda_{1},\ldots,\lambda_{n})\ Q^{T},\qquad Q = [q_1,\ldots,q_n],
   </dt-math>
 
-  and, as per convention, we will assume that the <dt-math>\lambda_i</dt-math>'s are sorted, from smallest <dt-math>\lambda_1</dt-math> to biggest <dt-math>\lambda_n</dt-math>. If we perform a change of basis, <dt-math>x^{k} = Q^T(w^{k} - w^\star)</dt-math>, the iterations break apart, becoming:
+  and, as per convention, we will assume that the <dt-math>\lambda_i</dt-math>'s are sorted, from smallest <dt-math>\lambda_1</dt-math> to biggest <dt-math>\lambda_n</dt-math>. If we perform a change of basis, <dt-math>x^{k} = Q^T(w^{k} - w^\star)Q</dt-math>, the iterations break apart, becoming:
 
   <dt-math block>
   \begin{aligned}
@@ -351,7 +351,7 @@ <h2>First Steps: Gradient Descent</h2>
   Moving back to our original space <dt-math>w</dt-math>, we can see that
 
   <dt-math block>
-  w^k - w^\star = Qx^k=\sum_i^n x^0_i(1-\alpha\lambda_i)^k q_i
+  w^k - w^\star = Qx^kQ^T=\sum_i^n x^0_i(1-\alpha\lambda_i)^k q_i
   </dt-math>
 
   and there we have it -- gradient descent in closed form.
@@ -599,7 +599,7 @@ <h2>Example: Polynomial Regression</h2>
   </p>
 
   <p>
-  The path of convergence, as we know, is elucidated when we view the iterates in the space of <dt-math>Q</dt-math> (the eigenvectors of <dt-math>Z^T Z</dt-math>). So let's recast our regression problem in the basis of <dt-math>Q</dt-math>. First, we do a change of basis, by rotating <dt-math>w</dt-math> into <dt-math>Qw</dt-math>, and counter-rotating our feature maps <dt-math>p</dt-math> into eigenspace, <dt-math>\bar{p}</dt-math>. We can now conceptualize the same regression as one over a different polynomial basis, with the model
+  The path of convergence, as we know, is elucidated when we view the iterates in the space of <dt-math>Q</dt-math> (the eigenvectors of <dt-math>Z^T Z</dt-math>). So let's recast our regression problem in the basis of <dt-math>Q</dt-math>. First, we do a change of basis, by rotating <dt-math>w</dt-math> into <dt-math>QwQ^T</dt-math>, and counter-rotating our feature maps <dt-math>p</dt-math> into eigenspace, <dt-math>\bar{p}</dt-math>. We can now conceptualize the same regression as one over a different polynomial basis, with the model
 
   <dt-math block>
   \text{model}(\xi)~=~x_{1}\bar{p}_{1}(\xi)~+~\cdots~+~x_{n}\bar{p}_{n}(\xi)\qquad \bar{p}_{i}=\sum q_{ij}p_j.

From 6a0e609e0ce142b4bae0bd65cadb05b154fd2bb7 Mon Sep 17 00:00:00 2001
From: George Fu <GeorgeFu891@users.noreply.github.com>
Date: Sun, 11 Aug 2019 13:47:15 -0500
Subject: [PATCH 3/7] Use parenthesis not square brackets

---
 public/index.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/public/index.html b/public/index.html
index 79940c7..9574bbf 100644
--- a/public/index.html
+++ b/public/index.html
@@ -364,7 +364,7 @@ <h3>Decomposing the Error</h3>
   <p>
   For most step-sizes, the eigenvectors with largest eigenvalues converge the fastest. This triggers an explosion of progress in the first few iterations, before things slow down as the smaller eigenvectors' struggles are revealed. By writing the contributions of each eigenspace's error to the loss
   <dt-math block>
-  f(w^{k})-f(w^{\star})=\sum(1-\alpha\lambda_{i})^{2k}\lambda_{i}[x_{i}^{0}]^2
+  f(w^{k})-f(w^{\star})=\sum(1-\alpha\lambda_{i})^{2k}\lambda_{i}(x_{i}^{0})^2
   </dt-math>
   we can visualize the contributions of each error component to the loss.
   </p>

From 605353dbdc12af1e365dedbffffbd1e4ae09d786 Mon Sep 17 00:00:00 2001
From: George Fu <GeorgeFu891@users.noreply.github.com>
Date: Sun, 11 Aug 2019 13:47:16 -0500
Subject: [PATCH 4/7] Rephrase

---
 public/index.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/public/index.html b/public/index.html
index 9574bbf..fafa97e 100644
--- a/public/index.html
+++ b/public/index.html
@@ -369,7 +369,7 @@ <h3>Decomposing the Error</h3>
   we can visualize the contributions of each error component to the loss.
   </p>
   <figure style="position:relative; width:920px; height:360px" id = "milestones_gd">
-  <figcaption style="position:absolute; text-align:left; left:135px; width:350px; height:80px">Optimization can be seen as combination of several component problems, shown here as <svg style="position:relative; top:2px; width:3px; height:14px; background:#fde0dd"></svg> 1 <svg style="position:relative; top:2px; width:3px; height:14px; background:#fa9fb5"></svg> 2 <svg style="position:relative; top:2px; width:3px; height:14px; background:#c51b8a"></svg> 3 with eigenvalues <svg style="position:relative; top:2px; width:3px; height:14px; background:#fde0dd"></svg> <dt-math>\lambda_1=0.01</dt-math>, <svg style="position:relative; top:2px; width:3px; height:14px; background:#fa9fb5"></svg> <dt-math>\lambda_2=0.1</dt-math>, and <svg style="position:relative; top:2px; width:3px; height:14px; background:#c51b8a"></svg> <dt-math>\lambda_3=1</dt-math> respectively. </figcaption>
+  <figcaption style="position:absolute; text-align:left; left:135px; width:350px; height:80px">The loss can be seen as combination of several component losses, shown here as <svg style="position:relative; top:2px; width:3px; height:14px; background:#fde0dd"></svg> 1 <svg style="position:relative; top:2px; width:3px; height:14px; background:#fa9fb5"></svg> 2 <svg style="position:relative; top:2px; width:3px; height:14px; background:#c51b8a"></svg> 3 with eigenvalues <svg style="position:relative; top:2px; width:3px; height:14px; background:#fde0dd"></svg> <dt-math>\lambda_1=0.01</dt-math>, <svg style="position:relative; top:2px; width:3px; height:14px; background:#fa9fb5"></svg> <dt-math>\lambda_2=0.1</dt-math>, and <svg style="position:relative; top:2px; width:3px; height:14px; background:#c51b8a"></svg> <dt-math>\lambda_3=1</dt-math> respectively. </figcaption>
 
 <!-- ["#fde0dd", "#fa9fb5", "#c51b8a"]
  -->

From f56187e4785ff8df59f9689a1a19ea43ae5f1523 Mon Sep 17 00:00:00 2001
From: George Fu <GeorgeFu891@users.noreply.github.com>
Date: Sun, 11 Aug 2019 13:47:16 -0500
Subject: [PATCH 5/7] Init slider on load and use more precision

---
 public/index.html | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/public/index.html b/public/index.html
index fafa97e..4ad7d0a 100644
--- a/public/index.html
+++ b/public/index.html
@@ -417,21 +417,22 @@ <h3>Decomposing the Error</h3>
 
       var alphaHTML = MathCache("alpha-equals");
 
+      var optimal_pt = 1.98005;
       var slidera = sliderGen([250, 80])
-                  .ticks([0,1,200/(101),2])
+                  .ticks([0, 1, optimal_pt, 2])
                   .change( function (i) {
-                    var html = alphaHTML + '<span style="font-weight: normal;">' + i.toPrecision(4) + "</span>";
+                    var html = alphaHTML + '<span style="font-weight: normal;">' + i.toPrecision(5) + "</span>";
                     d3.select("#stepSizeMilestones")
                       .html("Stepsize " + html )
                     updateSliderGD(i,0.000)
                   } )
                   .ticktitles(function(d,i) { return [0,1,"",2][i] })
-                  .startxval(200/(101))
+                  .startxval(optimal_pt)
                   .cRadius(7)
                   .shifty(-12)
                   .shifty(10)
                   .margins(20,20)(d3.select("#sliderStep"))
-
+      slidera.init()
 
       // renderDraggable(svg, [133.5, 23], [114.5, 90], 2, " ").attr("opacity", 0.1)
       // renderDraggable(svg, [133.5, 88], [115.5, 95], 2, " ").attr("opacity", 0.1)

From 17c2700c033ce86d74622c7115a2c032c84aa8fc Mon Sep 17 00:00:00 2001
From: George Fu <GeorgeFu891@users.noreply.github.com>
Date: Sun, 11 Aug 2019 13:47:16 -0500
Subject: [PATCH 6/7] Rephrase

---
 public/index.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/public/index.html b/public/index.html
index 4ad7d0a..47d3739 100644
--- a/public/index.html
+++ b/public/index.html
@@ -474,7 +474,7 @@ <h3>Decomposing the Error</h3>
         .attr("dx", -295)
         .attr("text-anchor", "start")
         .attr("fill", "gray")
-        .text("At the optimum, the rates of convergence of the largest and smallest eigenvalues equalize.")
+        .text("At the optimal step size, the rates of convergence of the largest and smallest eigenvalues equalize.")
 
       callback(null);
     });

From f036fcf5b1c946dd1f762065146d970ebeb07b91 Mon Sep 17 00:00:00 2001
From: George Fu <GeorgeFu891@users.noreply.github.com>
Date: Sun, 11 Aug 2019 13:50:25 -0500
Subject: [PATCH 7/7] Tikhonov Regularization

---
 public/index.html | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/public/index.html b/public/index.html
index 47d3739..7a2d7d9 100644
--- a/public/index.html
+++ b/public/index.html
@@ -691,7 +691,7 @@ <h2>Example: Polynomial Regression</h2>
 
   </script>
   <p>
-  The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a model which is, in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to the observations changes the entire model dramatically. And the eigenfeatures, the principal components of the data, give us exactly the decomposition we need to sort the features by its sensitivity to perturbations in <dt-math>d_i</dt-math>'s. The most robust components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the smallest eigenvalues).
+  The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a model which is, in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to the observations changes the entire model dramatically. The eigenfeatures, the principal components of the data, give us exactly the decomposition we need to order the features by the model's sensitivity to perturbations in <dt-math>d_i</dt-math>'s. The most robust components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the smallest eigenvalues).
   </p>
 
   <p>
@@ -790,11 +790,11 @@ <h2>Example: Polynomial Regression</h2>
 
   </script>
   <p>
-  This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of regularization, such as Tikhonov Regression. Both methods try to suppress the components of the smallest eigenvalues directly, though they employ different methods of spectral decay.<dt-fn>In Tikhonov Regression we add a quadratic penalty to the regression, minimizing
+  This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of regularization, such as Tikhonov Regularization. Both methods try to suppress the components of the smallest eigenvalues directly, though they employ different methods of spectral decay.<dt-fn>In Tikhonov Regularization we add a quadratic penalty to the loss function, minimizing
 <dt-math block>
 \text{minimize}\qquad\tfrac{1}{2}\|Zw-d\|^{2}+\frac{\eta}{2}\|w\|^{2}=\tfrac{1}{2}w^{T}(Z^{T}Z+\eta I)w-(Zd)^{T}w
 </dt-math>
-Recall that <dt-math>Z^{T}Z=Q\ \text{diag}(\Lambda_{1},\ldots,\Lambda_{n})\ Q^T</dt-math>. The solution to Tikhonov Regression is therefore
+Recall that <dt-math>Z^{T}Z=Q\ \text{diag}(\Lambda_{1},\ldots,\Lambda_{n})\ Q^T</dt-math>. The solution to Tikhonov Regularization is therefore
 <dt-math block>
 (Z^{T}Z+\eta I)^{-1}(Zd)=Q\ \text{diag}\left(\frac{1}{\lambda_{1}+\eta},\cdots,\frac{1}{\lambda_{n}+\eta}\right)Q^T(Zd)
 </dt-math>