Skip to content

Commit

Permalink
note 16 fix
Browse files Browse the repository at this point in the history
  • Loading branch information
ishani07 committed Apr 29, 2024
1 parent 1b35b8f commit 2785d77
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
10 changes: 5 additions & 5 deletions cv_regularization/cv_reg.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -279,8 +279,8 @@ $$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \p
Unfortunately, we can't directly use this formulation as our objective function – it's not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the [Lagrangian Duality](https://en.wikipedia.org/wiki/Duality_(optimization)). The details of this are out of scope (take EECS 127 if you're interested in learning more), but the end result is very useful. It turns out that minimizing the following *augmented* objective function is *equivalent* to our minimization goal above.

$$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert$$
$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$
$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$


The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p |\theta_i|$ as it's **L1 norm** equivalent form, $|| \theta ||_1$.
Expand All @@ -290,7 +290,7 @@ Notice that we've replaced the constraint with a second term in our objective fu
1. Keeping the model's error on the training data low, represented by the term $\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2$
2. Keeping the magnitudes of model parameters low, represented by the term $\lambda \sum_{i=1}^p |\theta_i|$

The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.

- Assume $\lambda \rightarrow \infty$. Then, $\lambda || \theta ||_1$ dominates the cost function. In order to neutralize the $\infty$ and minimize this term, we set $\theta_j = 0$ for all $j \ge 1$. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. -->

Expand Down Expand Up @@ -360,8 +360,8 @@ Notice that all we have done is change the constraint on the model parameters. T

Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as:
$$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2$$
$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$
$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$


The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p \theta_i^2$ as it's **L2 norm** equivalent form, $|| \theta ||_2^2$.
Expand Down
6 changes: 3 additions & 3 deletions docs/cv_regularization/cv_reg.html
Original file line number Diff line number Diff line change
Expand Up @@ -613,14 +613,14 @@ <h3 data-number="16.2.2" class="anchored" data-anchor-id="l1-lasso-regularizatio
<p>To apply our constraint, we need to rephrase our minimization goal as:</p>
<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p |\theta_i| \leq Q\]</span></p>
<p>Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the <a href="https://en.wikipedia.org/wiki/Duality_(optimization)">Lagrangian Duality</a>. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following <em>augmented</em> objective function is <em>equivalent</em> to our minimization goal above.</p>
<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\]</span> <span class="math display">\[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\]</span> <span class="math display">\[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]</span></p>
<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\]</span> <span class="math display">\[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\]</span> <span class="math display">\[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]</span></p>
<p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <span class="math inline">\(\sum_{i=1}^p |\theta_i|\)</span> as it’s <strong>L1 norm</strong> equivalent form, <span class="math inline">\(|| \theta ||_1\)</span>.</p>
<p>Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that <em>penalizes large coefficients</em>. In order to minimize this new objective function, we’ll end up balancing two components:</p>
<ol type="1">
<li>Keeping the model’s error on the training data low, represented by the term <span class="math inline">\(\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2\)</span></li>
<li>Keeping the magnitudes of model parameters low, represented by the term <span class="math inline">\(\lambda \sum_{i=1}^p |\theta_i|\)</span></li>
</ol>
<p>The <span class="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <span class="math inline">\(\lambda\)</span> is related to our <span class="math inline">\(Q\)</span> constraint from before by the rule <span class="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <span class="math inline">\(||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
<p>The <span class="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <span class="math inline">\(\lambda\)</span> is related to our <span class="math inline">\(Q\)</span> constraint from before by the rule <span class="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <span class="math inline">\(\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
<ul>
<li><p>Assume <span class="math inline">\(\lambda \rightarrow \infty\)</span>. Then, <span class="math inline">\(\lambda || \theta ||_1\)</span> dominates the cost function. In order to neutralize the <span class="math inline">\(\infty\)</span> and minimize this term, we set <span class="math inline">\(\theta_j = 0\)</span> for all <span class="math inline">\(j \ge 1\)</span>. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. --></p></li>
<li><p>Assume <span class="math inline">\(\lambda \rightarrow 0\)</span>. Then, <span class="math inline">\(\lambda || \theta ||_1=0\)</span>. Minimizing the cost function is equivalent to minimizing <span class="math inline">\(\frac{1}{n} || Y - X\theta ||_2^2\)</span>, our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum <span class="math inline">\(\hat{\theta} = \hat\theta_{No Reg.}\)</span>. <!-- We showed that the global optimum is achieved when the L2 norm ball radius $Q \rightarrow \infty$. --></p></li>
Expand Down Expand Up @@ -766,7 +766,7 @@ <h3 data-number="16.2.4" class="anchored" data-anchor-id="l2-ridge-regularizatio
</center>
<p>If we modify our objective function like before, we find that our new goal is to minimize the function: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]</span></p>
<p>Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.</p>
<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
<p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <span class="math inline">\(\sum_{i=1}^p \theta_i^2\)</span> as it’s <strong>L2 norm</strong> equivalent form, <span class="math inline">\(|| \theta ||_2^2\)</span>.</p>
<p>When applying L2 regularization, our goal is to minimize this updated objective function.</p>
<p>Unlike L1 regularization, L2 regularization <em>does</em> have a closed-form solution for the best parameter vector when regularization is applied:</p>
Expand Down

0 comments on commit 2785d77

Please sign in to comment.