From 2785d775121aa41f673cfbfcb870f7dc9f7376d0 Mon Sep 17 00:00:00 2001
From: ishani07
Date: Mon, 29 Apr 2024 15:12:31 -0700
Subject: [PATCH] note 16 fix
---
cv_regularization/cv_reg.qmd | 10 +++++-----
docs/cv_regularization/cv_reg.html | 6 +++---
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/cv_regularization/cv_reg.qmd b/cv_regularization/cv_reg.qmd
index eb52cc3c..c7e03a40 100644
--- a/cv_regularization/cv_reg.qmd
+++ b/cv_regularization/cv_reg.qmd
@@ -279,8 +279,8 @@ $$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \p
Unfortunately, we can't directly use this formulation as our objective function – it's not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the [Lagrangian Duality](https://en.wikipedia.org/wiki/Duality_(optimization)). The details of this are out of scope (take EECS 127 if you're interested in learning more), but the end result is very useful. It turns out that minimizing the following *augmented* objective function is *equivalent* to our minimization goal above.
$$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert$$
-$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
-$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$
+$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
+$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$
The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p |\theta_i|$ as it's **L1 norm** equivalent form, $|| \theta ||_1$.
@@ -290,7 +290,7 @@ Notice that we've replaced the constraint with a second term in our objective fu
1. Keeping the model's error on the training data low, represented by the term $\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2$
2. Keeping the magnitudes of model parameters low, represented by the term $\lambda \sum_{i=1}^p |\theta_i|$
-The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
+The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
- Assume $\lambda \rightarrow \infty$. Then, $\lambda || \theta ||_1$ dominates the cost function. In order to neutralize the $\infty$ and minimize this term, we set $\theta_j = 0$ for all $j \ge 1$. This is a very constrained model that is mathematically equivalent to the constant model
@@ -360,8 +360,8 @@ Notice that all we have done is change the constraint on the model parameters. T
Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as:
$$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2$$
-$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
-$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$
+$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
+$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$
The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p \theta_i^2$ as it's **L2 norm** equivalent form, $|| \theta ||_2^2$.
diff --git a/docs/cv_regularization/cv_reg.html b/docs/cv_regularization/cv_reg.html
index fd8c8de7..dbd6a44b 100644
--- a/docs/cv_regularization/cv_reg.html
+++ b/docs/cv_regularization/cv_reg.html
@@ -613,14 +613,14 @@ \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p |\theta_i| \leq Q\]
Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the Lagrangian Duality. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following augmented objective function is equivalent to our minimization goal above.
-\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\] \[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\] \[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]
+\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\] \[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\] \[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]
The last two expressions include the MSE expressed using vector notation, and the last expression writes \(\sum_{i=1}^p |\theta_i|\) as it’s L1 norm equivalent form, \(|| \theta ||_1\).
Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that penalizes large coefficients. In order to minimize this new objective function, we’ll end up balancing two components:
- Keeping the model’s error on the training data low, represented by the term \(\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2\)
- Keeping the magnitudes of model parameters low, represented by the term \(\lambda \sum_{i=1}^p |\theta_i|\)
-The \(\lambda\) factor controls the degree of regularization. Roughly speaking, \(\lambda\) is related to our \(Q\) constraint from before by the rule \(\lambda \approx \frac{1}{Q}\). To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: \(||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\).
+The \(\lambda\) factor controls the degree of regularization. Roughly speaking, \(\lambda\) is related to our \(Q\) constraint from before by the rule \(\lambda \approx \frac{1}{Q}\). To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: \(\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\).
Assume \(\lambda \rightarrow \infty\). Then, \(\lambda || \theta ||_1\) dominates the cost function. In order to neutralize the \(\infty\) and minimize this term, we set \(\theta_j = 0\) for all \(j \ge 1\). This is a very constrained model that is mathematically equivalent to the constant model
Assume \(\lambda \rightarrow 0\). Then, \(\lambda || \theta ||_1=0\). Minimizing the cost function is equivalent to minimizing \(\frac{1}{n} || Y - X\theta ||_2^2\), our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum \(\hat{\theta} = \hat\theta_{No Reg.}\).
@@ -766,7 +766,7 @@ \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]
Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.
-Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\] \[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\] \[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]
+Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\] \[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\] \[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]
The last two expressions include the MSE expressed using vector notation, and the last expression writes \(\sum_{i=1}^p \theta_i^2\) as it’s L2 norm equivalent form, \(|| \theta ||_2^2\).
When applying L2 regularization, our goal is to minimize this updated objective function.
Unlike L1 regularization, L2 regularization does have a closed-form solution for the best parameter vector when regularization is applied: