From 0f8638baf06f638e4265f6977253a05848886a29 Mon Sep 17 00:00:00 2001 From: Nikhil Reddy Date: Tue, 19 Nov 2024 21:05:34 -0800 Subject: [PATCH] edit and publish note 24 --- _quarto.yml | 2 +- docs/case_study_HCE/case_study_HCE.html | 6 + .../loss_transformations.html | 34 +- .../figure-pdf/cell-13-output-1.pdf | Bin 9193 -> 9193 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 15000 -> 15000 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 8394 -> 8394 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 11041 -> 11041 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 103470 -> 103470 bytes .../figure-pdf/cell-7-output-2.pdf | Bin 11239 -> 11239 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 9752 -> 9752 bytes docs/cv_regularization/cv_reg.html | 20 +- docs/eda/eda.html | 162 +- .../eda_files/figure-pdf/cell-62-output-1.pdf | Bin 16671 -> 16671 bytes .../eda_files/figure-pdf/cell-67-output-1.pdf | Bin 10991 -> 10991 bytes .../eda_files/figure-pdf/cell-68-output-1.pdf | Bin 12638 -> 12638 bytes .../eda_files/figure-pdf/cell-69-output-1.pdf | Bin 9239 -> 9239 bytes .../eda_files/figure-pdf/cell-71-output-1.pdf | Bin 19825 -> 19825 bytes .../eda_files/figure-pdf/cell-75-output-1.pdf | Bin 16799 -> 16799 bytes .../eda_files/figure-pdf/cell-76-output-1.pdf | Bin 21577 -> 21577 bytes .../eda_files/figure-pdf/cell-77-output-1.pdf | Bin 11851 -> 11851 bytes .../feature_engineering.html | 30 +- .../figure-pdf/cell-8-output-2.pdf | Bin 9247 -> 9247 bytes .../figure-pdf/cell-9-output-2.pdf | Bin 9545 -> 9545 bytes docs/gradient_descent/gradient_descent.html | 54 +- .../figure-pdf/cell-21-output-2.pdf | Bin 11767 -> 11767 bytes docs/index.html | 6 + .../inference_causality.html | 52 +- .../figure-pdf/cell-14-output-2.pdf | Bin 20716 -> 20716 bytes .../figure-pdf/cell-16-output-2.pdf | Bin 17984 -> 17984 bytes docs/intro_lec/introduction.html | 6 + docs/intro_to_modeling/intro_to_modeling.html | 22 +- .../figure-html/cell-2-output-1.png | Bin 86625 -> 86938 bytes .../figure-pdf/cell-2-output-1.pdf | Bin 9964 -> 9967 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 15408 -> 15408 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 14938 -> 14938 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 16000 -> 16000 bytes .../logistic_regression_1/logistic_reg_1.html | 30 +- .../figure-html/cell-3-output-1.png | Bin 117822 -> 117494 bytes .../figure-html/cell-4-output-1.png | Bin 134505 -> 134923 bytes .../figure-html/cell-5-output-1.png | Bin 175912 -> 174899 bytes .../figure-html/cell-8-output-1.png | Bin 180892 -> 181927 bytes .../figure-pdf/cell-10-output-1.pdf | Bin 13791 -> 13791 bytes .../figure-pdf/cell-11-output-1.pdf | Bin 13937 -> 13937 bytes .../figure-pdf/cell-13-output-1.pdf | Bin 10478 -> 10478 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 19592 -> 19583 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 19608 -> 19633 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 19970 -> 19986 bytes .../figure-pdf/cell-6-output-1.pdf | Bin 11733 -> 11733 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 12423 -> 12423 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 25413 -> 25436 bytes .../logistic_regression_2/logistic_reg_2.html | 10 + docs/ols/ols.html | 12 +- docs/pandas_1/pandas_1.html | 100 +- docs/pandas_2/pandas_2.html | 148 +- docs/pandas_3/pandas_3.html | 122 +- docs/pca_1/images/PCA_1.png | Bin 0 -> 65913 bytes docs/pca_1/images/dataset3.png | Bin 0 -> 31437 bytes docs/pca_1/images/dataset3_outlier.png | Bin 0 -> 28258 bytes docs/pca_1/images/dataset4.png | Bin 0 -> 15220 bytes docs/pca_1/images/dataset_dims.png | Bin 0 -> 25444 bytes docs/pca_1/images/diff_reductions.png | Bin 0 -> 76852 bytes docs/pca_1/images/factorization.png | Bin 0 -> 77351 bytes .../images/factorization_constraints.png | Bin 0 -> 709501 bytes docs/pca_1/images/matmul.png | Bin 0 -> 56485 bytes docs/pca_1/images/matmul2.png | Bin 0 -> 50897 bytes docs/pca_1/images/matmul3.png | Bin 0 -> 64871 bytes docs/pca_1/images/matrix_decomp.png | Bin 0 -> 51052 bytes docs/pca_1/images/optimization_takeaways.png | Bin 0 -> 51814 bytes docs/pca_1/images/pc_rotation.gif | Bin 0 -> 500005 bytes docs/pca_1/images/pca_example.png | Bin 0 -> 102002 bytes docs/pca_1/images/reconstruction_loss.png | Bin 0 -> 49076 bytes docs/pca_1/images/total_variance_1.png | Bin 0 -> 38805 bytes docs/pca_1/images/total_variance_2.png | Bin 0 -> 57774 bytes docs/pca_1/pca_1.html | 1298 +++++++++++++ docs/probability_1/probability_1.html | 6 + docs/probability_2/probability_2.html | 6 + docs/regex/regex.html | 54 +- docs/sampling/sampling.html | 40 +- .../figure-html/cell-13-output-2.png | Bin 31066 -> 33164 bytes .../figure-html/cell-15-output-2.png | Bin 56665 -> 57769 bytes docs/search.json | 62 +- docs/sql_I/sql_I.html | 42 +- docs/sql_II/sql_II.html | 162 +- docs/visualization_1/visualization_1.html | 50 +- .../figure-pdf/cell-10-output-2.pdf | Bin 14751 -> 14751 bytes .../figure-pdf/cell-11-output-1.pdf | Bin 11421 -> 11421 bytes .../figure-pdf/cell-12-output-1.pdf | Bin 12962 -> 12962 bytes .../figure-pdf/cell-13-output-1.pdf | Bin 15653 -> 15653 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 13198 -> 13198 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 13903 -> 13903 bytes .../figure-pdf/cell-17-output-2.pdf | Bin 16169 -> 16169 bytes .../figure-pdf/cell-18-output-2.pdf | Bin 11504 -> 11504 bytes .../figure-pdf/cell-19-output-2.pdf | Bin 13869 -> 13869 bytes .../figure-pdf/cell-20-output-2.pdf | Bin 14660 -> 14660 bytes .../figure-pdf/cell-21-output-1.pdf | Bin 11648 -> 11648 bytes .../figure-pdf/cell-22-output-1.pdf | Bin 11461 -> 11461 bytes .../figure-pdf/cell-23-output-1.pdf | Bin 12128 -> 12128 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 11274 -> 11274 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 11328 -> 11328 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 11395 -> 11395 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 23251 -> 23251 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 11931 -> 11931 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 13379 -> 13379 bytes docs/visualization_2/visualization_2.html | 56 +- .../figure-html/cell-18-output-1.png | Bin 98907 -> 98716 bytes .../figure-pdf/cell-10-output-1.pdf | Bin 10169 -> 10169 bytes .../figure-pdf/cell-11-output-1.pdf | Bin 5887 -> 5887 bytes .../figure-pdf/cell-12-output-1.pdf | Bin 11927 -> 11927 bytes .../figure-pdf/cell-13-output-1.pdf | Bin 14012 -> 14012 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 13643 -> 13643 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 13905 -> 13905 bytes .../figure-pdf/cell-16-output-1.pdf | Bin 17703 -> 17703 bytes .../figure-pdf/cell-17-output-1.pdf | Bin 15914 -> 15914 bytes .../figure-pdf/cell-18-output-1.pdf | Bin 17732 -> 17750 bytes .../figure-pdf/cell-19-output-1.pdf | Bin 15715 -> 15715 bytes .../figure-pdf/cell-20-output-1.pdf | Bin 14911 -> 14911 bytes .../figure-pdf/cell-21-output-1.pdf | Bin 40952 -> 40952 bytes .../figure-pdf/cell-22-output-1.pdf | Bin 13919 -> 13919 bytes .../figure-pdf/cell-23-output-1.pdf | Bin 14978 -> 14978 bytes .../figure-pdf/cell-24-output-1.pdf | Bin 16210 -> 16210 bytes .../figure-pdf/cell-25-output-2.pdf | Bin 16563 -> 16563 bytes .../figure-pdf/cell-26-output-1.pdf | Bin 14791 -> 14791 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 12068 -> 12068 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 9274 -> 9274 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 10244 -> 10244 bytes .../figure-pdf/cell-6-output-1.pdf | Bin 10243 -> 10243 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 10130 -> 10130 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 12591 -> 12591 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 11286 -> 11286 bytes index.tex | 694 ++++++- pca_1/images/PCA_1.png | Bin 23537 -> 65913 bytes pca_1/images/factorization.png | Bin 446024 -> 77351 bytes pca_1/images/pc_rotation.gif | Bin 0 -> 500005 bytes pca_1/images/total_variance_1.png | Bin 0 -> 38805 bytes pca_1/images/total_variance_2.png | Bin 0 -> 57774 bytes pca_1/pca_1.qmd | 109 +- .../libs/bootstrap/bootstrap-icons.css | 1704 ----------------- .../libs/bootstrap/bootstrap-icons.woff | Bin 137124 -> 0 bytes .../libs/bootstrap/bootstrap.min.css | 10 - .../libs/bootstrap/bootstrap.min.js | 7 - .../libs/clipboard/clipboard.min.js | 7 - .../libs/quarto-html/anchor.min.js | 9 - .../libs/quarto-html/popper.min.js | 6 - .../quarto-syntax-highlighting.css | 171 -- pca_1/pca_1_files/libs/quarto-html/quarto.js | 770 -------- pca_1/pca_1_files/libs/quarto-html/tippy.css | 1 - .../libs/quarto-html/tippy.umd.min.js | 2 - 147 files changed, 2696 insertions(+), 3386 deletions(-) create mode 100644 docs/pca_1/images/PCA_1.png create mode 100644 docs/pca_1/images/dataset3.png create mode 100644 docs/pca_1/images/dataset3_outlier.png create mode 100644 docs/pca_1/images/dataset4.png create mode 100644 docs/pca_1/images/dataset_dims.png create mode 100644 docs/pca_1/images/diff_reductions.png create mode 100644 docs/pca_1/images/factorization.png create mode 100644 docs/pca_1/images/factorization_constraints.png create mode 100644 docs/pca_1/images/matmul.png create mode 100644 docs/pca_1/images/matmul2.png create mode 100644 docs/pca_1/images/matmul3.png create mode 100644 docs/pca_1/images/matrix_decomp.png create mode 100644 docs/pca_1/images/optimization_takeaways.png create mode 100644 docs/pca_1/images/pc_rotation.gif create mode 100644 docs/pca_1/images/pca_example.png create mode 100644 docs/pca_1/images/reconstruction_loss.png create mode 100644 docs/pca_1/images/total_variance_1.png create mode 100644 docs/pca_1/images/total_variance_2.png create mode 100644 docs/pca_1/pca_1.html create mode 100644 pca_1/images/pc_rotation.gif create mode 100644 pca_1/images/total_variance_1.png create mode 100644 pca_1/images/total_variance_2.png delete mode 100644 pca_1/pca_1_files/libs/bootstrap/bootstrap-icons.css delete mode 100644 pca_1/pca_1_files/libs/bootstrap/bootstrap-icons.woff delete mode 100644 pca_1/pca_1_files/libs/bootstrap/bootstrap.min.css delete mode 100644 pca_1/pca_1_files/libs/bootstrap/bootstrap.min.js delete mode 100644 pca_1/pca_1_files/libs/clipboard/clipboard.min.js delete mode 100644 pca_1/pca_1_files/libs/quarto-html/anchor.min.js delete mode 100644 pca_1/pca_1_files/libs/quarto-html/popper.min.js delete mode 100644 pca_1/pca_1_files/libs/quarto-html/quarto-syntax-highlighting.css delete mode 100644 pca_1/pca_1_files/libs/quarto-html/quarto.js delete mode 100644 pca_1/pca_1_files/libs/quarto-html/tippy.css delete mode 100644 pca_1/pca_1_files/libs/quarto-html/tippy.umd.min.js diff --git a/_quarto.yml b/_quarto.yml index d8d91f52..5e0ddcf8 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -40,7 +40,7 @@ book: - sql_II/sql_II.qmd - logistic_regression_1/logistic_reg_1.qmd - logistic_regression_2/logistic_reg_2.qmd - # - pca_1/pca_1.qmd + - pca_1/pca_1.qmd # - pca_2/pca_2.qmd # - clustering/clustering.qmd diff --git a/docs/case_study_HCE/case_study_HCE.html b/docs/case_study_HCE/case_study_HCE.html index 5d8434bb..9d702071 100644 --- a/docs/case_study_HCE/case_study_HCE.html +++ b/docs/case_study_HCE/case_study_HCE.html @@ -289,6 +289,12 @@ 23  Logistic Regression II + + diff --git a/docs/constant_model_loss_transformations/loss_transformations.html b/docs/constant_model_loss_transformations/loss_transformations.html index 373f8491..d3963284 100644 --- a/docs/constant_model_loss_transformations/loss_transformations.html +++ b/docs/constant_model_loss_transformations/loss_transformations.html @@ -318,6 +318,12 @@ 23  Logistic Regression II + + @@ -525,7 +531,7 @@

+
Code
import numpy as np
@@ -540,7 +546,7 @@ 

data_linear = dugongs[["Length", "Age"]]

-
+
Code
# Big font helper
@@ -562,7 +568,7 @@ 

plt.style.use("default") # Revert style to default mpl

-
+
Code
# Constant Model + MSE
@@ -595,7 +601,7 @@ 

+
Code
# SLR + MSE
@@ -658,7 +664,7 @@ 

+
Code
# Predictions
@@ -670,7 +676,7 @@ 

yhats_linear = [theta_0_hat + theta_1_hat * x for x in xs]

-
+
Code
# Constant Model Rug Plot
@@ -700,7 +706,7 @@ 

+
Code
# SLR model scatter plot 
@@ -814,7 +820,7 @@ 

11.4 Comparing Loss Functions

We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?

Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.

-
+
drinks = np.array([20, 21, 22, 29, 33])
 drinks
@@ -822,7 +828,7 @@

+
np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
@@ -832,7 +838,7 @@

Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.

How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.

-
+
drinks_with_outlier = np.append(drinks, 1033)
 display(drinks_with_outlier)
 np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
@@ -846,7 +852,7 @@

This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.

Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.

-
+
drinks_with_additional_observation = np.append(drinks, 35)
 drinks_with_additional_observation
@@ -918,7 +924,7 @@

+
Code
# `corrcoef` computes the correlation coefficient between two variables
@@ -950,7 +956,7 @@ 

and "Length". What is making the raw data deviate from a linear relationship? Notice that the data points with "Length" greater than 2.6 have disproportionately high values of "Age" relative to the rest of the data. If we could manipulate these data points to have lower "Age" values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\) "Age" \()\) ) would achieve just that.

An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).

-
+
Code
z = np.log(y)
@@ -985,7 +991,7 @@ 

\[\log{(y)} = \theta_0 + \theta_1 x\] \[y = e^{\theta_0 + \theta_1 x}\] \[y = (e^{\theta_0})e^{\theta_1 x}\] \[y_i = C e^{k x}\]

For some constants \(C\) and \(k\).

\(y\) is an exponential function of \(x\). Applying an exponential fit to the untransformed variables corroborates this finding.

-
+
Code
plt.figure(dpi=120, figsize=(4, 3))
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf
index deb32752f8d27a10bbdbaf2f8465755159af2510..d732b5acf79d7222bf4576ca4f49eb1d8481d36c 100644
GIT binary patch
delta 20
ccmaFq{?dKJ8wGYtBLh<-Q}fNA6}~Y60A30RPyhe`

delta 20
ccmaFq{?dKJ8wGX~BLhPVW3$bl6}~Y60A0oiN&o-=

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf
index c5e38b7dbe48b814cd8235166f19da7a4a65dba4..1fa11b9867349e719da9a32fbcc11c549792e917 100644
GIT binary patch
delta 20
bcmbPHI-_($o*BEPk%6g^srlwoGi4S4PV@$8

delta 20
bcmbPHI-_($o*BD|k%6IwvDxNQGi4S4POt`J

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf
index 45c9ade46031d343d9ad46d14b3f9cc3f5c0c50d..9fcfd40a7952b8da72e70f32a57c260196250300 100644
GIT binary patch
delta 20
ccmX@*c*=3ZLs@o9BLh<-Q;W?nWFIjB09SGc*8l(j

delta 20
ccmX@*c*=3ZLs@ncBLhPVW3$aKWFIjB09Pmn&;S4c

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf
index 6178405c560ba8d20012b9463250ef7f24570c5d..32d34a00f1be4c90452750659d0dfca4bad4e521 100644
GIT binary patch
delta 20
bcmZ1&wlHkNZ*_J{BLh<-Q}fMC8uH8lQD6o&

delta 20
bcmZ1&wlHkNZ*_JPBLhPVW7Ew{8uH8lQ5FU+

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf
index 1f01fad68340518d233f4d9d7aa6dc3afbbcaea9..ec4a89e88f5a334d7ecb90f423c5e4d95a004743 100644
GIT binary patch
delta 25
hcmZ3tf^FRjwuUW?^ZVH?jSNhUOwG41>u2m>0RVue2?_uJ

delta 25
hcmZ3tf^FRjwuUW?^ZVINj0_AdjLo($>u2m>0RVtp2?PKD

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf
index 0edcda59d9d242db99b8c888ebfe859d625367c8..7ccf66658ee0b9c33eee3cf15b613914b7e39aac 100644
GIT binary patch
delta 20
bcmaDJ{ycocObvERBLh<-Q}fLWG~Ag1SpWwp

delta 20
bcmaDJ{ycocObvDuBLhPVW3$Z*G~Ag1SiA=!

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf
index 01a2c228b8c984b9357b8816d61f5610e3af0c0d..e4f43561322943d46397d278279cb99d786b6bc1 100644
GIT binary patch
delta 20
bcmbQ?Gs9=YXC-z^BLh<-Q}fNgl;oHJQLqN{

delta 20
bcmbQ?Gs9=YXC-zMBLhPVW3$b_l;oHJQEUe7

diff --git a/docs/cv_regularization/cv_reg.html b/docs/cv_regularization/cv_reg.html
index 3c98fe9a..deca9cd2 100644
--- a/docs/cv_regularization/cv_reg.html
+++ b/docs/cv_regularization/cv_reg.html
@@ -321,6 +321,12 @@
   
  23  Logistic Regression II
   
+ +
@@ -424,7 +430,7 @@


In sklearn, the train_test_split function (documentation) of the model_selection module allows us to automatically generate train-test splits.

We will work with the vehicles dataset from previous lectures. As before, we will attempt to predict the mpg of a vehicle from transformations of its hp. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training.

-
+
Code
import pandas as pd
@@ -443,7 +449,7 @@ 

Y = vehicles["mpg"]

-
+
from sklearn.model_selection import train_test_split
 
 # `test_size` specifies the proportion of the full dataset that should be allocated to testing
@@ -465,7 +471,7 @@ 

After performing our train-test split, we fit a model to the training set and assess its performance on the test set.

-
+
import sklearn.linear_model as lm
 from sklearn.metrics import mean_squared_error
 
@@ -645,7 +651,7 @@ 

\(\lambda\) the regularization penalty hyperparameter; it needs to be determined prior to training the model, so we must find the best value via cross-validation.

The process of finding the optimal \(\hat{\theta}\) to minimize our new objective function is called L1 regularization. It is also sometimes known by the acronym “LASSO”, which stands for “Least Absolute Shrinkage and Selection Operator.”

Unlike ordinary least squares, which can be solved via the closed-form solution \(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\), there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the Lasso model class of sklearn.

-
+
import sklearn.linear_model as lm
 
 # The alpha parameter represents our lambda term
@@ -663,7 +669,7 @@ 

16.2.3 Scaling Features for Regularization

The regularization procedure we just performed had one subtle issue. To see what it is, let’s take a look at the design matrix for our lasso_model.

-
+
Code
X_train.head()
@@ -726,7 +732,7 @@

\(\hat{y}\) because it is so much greater than the values of the other features. For hp to have much of an impact at all on the prediction, it must be scaled by a large model parameter.

By inspecting the fitted parameters of our model, we see that this is the case – the parameter for hp is much larger in magnitude than the parameter for hp^4.

-
+
pd.DataFrame({"Feature":X_train.columns, "Parameter":lasso_model.coef_})
@@ -790,7 +796,7 @@

\[\hat\theta_{\text{ridge}} = (\mathbb{X}^{\top}\mathbb{X} + n\lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}\]

This solution exists even if \(\mathbb{X}\) is not full column rank. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture, but we will not derive this result in Data 100, as it involves a fair bit of matrix calculus.

In sklearn, we perform L2 regularization using the Ridge class. It runs gradient descent to minimize the L2 objective function. Notice that we scale the data before regularizing.

-
+
ridge_model = lm.Ridge(alpha=1) # alpha represents the hyperparameter lambda
 ridge_model.fit(X_train, Y_train)
 
diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index 20a201c5..b3fe4de8 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -321,6 +321,12 @@
   
  23  Logistic Regression II
   
+ +
@@ -409,7 +415,7 @@

Data Cleaning and EDA

-
+
Code
import numpy as np
@@ -474,7 +480,7 @@ 

5.1.1.1 CSV

CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections and babynames datasets were stored and loaded as CSVs:

-
+
pd.read_csv("data/elections.csv").head(5)
@@ -545,7 +551,7 @@