diff --git a/_quarto.yml b/_quarto.yml index d8d91f52..5e0ddcf8 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -40,7 +40,7 @@ book: - sql_II/sql_II.qmd - logistic_regression_1/logistic_reg_1.qmd - logistic_regression_2/logistic_reg_2.qmd - # - pca_1/pca_1.qmd + - pca_1/pca_1.qmd # - pca_2/pca_2.qmd # - clustering/clustering.qmd diff --git a/docs/case_study_HCE/case_study_HCE.html b/docs/case_study_HCE/case_study_HCE.html index 5d8434bb..9d702071 100644 --- a/docs/case_study_HCE/case_study_HCE.html +++ b/docs/case_study_HCE/case_study_HCE.html @@ -289,6 +289,12 @@ 23  Logistic Regression II + + diff --git a/docs/constant_model_loss_transformations/loss_transformations.html b/docs/constant_model_loss_transformations/loss_transformations.html index 373f8491..d3963284 100644 --- a/docs/constant_model_loss_transformations/loss_transformations.html +++ b/docs/constant_model_loss_transformations/loss_transformations.html @@ -318,6 +318,12 @@ 23  Logistic Regression II + + @@ -525,7 +531,7 @@

+
Code
import numpy as np
@@ -540,7 +546,7 @@ 

data_linear = dugongs[["Length", "Age"]]

-
+
Code
# Big font helper
@@ -562,7 +568,7 @@ 

plt.style.use("default") # Revert style to default mpl

-
+
Code
# Constant Model + MSE
@@ -595,7 +601,7 @@ 

+
Code
# SLR + MSE
@@ -658,7 +664,7 @@ 

+
Code
# Predictions
@@ -670,7 +676,7 @@ 

yhats_linear = [theta_0_hat + theta_1_hat * x for x in xs]

-
+
Code
# Constant Model Rug Plot
@@ -700,7 +706,7 @@ 

+
Code
# SLR model scatter plot 
@@ -814,7 +820,7 @@ 

11.4 Comparing Loss Functions

We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?

Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.

-
+
drinks = np.array([20, 21, 22, 29, 33])
 drinks
@@ -822,7 +828,7 @@

+
np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
@@ -832,7 +838,7 @@

Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.

How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.

-
+
drinks_with_outlier = np.append(drinks, 1033)
 display(drinks_with_outlier)
 np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
@@ -846,7 +852,7 @@

This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.

Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.

-
+
drinks_with_additional_observation = np.append(drinks, 35)
 drinks_with_additional_observation
@@ -918,7 +924,7 @@

+
Code
# `corrcoef` computes the correlation coefficient between two variables
@@ -950,7 +956,7 @@ 

and "Length". What is making the raw data deviate from a linear relationship? Notice that the data points with "Length" greater than 2.6 have disproportionately high values of "Age" relative to the rest of the data. If we could manipulate these data points to have lower "Age" values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\) "Age" \()\) ) would achieve just that.

An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).

-
+
Code
z = np.log(y)
@@ -985,7 +991,7 @@ 

\[\log{(y)} = \theta_0 + \theta_1 x\] \[y = e^{\theta_0 + \theta_1 x}\] \[y = (e^{\theta_0})e^{\theta_1 x}\] \[y_i = C e^{k x}\]

For some constants \(C\) and \(k\).

\(y\) is an exponential function of \(x\). Applying an exponential fit to the untransformed variables corroborates this finding.

-
+
Code
plt.figure(dpi=120, figsize=(4, 3))
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf
index deb32752..d732b5ac 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf
index c5e38b7d..1fa11b98 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf
index 45c9ade4..9fcfd40a 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf
index 6178405c..32d34a00 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf
index 1f01fad6..ec4a89e8 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf
index 0edcda59..7ccf6665 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf
index 01a2c228..e4f43561 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf differ
diff --git a/docs/cv_regularization/cv_reg.html b/docs/cv_regularization/cv_reg.html
index 3c98fe9a..deca9cd2 100644
--- a/docs/cv_regularization/cv_reg.html
+++ b/docs/cv_regularization/cv_reg.html
@@ -321,6 +321,12 @@
   
  23  Logistic Regression II
   
+ +
@@ -424,7 +430,7 @@


In sklearn, the train_test_split function (documentation) of the model_selection module allows us to automatically generate train-test splits.

We will work with the vehicles dataset from previous lectures. As before, we will attempt to predict the mpg of a vehicle from transformations of its hp. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training.

-
+
Code
import pandas as pd
@@ -443,7 +449,7 @@ 

Y = vehicles["mpg"]

-
+
from sklearn.model_selection import train_test_split
 
 # `test_size` specifies the proportion of the full dataset that should be allocated to testing
@@ -465,7 +471,7 @@ 

After performing our train-test split, we fit a model to the training set and assess its performance on the test set.

-
+
import sklearn.linear_model as lm
 from sklearn.metrics import mean_squared_error
 
@@ -645,7 +651,7 @@ 

\(\lambda\) the regularization penalty hyperparameter; it needs to be determined prior to training the model, so we must find the best value via cross-validation.

The process of finding the optimal \(\hat{\theta}\) to minimize our new objective function is called L1 regularization. It is also sometimes known by the acronym “LASSO”, which stands for “Least Absolute Shrinkage and Selection Operator.”

Unlike ordinary least squares, which can be solved via the closed-form solution \(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\), there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the Lasso model class of sklearn.

-
+
import sklearn.linear_model as lm
 
 # The alpha parameter represents our lambda term
@@ -663,7 +669,7 @@ 

16.2.3 Scaling Features for Regularization

The regularization procedure we just performed had one subtle issue. To see what it is, let’s take a look at the design matrix for our lasso_model.

-
+
Code
X_train.head()
@@ -726,7 +732,7 @@

\(\hat{y}\) because it is so much greater than the values of the other features. For hp to have much of an impact at all on the prediction, it must be scaled by a large model parameter.

By inspecting the fitted parameters of our model, we see that this is the case – the parameter for hp is much larger in magnitude than the parameter for hp^4.

-
+
pd.DataFrame({"Feature":X_train.columns, "Parameter":lasso_model.coef_})
@@ -790,7 +796,7 @@

\[\hat\theta_{\text{ridge}} = (\mathbb{X}^{\top}\mathbb{X} + n\lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}\]

This solution exists even if \(\mathbb{X}\) is not full column rank. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture, but we will not derive this result in Data 100, as it involves a fair bit of matrix calculus.

In sklearn, we perform L2 regularization using the Ridge class. It runs gradient descent to minimize the L2 objective function. Notice that we scale the data before regularizing.

-
+
ridge_model = lm.Ridge(alpha=1) # alpha represents the hyperparameter lambda
 ridge_model.fit(X_train, Y_train)
 
diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index 20a201c5..b3fe4de8 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -321,6 +321,12 @@
   
  23  Logistic Regression II
   
+ +
@@ -409,7 +415,7 @@

Data Cleaning and EDA

-
+
Code
import numpy as np
@@ -474,7 +480,7 @@ 

5.1.1.1 CSV

CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections and babynames datasets were stored and loaded as CSVs:

-
+
pd.read_csv("data/elections.csv").head(5)
@@ -545,7 +551,7 @@