diff --git a/_quarto.yml b/_quarto.yml
index 840b29ec..f42d47ce 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -31,7 +31,7 @@ book:
- gradient_descent/gradient_descent.qmd
- feature_engineering/feature_engineering.qmd
- case_study_HCE/case_study_HCE.qmd
- # - cv_regularization/cv_reg.qmd
+ - cv_regularization/cv_reg.qmd
# - probability_1/probability_1.qmd
# - probability_2/probability_2.qmd
# - inference_causality/inference_causality.qmd
diff --git a/cv_regularization/cv_reg.qmd b/cv_regularization/cv_reg.qmd
index c7e03a40..0d1614f8 100644
--- a/cv_regularization/cv_reg.qmd
+++ b/cv_regularization/cv_reg.qmd
@@ -18,7 +18,7 @@ jupyter:
format_version: '1.0'
jupytext_version: 1.16.1
kernelspec:
- display_name: Python 3 (ipykernel)
+ display_name: ds100env
language: python
name: python3
---
@@ -39,7 +39,7 @@ To answer this question, we will need to address two things: first, we need to u
-From the last lecture, we learned that *increasing* model complexity *decreased* our model's training error but *increased* its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but it generalizes worse to new data that hasn't been seen before. For this reason, a low training error is not always representative of our model's underlying performance -- we need to also assess how well it performs on unseen data to ensure that it is not overfitting.
+From lecture 14, we learned that *increasing* model complexity *decreased* our model's training error but *increased* its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but it generalizes worse to new data that hasn't been seen before. For this reason, a low training error is not always representative of our model's underlying performance -- we need to also assess how well it performs on unseen data to ensure that it is not overfitting.
Truly, the only way to know when our model overfits is by evaluating it on unseen data. Unfortunately, that means we need to wait for more data. This may be very expensive and time-consuming.
@@ -143,7 +143,7 @@ Our goal is to train a model with complexity near the orange dotted line – thi
### K-Fold Cross-Validation
-Introducing a validation set gave us an "extra" chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this one set of validation data.
+Introducing a validation set gave us one "extra" chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this *one* set of validation data.
But what if, by random chance, our validation set just happened to contain many outliers? It is possible that the validation datapoints we set aside do not actually represent other unseen data that the model might encounter. Ideally, we would like to validate our model's performance on several different unseen datasets. This would give us greater confidence in our understanding of how the model behaves on new data.
@@ -160,7 +160,7 @@ The common term for one of these chunks is a **fold**. In the example above, we
In **cross-validation**, we perform validation splits for each fold in the training set. For a dataset with $K$ folds, we:
1. Pick one fold to be the validation fold
-2. Fit the model to training data from every fold *other* than the validation fold
+2. Train model of data from every fold *other* than the validation fold
3. Compute the model's error on the validation fold and record it
4. Repeat for all $K$ folds
@@ -183,7 +183,7 @@ Some examples of hyperparameters in Data 100 are:
To select a hyperparameter value via cross-validation, we first list out several "guesses" for what the best hyperparameter may be. For each guess, we then run cross-validation to compute the cross-validation error incurred by the model when using that choice of hyperparameter value. We then select the value of the hyperparameter that resulted in the lowest cross-validation error.
-For example, we may wish to use cross-validation to decide what value we should use for $\alpha$, which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best $\alpha$, like 0.1, 1, and 10. For each possible value, we perform cross-validation to see what error the model has when we use that value of $\alpha$ to train it.
+For example, we may wish to use cross-validation to decide what value we should use for $\alpha$, which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best $\alpha$, like 0.1, 1, and 10. For each possible value, we decide to apply 3-fold cross-validation to see what error the model has when we use that value of $\alpha$ to train it.
@@ -213,7 +213,7 @@ What if, instead of fully removing particular features, we kept all features and
What do we mean by a "little bit"? Consider the case where some parameter $\theta_i$ is close to or equal to 0. Then, feature $\phi_i$ barely impacts the prediction – the feature is weighted by such a small value that its presence doesn't significantly change the value of $\hat{\mathbb{Y}}$. If we restrict how large each parameter $\theta_i$ can be, we restrict how much feature $\phi_i$ contributes to the model. This has the effect of *reducing* model complexity.
-In **regularization**, we restrict model complexity by putting a limit on the *magnitudes* of the model parameters $\theta_i$.
+In **regularization**, we restrict model complexity by *putting a limit* on the magnitudes of the model parameters $\theta_i$.
What do these limits look like? Suppose we specify that the sum of all absolute parameter values can be no greater than some number $Q$. In other words:
@@ -258,7 +258,7 @@ Consider the extreme case of when $Q$ is extremely large. In this situation, our
Now what if $Q$ was extremely small? Most parameters are then set to (essentially) 0.
* If the model has no intercept term: $\hat{\mathbb{Y}} = (0)\phi_1 + (0)\phi_2 + \ldots = 0$.
-* If the model has an intercept term: $\hat{\mathbb{Y}} = (0)\phi_1 + (0)\phi_2 + \ldots = \theta_0$. Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0.
+* If the model has an intercept term: $\hat{\mathbb{Y}} = \theta_0 + (0)\phi_1 + (0)\phi_2 + \ldots = \theta_0$. Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0.
Let's summarize what we have seen.
@@ -290,7 +290,7 @@ Notice that we've replaced the constraint with a second term in our objective fu
1. Keeping the model's error on the training data low, represented by the term $\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2$
2. Keeping the magnitudes of model parameters low, represented by the term $\lambda \sum_{i=1}^p |\theta_i|$
-The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
+The $\lambda$ controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
- Assume $\lambda \rightarrow \infty$. Then, $\lambda || \theta ||_1$ dominates the cost function. In order to neutralize the $\infty$ and minimize this term, we set $\theta_j = 0$ for all $j \ge 1$. This is a very constrained model that is mathematically equivalent to the constant model
@@ -337,7 +337,7 @@ Recall that by applying regularization, we give our a model a "budget" for how i
We can avoid this issue by **scaling** the data before regularizing. This is a process where we convert all features to the same numeric scale. A common way to scale data is to perform **standardization** such that all features have mean 0 and standard deviation 1; essentially, we replace everything with its Z-score.
-$$z_i = \frac{x_i - \mu}{\sigma}$$
+$$z_k = \frac{x_k - \mu_k}{\sigma_k}$$
### L2 (Ridge) Regularization
@@ -389,6 +389,7 @@ Our regression models are summarized below. Note the objective function is what
| Type | Model | Loss | Regularization | Objective Function | Solution |
|-----------------|----------------------------------------|---------------|----------------|-------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
-| OLS | $\hat{\mathbb{Y}} = \mathbb{X}\theta$ | MSE | None | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X} \theta\|^2_2$ | $\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}$ if $\mathbb{X}$ is full column rank |
+| OLS | $\hat{\mathbb{Y}} = \mathbb{X}\theta$ | MSE | None | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X} \theta\|^2_2$ | $\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}$ if $\mathbb{X}$ is full-column rank |
| Ridge | $\hat{\mathbb{Y}} = \mathbb{X} \theta$ | MSE | L2 | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \theta_i^2$ | $\hat{\theta}_{ridge} = (\mathbb{X}^{\top}\mathbb{X} + n \lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}$ |
| LASSO | $\hat{\mathbb{Y}} = \mathbb{X} \theta$ | MSE | L1 | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \vert \theta_i \vert$ | No closed form solution | |
+
diff --git a/docs/case_study_HCE/case_study_HCE.html b/docs/case_study_HCE/case_study_HCE.html
index 6db0a4ed..a1105df1 100644
--- a/docs/case_study_HCE/case_study_HCE.html
+++ b/docs/case_study_HCE/case_study_HCE.html
@@ -64,6 +64,7 @@
+
@@ -240,6 +241,12 @@
15Case Study in Human Contexts and Ethics
+
+
---
diff --git a/docs/constant_model_loss_transformations/loss_transformations.html b/docs/constant_model_loss_transformations/loss_transformations.html
index 0eca156f..ff98d103 100644
--- a/docs/constant_model_loss_transformations/loss_transformations.html
+++ b/docs/constant_model_loss_transformations/loss_transformations.html
@@ -270,6 +270,12 @@
15Case Study in Human Contexts and Ethics
plt.style.use("default") # Revert style to default mpl
-
+
Code
# Constant Model + MSE
@@ -547,7 +553,7 @@
+
Code
# SLR + MSE
@@ -610,7 +616,7 @@
+
Code
# Predictions
@@ -622,7 +628,7 @@
yhats_linear = [theta_0_hat + theta_1_hat * x for x in xs]
-
+
Code
# Constant Model Rug Plot
@@ -652,7 +658,7 @@
+
Code
# SLR model scatter plot
@@ -766,7 +772,7 @@
11.4 Comparing Loss Functions
We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?
Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.
-
+
drinks = np.array([20, 21, 22, 29, 33])drinks
@@ -774,7 +780,7 @@
+
np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
@@ -784,7 +790,7 @@
Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.
How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.
This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.
Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.
# `corrcoef` computes the correlation coefficient between two variables
@@ -902,7 +908,7 @@
and "Length". What is making the raw data deviate from a linear relationship? Notice that the data points with "Length" greater than 2.6 have disproportionately high values of "Age" relative to the rest of the data. If we could manipulate these data points to have lower "Age" values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\)"Age"\()\) ) would achieve just that.
An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).
Recognize the need for validation and test sets to preview model performance on unseen data
+
Apply cross-validation to select model hyperparameters
+
Understand the conceptual basis for L1 and L2 regularization
+
+
+
+
+
At the end of the Feature Engineering lecture (Lecture 14), we arrived at the issue of fine-tuning model complexity. We identified that a model that’s too complex can lead to overfitting while a model that’s too simple can lead to underfitting. This brings us to a natural question: how do we control model complexity to avoid under- and overfitting?
+
To answer this question, we will need to address two things: first, we need to understand when our model begins to overfit by assessing its performance on unseen data. We can achieve this through cross-validation. Secondly, we need to introduce a technique to adjust the complexity of our models ourselves – to do so, we will apply regularization.
+
+
16.1 Cross-validation
+
+
16.1.1 Training, Test, and Validation Sets
+
+
+
+
+
From lecture 14, we learned that increasing model complexity decreased our model’s training error but increased its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but it generalizes worse to new data that hasn’t been seen before. For this reason, a low training error is not always representative of our model’s underlying performance – we need to also assess how well it performs on unseen data to ensure that it is not overfitting.
+
Truly, the only way to know when our model overfits is by evaluating it on unseen data. Unfortunately, that means we need to wait for more data. This may be very expensive and time-consuming.
+
How should we proceed? In this section, we will build up a viable solution to this problem.
+
+
16.1.1.1 Test Sets
+
The simplest approach to avoid overfitting is to keep some of our data “secret” from ourselves. We can set aside a random portion of our full dataset to use only for testing purposes. The datapoints in this test set will not be used to fit the model. Instead, we will:
+
+
Use the remaining portion of our dataset – now called the training set – to run ordinary least squares, gradient descent, or some other technique to train our model,
+
Take the fitted model and use it to make predictions on datapoints in the test set. The model’s performance on the test set (expressed as the MSE, RMSE, etc.) is now indicative of how well it can make predictions on unseen data
+
+
Importantly, the optimal model parameters were found by only considering the data in the training set. After the model has been fitted to the training data, we do not change any parameters before making predictions on the test set. Importantly, we only ever make predictions on the test set once after all model design has been completely finalized. We treat the test set performance as the final test of how well a model does. To reiterate, the test set is only ever touched once: to compute the performance of the model after all fine-tuning has been completed.
+
The process of sub-dividing our dataset into training and test sets is known as a train-test split. Typically, between 10% and 20% of the data is allocated to the test set.
+
+
+
+
+
In sklearn, the train_test_split function (documentation) of the model_selection module allows us to automatically generate train-test splits.
+
We will work with the vehicles dataset from previous lectures. As before, we will attempt to predict the mpg of a vehicle from transformations of its hp. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training.
+
+
+Code
+
import pandas as pd
+import numpy as np
+import seaborn as sns
+import warnings
+warnings.filterwarnings('ignore')
+
+# Load the dataset and construct the design matrix
+vehicles = sns.load_dataset("mpg").rename(columns={"horsepower":"hp"}).dropna()
+X = vehicles[["hp"]]
+X["hp^2"] = vehicles["hp"]**2
+X["hp^3"] = vehicles["hp"]**3
+X["hp^4"] = vehicles["hp"]**4
+
+Y = vehicles["mpg"]
+
+
+
+
from sklearn.model_selection import train_test_split
+
+# `test_size` specifies the proportion of the full dataset that should be allocated to testing
+# `random_state` makes our results reproducible for educational purposes
+X_train, X_test, Y_train, Y_test = train_test_split(
+ X,
+ Y,
+ test_size=0.2,
+ random_state=220
+ )
+
+print(f"Size of full dataset: {X.shape[0]} points")
+print(f"Size of training set: {X_train.shape[0]} points")
+print(f"Size of test set: {X_test.shape[0]} points")
+
+
Size of full dataset: 392 points
+Size of training set: 313 points
+Size of test set: 79 points
+
+
+
After performing our train-test split, we fit a model to the training set and assess its performance on the test set.
+
+
import sklearn.linear_model as lm
+from sklearn.metrics import mean_squared_error
+
+model = lm.LinearRegression()
+
+# Fit to the training set
+model.fit(X_train, Y_train)
+
+# Calculate errors
+train_error = mean_squared_error(Y_train, model.predict(X_train))
+test_error = mean_squared_error(Y_test, model.predict(X_test))
+
+print(f"Training error: {train_error}")
+print(f"Test error: {test_error}")
+
+
Training error: 17.858516841012097
+Test error: 23.19240562932651
+
+
+
+
+
16.1.1.2 Validation Sets
+
Now, what if we were dissatisfied with our test set performance? With our current framework, we’d be stuck. As outlined previously, assessing model performance on the test set is the final stage of the model design process; we can’t go back and adjust our model based on the new discovery that it is overfitting. If we did, then we would be factoring in information from the test set to design our model. The test error would no longer be a true representation of the model’s performance on unseen data!
+
Our solution is to introduce a validation set. A validation set is a random portion of the training set that is set aside for assessing model performance while the model is still being developed. The process for using a validation set is:
+
+
Perform a train-test split.
+
Set the test set aside; we will not touch it until the very end of the model design process.
+
Set aside a portion of the training set to be used for validation.
+
Fit the model parameters to the datapoints contained in the remaining portion of the training set.
+
Assess the model’s performance on the validation set. Adjust the model as needed, re-fit it to the remaining portion of the training set, then re-evaluate it on the validation set. Repeat as necessary until you are satisfied.
+
After all model development is complete, assess the model’s performance on the test set. This is the final test of how well the model performs on unseen data. No further modifications should be made to the model.
+
+
The process of creating a validation set is called a validation split.
+
+
+
+
+
Note that the validation error behaves quite differently from the training error explored previously. As the model becomes more complex, it makes better predictions on the training data; the variance of the model typically increases as model complexity increases. Validation error, on the other hand, decreases then increases as we increase model complexity. This reflects the transition from under- to overfitting: at low model complexity, the model underfits because it is not complex enough to capture the main trends in the data; at high model complexity, the model overfits because it “memorizes” the training data too closely.
+
We can update our understanding of the relationships between error, complexity, and model variance:
+
+
+
+
+
Our goal is to train a model with complexity near the orange dotted line – this is where our model minimizes the validation error. Note that this relationship is a simplification of the real-world, but it’s a good enough approximation for the purposes of Data 100.
+
+
+
+
16.1.2 K-Fold Cross-Validation
+
Introducing a validation set gave us one “extra” chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this one set of validation data.
+
But what if, by random chance, our validation set just happened to contain many outliers? It is possible that the validation datapoints we set aside do not actually represent other unseen data that the model might encounter. Ideally, we would like to validate our model’s performance on several different unseen datasets. This would give us greater confidence in our understanding of how the model behaves on new data.
+
Let’s think back to our validation framework. Earlier, we set aside \(x\)% of our training data (say, 20%) to use for validation.
+
+
+
+
In the example above, we set aside the first 20% of training datapoints for the validation set. This was an arbitrary choice. We could have set aside any 20% portion of the training data for validation. In fact, there are 5 non-overlapping “chunks” of training points that we could have designated as the validation set.
+
+
+
+
The common term for one of these chunks is a fold. In the example above, we had 5 folds, each containing 20% of the training data. This gives us a new perspective: we really have 5 validation sets “hidden” in our training set.
+
In cross-validation, we perform validation splits for each fold in the training set. For a dataset with \(K\) folds, we:
+
+
Pick one fold to be the validation fold
+
Train model of data from every fold other than the validation fold
+
Compute the model’s error on the validation fold and record it
+
Repeat for all \(K\) folds
+
+The cross-validation error is then the average error across all \(K\) validation folds. In the example below, the cross-validation error is the mean of validation errors #1 to #5.
+
+
+
+
+
+
16.1.3 Model Selection Workflow
+
At this stage, we have refined our model selection workflow. We begin by performing a train-test split to set aside a test set for the final evaluation of model performance. Then, we alternate between adjusting our design matrix and computing the cross-validation error to finetune the model’s design. In the example below, we illustrate the use of 4-fold cross-validation to help inform model design.
+
+
+
+
+
+
16.1.4 Hyperparameters
+
An important use of cross-validation is for hyperparameter selection. A hyperparameter is some value in a model that is chosen before the model is fit to any data. This means that it is distinct from the model parameters, \(\theta_i\), because its value is selected before the training process begins. We cannot use our usual techniques – calculus, ordinary least squares, or gradient descent – to choose its value. Instead, we must decide it ourselves.
+
Some examples of hyperparameters in Data 100 are:
+
+
The degree of our polynomial model (recall that we selected the degree before creating our design matrix and calling .fit)
+
The learning rate, \(\alpha\), in gradient descent
+
The regularization penalty, \(\lambda\) (to be introduced later this lecture)
+
+
To select a hyperparameter value via cross-validation, we first list out several “guesses” for what the best hyperparameter may be. For each guess, we then run cross-validation to compute the cross-validation error incurred by the model when using that choice of hyperparameter value. We then select the value of the hyperparameter that resulted in the lowest cross-validation error.
+
For example, we may wish to use cross-validation to decide what value we should use for \(\alpha\), which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best \(\alpha\), like 0.1, 1, and 10. For each possible value, we decide to apply 3-fold cross-validation to see what error the model has when we use that value of \(\alpha\) to train it.
+
+
+
+
+
+
+
16.2 Regularization
+
We’ve now addressed the first of our two goals for today: creating a framework to assess model performance on unseen data. Now, we’ll discuss our second objective: developing a technique to adjust model complexity. This will allow us to directly tackle the issues of under- and overfitting.
+
Earlier, we adjusted the complexity of our polynomial model by tuning a hyperparameter – the degree of the polynomial. We tested out several different polynomial degrees, computed the validation error for each, and selected the value that minimized the validation error. Tweaking the “complexity” was simple; it was only a matter of adjusting the polynomial degree.
+
In most machine learning problems, complexity is defined differently from what we have seen so far. Today, we’ll explore two different definitions of complexity: the squared and absolute magnitude of \(\theta_i\) coefficients.
+
+
16.2.1 Constraining Model Parameters
+
Think back to our work using gradient descent to descend down a loss surface. You may find it helpful to refer back to the Gradient Descent note to refresh your memory. Our aim was to find the combination of model parameters that the smallest, minimum loss. We visualized this using a contour map by plotting possible parameter values on the horizontal and vertical axes, which allows us to take a bird’s eye view above the loss surface. Notice that the contour map has \(p=2\) parameters for ease of visualization. We want to find the model parameters corresponding to the lowest point on the loss surface.
Recall that we represent our features with \(\phi_i\) to reflect the fact that we have performed feature engineering.
+
Previously, we restricted model complexity by limiting the total number of features present in the model. We only included a limited number of polynomial features at a time; all other polynomials were excluded from the model.
+
What if, instead of fully removing particular features, we kept all features and used each one only a “little bit”? If we put a limit on how much each feature can contribute to the predictions, we can still control the model’s complexity without the need to manually determine how many features should be removed.
+
What do we mean by a “little bit”? Consider the case where some parameter \(\theta_i\) is close to or equal to 0. Then, feature \(\phi_i\) barely impacts the prediction – the feature is weighted by such a small value that its presence doesn’t significantly change the value of \(\hat{\mathbb{Y}}\). If we restrict how large each parameter \(\theta_i\) can be, we restrict how much feature \(\phi_i\) contributes to the model. This has the effect of reducing model complexity.
+
In regularization, we restrict model complexity by putting a limit on the magnitudes of the model parameters \(\theta_i\).
+
What do these limits look like? Suppose we specify that the sum of all absolute parameter values can be no greater than some number \(Q\). In other words:
+
\[\sum_{i=1}^p |\theta_i| \leq Q\]
+
where \(p\) is the total number of parameters in the model. You can think of this as us giving our model a “budget” for how it distributes the magnitudes of each parameter. If the model assigns a large value to some \(\theta_i\), it may have to assign a small value to some other \(\theta_j\). This has the effect of increasing feature \(\phi_i\)’s influence on the predictions while decreasing the influence of feature \(\phi_j\). The model will need to be strategic about how the parameter weights are distributed – ideally, more “important” features will receive greater weighting.
+
Notice that the intercept term, \(\theta_0\), is excluded from this constraint. We typically do not regularize the intercept term.
+
Now, let’s think back to gradient descent and visualize the loss surface as a contour map. As a refresher, a loss surface means that each point represents the model’s loss for a particular combination of \(\theta_1\), \(\theta_2\). Let’s say our goal is to find the combination of parameters that gives us the lowest loss.
+
+
+
+
With no constraint, the optimal \(\hat{\theta}\) is in the center. We denote this as \(\hat{\theta}_\text{No Reg}\).
+
Applying this constraint limits what combinations of model parameters are valid. We can now only consider parameter combinations with a total absolute sum less than or equal to our number \(Q\). For our 2D example, the constraint \(\sum_{i=1}^p |\theta_i| \leq Q\) can be rewritten as \(|\theta_0| + |\theta_1| \leq Q\). This means that we can only assign our regularized parameter vector \(\hat{\theta}_{\text{Reg}}\) to positions in the green diamond below.
+
+
+
+
We can no longer select the parameter vector that truly minimizes the loss surface, \(\hat{\theta}_{\text{No Reg}}\), because this combination of parameters does not lie within our allowed region. Instead, we select whatever allowable combination brings us closest to the true minimum loss, which is depicted by the red point below.
+
+
+
+
Notice that, under regularization, our optimized \(\theta_1\) and \(\theta_2\) values are much smaller than they were without regularization (indeed, \(\theta_1\) has decreased to 0). The model has decreased in complexity because we have limited how much our features contribute to the model. In fact, by setting its parameter to 0, we have effectively removed the influence of feature \(\phi_1\) from the model altogether.
+
If we change the value of \(Q\), we change the region of allowed parameter combinations. The model will still choose the combination of parameters that produces the lowest loss – the closest point in the constrained region to the true minimizer, \(\hat{\theta}_{\text{No Reg}}\).
+
When \(Q\) is small, we severely restrict the size of our parameters. \(\theta_i\)s are small in value, and features \(\phi_i\) only contribute a little to the model. The allowed region of model parameters contracts, and the model becomes much simpler:
+
+
+
+
+
When \(Q\) is large, we do not restrict our parameter sizes by much. \(\theta_i\)s are large in value, and features \(\phi_i\) contribute more to the model. The allowed region of model parameters expands, and the model becomes more complex:
+
+
+
+
+
Consider the extreme case of when \(Q\) is extremely large. In this situation, our restriction has essentially no effect, and the allowed region includes the OLS solution!
+
+
+
+
+
Now what if \(Q\) was extremely small? Most parameters are then set to (essentially) 0.
+
+
If the model has no intercept term: \(\hat{\mathbb{Y}} = (0)\phi_1 + (0)\phi_2 + \ldots = 0\).
+
If the model has an intercept term: \(\hat{\mathbb{Y}} = \theta_0 + (0)\phi_1 + (0)\phi_2 + \ldots = \theta_0\). Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0.
+
+
Let’s summarize what we have seen.
+
+
+
+
+
+
16.2.2 L1 (LASSO) Regularization
+
How do we actually apply our constraint \(\sum_{i=1}^p |\theta_i| \leq Q\)? We will do so by modifying the objective function that we seek to minimize when fitting a model.
+
Recall our ordinary least squares objective function: our goal was to find parameters that minimize the model’s mean squared error:
Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the Lagrangian Duality. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following augmented objective function is equivalent to our minimization goal above.
The last two expressions include the MSE expressed using vector notation, and the last expression writes \(\sum_{i=1}^p |\theta_i|\) as it’s L1 norm equivalent form, \(|| \theta ||_1\).
+
Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that penalizes large coefficients. In order to minimize this new objective function, we’ll end up balancing two components:
+
+
Keeping the model’s error on the training data low, represented by the term \(\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2\)
+
Keeping the magnitudes of model parameters low, represented by the term \(\lambda \sum_{i=1}^p |\theta_i|\)
+
+
The \(\lambda\) controls the degree of regularization. Roughly speaking, \(\lambda\) is related to our \(Q\) constraint from before by the rule \(\lambda \approx \frac{1}{Q}\). To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: \(\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\).
+
+
Assume \(\lambda \rightarrow \infty\). Then, \(\lambda || \theta ||_1\) dominates the cost function. In order to neutralize the \(\infty\) and minimize this term, we set \(\theta_j = 0\) for all \(j \ge 1\). This is a very constrained model that is mathematically equivalent to the constant model
+
Assume \(\lambda \rightarrow 0\). Then, \(\lambda || \theta ||_1=0\). Minimizing the cost function is equivalent to minimizing \(\frac{1}{n} || Y - X\theta ||_2^2\), our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum \(\hat{\theta} = \hat\theta_{No Reg.}\).
+
+
We call \(\lambda\) the regularization penalty hyperparameter; it needs to be determined prior to training the model, so we must find the best value via cross-validation.
+
The process of finding the optimal \(\hat{\theta}\) to minimize our new objective function is called L1 regularization. It is also sometimes known by the acronym “LASSO”, which stands for “Least Absolute Shrinkage and Selection Operator.”
+
Unlike ordinary least squares, which can be solved via the closed-form solution \(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\), there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the Lasso model class of sklearn.
+
+
import sklearn.linear_model as lm
+
+# The alpha parameter represents our lambda term
+lasso_model = lm.Lasso(alpha=2)
+lasso_model.fit(X_train, Y_train)
+
+lasso_model.coef_
Notice that all model coefficients are very small in magnitude. In fact, some of them are so small that they are essentially 0. An important characteristic of L1 regularization is that many model parameters are set to 0. In other words, LASSO effectively selects only a subset of the features. The reason for this comes back to our loss surface and allowed “diamond” regions from earlier – we can often get closer to the lowest loss contour at a corner of the diamond than along an edge.
+
When a model parameter is set to 0 or close to 0, its corresponding feature is essentially removed from the model. We say that L1 regularization performs feature selection because, by setting the parameters of unimportant features to 0, LASSO “selects” which features are more useful for modeling. L1 regularization indicates that the features with non-zero parameters are more informative for modeling than those with parameters set to zero.
+
+
+
16.2.3 Scaling Features for Regularization
+
The regularization procedure we just performed had one subtle issue. To see what it is, let’s take a look at the design matrix for our lasso_model.
+
+
+Code
+
X_train.head()
+
+
+
+
+
+
+
+
+
+
hp
+
hp^2
+
hp^3
+
hp^4
+
+
+
+
+
259
+
85.0
+
7225.0
+
614125.0
+
52200625.0
+
+
+
129
+
67.0
+
4489.0
+
300763.0
+
20151121.0
+
+
+
207
+
102.0
+
10404.0
+
1061208.0
+
108243216.0
+
+
+
302
+
70.0
+
4900.0
+
343000.0
+
24010000.0
+
+
+
71
+
97.0
+
9409.0
+
912673.0
+
88529281.0
+
+
+
+
+
+
+
+
Our features – hp, hp^2, hp^3, and hp^4 – are on drastically different numeric scales! The values contained in hp^4 are orders of magnitude larger than those contained in hp. This can be a problem because the value of hp^4 will naturally contribute more to each predicted \(\hat{y}\) because it is so much greater than the values of the other features. For hp to have much of an impact at all on the prediction, it must be scaled by a large model parameter.
+
By inspecting the fitted parameters of our model, we see that this is the case – the parameter for hp is much larger in magnitude than the parameter for hp^4.
Recall that by applying regularization, we give our a model a “budget” for how it can allocate the values of model parameters. For hp to have much of an impact on each prediction, LASSO is forced to “spend” more of this budget on the parameter for hp.
+
We can avoid this issue by scaling the data before regularizing. This is a process where we convert all features to the same numeric scale. A common way to scale data is to perform standardization such that all features have mean 0 and standard deviation 1; essentially, we replace everything with its Z-score.
+
\[z_k = \frac{x_k - \mu_k}{\sigma_k}\]
+
+
+
16.2.4 L2 (Ridge) Regularization
+
In all of our work above, we considered the constraint \(\sum_{i=1}^p |\theta_i| \leq Q\) to limit the complexity of the model. What if we had applied a different constraint?
+
In L2 regularization, also known as ridge regression, we constrain the model such that the sum of the squared parameters must be less than some number \(Q\). This constraint takes the form:
+
\[\sum_{i=1}^p \theta_i^2 \leq Q\]
+
As before, we typically do not regularize the intercept term.
+
In our 2D example, the constraint becomes \(\theta_1^2 + \theta_2^2 \leq Q\). Can you see how this is similar to the equation for a circle, \(x^2 + y^2 = r^2\)? The allowed region of parameters for a given value of \(Q\) is now shaped like a ball.
+
+
+
+
If we modify our objective function like before, we find that our new goal is to minimize the function: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]
+
Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.
+
Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]
+
The last two expressions include the MSE expressed using vector notation, and the last expression writes \(\sum_{i=1}^p \theta_i^2\) as it’s L2 norm equivalent form, \(|| \theta ||_2^2\).
+
When applying L2 regularization, our goal is to minimize this updated objective function.
+
Unlike L1 regularization, L2 regularization does have a closed-form solution for the best parameter vector when regularization is applied:
This solution exists even if \(\mathbb{X}\) is not full column rank. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture, but we will not derive this result in Data 100, as it involves a fair bit of matrix calculus.
+
In sklearn, we perform L2 regularization using the Ridge class. It runs gradient descent to minimize the L2 objective function. Notice that we scale the data before regularizing.
CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections and babynames datasets were stored and loaded as CSVs:
-
+
pd.read_csv("data/elections.csv").head(5)
@@ -497,7 +503,7 @@
+
withopen("data/elections.csv", "r") as table: i =0for row in table:
@@ -518,7 +524,7 @@
5.1.1.2 TSV
Another common file type is TSV (Tab-Separated Values). In a TSV, records are still delimited by a newline \n, while fields are delimited by \t tab character.
Let’s check out the first few rows of the raw TSV file. Again, we’ll use the repr() function so that print shows the special characters.
-
+
withopen("data/elections.txt", "r") as table: i =0for row in table:
@@ -534,7 +540,7 @@
JSON (JavaScript Object Notation) files behave similarly to Python dictionaries. A raw JSON is shown below.
-
+
withopen("data/elections.json", "r") as table: i =0for row in table:
@@ -621,7 +627,7 @@
+
pd.read_json('data/elections.json').head(3)
@@ -676,7 +682,7 @@
5.1.1.3.1 EDA with JSON: Berkeley COVID-19 Data
The City of Berkeley Open Data website has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let’s download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the ds100_utils.py file that we can reuse these helper functions in many different notebooks.
-
+
from ds100_utils import fetch_and_cachecovid_file = fetch_and_cache(
@@ -695,7 +701,7 @@
5.1.1.3.1.1 File Size
Let’s start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use Python tools to probe the file.
Since there seem to be text files, let’s investigate the number of lines, which often corresponds to the number of records
As part of the EDA workflow, Unix commands can come in very handy. In fact, there’s an entire book called “Data Science at the Command Line” that explores this idea in depth! In Jupyter/IPython, you can prefix lines with ! to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax {expr}.
Here, we use the ls command to list files, using the -lh flags, which request “long format with information in human-readable form.” We also use the wc command for “word count,” but with the -l flag, which asks for line counts instead of words.
These two give us the same information as the code above, albeit in a slightly different form:
-
+
!ls -lh {covid_file}!wc -l {covid_file}
@@ -725,7 +731,7 @@
5.1.1.3.1.3 File Contents
Let’s explore the data format using Python.
-
+
withopen(covid_file, "r") as f:for i, row inenumerate(f):print(repr(row)) # print raw strings
@@ -739,7 +745,7 @@
We can use the head Unix command (which is where pandas’ head method comes from!) to see the first few lines of the file:
-
+
!head -5 {covid_file}
{
@@ -750,21 +756,21 @@
In order to load the JSON file into pandas, Let’s first do some EDA with Oython’s json package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into pandas. Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the json package.
-
+
import jsonwithopen(covid_file, "rb") as f: covid_json = json.load(f)
The covid_json variable is now a dictionary encoding the data in the file:
-
+
type(covid_json)
dict
We can examine what keys are in the top level JSON object by listing out the keys.
-
+
covid_json.keys()
dict_keys(['meta', 'data'])
@@ -772,14 +778,14 @@
Observation: The JSON dictionary contains a meta key which likely refers to metadata (data about the data). Metadata is often maintained with the data and can be a good source of additional information.
We can investigate the metadata further by examining the keys associated with the metadata.
-
+
covid_json['meta'].keys()
dict_keys(['view'])
The meta key contains another dictionary called view. This likely refers to metadata about a particular “view” of some underlying database. We will learn more about views when we study SQL later in the class.
There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
+
print(covid_json['meta']['view']['description'])
Counts of confirmed COVID-19 cases among Berkeley residents by date.
@@ -809,7 +815,7 @@
5.1.1.3.1.4 Examining the Data Field for Records
We can look at a few entries in the data field. This is what we’ll load into pandas.
-
+
for i inrange(3):print(f"{i:03} | {covid_json['data'][i]}")
@@ -820,7 +826,7 @@
+
type(covid_json['meta']['view']['columns'])
list
@@ -847,7 +853,7 @@
+
# Load the data from JSON and assign column titlescovid = pd.DataFrame( covid_json['data'],
@@ -960,7 +966,7 @@
5.1.2 Primary and Foreign Keys
Last time, we introduced .merge as the pandas method for joining multiple DataFrames together. In our discussion of joins, we touched on the idea of using a “key” to determine what rows should be merged from each table. Let’s take a moment to examine this idea more closely.
The primary key is the column or set of columns in a table that uniquely determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student’s Cal ID as the primary key.
-
+
@@ -1006,7 +1012,7 @@
is a foreign key referencing the previous table.
-
+
@@ -1099,7 +1105,7 @@
5.2.3.1 Temporality with pandas’ dt accessors
Let’s briefly look at how we can use pandas’ dt accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.
/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_71284/874729699.py:1: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
@@ -1315,7 +1321,7 @@
+
calls["EVENTDT"].dt.month.head()
0 4
@@ -1327,7 +1333,7 @@
+
calls["EVENTDT"].dt.dayofweek.head()
0 3
@@ -1339,7 +1345,7 @@
+
calls.sort_values("EVENTDT").head()
@@ -1486,7 +1492,7 @@
<
We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways: 1. Using a text editor like emacs, vim, VSCode, etc. 2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc. 3. The Python file object 4. pandas, using pd.read_csv()
To try out options 1 and 2, you can view or download the Tuberculosis from the lecture demo notebook under the data folder in the left hand menu. Notice how the CSV file is a type of rectangular data (i.e., tabular data) stored as comma-separated values.
Next, let’s try out option 3 using the Python file object. We’ll look at the first four lines:
-
+
Code
withopen("data/cdc_tuberculosis.csv", "r") as f:
@@ -1511,7 +1517,7 @@
<
Whoa, why are there blank lines interspaced between the lines of the CSV?
You may recall that all line breaks in text files are encoded as the special newline character \n. Python’s print() prints each string (including the newline), and an additional newline on top of that.
If you’re curious, we can use the repr() function to return the raw string with all special characters:
-
+
Code
withopen("data/cdc_tuberculosis.csv", "r") as f:
@@ -1530,7 +1536,7 @@
<
Finally, let’s try option 4 and use the tried-and-true Data 100 approach: pandas.
You may notice some strange things about this table: what’s up with the “Unnamed” column names and the first row?
Congratulations — you’re ready to wrangle your data! Because of how things are stored, we’ll need to clean the data a bit to name our columns better.
A reasonable first step is to identify the row with the right header. The pd.read_csv() function (documentation) has the convenient header parameter that we can set to use the elements in row 1 as the appropriate columns:
Wait…but now we can’t differentiate betwen the “Number of TB cases” and “TB incidence” year columns. pandas has tried to make our lives easier by automatically adding “.1” to the latter columns, but this doesn’t help us, as humans, understand the data.
We can do this manually with df.rename() (documentation):
Row 0 is what we call a rollup record, or summary record. It’s often useful when displaying tables to humans. The granularity of record 0 (Totals) vs the rest of the records (States) is different.
Okay, EDA step two. How was the rollup record aggregated?
Let’s check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get 2x the total cases in each of our TB cases by year (why do you think this is?).
-
+
Code
tb_df.sum(axis=0)
@@ -1796,7 +1802,7 @@
Whoa, what’s going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
+
Code
tb_df.dtypes
@@ -1814,7 +1820,7 @@
Since there are commas in the values for TB cases, the numbers are read as the object datatype, or storage type (close to the Python string datatype), so pandas is concatenating strings instead of adding integers (recall that Python can “sum”, or concatenate, strings together: "data" + "100" evaluates to "data100").
Fortunately read_csv also has a thousands parameter (documentation):
U.S. jurisdiction TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
@@ -1910,7 +1916,7 @@
The total TB cases look right. Phew!
Let’s just look at the records with state-level granularity:
-
+
Code
state_tb_df = tb_df[1:]
@@ -1995,7 +2001,7 @@
5.4.3 Gather Census Data
U.S. Census population estimates source (2019), source (2020-2021).
Running the below cells cleans the data. There are a few new methods here: * df.convert_dtypes() (documentation) conveniently converts all float dtypes into ints and is out of scope for the class. * df.drop_na() (documentation) will be explained in more detail next time.
-
+
Code
# 2010s census data
@@ -2119,7 +2125,7 @@
or use iPython magic which will intelligently import code when files change:
%load_ext autoreload%autoreload 2
-
+
Code
# census 2020s data
@@ -2196,7 +2202,7 @@
5.4.4 Joining Data (Merging DataFrames)
Time to merge! Here we use the DataFrame method df1.merge(right=df2, ...) on DataFramedf1 (documentation). Contrast this with the function pd.merge(left=df1, right=df2, ...) (documentation). Feel free to use either.
-
+
# merge TB DataFrame with two US census DataFramestb_census_df = ( tb_df
@@ -2371,7 +2377,7 @@
+
# try merging again, but cleaner this timetb_census_df = ( tb_df
@@ -2484,7 +2490,7 @@
\[\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} \]
\[= \frac{\text{TB cases in population}}{\text{population}} \times 100000\]