From 2882853a582dff06242b3b17e9cba86318620644 Mon Sep 17 00:00:00 2001 From: Nikhil Reddy Date: Tue, 22 Oct 2024 21:48:21 -0700 Subject: [PATCH] publish note 16 --- _quarto.yml | 2 +- cv_regularization/cv_reg.qmd | 21 +- docs/case_study_HCE/case_study_HCE.html | 10 + .../loss_transformations.html | 34 +- .../figure-pdf/cell-13-output-1.pdf | Bin 9193 -> 9193 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 15000 -> 15000 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 8394 -> 8394 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 11041 -> 11041 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 103470 -> 103470 bytes .../figure-pdf/cell-7-output-2.pdf | Bin 11239 -> 11239 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 9752 -> 9752 bytes docs/cv_regularization/cv_reg.html | 1216 +++++++++++++++++ .../images/constrained_gd.png | Bin 0 -> 467880 bytes .../images/cross_validation.png | Bin 0 -> 34529 bytes docs/cv_regularization/images/diamond.png | Bin 0 -> 82332 bytes .../cv_regularization/images/diamondpoint.png | Bin 0 -> 82903 bytes docs/cv_regularization/images/diamondreg.png | Bin 0 -> 106368 bytes .../images/green_constrained_gd_sol.png | Bin 0 -> 450404 bytes .../images/hyperparameter_tuning.png | Bin 0 -> 47848 bytes docs/cv_regularization/images/largerq.png | Bin 0 -> 83611 bytes .../images/model_selection.png | Bin 0 -> 67450 bytes .../images/possible_validation_sets.png | Bin 0 -> 13222 bytes .../images/simple_under_overfit.png | Bin 0 -> 58740 bytes docs/cv_regularization/images/summary.png | Bin 0 -> 127869 bytes .../images/train-test-split.png | Bin 0 -> 25769 bytes .../images/training_validation_curve.png | Bin 0 -> 92721 bytes .../images/unconstrained.png | Bin 0 -> 157127 bytes .../images/validation-split.png | Bin 0 -> 39007 bytes .../images/validation_set.png | Bin 0 -> 9306 bytes docs/cv_regularization/images/verylarge.png | Bin 0 -> 81823 bytes docs/eda/eda.html | 162 +-- .../eda_files/figure-pdf/cell-62-output-1.pdf | Bin 16671 -> 16671 bytes .../eda_files/figure-pdf/cell-67-output-1.pdf | Bin 10991 -> 10991 bytes .../eda_files/figure-pdf/cell-68-output-1.pdf | Bin 12638 -> 12638 bytes .../eda_files/figure-pdf/cell-69-output-1.pdf | Bin 9239 -> 9239 bytes .../eda_files/figure-pdf/cell-71-output-1.pdf | Bin 19825 -> 19825 bytes .../eda_files/figure-pdf/cell-75-output-1.pdf | Bin 16799 -> 16799 bytes .../eda_files/figure-pdf/cell-76-output-1.pdf | Bin 21577 -> 21577 bytes .../eda_files/figure-pdf/cell-77-output-1.pdf | Bin 11851 -> 11851 bytes .../feature_engineering.html | 30 +- .../figure-pdf/cell-8-output-2.pdf | Bin 9247 -> 9247 bytes .../figure-pdf/cell-9-output-2.pdf | Bin 9545 -> 9545 bytes docs/gradient_descent/gradient_descent.html | 54 +- .../figure-pdf/cell-21-output-2.pdf | Bin 11767 -> 11767 bytes docs/index.html | 6 + docs/intro_lec/introduction.html | 6 + docs/intro_to_modeling/intro_to_modeling.html | 22 +- .../figure-html/cell-2-output-1.png | Bin 86618 -> 86442 bytes .../figure-pdf/cell-2-output-1.pdf | Bin 9964 -> 9962 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 15408 -> 15408 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 14938 -> 14938 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 16000 -> 16000 bytes docs/ols/ols.html | 12 +- docs/pandas_1/pandas_1.html | 100 +- docs/pandas_2/pandas_2.html | 144 +- docs/pandas_3/pandas_3.html | 122 +- docs/regex/regex.html | 54 +- docs/sampling/sampling.html | 40 +- .../figure-html/cell-13-output-2.png | Bin 33006 -> 33117 bytes .../figure-html/cell-15-output-2.png | Bin 56833 -> 58359 bytes docs/search.json | 134 +- docs/visualization_1/visualization_1.html | 50 +- .../figure-pdf/cell-10-output-2.pdf | Bin 14751 -> 14751 bytes .../figure-pdf/cell-11-output-1.pdf | Bin 11421 -> 11421 bytes .../figure-pdf/cell-12-output-1.pdf | Bin 12962 -> 12962 bytes .../figure-pdf/cell-13-output-1.pdf | Bin 15653 -> 15653 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 13198 -> 13198 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 13903 -> 13903 bytes .../figure-pdf/cell-17-output-2.pdf | Bin 16169 -> 16169 bytes .../figure-pdf/cell-18-output-2.pdf | Bin 11504 -> 11504 bytes .../figure-pdf/cell-19-output-2.pdf | Bin 13869 -> 13869 bytes .../figure-pdf/cell-20-output-2.pdf | Bin 14660 -> 14660 bytes .../figure-pdf/cell-21-output-1.pdf | Bin 11648 -> 11648 bytes .../figure-pdf/cell-22-output-1.pdf | Bin 11461 -> 11461 bytes .../figure-pdf/cell-23-output-1.pdf | Bin 12128 -> 12128 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 11274 -> 11274 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 11328 -> 11328 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 11395 -> 11395 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 23251 -> 23251 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 11931 -> 11931 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 13379 -> 13379 bytes docs/visualization_2/visualization_2.html | 56 +- .../figure-html/cell-18-output-1.png | Bin 98344 -> 98206 bytes .../figure-pdf/cell-10-output-1.pdf | Bin 10169 -> 10169 bytes .../figure-pdf/cell-11-output-1.pdf | Bin 5887 -> 5887 bytes .../figure-pdf/cell-12-output-1.pdf | Bin 11927 -> 11927 bytes .../figure-pdf/cell-13-output-1.pdf | Bin 14012 -> 14012 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 13643 -> 13643 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 13905 -> 13905 bytes .../figure-pdf/cell-16-output-1.pdf | Bin 17703 -> 17703 bytes .../figure-pdf/cell-17-output-1.pdf | Bin 15914 -> 15914 bytes .../figure-pdf/cell-18-output-1.pdf | Bin 17750 -> 17730 bytes .../figure-pdf/cell-19-output-1.pdf | Bin 15715 -> 15715 bytes .../figure-pdf/cell-20-output-1.pdf | Bin 14911 -> 14911 bytes .../figure-pdf/cell-21-output-1.pdf | Bin 40952 -> 40952 bytes .../figure-pdf/cell-22-output-1.pdf | Bin 13919 -> 13919 bytes .../figure-pdf/cell-23-output-1.pdf | Bin 14978 -> 14978 bytes .../figure-pdf/cell-24-output-1.pdf | Bin 16210 -> 16210 bytes .../figure-pdf/cell-25-output-2.pdf | Bin 16563 -> 16563 bytes .../figure-pdf/cell-26-output-1.pdf | Bin 14791 -> 14791 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 12068 -> 12068 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 9274 -> 9274 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 10244 -> 10244 bytes .../figure-pdf/cell-6-output-1.pdf | Bin 10243 -> 10243 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 10130 -> 10130 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 12591 -> 12591 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 11286 -> 11286 bytes index.tex | 902 +++++++++++- 108 files changed, 2701 insertions(+), 476 deletions(-) create mode 100644 docs/cv_regularization/cv_reg.html create mode 100644 docs/cv_regularization/images/constrained_gd.png create mode 100644 docs/cv_regularization/images/cross_validation.png create mode 100644 docs/cv_regularization/images/diamond.png create mode 100644 docs/cv_regularization/images/diamondpoint.png create mode 100644 docs/cv_regularization/images/diamondreg.png create mode 100644 docs/cv_regularization/images/green_constrained_gd_sol.png create mode 100644 docs/cv_regularization/images/hyperparameter_tuning.png create mode 100644 docs/cv_regularization/images/largerq.png create mode 100644 docs/cv_regularization/images/model_selection.png create mode 100644 docs/cv_regularization/images/possible_validation_sets.png create mode 100644 docs/cv_regularization/images/simple_under_overfit.png create mode 100644 docs/cv_regularization/images/summary.png create mode 100644 docs/cv_regularization/images/train-test-split.png create mode 100644 docs/cv_regularization/images/training_validation_curve.png create mode 100644 docs/cv_regularization/images/unconstrained.png create mode 100644 docs/cv_regularization/images/validation-split.png create mode 100644 docs/cv_regularization/images/validation_set.png create mode 100644 docs/cv_regularization/images/verylarge.png diff --git a/_quarto.yml b/_quarto.yml index 840b29ec..f42d47ce 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -31,7 +31,7 @@ book: - gradient_descent/gradient_descent.qmd - feature_engineering/feature_engineering.qmd - case_study_HCE/case_study_HCE.qmd - # - cv_regularization/cv_reg.qmd + - cv_regularization/cv_reg.qmd # - probability_1/probability_1.qmd # - probability_2/probability_2.qmd # - inference_causality/inference_causality.qmd diff --git a/cv_regularization/cv_reg.qmd b/cv_regularization/cv_reg.qmd index c7e03a40..0d1614f8 100644 --- a/cv_regularization/cv_reg.qmd +++ b/cv_regularization/cv_reg.qmd @@ -18,7 +18,7 @@ jupyter: format_version: '1.0' jupytext_version: 1.16.1 kernelspec: - display_name: Python 3 (ipykernel) + display_name: ds100env language: python name: python3 --- @@ -39,7 +39,7 @@ To answer this question, we will need to address two things: first, we need to u
train-test-split

-From the last lecture, we learned that *increasing* model complexity *decreased* our model's training error but *increased* its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but it generalizes worse to new data that hasn't been seen before. For this reason, a low training error is not always representative of our model's underlying performance -- we need to also assess how well it performs on unseen data to ensure that it is not overfitting. +From lecture 14, we learned that *increasing* model complexity *decreased* our model's training error but *increased* its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but it generalizes worse to new data that hasn't been seen before. For this reason, a low training error is not always representative of our model's underlying performance -- we need to also assess how well it performs on unseen data to ensure that it is not overfitting. Truly, the only way to know when our model overfits is by evaluating it on unseen data. Unfortunately, that means we need to wait for more data. This may be very expensive and time-consuming. @@ -143,7 +143,7 @@ Our goal is to train a model with complexity near the orange dotted line – thi ### K-Fold Cross-Validation -Introducing a validation set gave us an "extra" chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this one set of validation data. +Introducing a validation set gave us one "extra" chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this *one* set of validation data. But what if, by random chance, our validation set just happened to contain many outliers? It is possible that the validation datapoints we set aside do not actually represent other unseen data that the model might encounter. Ideally, we would like to validate our model's performance on several different unseen datasets. This would give us greater confidence in our understanding of how the model behaves on new data. @@ -160,7 +160,7 @@ The common term for one of these chunks is a **fold**. In the example above, we In **cross-validation**, we perform validation splits for each fold in the training set. For a dataset with $K$ folds, we: 1. Pick one fold to be the validation fold -2. Fit the model to training data from every fold *other* than the validation fold +2. Train model of data from every fold *other* than the validation fold 3. Compute the model's error on the validation fold and record it 4. Repeat for all $K$ folds @@ -183,7 +183,7 @@ Some examples of hyperparameters in Data 100 are: To select a hyperparameter value via cross-validation, we first list out several "guesses" for what the best hyperparameter may be. For each guess, we then run cross-validation to compute the cross-validation error incurred by the model when using that choice of hyperparameter value. We then select the value of the hyperparameter that resulted in the lowest cross-validation error. -For example, we may wish to use cross-validation to decide what value we should use for $\alpha$, which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best $\alpha$, like 0.1, 1, and 10. For each possible value, we perform cross-validation to see what error the model has when we use that value of $\alpha$ to train it. +For example, we may wish to use cross-validation to decide what value we should use for $\alpha$, which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best $\alpha$, like 0.1, 1, and 10. For each possible value, we decide to apply 3-fold cross-validation to see what error the model has when we use that value of $\alpha$ to train it.
hyperparameter_tuning
@@ -213,7 +213,7 @@ What if, instead of fully removing particular features, we kept all features and What do we mean by a "little bit"? Consider the case where some parameter $\theta_i$ is close to or equal to 0. Then, feature $\phi_i$ barely impacts the prediction – the feature is weighted by such a small value that its presence doesn't significantly change the value of $\hat{\mathbb{Y}}$. If we restrict how large each parameter $\theta_i$ can be, we restrict how much feature $\phi_i$ contributes to the model. This has the effect of *reducing* model complexity. -In **regularization**, we restrict model complexity by putting a limit on the *magnitudes* of the model parameters $\theta_i$. +In **regularization**, we restrict model complexity by *putting a limit* on the magnitudes of the model parameters $\theta_i$. What do these limits look like? Suppose we specify that the sum of all absolute parameter values can be no greater than some number $Q$. In other words: @@ -258,7 +258,7 @@ Consider the extreme case of when $Q$ is extremely large. In this situation, our Now what if $Q$ was extremely small? Most parameters are then set to (essentially) 0. * If the model has no intercept term: $\hat{\mathbb{Y}} = (0)\phi_1 + (0)\phi_2 + \ldots = 0$. -* If the model has an intercept term: $\hat{\mathbb{Y}} = (0)\phi_1 + (0)\phi_2 + \ldots = \theta_0$. Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0. +* If the model has an intercept term: $\hat{\mathbb{Y}} = \theta_0 + (0)\phi_1 + (0)\phi_2 + \ldots = \theta_0$. Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0. Let's summarize what we have seen. @@ -290,7 +290,7 @@ Notice that we've replaced the constraint with a second term in our objective fu 1. Keeping the model's error on the training data low, represented by the term $\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2$ 2. Keeping the magnitudes of model parameters low, represented by the term $\lambda \sum_{i=1}^p |\theta_i|$ -The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$. +The $\lambda$ controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$. - Assume $\lambda \rightarrow \infty$. Then, $\lambda || \theta ||_1$ dominates the cost function. In order to neutralize the $\infty$ and minimize this term, we set $\theta_j = 0$ for all $j \ge 1$. This is a very constrained model that is mathematically equivalent to the constant model @@ -337,7 +337,7 @@ Recall that by applying regularization, we give our a model a "budget" for how i We can avoid this issue by **scaling** the data before regularizing. This is a process where we convert all features to the same numeric scale. A common way to scale data is to perform **standardization** such that all features have mean 0 and standard deviation 1; essentially, we replace everything with its Z-score. -$$z_i = \frac{x_i - \mu}{\sigma}$$ +$$z_k = \frac{x_k - \mu_k}{\sigma_k}$$ ### L2 (Ridge) Regularization @@ -389,6 +389,7 @@ Our regression models are summarized below. Note the objective function is what | Type | Model | Loss | Regularization | Objective Function | Solution | |-----------------|----------------------------------------|---------------|----------------|-------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------| -| OLS | $\hat{\mathbb{Y}} = \mathbb{X}\theta$ | MSE | None | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X} \theta\|^2_2$ | $\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}$ if $\mathbb{X}$ is full column rank | +| OLS | $\hat{\mathbb{Y}} = \mathbb{X}\theta$ | MSE | None | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X} \theta\|^2_2$ | $\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}$ if $\mathbb{X}$ is full-column rank | | Ridge | $\hat{\mathbb{Y}} = \mathbb{X} \theta$ | MSE | L2 | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \theta_i^2$ | $\hat{\theta}_{ridge} = (\mathbb{X}^{\top}\mathbb{X} + n \lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}$ | | LASSO | $\hat{\mathbb{Y}} = \mathbb{X} \theta$ | MSE | L1 | $\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \vert \theta_i \vert$ | No closed form solution | | + diff --git a/docs/case_study_HCE/case_study_HCE.html b/docs/case_study_HCE/case_study_HCE.html index 6db0a4ed..a1105df1 100644 --- a/docs/case_study_HCE/case_study_HCE.html +++ b/docs/case_study_HCE/case_study_HCE.html @@ -64,6 +64,7 @@ + @@ -240,6 +241,12 @@ 15  Case Study in Human Contexts and Ethics + + @@ -1084,6 +1091,9 @@