From 7bd5855f450efa63cd68aae6ba475ce63ea62bf3 Mon Sep 17 00:00:00 2001 From: Nikhil Reddy Date: Mon, 21 Oct 2024 13:57:55 -0700 Subject: [PATCH] add ols recap to note 13 --- .../loss_transformations.html | 28 +- .../figure-pdf/cell-13-output-1.pdf | Bin 9193 -> 9193 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 15000 -> 15000 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 8394 -> 8394 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 11041 -> 11041 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 103470 -> 103470 bytes .../figure-pdf/cell-7-output-2.pdf | Bin 11239 -> 11239 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 9752 -> 9752 bytes docs/eda/eda.html | 156 +- .../eda_files/figure-pdf/cell-62-output-1.pdf | Bin 16671 -> 16671 bytes .../eda_files/figure-pdf/cell-67-output-1.pdf | Bin 10991 -> 10991 bytes .../eda_files/figure-pdf/cell-68-output-1.pdf | Bin 12638 -> 12638 bytes .../eda_files/figure-pdf/cell-69-output-1.pdf | Bin 9239 -> 9239 bytes .../eda_files/figure-pdf/cell-71-output-1.pdf | Bin 19825 -> 19825 bytes .../eda_files/figure-pdf/cell-75-output-1.pdf | Bin 16799 -> 16799 bytes .../eda_files/figure-pdf/cell-76-output-1.pdf | Bin 21577 -> 21577 bytes .../eda_files/figure-pdf/cell-77-output-1.pdf | Bin 11851 -> 11851 bytes .../feature_engineering.html | 24 +- .../figure-pdf/cell-8-output-2.pdf | Bin 9247 -> 9247 bytes .../figure-pdf/cell-9-output-2.pdf | Bin 9545 -> 9545 bytes docs/gradient_descent/gradient_descent.html | 1308 ++++++++++------- .../figure-pdf/cell-21-output-2.pdf | Bin 11767 -> 11767 bytes .../images/ols_matrices_new.png | Bin 0 -> 61834 bytes .../images/ols_matrices_old.png | Bin 0 -> 62462 bytes .../images/ols_solution_matrices.png | Bin 0 -> 81283 bytes docs/intro_to_modeling/intro_to_modeling.html | 16 +- .../figure-html/cell-2-output-1.png | Bin 86900 -> 86618 bytes .../figure-pdf/cell-2-output-1.pdf | Bin 9949 -> 9964 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 15408 -> 15408 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 14938 -> 14938 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 16000 -> 16000 bytes docs/ols/ols.html | 6 +- docs/pandas_1/pandas_1.html | 94 +- docs/pandas_2/pandas_2.html | 142 +- docs/pandas_3/pandas_3.html | 116 +- docs/regex/regex.html | 48 +- docs/sampling/sampling.html | 34 +- .../figure-html/cell-13-output-2.png | Bin 33189 -> 33006 bytes .../figure-html/cell-15-output-2.png | Bin 57953 -> 56833 bytes docs/search.json | 114 +- docs/visualization_1/visualization_1.html | 44 +- .../figure-pdf/cell-10-output-2.pdf | Bin 14751 -> 14751 bytes .../figure-pdf/cell-11-output-1.pdf | Bin 11421 -> 11421 bytes .../figure-pdf/cell-12-output-1.pdf | Bin 12962 -> 12962 bytes .../figure-pdf/cell-13-output-1.pdf | Bin 15653 -> 15653 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 13198 -> 13198 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 13903 -> 13903 bytes .../figure-pdf/cell-17-output-2.pdf | Bin 16169 -> 16169 bytes .../figure-pdf/cell-18-output-2.pdf | Bin 11504 -> 11504 bytes .../figure-pdf/cell-19-output-2.pdf | Bin 13869 -> 13869 bytes .../figure-pdf/cell-20-output-2.pdf | Bin 14660 -> 14660 bytes .../figure-pdf/cell-21-output-1.pdf | Bin 11648 -> 11648 bytes .../figure-pdf/cell-22-output-1.pdf | Bin 11461 -> 11461 bytes .../figure-pdf/cell-23-output-1.pdf | Bin 12128 -> 12128 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 11274 -> 11274 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 11328 -> 11328 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 11395 -> 11395 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 23251 -> 23251 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 11931 -> 11931 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 13379 -> 13379 bytes docs/visualization_2/visualization_2.html | 50 +- .../figure-html/cell-18-output-1.png | Bin 98480 -> 98344 bytes .../figure-pdf/cell-10-output-1.pdf | Bin 10169 -> 10169 bytes .../figure-pdf/cell-11-output-1.pdf | Bin 5887 -> 5887 bytes .../figure-pdf/cell-12-output-1.pdf | Bin 11927 -> 11927 bytes .../figure-pdf/cell-13-output-1.pdf | Bin 14012 -> 14012 bytes .../figure-pdf/cell-14-output-1.pdf | Bin 13643 -> 13643 bytes .../figure-pdf/cell-15-output-1.pdf | Bin 13905 -> 13905 bytes .../figure-pdf/cell-16-output-1.pdf | Bin 17703 -> 17703 bytes .../figure-pdf/cell-17-output-1.pdf | Bin 15914 -> 15914 bytes .../figure-pdf/cell-18-output-1.pdf | Bin 17735 -> 17750 bytes .../figure-pdf/cell-19-output-1.pdf | Bin 15715 -> 15715 bytes .../figure-pdf/cell-20-output-1.pdf | Bin 14911 -> 14911 bytes .../figure-pdf/cell-21-output-1.pdf | Bin 40952 -> 40952 bytes .../figure-pdf/cell-22-output-1.pdf | Bin 13919 -> 13919 bytes .../figure-pdf/cell-23-output-1.pdf | Bin 14978 -> 14978 bytes .../figure-pdf/cell-24-output-1.pdf | Bin 16210 -> 16210 bytes .../figure-pdf/cell-25-output-2.pdf | Bin 16563 -> 16563 bytes .../figure-pdf/cell-26-output-1.pdf | Bin 14791 -> 14791 bytes .../figure-pdf/cell-3-output-1.pdf | Bin 12068 -> 12068 bytes .../figure-pdf/cell-4-output-1.pdf | Bin 9274 -> 9274 bytes .../figure-pdf/cell-5-output-1.pdf | Bin 10244 -> 10244 bytes .../figure-pdf/cell-6-output-1.pdf | Bin 10243 -> 10243 bytes .../figure-pdf/cell-7-output-1.pdf | Bin 10130 -> 10130 bytes .../figure-pdf/cell-8-output-1.pdf | Bin 12591 -> 12591 bytes .../figure-pdf/cell-9-output-1.pdf | Bin 11286 -> 11286 bytes gradient_descent/gradient_descent.qmd | 91 ++ gradient_descent/images/ols_matrices_new.png | Bin 0 -> 61834 bytes gradient_descent/images/ols_matrices_old.png | Bin 0 -> 62462 bytes .../images/ols_solution_matrices.png | Bin 0 -> 81283 bytes index.tex | 285 +++- 91 files changed, 1492 insertions(+), 1064 deletions(-) create mode 100644 docs/gradient_descent/images/ols_matrices_new.png create mode 100644 docs/gradient_descent/images/ols_matrices_old.png create mode 100644 docs/gradient_descent/images/ols_solution_matrices.png create mode 100644 gradient_descent/images/ols_matrices_new.png create mode 100644 gradient_descent/images/ols_matrices_old.png create mode 100644 gradient_descent/images/ols_solution_matrices.png diff --git a/docs/constant_model_loss_transformations/loss_transformations.html b/docs/constant_model_loss_transformations/loss_transformations.html index 9e3783d7..0eca156f 100644 --- a/docs/constant_model_loss_transformations/loss_transformations.html +++ b/docs/constant_model_loss_transformations/loss_transformations.html @@ -477,7 +477,7 @@

+
Code
import numpy as np
@@ -492,7 +492,7 @@ 

data_linear = dugongs[["Length", "Age"]]

-
+
Code
# Big font helper
@@ -514,7 +514,7 @@ 

plt.style.use("default") # Revert style to default mpl

-
+
Code
# Constant Model + MSE
@@ -547,7 +547,7 @@ 

+
Code
# SLR + MSE
@@ -610,7 +610,7 @@ 

+
Code
# Predictions
@@ -622,7 +622,7 @@ 

yhats_linear = [theta_0_hat + theta_1_hat * x for x in xs]

-
+
Code
# Constant Model Rug Plot
@@ -652,7 +652,7 @@ 

+
Code
# SLR model scatter plot 
@@ -766,7 +766,7 @@ 

11.4 Comparing Loss Functions

We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?

Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.

-
+
drinks = np.array([20, 21, 22, 29, 33])
 drinks
@@ -774,7 +774,7 @@

+
np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
@@ -784,7 +784,7 @@

Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.

How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.

-
+
drinks_with_outlier = np.append(drinks, 1033)
 display(drinks_with_outlier)
 np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
@@ -798,7 +798,7 @@

This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.

Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.

-
+
drinks_with_additional_observation = np.append(drinks, 35)
 drinks_with_additional_observation
@@ -870,7 +870,7 @@

+
Code
# `corrcoef` computes the correlation coefficient between two variables
@@ -902,7 +902,7 @@ 

and "Length". What is making the raw data deviate from a linear relationship? Notice that the data points with "Length" greater than 2.6 have disproportionately high values of "Age" relative to the rest of the data. If we could manipulate these data points to have lower "Age" values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\) "Age" \()\) ) would achieve just that.

An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).

-
+
Code
z = np.log(y)
@@ -937,7 +937,7 @@ 

\[\log{(y)} = \theta_0 + \theta_1 x\] \[y = e^{\theta_0 + \theta_1 x}\] \[y = (e^{\theta_0})e^{\theta_1 x}\] \[y_i = C e^{k x}\]

For some constants \(C\) and \(k\).

\(y\) is an exponential function of \(x\). Applying an exponential fit to the untransformed variables corroborates this finding.

-
+
Code
plt.figure(dpi=120, figsize=(4, 3))
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf
index 378390164680e6548cb15e0c09f67b57a610a937..c171c477fbe243cd28ad98f5c74cb174e6a87e35 100644
GIT binary patch
delta 18
acmaFq{?dKJI|Wu_Q)2_O&7T#%F#!Nj@dx7o

delta 18
acmaFq{?dKJI|WuF15*>z&7T#%F#!Nj;0NLW

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf
index b8a03a0aa37c9c71683836d7beb7bb81bc5c9ad0..bd75b662c83dc6975204d2f9f52ed8ebc96cdf38 100644
GIT binary patch
delta 18
ZcmbPHI-_($ff=i@sj-3i=2A0d763y%1^oa3

delta 18
ZcmbPHI-_($ff=ijfvJhv=2A0d763ym1^fU2

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf
index adbd1819813932b960da0d24c9489a48a2393c9c..6ea932d483876fa9f33f855fe4e1e8f3a7a1cce1 100644
GIT binary patch
delta 18
acmX@*c*=3ZV_8;XQ)2`3%`aphF#!Ne2?uHb

delta 18
acmX@*c*=3ZV_8-s15*>T%`aphF#!Nd_y=hK

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf
index dcbde775b80414a4a8d175c73264d61ba53e1dbc..07d3fdee89819dbce4a0b2cbb6c187209228b050 100644
GIT binary patch
delta 18
ZcmZ1&wlHkNUv*YvQ)2_O%}g5d%m76H1;_vZ

delta 18
ZcmZ1&wlHkNUv*X^15*>z%}g5d%m7601;+pY

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf
index ea9d21db4261ea5b6019ea8541243823601e933a..1c8ae2f13c4f66669fd22174040364578ec5e5ad 100644
GIT binary patch
delta 23
fcmZ3tf^FRjwuUW?3;S7(O^prAwlC{v>|g-^aUuwv

delta 23
fcmZ3tf^FRjwuUW?3;S7(3`|W-w=e5w>|g-^aS#Zc

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf
index 0355aceacd90657aac4b1b3382f7acd762a3e2d2..ef4487009425d620dbca8684b50944102d8a85d8 100644
GIT binary patch
delta 18
ZcmaDJ{ycocYzz%?mW#nE^~R2D$(M

diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf
index 94ed9624cf16c0f1a4f7cac821d755193ac0cde9..04dc09f51420705974849b6b6ec2f90390bbb49b 100644
GIT binary patch
delta 18
ZcmbQ?Gs9=YS0z?sQ)2_O&A*i7m;pt#28I9t

delta 18
ZcmbQ?Gs9=YS0z>>15*>z&A*i7m;ptk2893s

diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index ff41ca75..dd8a7edb 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -361,7 +361,7 @@ 

Data Cleaning and EDA

-
+
Code
import numpy as np
@@ -426,7 +426,7 @@ 

5.1.1.1 CSV

CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections and babynames datasets were stored and loaded as CSVs:

-
+
pd.read_csv("data/elections.csv").head(5)
@@ -497,7 +497,7 @@