diff --git a/_quarto.yml b/_quarto.yml index bcc0d7a11..e9340bc24 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -28,7 +28,7 @@ book: - intro_to_modeling/intro_to_modeling.qmd - constant_model_loss_transformations/loss_transformations.qmd - ols/ols.qmd - # - gradient_descent/gradient_descent.qmd + - gradient_descent/gradient_descent.qmd # - feature_engineering/feature_engineering.qmd # - case_study_HCE/case_study_HCE.qmd # - cv_regularization/cv_reg.qmd diff --git a/docs/constant_model_loss_transformations/loss_transformations.html b/docs/constant_model_loss_transformations/loss_transformations.html index 07cc2058c..1fe95e7ab 100644 --- a/docs/constant_model_loss_transformations/loss_transformations.html +++ b/docs/constant_model_loss_transformations/loss_transformations.html @@ -252,6 +252,12 @@ 12  Ordinary Least Squares + + @@ -459,7 +465,7 @@

+
Code
import numpy as np
@@ -474,7 +480,7 @@ 

data_linear = dugongs[["Length", "Age"]]

-
+
Code
# Big font helper
@@ -496,7 +502,7 @@ 

plt.style.use("default") # Revert style to default mpl

-
+
Code
# Constant Model + MSE
@@ -529,7 +535,7 @@ 

+
Code
# SLR + MSE
@@ -592,7 +598,7 @@ 

+
Code
# Predictions
@@ -604,7 +610,7 @@ 

yhats_linear = [theta_0_hat + theta_1_hat * x for x in xs]

-
+
Code
# Constant Model Rug Plot
@@ -634,7 +640,7 @@ 

+
Code
# SLR model scatter plot 
@@ -748,7 +754,7 @@ 

11.4 Comparing Loss Functions

We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?

Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.

-
+
drinks = np.array([20, 21, 22, 29, 33])
 drinks
@@ -756,7 +762,7 @@

+
np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
@@ -766,7 +772,7 @@

Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.

How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.

-
+
drinks_with_outlier = np.append(drinks, 1033)
 display(drinks_with_outlier)
 np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
@@ -780,7 +786,7 @@

This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.

Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.

-
+
drinks_with_additional_observation = np.append(drinks, 35)
 drinks_with_additional_observation
@@ -852,7 +858,7 @@

+
Code
# `corrcoef` computes the correlation coefficient between two variables
@@ -884,7 +890,7 @@ 

and "Length". What is making the raw data deviate from a linear relationship? Notice that the data points with "Length" greater than 2.6 have disproportionately high values of "Age" relative to the rest of the data. If we could manipulate these data points to have lower "Age" values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\) "Age" \()\) ) would achieve just that.

An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).

-
+
Code
z = np.log(y)
@@ -919,7 +925,7 @@ 

\[\log{(y)} = \theta_0 + \theta_1 x\] \[y = e^{\theta_0 + \theta_1 x}\] \[y = (e^{\theta_0})e^{\theta_1 x}\] \[y_i = C e^{k x}\]

For some constants \(C\) and \(k\).

\(y\) is an exponential function of \(x\). Applying an exponential fit to the untransformed variables corroborates this finding.

-
+
Code
plt.figure(dpi=120, figsize=(4, 3))
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf
index bae7e593c..bc735d3e7 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf
index fcf504066..dec61cece 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf
index b2ec7eb82..5c08b56ec 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf
index aaf5adffd..586ab8a97 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf
index dfdad302a..a083da24b 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf
index 1c5bfb789..0d07f4125 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf
index de3d473cf..e054b4431 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf differ
diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index abfd86a39..2ce797035 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -255,6 +255,12 @@
   
  12  Ordinary Least Squares
   
+ +
@@ -343,7 +349,7 @@

Data Cleaning and EDA

-
+
Code
import numpy as np
@@ -408,7 +414,7 @@ 

5.1.1.1 CSV

CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections and babynames datasets were stored and loaded as CSVs:

-
+
pd.read_csv("data/elections.csv").head(5)
@@ -479,7 +485,7 @@