From 466a72e9742287268d2f5a29e260e6ea5a65308d Mon Sep 17 00:00:00 2001 From: Ian Dong Date: Wed, 28 Feb 2024 22:51:08 -0800 Subject: [PATCH] Updated feature_engineering notes --- _quarto.yml | 2 +- feature_engineering/feature_engineering.html | 1279 ------------------ gradient_descent/gradient_descent.ipynb | 965 +++++++++++++ gradient_descent/gradient_descent.qmd | 1 - 4 files changed, 966 insertions(+), 1281 deletions(-) delete mode 100644 feature_engineering/feature_engineering.html create mode 100644 gradient_descent/gradient_descent.ipynb diff --git a/_quarto.yml b/_quarto.yml index 1e6c90e3..f8a4b473 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -29,7 +29,7 @@ book: - constant_model_loss_transformations/loss_transformations.qmd - ols/ols.qmd - gradient_descent/gradient_descent.qmd - # - feature_engineering/feature_engineering.qmd + - feature_engineering/feature_engineering.qmd # - case_study_HCE/case_study_HCE.qmd # - cv_regularization/cv_reg.qmd # - probability_1/probability_1.qmd diff --git a/feature_engineering/feature_engineering.html b/feature_engineering/feature_engineering.html deleted file mode 100644 index 6d91abb2..00000000 --- a/feature_engineering/feature_engineering.html +++ /dev/null @@ -1,1279 +0,0 @@ - - - - - - - - - -Feature Engineering - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
- -
-
- -
- - - -
- - - - -
- - - -
- - -
-
-
- -
-
-Learning Outcomes -
-
-
-
-
-
    -
  • Recognize the value of feature engineering as a tool to improve model performance
  • -
  • Implement polynomial feature generation and one hot encoding
  • -
  • Understand the interactions between model complexity, model variance, and training error
  • -
-
-
-
-

At this point, we’ve grown quite familiar with the modeling process. We’ve introduced the concept of loss, used it to fit several types of models, and, most recently, extended our analysis to multiple regression. Along the way, we’ve forged our way through the mathematics of deriving the optimal model parameters in all its gory detail. It’s time to make our lives a little easier – let’s implement the modeling process in code!

-

In this lecture, we’ll explore two techniques for model fitting:

-
    -
  1. Translating our derived formulas for regression to python
  2. -
  3. Using python’s sklearn package
  4. -
-

With our new programming frameworks in hand, we will also add sophistication to our models by introducing more complex features to enhance model performance.

-
-

Feature Engineering

-

At this point in the course, we’ve equipped ourselves with some powerful techniques to build and optimize models. We’ve explored how to develop models of multiple variables, as well as how to transform variables to help linearize a dataset and fit these models to maximize their performance.

-

All of this was done with one major caveat: the regression models we’ve worked with so far are all linear in the input variables. We’ve assumed that our predictions should be some combination of linear variables. While this works well in some cases, the real world isn’t always so straightforward. We’ll learn an important method to address this issue – feature engineering – and consider some new problems that can arise when we do so.

-

Feature engineering is the process of transforming raw features into more informative features that can be used in modeling or EDA tasks and improve model performance.

-

Feature engineering allows you to:

-
    -
  • Capture domain knowledge
  • -
  • Express non-linear relationships using linear models
  • -
  • Use non-numeric (qualitative) features in models
  • -
-
-
-

Feature Functions

-

A feature function describes the transformations we apply to raw features in a dataset to create a design matrix of transformed features. We typically denote the feature function as \(\Phi\) (think to yourself: “phi”-true function). When we apply the feature function to our original dataset \(\mathbb{X}\), the result, \(\Phi(\mathbb{X})\), is a transformed design matrix ready to be used in modeling.

-

For example, we might design a feature function that computes the square of an existing feature and adds it to the design matrix. In this case, our existing matrix \([x]\) is transformed to \([x, x^2]\). Its dimension increases from 1 to 2. Often, the dimension of the featurized dataset increases as seen here.

-
-phi -
-

The new features introduced by the feature function can then be used in modeling. Often, we use the symbol \(\phi_i\) to represent transformed features after feature engineering.

-

\[\hat{y} = \theta_1 x + \theta_2 x^2\] \[\hat{y}= \theta_1 \phi_1 + \theta_2 \phi_2\]

-

In matrix notation, the symbol \(\Phi\) is sometimes used to denote the design matrix after feature engineering has been performed. Note that in the usage below, \(\Phi\) is now a feature-engineered matrix, rather than a function.

-

\[\hat{\mathbb{Y}} = \Phi \theta\]

-

More formally, we describe a feature function as transforming the original \(\mathbb{R}^{n \times p}\) dataset \(\mathbb{X}\) to a featurized \(\mathbb{R}^{n \times p'}\) dataset \(\mathbb{\Phi}\), where \(p'\) is typically greater than \(p\).

-

\[\mathbb{X} \in \mathbb{R}^{n \times p} \longrightarrow \Phi \in \mathbb{R}^{n \times p'}\]

-
-
-

One Hot Encoding

-

Feature engineering opens up a whole new set of possibilities for designing better-performing models. As you will see in lab and homework, feature engineering is one of the most important parts of the entire modeling process.

-

A particularly powerful use of feature engineering is to allow us to perform regression on non-numeric features. One hot encoding is a feature engineering technique that generates numeric features from categorical data, allowing us to use our usual methods to fit a regression model on the data.

-

To illustrate how this works, we’ll refer back to the tips dataset from previous lectures. Consider the "day" column of the dataset:

-
-
-Code -
import numpy as np
-import seaborn as sns
-import pandas as pd
-import sklearn.linear_model as lm
-tips = sns.load_dataset("tips")
-tips.head()
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
- -
-
-
-

At first glance, it doesn’t seem possible to fit a regression model to this data – we can’t directly perform any mathematical operations on the entry “Sun”.

-

To resolve this, we instead create a new table with a feature for each unique value in the original "day" column. We then iterate through the "day" column. For each entry in "day" we fill the corresponding feature in the new table with 1. All other features are set to 0.

-
-ohe -
-


-In short, each category of a categorical variable gets its own feature -
    -
  • -Value = 1 if a row belongs to the category -
  • -
  • -Value = 0 otherwise -
  • -
-

The OneHotEncoder class of sklearn (documentation) offers a quick way to perform this one-hot encoding. You will explore its use in detail in the lab. For now, recognize that we follow a very similar workflow to when we were working with the LinearRegression class: we initialize a OneHotEncoder object, fit it to our data, and finally use .transform to apply the fitted encoder.

-
-
from sklearn.preprocessing import OneHotEncoder
-
-# Initialize a OneHotEncoder object
-ohe = OneHotEncoder()
-
-# Fit the encoder
-ohe.fit(tips[["day"]])
-
-# Use the encoder to transform the raw "day" feature
-encoded_day = ohe.transform(tips[["day"]]).toarray()
-encoded_day_df = pd.DataFrame(encoded_day, columns=ohe.get_feature_names_out())
-
-encoded_day_df.head()
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
day_Friday_Satday_Sunday_Thur
00.00.01.00.0
10.00.01.00.0
20.00.01.00.0
30.00.01.00.0
40.00.01.00.0
- -
-
-
-

The one-hot encoded features can then be used in the design matrix to train a model:

-
-ohemodel -
-

\[\hat{y} = \theta_1 (\text{total}\_\text{bill}) + \theta_2 (\text{size}) + \theta_3 (\text{day}\_\text{Fri}) + \theta_4 (\text{day}\_\text{Sat}) + \theta_5 (\text{day}\_\text{Sun}) + \theta_6 (\text{day}\_\text{Thur})\]

-

Or in shorthand:

-

\[\hat{y} = \theta_{1}\phi_{1} + \theta_{2}\phi_{2} + \theta_{3}\phi_{3} + \theta_{4}\phi_{4} + \theta_{5}\phi_{5} + \theta_{6}\phi_{6}\]

-

Now, the day feature (or rather, the four new boolean features that represent day) can be used to fit a model.

-

Using sklearn to fit the new model, we can determine the model coefficients, allowing us to understand how each feature impacts the predicted tip.

-
-
from sklearn.linear_model import LinearRegression
-data_w_ohe = tips[["total_bill", "size", "day"]].join(encoded_day_df).drop(columns = "day")
-ohe_model = lm.LinearRegression(fit_intercept=False) #Tell sklearn to not add an additional bias column. Why?
-ohe_model.fit(data_w_ohe, tips["tip"])
-
-pd.DataFrame({"Feature":data_w_ohe.columns, "Model Coefficient":ohe_model.coef_})
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
FeatureModel Coefficient
0total_bill0.092994
1size0.187132
2day_Fri0.745787
3day_Sat0.621129
4day_Sun0.732289
5day_Thur0.668294
- -
-
-
-

For example, when looking at the coefficient for day_Fri, we can understand how much the fact that it is Friday impacts the predicted tip.

-

When one-hot encoding, keep in mind that any set of one-hot encoded columns will always sum to a column of all ones, representing the bias column. More formally, the bias column is a linear combination of the OHE columns.

-
-bias -
-

We must be careful not to include this bias column in our design matrix. Otherwise, there will be linear dependence in the model, meaning \(\mathbb{X}^{\top}\mathbb{X}\) would no longer be invertible, and our OLS estimate \(\hat{\theta} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\) fails.

-

To resolve this issue, we simply omit one of the one-hot encoded columns or do not include an intercept term. The adjusted design matrices are shown below.

-
-remove -
-

Either approach works — we still retain the same information as the omitted column being a linear combination of the remaining columns.

-
-
-

Polynomial Features

-

We have encountered a few cases now where models with linear features have performed poorly on datasets that show clear non-linear curvature.

-

As an example, consider the vehicles dataset, which contains information about cars. Suppose we want to use the hp (horsepower) of a car to predict its "mpg" (gas mileage in miles per gallon). If we visualize the relationship between these two variables, we see a non-linear curvature. Fitting a linear model to these variables results in a high (poor) value of RMSE.

-

\[\hat{y} = \theta_0 + \theta_1 (\text{hp})\]

-
-
-Code -
pd.options.mode.chained_assignment = None 
-vehicles = sns.load_dataset("mpg").dropna().rename(columns = {"horsepower": "hp"}).sort_values("hp")
-
-X = vehicles[["hp"]]
-Y = vehicles["mpg"]
-
-hp_model = lm.LinearRegression()
-hp_model.fit(X, Y)
-hp_model_predictions = hp_model.predict(X)
-
-import matplotlib.pyplot as plt
-
-sns.scatterplot(data=vehicles, x="hp", y="mpg")
-plt.plot(vehicles["hp"], hp_model_predictions, c="tab:red");
-
-print(f"MSE of model with (hp) feature: {np.mean((Y-hp_model_predictions)**2)}")
-
-
-
MSE of model with (hp) feature: 23.943662938603104
-
-
-
-
-

-
-
-
-
-

To capture non-linearity in a dataset, it makes sense to incorporate non-linear features. Let’s introduce a polynomial term, \(\text{hp}^2\), into our regression model. The model now takes the form:

-

\[\hat{y} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2)\] \[\hat{y} = \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2\]

-

How can we fit a model with non-linear features? We can use the exact same techniques as before: ordinary least squares, gradient descent, or sklearn. This is because our new model is still a linear model. Although it contains non-linear features, it is linear with respect to the model parameters. All of our previous work on fitting models was done under the assumption that we were working with linear models. Because our new model is still linear, we can apply our existing methods to determine the optimal parameters.

-
-
# Add a hp^2 feature to the design matrix
-X = vehicles[["hp"]]
-X["hp^2"] = vehicles["hp"]**2
-
-# Use sklearn to fit the model
-hp2_model = lm.LinearRegression()
-hp2_model.fit(X, Y)
-hp2_model_predictions = hp2_model.predict(X)
-
-sns.scatterplot(data=vehicles, x="hp", y="mpg")
-plt.plot(vehicles["hp"], hp2_model_predictions, c="tab:red");
-
-print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")
-
-
MSE of model with (hp^2) feature: 18.98476890761722
-
-
-
-
-

-
-
-
-
-

Looking a lot better! By incorporating a squared feature, we are able to capture the curvature of the dataset. Our model is now a parabola centered on our data. Notice that our new model’s error has decreased relative to the original model with linear features.

-
-
-

Complexity and Overfitting

-

We’ve seen now that feature engineering allows us to build all sorts of features to improve the performance of the model. In particular, we saw that designing a more complex feature (squaring hp in the vehicles data previously) substantially improved the model’s ability to capture non-linear relationships. To take full advantage of this, we might be inclined to design increasingly complex features. Consider the following three models, each of different order (the maximum exponent power of each model):

-
    -
  • Model with order 2: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2)\)
  • -
  • Model with order 3: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2) + \theta_3 (\text{hp}^3)\)
  • -
  • Model with order 4: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2) + \theta_3 (\text{hp}^3) + \theta_4 (\text{hp}^4)\)
  • -
-


-
-degree_comparison -
-

As we can see in the plots above, MSE continues to decrease with each additional polynomial term. To visualize it further, let’s plot models as the complexity increases from 0 to 6:

-
-degree_comparison -
-

When we use our model to make predictions on the same data that was used to fit the model, we find that the MSE decreases with each additional polynomial term (as our model gets more complex). The training error is the model’s error when generating predictions from the same data that was used for training purposes. We can conclude that the training error goes down as the complexity of the model increases.

-
-train_error -
-

This seems like good news – when working on the training data, we can improve model performance by designing increasingly complex models.

-
-

Math Fact: given \(N\) overlapping data points, we can always find a polynomial of degree \(N-1\) that goes through all those points.

-For example: there always exists a degree-4 polynomial curve that can perfectly model a dataset of 5 datapoints -
-train_error -
-
-

However, high model complexity comes with its own set of issues. When building the vehicles models above, we trained the models on the entire dataset and then evaluated their performance on this same dataset. In reality, we are likely to instead train the model on a sample from the population, then use it to make predictions on data it didn’t encounter during training.

-

Let’s walk through a more realistic example. Say we are given a training dataset of just 6 datapoints and want to train a model to then make predictions on a different set of points. We may be tempted to make a highly complex model (e.g., degree 5), especially given it makes perfect predictions on the training data as clear on the left. However, as shown in the graph on the right, this model would perform horribly on the rest of the population!

-
-complex -
-

The phenomenon above is called overfitting. The model effectively just memorized the training data it encountered when it was fitted, leaving it unable to generalize well to data it didn’t encounter during training. This is a problem: we want models that are generalizable to “unseen” data.

-

Additionally, since complex models are sensitive to the specific dataset used to train them, they have high variance. A model with high variance tends to vary more dramatically when trained on different datasets. Going back to our example above, we can see our degree-5 model varies erratically when we fit it to different samples of 6 points from vehicles.

-
-resamples -
-

We now face a dilemma: we know that we can decrease training error by increasing model complexity, but models that are too complex start to overfit and can’t be reapplied to new datasets due to high variance.

-
-bvt -
-

We can see that there is a clear trade-off that comes from the complexity of our model. As model complexity increases, the model’s error on the training data decreases. At the same time, the model’s variance tends to increase.

-

The takeaway here: we need to strike a balance in the complexity of our models; we want models that are generalizable to “unseen” data. A model that is too simple won’t be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting.

-

This begs the question: how do we control the complexity of a model? Stay tuned for our Lecture 17 on Cross-Validation and Regularization!

- - -
- -
- - -
- - - - - \ No newline at end of file diff --git a/gradient_descent/gradient_descent.ipynb b/gradient_descent/gradient_descent.ipynb new file mode 100644 index 00000000..64106e4e --- /dev/null +++ b/gradient_descent/gradient_descent.ipynb @@ -0,0 +1,965 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: sklearn and Gradient Descent\n", + "execute:\n", + " echo: true\n", + " warning: false\n", + "format:\n", + " html:\n", + " code-fold: false\n", + " code-tools: true\n", + " toc: true\n", + " toc-title: sklearn and Gradient Descent\n", + " page-layout: full\n", + " theme:\n", + " - cosmo\n", + " - cerulean\n", + " callout-icon: false\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.callout-note collapse=\"false\"}\n", + "## Learning Outcomes\n", + "* Apply the `sklearn` library for model creation and training\n", + "* Optimizing complex models \n", + "* Identifying cases where straight calculus or geometric arguments won't help solve the loss function\n", + "* Applying gradient descent for numerical optimization\n", + ":::" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import plotly.express as px\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "from sklearn.linear_model import LinearRegression\n", + "pd.options.mode.chained_assignment = None # default='warn'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `sklearn`\n", + "### Implementing Derived Formulas in Code\n", + "\n", + "Throughout this lecture, we'll refer to the `penguins` dataset. " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import numpy as np\n", + "\n", + "penguins = sns.load_dataset(\"penguins\")\n", + "penguins = penguins[penguins[\"species\"] == \"Adelie\"].dropna()\n", + "penguins.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our goal will be to predict the value of the `\"bill_depth_mm\"` for a particular penguin given its `\"flipper_length_mm\"` and `\"body_mass_g\"`. We'll also add a bias column of all ones to represent the intercept term of our models." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Add a bias column of all ones to `penguins`\n", + "penguins[\"bias\"] = np.ones(len(penguins), dtype=int) \n", + "\n", + "# Define the design matrix, X...\n", + "# Note that we use .to_numpy() to convert our DataFrame into a NumPy array so it's in Matrix form\n", + "X = penguins[[\"bias\", \"flipper_length_mm\", \"body_mass_g\"]].to_numpy()\n", + "\n", + "# ...as well as the target variable, Y\n", + "# Again, we use .to_numpy() to convert our DataFrame into a NumPy array so it's in Matrix form\n", + "Y = penguins[[\"bill_depth_mm\"]].to_numpy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the lecture on ordinary least squares, we expressed multiple linear regression using matrix notation.\n", + "\n", + "$$\\hat{\\mathbb{Y}} = \\mathbb{X}\\theta$$\n", + "\n", + "We used a geometric approach to derive the following expression for the optimal model parameters:\n", + "\n", + "$$\\hat{\\theta} = (\\mathbb{X}^T \\mathbb{X})^{-1}\\mathbb{X}^T \\mathbb{Y}$$\n", + "\n", + "That's a whole lot of matrix manipulation. How do we implement it in `python`?\n", + "\n", + "There are three operations we need to perform here: multiplying matrices, taking transposes, and finding inverses. \n", + "\n", + "* To perform matrix multiplication, use the `@` operator\n", + "* To take a transpose, call the `.T` attribute of an `NumPy` array or `DataFrame`\n", + "* To compute an inverse, use `NumPy`'s in-built method `np.linalg.inv`\n", + "\n", + "Putting this all together, we can compute the OLS estimate for the optimal model parameters, stored in the array `theta_hat`." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: false\n", + "theta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y\n", + "theta_hat" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To make predictions using our optimized parameter values, we matrix-multiply the design matrix with the parameter vector:\n", + "\n", + "$$\\hat{\\mathbb{Y}} = \\mathbb{X}\\theta$$" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: false\n", + "Y_hat = X @ theta_hat\n", + "pd.DataFrame(Y_hat).head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `sklearn` Workflow\n", + "We've already saved a lot of time (and avoided tedious calculations) by translating our derived formulas into code. However, we still had to go through the process of writing out the linear algebra ourselves. \n", + "\n", + "To make life *even easier*, we can turn to the `sklearn` [`python` library](https://scikit-learn.org/stable/). `sklearn` is a robust library of machine learning tools used extensively in research and industry. It is the standard for simple machine learning tasks and gives us a wide variety of in-built modeling frameworks and methods, so we'll keep returning to `sklearn` techniques as we progress through Data 100. \n", + "\n", + "Regardless of the specific type of model being implemented, `sklearn` follows a standard set of steps for creating a model: \n", + "\n", + "1. Import the `LinearRegression` model from `sklearn`\n", + "\n", + " ```\n", + " from sklearn.linear_model import LinearRegression\n", + " ```\n", + "\n", + "2. Create a model object. This generates a new instance of the model class. You can think of it as making a new \"copy\" of a standard \"template\" for a model. In code, this looks like:\n", + "\n", + " ```\n", + " my_model = LinearRegression()\n", + " ```\n", + "\n", + " \n", + "3. Fit the model to the `X` design matrix and `Y` target vector. This calculates the optimal model parameters \"behind the scenes\" without us explicitly working through the calculations ourselves. The fitted parameters are then stored within the model for use in future predictions:\n", + "\n", + " ```\n", + " my_model.fit(X, Y)\n", + " ```\n", + "\n", + " \n", + "4. Use the fitted model to make predictions on the `X` input data using `.predict`. \n", + "\n", + " ```\n", + " my_model.predict(X)\n", + " ```\n", + "\n", + "To extract the fitted parameters, we can use:\n", + "\n", + " ```\n", + " my_model.coef_\n", + "\n", + " my_model.intercept_\n", + " ```\n", + "\n", + "\n", + "Let's put this into action with our multiple regression task!\n", + "\n", + "**1. Initialize an instance of the model class**\n", + "\n", + "`sklearn` stores \"templates\" of useful models for machine learning. We begin the modeling process by making a \"copy\" of one of these templates for our own use. Model initialization looks like `ModelClass()`, where `ModelClass` is the type of model we wish to create.\n", + "\n", + "For now, let's create a linear regression model using `LinearRegression`. \n", + "\n", + "`my_model` is now an instance of the `LinearRegression` class. You can think of it as the \"idea\" of a linear regression model. We haven't trained it yet, so it doesn't know any model parameters and cannot be used to make predictions. In fact, we haven't even told it what data to use for modeling! It simply waits for further instructions." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "my_model = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**2. Train the model using `.fit`**\n", + "\n", + "Before the model can make predictions, we will need to fit it to our training data. When we fit the model, `sklearn` will run gradient descent behind the scenes to determine the optimal model parameters. It will then save these model parameters to our model instance for future use. \n", + "\n", + "All `sklearn` model classes include a `.fit` method, which is used to fit the model. It takes in two inputs: the design matrix, `X`, and the target variable, `Y`. \n", + "\n", + "Let's start by fitting a model with just one feature: the flipper length. We create a design matrix `X` by pulling out the `\"flipper_length_mm\"` column from the `DataFrame`. " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# .fit expects a 2D data design matrix, so we use double brackets to extract a DataFrame\n", + "X = penguins[[\"flipper_length_mm\"]]\n", + "Y = penguins[\"bill_depth_mm\"]\n", + "\n", + "my_model.fit(X, Y)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that we use **double brackets** to extract this column. Why double brackets instead of just single brackets? The `.fit` method, by default, expects to receive **2-dimensional** data – some kind of data that includes both rows and columns. Writing `penguins[\"flipper_length_mm\"]` would return a 1D `Series`, causing `sklearn` to error. We avoid this by writing `penguins[[\"flipper_length_mm\"]]` to produce a 2D `DataFrame`. \n", + "\n", + "And in just three lines of code, our model has run gradient descent to determine the optimal model parameters! Our single-feature model takes the form:\n", + "\n", + "$$\\text{bill depth} = \\theta_0 + \\theta_1 \\text{flipper length}$$\n", + "\n", + "Note that `LinearRegression` will automatically include an intercept term. \n", + "\n", + "The fitted model parameters are stored as attributes of the model instance. `my_model.intercept_` will return the value of $\\hat{\\theta}_0$ as a scalar. `my_model.coef_` will return all values $\\hat{\\theta}_1, \n", + "\\hat{\\theta}_1, ...$ in an array. Because our model only contains one feature, we see just the value of $\\hat{\\theta}_1$ in the cell below." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# The intercept term, theta_0\n", + "my_model.intercept_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# All parameters theta_1, ..., theta_p\n", + "my_model.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**3. Use the fitted model to make predictions**\n", + "\n", + "Now that the model has been trained, we can use it to make predictions! To do so, we use the `.predict` method. `.predict` takes in one argument: the design matrix that should be used to generate predictions. To understand how the model performs on the training set, we would pass in the training data. Alternatively, to make predictions on unseen data, we would pass in a new dataset that wasn't used to train the model.\n", + "\n", + "Below, we call `.predict` to generate model predictions on the original training data. As before, we use double brackets to ensure that we extract 2-dimensional data." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "Y_hat_one_feature = my_model.predict(penguins[[\"flipper_length_mm\"]])\n", + "\n", + "print(f\"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_one_feature)**2))}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What if we wanted a model with two features? \n", + "\n", + "$$\\text{bill depth} = \\theta_0 + \\theta_1 \\text{flipper length} + \\theta_2 \\text{body mass}$$\n", + "\n", + "We repeat this three-step process by intializing a new model object, then calling `.fit` and `.predict` as before." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Step 1: initialize LinearRegression model\n", + "two_feature_model = LinearRegression()\n", + "\n", + "# Step 2: fit the model\n", + "X_two_features = penguins[[\"flipper_length_mm\", \"body_mass_g\"]]\n", + "Y = penguins[\"bill_depth_mm\"]\n", + "\n", + "two_feature_model.fit(X_two_features, Y)\n", + "\n", + "# Step 3: make predictions\n", + "Y_hat_two_features = two_feature_model.predict(X_two_features)\n", + "\n", + "print(f\"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also see that we obtain the same predictions using `sklearn` as we did when applying the ordinary least squares formula before! " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "pd.DataFrame({\"Y_hat from OLS\":np.squeeze(Y_hat), \"Y_hat from sklearn\":Y_hat_two_features}).head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Gradient Descent \n", + "\n", + "At this point, we've grown quite familiar with the process of choosing a model and a corresponding loss function and optimizing parameters by choosing the values of $\\theta$ that minimize the loss function. So far, we've optimized $\\theta$ by\n", + "\n", + "1. Using calculus to take the derivative of the loss function with respect to $\\theta$, setting it equal to 0, and solving for $\\theta$.\n", + "2. Using the geometric argument of orthogonality to derive the OLS solution $\\hat{\\theta} = (\\mathbb{X}^T \\mathbb{X})^{-1}\\mathbb{X}^T \\mathbb{Y}$.\n", + "\n", + "One thing to note, however, is that the techniques we used above can only be applied if we make some big assumptions. For the calculus approach, we assumed that the loss function was differentiable at all points and that we could algebraically solve for the zero points of the derivative; for the geometric approach, OLS *only* applies when using a linear model with MSE loss. What happens when we have more complex models with different, more complex loss functions? The techniques we've learned so far will not work, so we need a new optimization technique: **gradient descent**. \n", + "\n", + "> **BIG IDEA**: use an iterative algorithm to numerically compute the minimum of the loss.\n", + "\n", + "### Minimizing an Arbitrary 1D Function\n", + "\n", + "Let's consider an arbitrary function. Our goal is to find the value of $x$ that minimizes this function." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "def arbitrary(x):\n", + " return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "arbitrary\n", + "\n", + "#### The Naive Approach: Guess and Check\n", + "\n", + "Above, we saw that the minimum is somewhere around 5.3. Let's see if we can figure out how to find the exact minimum algorithmically from scratch. One very slow (and terrible) way would be manual guess-and-check." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "arbitrary(6)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A somewhat better (but still slow) approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "def simple_minimize(f, xs):\n", + " # Takes in a function f and a set of values xs. \n", + " # Calculates the value of the function f at all values x in xs\n", + " # Takes the minimum value of f(x) and returns the corresponding value x \n", + " y = [f(x) for x in xs] \n", + " return xs[np.argmin(y)]\n", + "\n", + "guesses = [5.3, 5.31, 5.32, 5.33, 5.34, 5.35]\n", + "simple_minimize(arbitrary, guesses)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This process is essentially the same as before where we made a graphical plot, it's just that we're only looking at 20 selected points." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "xs = np.linspace(1, 7, 200)\n", + "sparse_xs = np.linspace(1, 7, 5)\n", + "\n", + "ys = arbitrary(xs)\n", + "sparse_ys = arbitrary(sparse_xs)\n", + "\n", + "fig = px.line(x = xs, y = arbitrary(xs))\n", + "fig.add_scatter(x = sparse_xs, y = arbitrary(sparse_xs), mode = \"markers\")\n", + "fig.update_layout(showlegend= False)\n", + "fig.update_layout(autosize=False, width=800, height=600)\n", + "fig.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This basic approach suffers from three major flaws:\n", + "\n", + "1. If the minimum is outside our range of guesses, the answer will be completely wrong.\n", + "2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.\n", + "3. It is *very* computationally inefficient, considering potentially vast numbers of guesses that are useless.\n", + "\n", + "#### Scipy.optimize.minimize\n", + "\n", + "One way to minimize this mathematical function is to use the `scipy.optimize.minimize` function. It takes a function and a starting guess and tries to find the minimum." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "from scipy.optimize import minimize\n", + "\n", + "# takes a function f and a starting point x0 and returns a readout \n", + "# with the optimal input value of x which minimizes f\n", + "minimize(arbitrary, x0 = 3.5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`scipy.optimize.minimize` is great. It may also seem a bit magical. How could you write a function that can find the minimum of any mathematical function? There are a number of ways to do this, which we'll explore in today's lecture, eventually arriving at the important idea of **gradient descent**, which is the principle that `scipy.optimize.minimize` uses.\n", + "\n", + "It turns out that under the hood, the `fit` method for `LinearRegression` models uses gradient descent. Gradient descent is also how much of machine learning works, including even advanced neural network models. \n", + "\n", + "In Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it's important that we know the underlying principles that optimization functions harness to find optimal parameters.\n", + "\n", + "\n", + "#### Digging into Gradient Descent\n", + "Looking at the function across this domain, it is clear that the function's minimum value occurs around $\\theta = 5.3$. Let's pretend for a moment that we *couldn't* see the full view of the cost function. How would we guess the value of $\\theta$ that minimizes the function? \n", + "\n", + "It turns out that the first derivative of the function can give us a clue. In the plots below, the line indicates the value of the derivative of each value of $\\theta$. The derivative is negative where it is red and positive where it is green.\n", + "\n", + "\n", + "Say we make a guess for the minimizing value of $\\theta$. Remember that we read plots from left to right, and assume that our starting $\\theta$ value is to the left of the optimal $\\hat{\\theta}$. If the guess \"undershoots\" the true minimizing value – our guess for $\\theta$ is lower than the value of the $\\hat{\\theta}$ that minimizes the function – the derivative will be **negative**. This means that if we increase $\\theta$ (move further to the right), then we **can decrease** our loss function further. If this guess \"overshoots\" the true minimizing value, the derivative will be positive, implying the converse.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
step\n", + "
\n", + "
\n", + "\n", + "We can use this pattern to help formulate our next guess for the optimal $\\hat{\\theta}$. Consider the case where we've undershot $\\theta$ by guessing too low of a value. We'll want our next guess to be greater in value than our previous guess – that is, we want to shift our guess to the right. You can think of this as following the slope \"downhill\" to the function's minimum value.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
neg_step\n", + "
\n", + "
\n", + "\n", + "If we've overshot $\\hat{\\theta}$ by guessing too high of a value, we'll want our next guess to be lower in value – we want to shift our guess for $\\hat{\\theta}$ to the left. \n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
pos_step\n", + "
\n", + "
\n", + "\n", + "In other words, the derivative of the function at each point tells us the direction of our next guess.\n", + "\n", + "* A negative slope means we want to step to the right, or move in the *positive* direction. \n", + "* A positive slope means we want to step to the left, or move in the *negative* direction.\n", + "\n", + "#### Algorithm Attempt 1\n", + "Armed with this knowledge, let's try to see if we can use the derivative to optimize the function.\n", + "\n", + "We start by making some guess for the minimizing value of $x$. Then, we look at the derivative of the function at this value of $x$, and step downhill in the *opposite* direction. We can express our new rule as a recurrence relation:\n", + "\n", + "$$x^{(t+1)} = x^{(t)} - \\frac{d}{dx} f(x^{(t)})$$\n", + "\n", + "Translating this statement into English: we obtain **our next guess** for the minimizing value of $x$ at timestep $t+1$ ($x^{(t+1)}$) by taking **our last guess** ($x^{(t)}$) and subtracting the **derivative of the function** at that point ($\\frac{d}{dx} f(x^{(t)})$).\n", + "\n", + "A few steps are shown below, where the old step is shown as a transparent point, and the next step taken is the green-filled dot.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
grad_descent_2\n", + "
\n", + "
\n", + "\n", + "Looking pretty good! We do have a problem though – once we arrive close to the minimum value of the function, our guesses \"bounce\" back and forth past the minimum without ever reaching it.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
grad_descent_2\n", + "
\n", + "
\n", + "\n", + "In other words, each step we take when updating our guess moves us too far. We can address this by decreasing the size of each step. \n", + "\n", + "#### Algorithm Attempt 2\n", + "Let's update our algorithm to use a **learning rate** (also sometimes called the step size), which controls how far we move with each update. We represent the learning rate with $\\alpha$. \n", + "\n", + "$$x^{(t+1)} = x^{(t)} - \\alpha \\frac{d}{dx} f(x^{(t)})$$\n", + "\n", + "A small $\\alpha$ means that we will take small steps; a large $\\alpha$ means we will take large steps. When do we stop updating? We stop updating either after a fixed number of updates or after a subsequent update doesn't change much.\n", + "\n", + "Updating our function to use $\\alpha=0.3$, our algorithm successfully **converges** (settles on a solution and stops updating significantly, or at all) on the minimum value.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
grad_descent_3\n", + "
\n", + "
\n", + "\n", + "### Convexity\n", + "In our analysis above, we focused our attention on the global minimum of the loss function. You may be wondering: what about the local minimum that's just to the left? \n", + "\n", + "If we had chosen a different starting guess for $\\theta$, or a different value for the learning rate $\\alpha$, our algorithm may have gotten \"stuck\" and converged on the local minimum, rather than on the true optimum value of loss. \n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
local\n", + "
\n", + "
\n", + "\n", + "If the loss function is **convex**, gradient descent is guaranteed to converge and find the global minimum of the objective function. Formally, a function $f$ is convex if:\n", + "$$tf(a) + (1-t)f(b) \\geq f(ta + (1-t)b)$$\n", + "for all $a, b$ in the domain of $f$ and $t \\in [0, 1]$.\n", + "\n", + "To put this into words: if you drew a line between any two points on the curve, all values on the curve must be *on or below* the line. Importantly, any local minimum of a convex function is also its global minimum so we avoid the situation where the algorithm converges on some critical point that is not the minimum of the function.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
convex\n", + "
\n", + "
\n", + "\n", + "In summary, non-convex loss functions can cause problems with optimization. This means that our choice of loss function is a key factor in our modeling process. It turns out that MSE *is* convex, which is a major reason why it is such a popular choice of loss function. Gradient descent is only guaranteed to converge (given enough iterations and an appropriate step size) for convex functions.\n", + "\n", + "### Gradient Descent in 1 Dimension\n", + "\n", + "> **Terminology clarification**: In past lectures, we have used “loss” to refer to the error incurred on a *single* datapoint. In applications, we usually care more about the average error across *all* datapoints. Going forward, we will take the “model’s loss” to mean the model’s average error across the dataset. This is sometimes also known as the empirical risk, cost function, or objective function. $$L(\\theta) = R(\\theta) = \\frac{1}{n} \\sum_{i=1}^{n} L(y, \\hat{y})$$\n", + "\n", + "In our discussion above, we worked with some arbitrary function $f$. As data scientists, we will almost always work with gradient descent in the context of optimizing *models* – specifically, we want to apply gradient descent to find the minimum of a *loss function*. In a modeling context, our goal is to minimize a loss function by choosing the minimizing model *parameters*.\n", + "\n", + "Recall our modeling workflow from the past few lectures: \n", + "\n", + "1. Define a model with some parameters $\\theta_i$\n", + "2. Choose a loss function \n", + "3. Select the values of $\\theta_i$ that minimize the loss function on the data\n", + "\n", + "Gradient descent is a powerful technique for completing this last task. By applying the gradient descent algorithm, we can select values for our parameters $\\theta_i$ that will lead to the model having minimal loss on the training data.\n", + "\n", + "When using gradient descent in a modeling context, we:\n", + "\n", + "1. Make guesses for the minimizing $\\theta_i$\n", + "2. Compute the derivative of the loss function $L$\n", + "\n", + "We can \"translate\" our gradient descent rule from before by replacing $x$ with $\\theta$ and $f$ with $L$:\n", + "\n", + "$$\\theta^{(t+1)} = \\theta^{(t)} - \\alpha \\frac{d}{d\\theta} L(\\theta^{(t)})$$\n", + "\n", + "#### Gradient Descent on the `tips` Dataset \n", + "To see this in action, let's consider a case where we have a linear model with no offset. We want to predict the tip (y) given the price of a meal (x). To do this, we\n", + "\n", + "* Choose a model: $\\hat{y} = \\theta_1 x$,\n", + "* Choose a loss function: $L(\\theta) = MSE(\\theta) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\theta_1x_i)^2$.\n", + "\n", + "Let's apply our `gradient_descent` function from before to optimize our model on the `tips` dataset. We will try to select the best parameter $\\theta_i$ to predict the `tip` $y$ from the `total_bill` $x$." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "df = sns.load_dataset(\"tips\")\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can visualize the value of the MSE on our dataset for different possible choices of $\\theta_1$. To optimize our model, we want to select the value of $\\theta_1$ that leads to the lowest MSE." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "import plotly.graph_objects as go\n", + "\n", + "def derivative_arbitrary(x):\n", + " return (4*x**3 - 45*x**2 + 160*x - 180)/10\n", + "\n", + "fig = go.Figure()\n", + "roots = np.array([2.3927, 3.5309, 5.3263])\n", + "\n", + "fig.add_trace(go.Scatter(x = xs, y = arbitrary(xs), \n", + " mode = \"lines\", name = \"f\"))\n", + "fig.add_trace(go.Scatter(x = xs, y = derivative_arbitrary(xs), \n", + " mode = \"lines\", name = \"df\", line = {\"dash\": \"dash\"}))\n", + "fig.add_trace(go.Scatter(x = np.array(roots), y = 0*roots, \n", + " mode = \"markers\", name = \"df = zero\", marker_size = 12))\n", + "fig.update_layout(font_size = 20, yaxis_range=[-1, 3])\n", + "fig.update_layout(autosize=False, width=800, height=600)\n", + "fig.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To apply gradient descent, we need to compute the derivative of the loss function with respect to our parameter $\\theta_1$.\n", + "\n", + "* Given our loss function, $$L(\\theta) = MSE(\\theta) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\theta_1x_i)^2$$\n", + "* We take the derivative with respect to $\\theta_1$ $$\\frac{\\partial}{\\partial \\theta_{1}} L(\\theta_1^{(t)}) = \\frac{-2}{n} \\sum_{i=1}^n (y_i - \\theta_1^{(t)} x_i) x_i$$\n", + "* Which results in the gradient descent update rule\n", + "$$\\theta_1^{(t+1)} = \\theta_1^{(t)} - \\alpha \\frac{d}{d\\theta}L(\\theta_1^{(t)})$$\n", + "\n", + "for some learning rate $\\alpha$.\n", + "\n", + "Implementing this in code, we can visualize the MSE loss on the `tips` data. **MSE is convex**, so there is one global minimum." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "def gradient_descent(df, initial_guess, alpha, n):\n", + " \"\"\"Performs n steps of gradient descent on df using learning rate alpha starting\n", + " from initial_guess. Returns a numpy array of all guesses over time.\"\"\"\n", + " guesses = [initial_guess]\n", + " current_guess = initial_guess\n", + " while len(guesses) < n:\n", + " current_guess = current_guess - alpha * df(current_guess)\n", + " guesses.append(current_guess)\n", + " \n", + " return np.array(guesses)\n", + "\n", + "def mse_single_arg(theta_1):\n", + " \"\"\"Returns the MSE on our data for the given theta1\"\"\"\n", + " x = df[\"total_bill\"]\n", + " y_obs = df[\"tip\"]\n", + " y_hat = theta_1 * x\n", + " return np.mean((y_hat - y_obs) ** 2)\n", + "\n", + "def mse_loss_derivative_single_arg(theta_1):\n", + " \"\"\"Returns the derivative of the MSE on our data for the given theta1\"\"\"\n", + " x = df[\"total_bill\"]\n", + " y_obs = df[\"tip\"]\n", + " y_hat = theta_1 * x\n", + " \n", + " return np.mean(2 * (y_hat - y_obs) * x)\n", + "\n", + "loss_df = pd.DataFrame({\"theta_1\":np.linspace(-1.5, 1), \"MSE\":[mse_single_arg(theta_1) for theta_1 in np.linspace(-1.5, 1)]})\n", + "\n", + "trajectory = gradient_descent(mse_loss_derivative_single_arg, -0.5, 0.0001, 100)\n", + "\n", + "plt.plot(loss_df[\"theta_1\"], loss_df[\"MSE\"])\n", + "plt.scatter(trajectory, [mse_single_arg(guess) for guess in trajectory], c=\"white\", edgecolor=\"firebrick\")\n", + "plt.scatter(trajectory[-1], mse_single_arg(trajectory[-1]), c=\"firebrick\")\n", + "plt.xlabel(r\"$\\theta_1$\")\n", + "plt.ylabel(r\"$L(\\theta_1)$\");\n", + "\n", + "print(f\"Final guess for theta_1: {trajectory[-1]}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Gradient Descent on Multi-Dimensional Models\n", + "The function we worked with above was one-dimensional – we were only minimizing the function with respect to a single parameter, $\\theta$. However, models usually have a cost function with multiple parameters that need to be optimized. For example, simple linear regression has 2 parameters: \n", + "$$\\hat{y} + \\theta_0 + \\theta_1x$$ \n", + "and multiple linear regression has $p+1$ parameters: \n", + "$$\\mathbb{Y} = \\theta_0 + \\theta_1 \\Bbb{X}_{:,1} + \\theta_2 \\Bbb{X}_{:,2} + \\cdots + \\theta_p \\Bbb{X}_{:,p}$$\n", + "\n", + "We'll need to expand gradient descent so we can update our guesses for all model parameters all in one go.\n", + "\n", + "With multiple parameters to optimize, we consider a **loss surface**, or the model's loss for a particular *combination* of possible parameter values." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "import plotly.graph_objects as go\n", + "\n", + "\n", + "def mse_loss(theta, X, y_obs):\n", + " y_hat = X @ theta\n", + " return np.mean((y_hat - y_obs) ** 2) \n", + "\n", + "tips_with_bias = df.copy()\n", + "tips_with_bias[\"bias\"] = 1\n", + "tips_with_bias = tips_with_bias[[\"bias\", \"total_bill\"]]\n", + "\n", + "uvalues = np.linspace(0, 2, 10)\n", + "vvalues = np.linspace(-0.1, 0.35, 10)\n", + "(u,v) = np.meshgrid(uvalues, vvalues)\n", + "thetas = np.vstack((u.flatten(),v.flatten()))\n", + "\n", + "def mse_loss_single_arg(theta):\n", + " return mse_loss(theta, tips_with_bias, df[\"tip\"])\n", + "\n", + "MSE = np.array([mse_loss_single_arg(t) for t in thetas.T])\n", + "\n", + "loss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))\n", + "\n", + "ind = np.argmin(MSE)\n", + "optimal_point = go.Scatter3d(name = \"Optimal Point\",\n", + " x = [thetas.T[ind,0]], y = [thetas.T[ind,1]], \n", + " z = [MSE[ind]],\n", + " marker=dict(size=10, color=\"red\"))\n", + "\n", + "fig = go.Figure(data=[loss_surface, optimal_point])\n", + "fig.update_layout(scene = dict(\n", + " xaxis_title = \"theta0\",\n", + " yaxis_title = \"theta1\",\n", + " zaxis_title = \"MSE\"), autosize=False, width=800, height=600)\n", + "\n", + "fig.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also visualize a bird's-eye view of the loss surface from above using a contour plot:" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "contour = go.Contour(x=u[0], y=v[:, 0], z=np.reshape(MSE, u.shape))\n", + "fig = go.Figure(contour)\n", + "fig.update_layout(\n", + " xaxis_title = \"theta0\",\n", + " yaxis_title = \"theta1\", autosize=False, width=800, height=600)\n", + "\n", + "fig.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### The Gradient Vector\n", + "As before, the derivative of the loss function tells us the best way towards the minimum value.\n", + "\n", + "On a 2D (or higher) surface, the best way to go down (gradient) is described by a *vector*.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
loss_surface\n", + "
\n", + "
\n", + "\n", + "> Math Aside: Partial Derivatives \n", + "\n", + "> - For an equation with multiple variables, we take a **partial derivative** by differentiating with respect to just one variable at a time. The partial derivative is denoted with a $\\partial$. Intuitively, we want to see how the function changes if we only vary one variable while holding other variables constant. \n", + "> - Using $f(x, y) = 3x^2 + y$ as an example,\n", + "> - taking the partial derivative with respect to x and treating y as a constant gives us $\\frac{\\partial f}{\\partial x} = 6x$\n", + "> - taking the partial derivative with respect to y and treating x as a constant gives us $\\frac{\\partial f}{\\partial y} = 1$\n", + "\n", + "For the *vector* of parameter values $\\vec{\\theta} = \\begin{bmatrix}\n", + " \\theta_{0} \\\\\n", + " \\theta_{1} \\\\\n", + " \\end{bmatrix}$, we take the *partial derivative* of loss with respect to each parameter: $\\frac{\\partial L}{\\partial \\theta_0}$ and $\\frac{\\partial L}{\\partial \\theta_1}$.\n", + "\n", + "> For example, consider the 2D function: $$f(\\theta_0, \\theta_1) = 8 \\theta_0^2 + 3\\theta_0\\theta_1$$\n", + "> For a function of 2 variables $f(\\theta_0, \\theta_1)$, we define the gradient \n", + "$$\n", + "\\begin{align}\n", + "\\frac{\\partial f}{\\partial \\theta_{0}} &= 16\\theta_0 + 3\\theta_1 \\\\\n", + "\\frac{\\partial f}{\\partial \\theta_{1}} &= 3\\theta_0 \\\\\n", + "\\nabla_{\\vec{\\theta}} f(\\vec{\\theta}) &= \\begin{bmatrix} 16\\theta_0 + 3\\theta_1 \\\\ 3\\theta_0 \\\\ \\end{bmatrix}\n", + "\\end{align}\n", + "$$\n", + "\n", + "\n", + "The **gradient vector** of a generic function of $p+1$ variables is therefore \n", + "$$\\nabla_{\\vec{\\theta}} L = \\begin{bmatrix} \\frac{\\partial L}{\\partial \\theta_0} \\\\ \\frac{\\partial L}{\\partial \\theta_1} \\\\ \\vdots \\end{bmatrix}$$\n", + "where $\\nabla_\\theta L$ always points in the downhill direction of the surface. We can interpret each gradient as: \"If I nudge the $i$th model weight, what happens to loss?\"\n", + "\n", + "We can use this to update our 1D gradient rule for models with multiple parameters. \n", + "\n", + "* Recall our 1D update rule: $$\\theta^{(t+1)} = \\theta^{(t)} - \\alpha \\frac{d}{d\\theta}L(\\theta^{(t)})$$ \n", + "* For models with multiple parameters, we work in terms of vectors:\n", + "$$\\begin{bmatrix}\n", + " \\theta_{0}^{(t+1)} \\\\\n", + " \\theta_{1}^{(t+1)} \\\\\n", + " \\vdots\n", + " \\end{bmatrix} = \\begin{bmatrix}\n", + " \\theta_{0}^{(t)} \\\\\n", + " \\theta_{1}^{(t)} \\\\\n", + " \\vdots\n", + " \\end{bmatrix} - \\alpha \\begin{bmatrix}\n", + " \\frac{\\partial L}{\\partial \\theta_{0}} \\\\\n", + " \\frac{\\partial L}{\\partial \\theta_{1}} \\\\\n", + " \\vdots \\\\\n", + " \\end{bmatrix}$$\n", + " \n", + "* Written in a more compact form, $$\\vec{\\theta}^{(t+1)} = \\vec{\\theta}^{(t)} - \\alpha \\nabla_{\\vec{\\theta}} L(\\theta^{(t)}) $$\n", + "\n", + " * $\\theta$ is a vector with our model weights\n", + " * $L$ is the loss function\n", + " * $\\alpha$ is the learning rate (ours is constant, but other techniques use an $\\alpha$ that decreases over time)\n", + " * $\\vec{\\theta}^{(t)}$ is the current value of $\\theta$\n", + " * $\\vec{\\theta}^{(t+1)}$ is the next value of $\\theta$\n", + " * $\\nabla_{\\vec{\\theta}} L(\\theta^{(t)})$ is the gradient of the loss function evaluated at the current $\\vec{\\theta}^{(t)}$\n", + "\n", + "\n", + "### Batch Gradient Descent and Stochastic Gradient Descent\n", + "\n", + "Formally, the algorithm we derived above is called **batch gradient descent.** For each iteration of the algorithm, the derivative of loss is computed across the *entire* batch of all $n$ datapoints. While this update rule works well in theory, it is not practical in most circumstances. For large datasets (with perhaps billions of datapoints), finding the gradient across all the data is incredibly computationally taxing; gradient descent will converge slowly because each individual update is slow.\n", + "\n", + "**Stochastic (mini-batch) gradient descent** tries to address this issue. In stochastic descent, only a *sample* of the full dataset is used at each update. We estimate the true gradient of the loss surface using just that sample of data. The **batch size** is the number of data points used in each sample. The sampling strategy is generally without replacement (data is shuffled and batch size examples are selected one at a time.)\n", + "\n", + "Each complete \"pass\" through the data is known as a **training epoch**. After shuffling the data, in a single **training epoch** of stochastic gradient descent, we\n", + "* Compute the gradient on the first x% of the data. Update the parameter guesses.\n", + "* Compute the gradient on the next x% of the data. Update the parameter guesses.\n", + "* $\\dots$\n", + "* Compute the gradient on the last x% of the data. Update the parameter guesses.\n", + "\n", + "Every data point appears once in a single training epoch. We then perform several training epochs until we're satisfied.\n", + "\n", + "Batch gradient descent is a deterministic technique – because the entire dataset is used at each update iteration, the algorithm will always advance towards the minimum of the loss surface. In contrast, stochastic gradient descent involve an element of randomness. Since only a subset of the full data is used to update the guess for $\\vec{\\theta}$ at each iteration, there's a chance the algorithm will not progress towards the true minimum of loss with each update. Over the longer term, these stochastic techniques should still converge towards the optimal solution. \n", + "\n", + "The diagrams below represent a \"bird's eye view\" of a loss surface from above. Notice that batch gradient descent takes a direct path towards the optimal $\\hat{\\theta}$. Stochastic gradient descent, in contrast, \"hops around\" on its path to the minimum point on the loss surface. This reflects the randomness of the sampling process at each update step.\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + "
stochastic\n", + "
\n", + "
\n", + "\n", + "To summarize the tradeoffs of batch size: \n", + "\n", + "| - | Smaller Batch Size | Larger Batch Size | \n", + "| -- | -- | -- | \n", + "| Pros | More frequent gradient updates | Leverage hardware acceleration to improve overall system performance and higher quality gradient updates | \n", + "| Cons | More variability in the gradient estimates | Less frequent gradient updates |\n", + "\n", + "The typical solution is to set batch size to ensure sufficient hardware utilization.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "name": "python3", + "language": "python", + "display_name": "Python 3 (ipykernel)" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/gradient_descent/gradient_descent.qmd b/gradient_descent/gradient_descent.qmd index a5e3cda2..253752f2 100644 --- a/gradient_descent/gradient_descent.qmd +++ b/gradient_descent/gradient_descent.qmd @@ -726,4 +726,3 @@ To summarize the tradeoffs of batch size: The typical solution is to set batch size to ensure sufficient hardware utilization. -