diff --git a/feature_engineering/feature_engineering.html b/feature_engineering/feature_engineering.html new file mode 100644 index 00000000..6d91abb2 --- /dev/null +++ b/feature_engineering/feature_engineering.html @@ -0,0 +1,1279 @@ + + + + + + + + + +Feature Engineering + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+
+ +
+ + + +
+ + + + +
+ + + +
+ + +
+
+
+ +
+
+Learning Outcomes +
+
+
+
+
+
    +
  • Recognize the value of feature engineering as a tool to improve model performance
  • +
  • Implement polynomial feature generation and one hot encoding
  • +
  • Understand the interactions between model complexity, model variance, and training error
  • +
+
+
+
+

At this point, we’ve grown quite familiar with the modeling process. We’ve introduced the concept of loss, used it to fit several types of models, and, most recently, extended our analysis to multiple regression. Along the way, we’ve forged our way through the mathematics of deriving the optimal model parameters in all its gory detail. It’s time to make our lives a little easier – let’s implement the modeling process in code!

+

In this lecture, we’ll explore two techniques for model fitting:

+
    +
  1. Translating our derived formulas for regression to python
  2. +
  3. Using python’s sklearn package
  4. +
+

With our new programming frameworks in hand, we will also add sophistication to our models by introducing more complex features to enhance model performance.

+
+

Feature Engineering

+

At this point in the course, we’ve equipped ourselves with some powerful techniques to build and optimize models. We’ve explored how to develop models of multiple variables, as well as how to transform variables to help linearize a dataset and fit these models to maximize their performance.

+

All of this was done with one major caveat: the regression models we’ve worked with so far are all linear in the input variables. We’ve assumed that our predictions should be some combination of linear variables. While this works well in some cases, the real world isn’t always so straightforward. We’ll learn an important method to address this issue – feature engineering – and consider some new problems that can arise when we do so.

+

Feature engineering is the process of transforming raw features into more informative features that can be used in modeling or EDA tasks and improve model performance.

+

Feature engineering allows you to:

+
    +
  • Capture domain knowledge
  • +
  • Express non-linear relationships using linear models
  • +
  • Use non-numeric (qualitative) features in models
  • +
+
+
+

Feature Functions

+

A feature function describes the transformations we apply to raw features in a dataset to create a design matrix of transformed features. We typically denote the feature function as \(\Phi\) (think to yourself: “phi”-true function). When we apply the feature function to our original dataset \(\mathbb{X}\), the result, \(\Phi(\mathbb{X})\), is a transformed design matrix ready to be used in modeling.

+

For example, we might design a feature function that computes the square of an existing feature and adds it to the design matrix. In this case, our existing matrix \([x]\) is transformed to \([x, x^2]\). Its dimension increases from 1 to 2. Often, the dimension of the featurized dataset increases as seen here.

+
+phi +
+

The new features introduced by the feature function can then be used in modeling. Often, we use the symbol \(\phi_i\) to represent transformed features after feature engineering.

+

\[\hat{y} = \theta_1 x + \theta_2 x^2\] \[\hat{y}= \theta_1 \phi_1 + \theta_2 \phi_2\]

+

In matrix notation, the symbol \(\Phi\) is sometimes used to denote the design matrix after feature engineering has been performed. Note that in the usage below, \(\Phi\) is now a feature-engineered matrix, rather than a function.

+

\[\hat{\mathbb{Y}} = \Phi \theta\]

+

More formally, we describe a feature function as transforming the original \(\mathbb{R}^{n \times p}\) dataset \(\mathbb{X}\) to a featurized \(\mathbb{R}^{n \times p'}\) dataset \(\mathbb{\Phi}\), where \(p'\) is typically greater than \(p\).

+

\[\mathbb{X} \in \mathbb{R}^{n \times p} \longrightarrow \Phi \in \mathbb{R}^{n \times p'}\]

+
+
+

One Hot Encoding

+

Feature engineering opens up a whole new set of possibilities for designing better-performing models. As you will see in lab and homework, feature engineering is one of the most important parts of the entire modeling process.

+

A particularly powerful use of feature engineering is to allow us to perform regression on non-numeric features. One hot encoding is a feature engineering technique that generates numeric features from categorical data, allowing us to use our usual methods to fit a regression model on the data.

+

To illustrate how this works, we’ll refer back to the tips dataset from previous lectures. Consider the "day" column of the dataset:

+
+
+Code +
import numpy as np
+import seaborn as sns
+import pandas as pd
+import sklearn.linear_model as lm
+tips = sns.load_dataset("tips")
+tips.head()
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
+ +
+
+
+

At first glance, it doesn’t seem possible to fit a regression model to this data – we can’t directly perform any mathematical operations on the entry “Sun”.

+

To resolve this, we instead create a new table with a feature for each unique value in the original "day" column. We then iterate through the "day" column. For each entry in "day" we fill the corresponding feature in the new table with 1. All other features are set to 0.

+
+ohe +
+


+In short, each category of a categorical variable gets its own feature +
    +
  • +Value = 1 if a row belongs to the category +
  • +
  • +Value = 0 otherwise +
  • +
+

The OneHotEncoder class of sklearn (documentation) offers a quick way to perform this one-hot encoding. You will explore its use in detail in the lab. For now, recognize that we follow a very similar workflow to when we were working with the LinearRegression class: we initialize a OneHotEncoder object, fit it to our data, and finally use .transform to apply the fitted encoder.

+
+
from sklearn.preprocessing import OneHotEncoder
+
+# Initialize a OneHotEncoder object
+ohe = OneHotEncoder()
+
+# Fit the encoder
+ohe.fit(tips[["day"]])
+
+# Use the encoder to transform the raw "day" feature
+encoded_day = ohe.transform(tips[["day"]]).toarray()
+encoded_day_df = pd.DataFrame(encoded_day, columns=ohe.get_feature_names_out())
+
+encoded_day_df.head()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
day_Friday_Satday_Sunday_Thur
00.00.01.00.0
10.00.01.00.0
20.00.01.00.0
30.00.01.00.0
40.00.01.00.0
+ +
+
+
+

The one-hot encoded features can then be used in the design matrix to train a model:

+
+ohemodel +
+

\[\hat{y} = \theta_1 (\text{total}\_\text{bill}) + \theta_2 (\text{size}) + \theta_3 (\text{day}\_\text{Fri}) + \theta_4 (\text{day}\_\text{Sat}) + \theta_5 (\text{day}\_\text{Sun}) + \theta_6 (\text{day}\_\text{Thur})\]

+

Or in shorthand:

+

\[\hat{y} = \theta_{1}\phi_{1} + \theta_{2}\phi_{2} + \theta_{3}\phi_{3} + \theta_{4}\phi_{4} + \theta_{5}\phi_{5} + \theta_{6}\phi_{6}\]

+

Now, the day feature (or rather, the four new boolean features that represent day) can be used to fit a model.

+

Using sklearn to fit the new model, we can determine the model coefficients, allowing us to understand how each feature impacts the predicted tip.

+
+
from sklearn.linear_model import LinearRegression
+data_w_ohe = tips[["total_bill", "size", "day"]].join(encoded_day_df).drop(columns = "day")
+ohe_model = lm.LinearRegression(fit_intercept=False) #Tell sklearn to not add an additional bias column. Why?
+ohe_model.fit(data_w_ohe, tips["tip"])
+
+pd.DataFrame({"Feature":data_w_ohe.columns, "Model Coefficient":ohe_model.coef_})
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FeatureModel Coefficient
0total_bill0.092994
1size0.187132
2day_Fri0.745787
3day_Sat0.621129
4day_Sun0.732289
5day_Thur0.668294
+ +
+
+
+

For example, when looking at the coefficient for day_Fri, we can understand how much the fact that it is Friday impacts the predicted tip.

+

When one-hot encoding, keep in mind that any set of one-hot encoded columns will always sum to a column of all ones, representing the bias column. More formally, the bias column is a linear combination of the OHE columns.

+
+bias +
+

We must be careful not to include this bias column in our design matrix. Otherwise, there will be linear dependence in the model, meaning \(\mathbb{X}^{\top}\mathbb{X}\) would no longer be invertible, and our OLS estimate \(\hat{\theta} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\) fails.

+

To resolve this issue, we simply omit one of the one-hot encoded columns or do not include an intercept term. The adjusted design matrices are shown below.

+
+remove +
+

Either approach works — we still retain the same information as the omitted column being a linear combination of the remaining columns.

+
+
+

Polynomial Features

+

We have encountered a few cases now where models with linear features have performed poorly on datasets that show clear non-linear curvature.

+

As an example, consider the vehicles dataset, which contains information about cars. Suppose we want to use the hp (horsepower) of a car to predict its "mpg" (gas mileage in miles per gallon). If we visualize the relationship between these two variables, we see a non-linear curvature. Fitting a linear model to these variables results in a high (poor) value of RMSE.

+

\[\hat{y} = \theta_0 + \theta_1 (\text{hp})\]

+
+
+Code +
pd.options.mode.chained_assignment = None 
+vehicles = sns.load_dataset("mpg").dropna().rename(columns = {"horsepower": "hp"}).sort_values("hp")
+
+X = vehicles[["hp"]]
+Y = vehicles["mpg"]
+
+hp_model = lm.LinearRegression()
+hp_model.fit(X, Y)
+hp_model_predictions = hp_model.predict(X)
+
+import matplotlib.pyplot as plt
+
+sns.scatterplot(data=vehicles, x="hp", y="mpg")
+plt.plot(vehicles["hp"], hp_model_predictions, c="tab:red");
+
+print(f"MSE of model with (hp) feature: {np.mean((Y-hp_model_predictions)**2)}")
+
+
+
MSE of model with (hp) feature: 23.943662938603104
+
+
+
+
+

+
+
+
+
+

To capture non-linearity in a dataset, it makes sense to incorporate non-linear features. Let’s introduce a polynomial term, \(\text{hp}^2\), into our regression model. The model now takes the form:

+

\[\hat{y} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2)\] \[\hat{y} = \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2\]

+

How can we fit a model with non-linear features? We can use the exact same techniques as before: ordinary least squares, gradient descent, or sklearn. This is because our new model is still a linear model. Although it contains non-linear features, it is linear with respect to the model parameters. All of our previous work on fitting models was done under the assumption that we were working with linear models. Because our new model is still linear, we can apply our existing methods to determine the optimal parameters.

+
+
# Add a hp^2 feature to the design matrix
+X = vehicles[["hp"]]
+X["hp^2"] = vehicles["hp"]**2
+
+# Use sklearn to fit the model
+hp2_model = lm.LinearRegression()
+hp2_model.fit(X, Y)
+hp2_model_predictions = hp2_model.predict(X)
+
+sns.scatterplot(data=vehicles, x="hp", y="mpg")
+plt.plot(vehicles["hp"], hp2_model_predictions, c="tab:red");
+
+print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")
+
+
MSE of model with (hp^2) feature: 18.98476890761722
+
+
+
+
+

+
+
+
+
+

Looking a lot better! By incorporating a squared feature, we are able to capture the curvature of the dataset. Our model is now a parabola centered on our data. Notice that our new model’s error has decreased relative to the original model with linear features.

+
+
+

Complexity and Overfitting

+

We’ve seen now that feature engineering allows us to build all sorts of features to improve the performance of the model. In particular, we saw that designing a more complex feature (squaring hp in the vehicles data previously) substantially improved the model’s ability to capture non-linear relationships. To take full advantage of this, we might be inclined to design increasingly complex features. Consider the following three models, each of different order (the maximum exponent power of each model):

+
    +
  • Model with order 2: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2)\)
  • +
  • Model with order 3: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2) + \theta_3 (\text{hp}^3)\)
  • +
  • Model with order 4: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2) + \theta_3 (\text{hp}^3) + \theta_4 (\text{hp}^4)\)
  • +
+


+
+degree_comparison +
+

As we can see in the plots above, MSE continues to decrease with each additional polynomial term. To visualize it further, let’s plot models as the complexity increases from 0 to 6:

+
+degree_comparison +
+

When we use our model to make predictions on the same data that was used to fit the model, we find that the MSE decreases with each additional polynomial term (as our model gets more complex). The training error is the model’s error when generating predictions from the same data that was used for training purposes. We can conclude that the training error goes down as the complexity of the model increases.

+
+train_error +
+

This seems like good news – when working on the training data, we can improve model performance by designing increasingly complex models.

+
+

Math Fact: given \(N\) overlapping data points, we can always find a polynomial of degree \(N-1\) that goes through all those points.

+For example: there always exists a degree-4 polynomial curve that can perfectly model a dataset of 5 datapoints +
+train_error +
+
+

However, high model complexity comes with its own set of issues. When building the vehicles models above, we trained the models on the entire dataset and then evaluated their performance on this same dataset. In reality, we are likely to instead train the model on a sample from the population, then use it to make predictions on data it didn’t encounter during training.

+

Let’s walk through a more realistic example. Say we are given a training dataset of just 6 datapoints and want to train a model to then make predictions on a different set of points. We may be tempted to make a highly complex model (e.g., degree 5), especially given it makes perfect predictions on the training data as clear on the left. However, as shown in the graph on the right, this model would perform horribly on the rest of the population!

+
+complex +
+

The phenomenon above is called overfitting. The model effectively just memorized the training data it encountered when it was fitted, leaving it unable to generalize well to data it didn’t encounter during training. This is a problem: we want models that are generalizable to “unseen” data.

+

Additionally, since complex models are sensitive to the specific dataset used to train them, they have high variance. A model with high variance tends to vary more dramatically when trained on different datasets. Going back to our example above, we can see our degree-5 model varies erratically when we fit it to different samples of 6 points from vehicles.

+
+resamples +
+

We now face a dilemma: we know that we can decrease training error by increasing model complexity, but models that are too complex start to overfit and can’t be reapplied to new datasets due to high variance.

+
+bvt +
+

We can see that there is a clear trade-off that comes from the complexity of our model. As model complexity increases, the model’s error on the training data decreases. At the same time, the model’s variance tends to increase.

+

The takeaway here: we need to strike a balance in the complexity of our models; we want models that are generalizable to “unseen” data. A model that is too simple won’t be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting.

+

This begs the question: how do we control the complexity of a model? Stay tuned for our Lecture 17 on Cross-Validation and Regularization!

+ + +
+ +
+ + +
+ + + + + \ No newline at end of file diff --git a/feature_engineering/feature_engineering.qmd b/feature_engineering/feature_engineering.qmd index 25da6805..0f1d4628 100644 --- a/feature_engineering/feature_engineering.qmd +++ b/feature_engineering/feature_engineering.qmd @@ -1,5 +1,5 @@ --- -title: Sklearn and Feature Engineering +title: Feature Engineering execute: echo: true warning: false @@ -8,7 +8,7 @@ format: code-fold: false code-tools: true toc: true - toc-title: Sklearn and Feature Engineering + toc-title: Feature Engineering page-layout: full theme: - cosmo @@ -45,11 +45,11 @@ Feature engineering allows you to: * Capture domain knowledge * Express non-linear relationships using linear models -* Use non-numeric features in models +* Use non-numeric (qualitative) features in models ## Feature Functions -A **feature function** describes the transformations we apply to raw features in a dataset to create a design matrix of transformed features. We typically denote the feature function as $\Phi$ (think to yourself: "phi"-ture function). When we apply the feature function to our original dataset $\mathbb{X}$, the result, $\Phi(\mathbb{X})$, is a transformed design matrix ready to be used in modeling. +A **feature function** describes the transformations we apply to raw features in a dataset to create a design matrix of transformed features. We typically denote the feature function as $\Phi$ (think to yourself: "phi"-true function). When we apply the feature function to our original dataset $\mathbb{X}$, the result, $\Phi(\mathbb{X})$, is a transformed design matrix ready to be used in modeling. For example, we might design a feature function that computes the square of an existing feature and adds it to the design matrix. In this case, our existing matrix $[x]$ is transformed to $[x, x^2]$. Its *dimension* increases from 1 to 2. Often, the dimension of the *featurized* dataset increases as seen here. @@ -77,7 +77,11 @@ To illustrate how this works, we'll refer back to the `tips` dataset from previo ```{python} #| code-fold: true +#| vscode: {languageId: python} import numpy as np +import seaborn as sns +import pandas as pd +import sklearn.linear_model as lm tips = sns.load_dataset("tips") tips.head() ``` @@ -90,10 +94,21 @@ To resolve this, we instead create a new table with a feature for each unique va
-The `OneHotEncoder` class of `sklearn` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.get_feature_names_out)) offers a quick way to perform this one-hot encoding. You will explore its use in detail in the lab. For now, recognize that we follow a very similar workflow to when we were working with the `LinearRegression` class: we initialize a `OneHotEncoder` object, fit it to our data, then use `.transform` to apply the fitted encoder. +In short, each category of a categorical variable gets its own feature + + +The `OneHotEncoder` class of `sklearn` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.get_feature_names_out)) offers a quick way to perform this one-hot encoding. You will explore its use in detail in the lab. For now, recognize that we follow a very similar workflow to when we were working with the `LinearRegression` class: we initialize a `OneHotEncoder` object, fit it to our data, and finally use `.transform` to apply the fitted encoder. ```{python} #| code-fold: false +#| vscode: {languageId: python} from sklearn.preprocessing import OneHotEncoder # Initialize a OneHotEncoder object @@ -113,17 +128,18 @@ The one-hot encoded features can then be used in the design matrix to train a mo
ohemodel
-$$\hat{y} = \theta_1 (\text{total}\textunderscore\text{bill}) + \theta_2 (\text{size}) + \theta_3 (\text{day}\textunderscore\text{Fri}) + \theta_4 (\text{day}\textunderscore\text{Sat}) + \theta_5 (\text{day}\textunderscore\text{Sun}) + \theta_6 (\text{day}\textunderscore\text{Thur})$$ +$$\hat{y} = \theta_1 (\text{total}\_\text{bill}) + \theta_2 (\text{size}) + \theta_3 (\text{day}\_\text{Fri}) + \theta_4 (\text{day}\_\text{Sat}) + \theta_5 (\text{day}\_\text{Sun}) + \theta_6 (\text{day}\_\text{Thur})$$ Or in shorthand: -$$\hat{y} = \theta_1\phi_1 + \theta_2\phi_2 + \theta_3\phi_3 + \theta_4\phi_4 + \theta_5\phi_5 + \theta_6\phi_6$$ +$$\hat{y} = \theta_{1}\phi_{1} + \theta_{2}\phi_{2} + \theta_{3}\phi_{3} + \theta_{4}\phi_{4} + \theta_{5}\phi_{5} + \theta_{6}\phi_{6}$$ Now, the `day` feature (or rather, the four new boolean features that represent day) can be used to fit a model. Using `sklearn` to fit the new model, we can determine the model coefficients, allowing us to understand how each feature impacts the predicted tip. ```{python} +#| vscode: {languageId: python} from sklearn.linear_model import LinearRegression data_w_ohe = tips[["total_bill", "size", "day"]].join(encoded_day_df).drop(columns = "day") ohe_model = lm.LinearRegression(fit_intercept=False) #Tell sklearn to not add an additional bias column. Why? @@ -138,9 +154,9 @@ When one-hot encoding, keep in mind that any set of one-hot encoded columns will
bias
-We must be careful not to include this bias column in our design matrix. Otherwise, there will be linear dependence in the model, meaning $\mathbb{X}^T\mathbb{X}$ would no longer be invertible, and our OLS estimate $\hat{\theta} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\mathbb{Y}$ fails. +We must be careful not to include this bias column in our design matrix. Otherwise, there will be linear dependence in the model, meaning $\mathbb{X}^{\top}\mathbb{X}$ would no longer be invertible, and our OLS estimate $\hat{\theta} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}$ fails. -To resolve this issue, we simply omit one of the one-hot encoded columns *or* do not include an intercept term. +To resolve this issue, we simply omit one of the one-hot encoded columns *or* do not include an intercept term. The adjusted design matrices are shown below.
remove
@@ -156,6 +172,7 @@ $$\hat{y} = \theta_0 + \theta_1 (\text{hp})$$ ```{python} #| code-fold: true +#| vscode: {languageId: python} pd.options.mode.chained_assignment = None vehicles = sns.load_dataset("mpg").dropna().rename(columns = {"horsepower": "hp"}).sort_values("hp") @@ -182,6 +199,7 @@ $$\hat{y} = \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2$$ How can we fit a model with non-linear features? We can use the exact same techniques as before: ordinary least squares, gradient descent, or `sklearn`. This is because our new model is still a **linear model**. Although it contains non-linear *features*, it is linear with respect to the model *parameters*. All of our previous work on fitting models was done under the assumption that we were working with linear models. Because our new model is still linear, we can apply our existing methods to determine the optimal parameters. ```{python} +#| vscode: {languageId: python} # Add a hp^2 feature to the design matrix X = vehicles[["hp"]] X["hp^2"] = vehicles["hp"]**2 @@ -197,7 +215,7 @@ plt.plot(vehicles["hp"], hp2_model_predictions, c="tab:red"); print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}") ``` -Looking a lot better! By incorporating a squared feature, we are able to capture the curvature of the dataset. Our model is now a parabola centered on our data. Notice that our new model's error has decreased relative to the original model with linear features. . +Looking a lot better! By incorporating a squared feature, we are able to capture the curvature of the dataset. Our model is now a parabola centered on our data. Notice that our new model's error has decreased relative to the original model with linear features. ## Complexity and Overfitting @@ -233,7 +251,8 @@ Let's walk through a more realistic example. Say we are given a training dataset
complex
-The phenomenon above is called **overfitting**. The model effectively just memorized the training data it encountered when it was fitted, leaving it unable to **generalize** well to data it didn't encounter during training. +The phenomenon above is called **overfitting**. The model effectively just memorized the training data it encountered when it was fitted, leaving it unable to **generalize** well to data it didn't encounter during training. This is a problem: we want models that are generalizable to “unseen” data. + Additionally, since complex models are sensitive to the specific dataset used to train them, they have high **variance**. A model with high variance tends to *vary* more dramatically when trained on different datasets. Going back to our example above, we can see our degree-5 model varies erratically when we fit it to different samples of 6 points from `vehicles`. @@ -247,5 +266,5 @@ We can see that there is a clear trade-off that comes from the complexity of our The takeaway here: we need to strike a balance in the complexity of our models; we want models that are generalizable to "unseen" data. A model that is too simple won't be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting. -This begs the question: how do we control the complexity of a model? Stay tuned for our Lecture 16 on Cross-Validation and Regularization! +This begs the question: how do we control the complexity of a model? Stay tuned for our Lecture 17 on Cross-Validation and Regularization!