Updated logistic regression 2 course notes

DS-100 · Apr 11, 2024 · 6e1faf9 · 6e1faf9
1 parent d2899f9
commit 6e1faf9
Show file tree

Hide file tree

Showing 7 changed files with 919 additions and 38 deletions.
diff --git a/logistic_regression_2/images/mean_cross_entropy_loss_plot.png b/logistic_regression_2/images/mean_cross_entropy_loss_plot.png
diff --git a/logistic_regression_2/images/reg_loss_finite_argmin.png b/logistic_regression_2/images/reg_loss_finite_argmin.png
diff --git a/logistic_regression_2/images/toy_linear_separable_dataset.png b/logistic_regression_2/images/toy_linear_separable_dataset.png
diff --git a/logistic_regression_2/images/toy_linear_separable_dataset_2.png b/logistic_regression_2/images/toy_linear_separable_dataset_2.png
diff --git a/logistic_regression_2/images/unreg_loss_infinite_argmin.png b/logistic_regression_2/images/unreg_loss_infinite_argmin.png
diff --git a/logistic_regression_2/logistic_reg_2.html b/logistic_regression_2/logistic_reg_2.html
diff --git a/logistic_regression_2/logistic_reg_2.qmd b/logistic_regression_2/logistic_reg_2.qmd
@@ -18,18 +18,20 @@ jupyter: python3
 * Apply decision rules to make a classification
 * Learn when logistic regression works well and when it does not
 * Introduce new metrics for model performance
-::: 
+...
 
-Today, we will continue studying the Logistic Regression model. We'll discuss decision boundaries that help inform the classification of a particular prediction. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discuss metrics that allow us to determine our model's performance in different scenarios. 
+Today, we will continue studying the Logistic Regression model. We'll discuss decision boundaries that help inform the classification of a particular prediction and learn about linear separability. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discuss metrics that allow us to determine our model's performance in different scenarios. 
 
 This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.
 
+## Decision Boundaries
+In logistic regression, we model the *probability* that a datapoint belongs to Class 1. 
+
 <center><img src="images/log_reg_summary.png" alt='tpr_fpr' width='800'></center>
 
-## Decision Boundaries
-In logistic regression, we model the *probability* that a datapoint belongs to Class 1. Last week, we developed the logistic regression model to predict that probability, but we never actually made any *classifications* for whether our prediction $y$ belongs in Class 0 or Class 1. 
+Last week, we developed the logistic regression model to predict that probability, but we never actually made any *classifications* for whether our prediction $y$ belongs in Class 0 or Class 1. 
 
-$$ p = P(Y=1 | x) = \frac{1}{1 + e^{-x^T\theta}}$$
+$$ p = P(Y=1 | x) = \frac{1}{1 + e^{-x^{\top}\theta}}$$
 
 A **decision rule** tells us how to interpret the output of the model to make a decision on how to classify a datapoint. We commonly make decision rules by specifying a **threshold**, $T$. If the predicted probability is greater than or equal to $T$, predict Class 1. Otherwise, predict Class 0. 
 
@@ -40,7 +42,7 @@ $$\hat y = \text{classify}(x) = \begin{cases}
 
 The threshold is often set to $T = 0.5$, but *not always*. We'll discuss why we might want to use other thresholds  $T \neq 0.5$ later in this lecture.
 
-Using our decision rule, we can define a **decision boundary** as the “line” that splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in p-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\theta = [\theta_0, \theta_1, \theta_2]$ including the intercept term, and we can solve for the decision boundary like so: 
+Using our decision rule, we can define a **decision boundary** as the “line” that splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in $p$-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\theta = [\theta_0, \theta_1, \theta_2]$ including the intercept term, and we can solve for the decision boundary like so: 
 
 $$
 \begin{align}
@@ -51,7 +53,7 @@ e^{-(\theta_0 + \theta_1  \cdot  \text{feature1} +  \theta_2  \cdot  \text{featu
 \end{align} 
 $$
 
-For a model with 2 features, the decision boundary is a line in terms of its features. To make it easier to visualize, we've included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes. 
+For a model with 2 features, the decision boundary is a line in terms of its features. To make it easier to visualize, we've included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes. Here the color is the *true* class, rather than the model's predictions.
 
 <center><img src="images/decision_boundary.png" alt='varying_threshold' width='800'></center>
 
@@ -63,47 +65,49 @@ As you can see, the decision boundary predicted by our logistic regression does
 
 ## Linear Separability and Regularization
 
-A classification dataset is said to be **linearly separable** if there exists a hyperplane among input features $x$ that separates the two classes $y$. 
+A classification dataset is said to be **linearly separable** if there exists a hyperplane **among input features $x$** that separates the two classes $y$. 
 
-Linear separability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly separable along the vertical line $x=0$. However, no such line perfectly separates the two classes on the bottom right.
+Linear separability in 1D can be found with a rugplot of a single feature where a point perfectly separates the classes. For example, notice how the plot on the bottom left is linearly separable along the vertical line $x=0$. However, no such line perfectly separates the two classes on the bottom right.
 
 <center><img src="images/linear_separability_1D.png" alt='linear_separability_1D' width='800'></center>
 
 This same definition holds in higher dimensions. If there are two features, the separating hyperplane must exist in two dimensions (any line of the form $y=mx+b$). We can visualize this using a scatter plot.
 
 <center><img src="images/linear_separability_2D.png" alt='linear_separability_1D' width='800'></center>
 
-This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. However, (unexpected) complications may arise. Consider the `toy` dataset with 2 points and only a single feature $x$:
+This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. Can it achieve 0 cross-entropy loss?
 
-<center><img src="images/toy_2_point.png" alt='toy_linear_separability' width='500'></center>
+$$-(y \log(p) + (1 - y) \log(1 - p))$$
+Loss is 0 if:
+- $p = 1$ when $y = 1$
+- $p = 0$ when $y = 0$
 
-The optimal $\theta$ value that minimizes loss pushes the predicted probabilities of the data points to their true class.
 
-- $P(Y = 1|x = -1) = \frac{1}{1 + e^\theta} \rightarrow 1$
-- $P(Y = 1|x = 1) = \frac{1}{1 + e^{-\theta}} \rightarrow 0$
+Consider a simple model with one feature and no intercept. 
 
-This happens when $\theta = -\infty$. When $\theta = -\infty$, we observe the following behavior for any input $x$.
+$$P_{\theta}(Y = 1|x) = \sigma(\theta x) = \frac{1}{1 + e^{-\theta x}}$$
 
-$$P(Y=1|x) = \sigma(\theta x) \rightarrow \begin{cases}
-        1, \text{if }  x < 0\\
-        0, \text{if }  x \ge 0
-    \end{cases}$$
+What $\theta$ will achieve 0 loss if we train on the datapoint $x = 1, y = 1$? We would want $p = 1$ which occurs when $\theta \rightarrow \infty$.
 
-The diverging weights cause the model to be overconfident. For example, consider the new point $(x, y) = (0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and thus, $\hat y = 0$.
+However, (unexpected) complications may arise. When data is linearly separable, the optimal model parameters **diverge** to $\pm \infty$. The sigmoid can never output exactly 0 or 1, so no finite optimal $\theta$ exists. This can be a problem when using gradient descent to fit the model. Consider a simple, linearly separable "toy" dataset with two datapoints.
 
-<center><img src="images/toy_3_point.png" alt='toy_linear_separability' width='500'></center>
+<center><img src="images/toy_linear_separable_dataset.png" alt='toy_linear_separability' width='500'></center>
 
-The loss incurred by this misclassified point is infinite.
+Let's also visualize the mean cross entropy loss along with the direction of the gradient.
 
-$$-(y\text{ log}(p) + (1-y)\text{ log}(1-p))=1\text{log}(0)$$
+<center><img src="images/mean_cross_entropy_loss_plot.png" alt='mean_cross_entropy_loss_plot' width='500'></center>
 
-Thus, diverging weights ($|\theta| \rightarrow \infty$) occur with **lineary separable** data. "Overconfidence" is a particularly dangerous version of overfitting.
+Because gradient descent follows the tilted loss surface downwards, it never converges.
+
+The diverging weights cause the model to be **overconfident**. For example, consider the new point $(x, y) = (-0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and thus, $\hat y = 0$.
 
-Consider the loss function with respect to the parameter $\theta$.
+<center><img src="images/toy_linear_separable_dataset_2.png" alt='toy_linear_separability' width='500'></center>
 
-<center><img src="images/unreg_loss.png" alt='unreg_loss' width='500'></center>
+The loss incurred by this misclassified point is infinite.
+
+$$-(y\text{ log}(p) + (1-y)\text{ log}(1-p))=1\text{log}(0)$$
 
-Though it's very difficult to see, the plateau for negative values of $\theta$ is slightly tilted downwards, meaning the loss approaches $0$ as $\theta$ decreases and approaches $-\infty$.
+Thus, diverging weights ($|\theta| \rightarrow \infty$) occur with **lineary separable** data. "Overconfidence" is a particularly dangerous version of overfitting.
 
 ### Regularized Logistic Regression
 
@@ -115,16 +119,16 @@ $$\min_{\theta} -\frac{1}{n} \sum_{i=1}^{n} (y_i \text{log}(\sigma(x_i^T\theta))
 
 Now, let us compare the loss functions of un-regularized and regularized logistic regression.
 
-<center><img src="images/unreg_loss.png" alt='unreg_loss' width='500'></center>
+<center><img src="images/unreg_loss_infinite_argmin.png" alt='unreg_loss' width='500'></center>
 
-<center><img src="images/reg_loss.png" alt='reg_loss' width='500'></center>
+<center><img src="images/reg_loss_finite_argmin.png" alt='reg_loss' width='500'></center>
 
 As we can see, $L2$ regularization helps us prevent diverging weights and deters against "overconfidence."
 
-`sklearn`'s logistic regression defaults to L2 regularization and `C=1.0`; `C` is the inverse of $\lambda$: $C = \frac{1}{\lambda}$. Setting `C` to a large value, for example, `C=300.0`, results in minimal regularization.
-
+`sklearn`'s logistic regression defaults to $L2$ regularization and `C=1.0`; `C` is the inverse of $\lambda$: $C = \frac{1}{\lambda}$. Setting `C` to a large value, for example, `C=300.0`, results in minimal regularization.
+    
     # sklearn defaults
-    model = LogisticRegression(penalty='l2', C=1.0, …)
+    model = LogisticRegression(penalty = 'l2', C = 1.0, ...)
     model.fit()
 
 Note that in Data 100, we only use `sklearn` to fit logistic regression models. There is no closed-form solution to the optimal theta vector, and the gradient is a little messy (see the bonus section below for details).
@@ -150,6 +154,8 @@ Translated to code:
         
     model.score(X, y) # built-in accuracy function
 
+You can find the `sklearn` documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score).
+
 However, accuracy is not always a great metric for classification. To understand why, let's consider a classification problem with 100 emails where only 5 are truly spam, and the remaining 95 are truly ham. We'll investigate two models where accuracy is a poor metric. 
 
 - **Model 1**: Our first model classifies every email as non-spam. The model's accuracy is high ($\frac{95}{100} = 0.95$), but it doesn't detect any spam emails. Despite the high accuracy, this is a bad model.
@@ -265,13 +271,13 @@ The **True Positive Rate (TPR)** is defined as
 
 $$\text{true positive rate} = \frac{\text{TP}}{\text{TP + FN}}$$
 
-You'll notice this is equivalent to *recall*. In the context of our spam email classifier, it answers the question: "What proportion of spam did I mark correctly?". We'd like this to be close to $1$
+You'll notice this is equivalent to *recall*. In the context of our spam email classifier, it answers the question: "What proportion of spam did I mark correctly?". We'd like this to be close to $1$.
 
 The **False Positive Rate (FPR)** is defined as
 
 $$\text{false positive rate} = \frac{\text{FP}}{\text{FP + TN}}$$
 
-Another word for FPR is *specificity*. This answers the question: "What proportion of regular email did I mark as spam?". We'd like this to be close to $0$
+Another word for FPR is *specificity*. This answers the question: "What proportion of regular email did I mark as spam?". We'd like this to be close to $0$.
 
 As we increase threshold $T$, both TPR and FPR decrease. We've plotted this relationship below for some model on a `toy` dataset.
 
@@ -308,7 +314,8 @@ In fact, we can choose a threshold $T$ based on our desired number, or proportio
 
 ### Precision-Recall Curves
 
-A **Precision-Recall Curve (PR Curve)** is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. It is constructed in a similar way as with the ROC curve.
+A **Precision-Recall Curve (PR Curve)** is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. In this curve, we test out many different possible thresholds and for each one we compute the precision and recall of the classifier.
+
 
 Let's first consider how precision and recall change as a function of the threshold $T$. We know this quite well from earlier -- precision will generally increase, and recall will decrease.
 
@@ -326,7 +333,7 @@ We want our PR curve to be as close to the “top right” of this graph as poss
 
 ### The ROC Curve
 
-The “Receiver Operating Characteristic” Curve (**ROC Curve**) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold $T$ values.
+The “Receiver Operating Characteristic” Curve (**ROC Curve**) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold $T$ values. At lower thresholds, the FPR and TPR are both high as there are many positive predictions while at higher thresholds the FPR and TPR are both low as there are fewer positive predictions.
 
 <center><img src="images/roc_curve.png" alt='roc_curve' width='700'></center>
 
@@ -384,11 +391,11 @@ Setting the derivative equal to 0 and solving for $\hat{\theta}$, we find that t
 $$\theta^{(0)} \leftarrow \text{initial vector (random, zeros, ...)} $$
 
 For $\tau$ from 0 to convergence: 
-$$ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} + \rho(\tau)\left( \frac{1}{n} \sum_{i=1}^n \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) $$
+$$ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} - \rho(\tau)\left( \frac{1}{n} \sum_{i=1}^n \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) $$
 
 ### Stochastic Gradient Descent Update Rule
 $$\theta^{(0)} \leftarrow \text{initial vector (random, zeros, ...)} $$
 
 For $\tau$ from 0 to convergence, let $B$ ~ $\text{Random subset of indices}$. 
-$$ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} + \rho(\tau)\left( \frac{1}{|B|} \sum_{i \in B} \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) $$
+$$ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} - \rho(\tau)\left( \frac{1}{|B|} \sum_{i \in B} \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) $$