log reg 2 fixes

DS-100 · Apr 11, 2024 · 29ed85d · 29ed85d
1 parent bf38b0a
commit 29ed85d
Show file tree

Hide file tree

Showing 6 changed files with 42 additions and 1,879 deletions.
diff --git a/logistic_regression_2/images/log_reg_summary.png b/logistic_regression_2/images/log_reg_summary.png
diff --git a/logistic_regression_2/images/varying_threshold.png b/logistic_regression_2/images/varying_threshold.png
diff --git a/logistic_regression_2/logistic_reg_2.html b/logistic_regression_2/logistic_reg_2.html
diff --git a/logistic_regression_2/logistic_reg_2.ipynb b/logistic_regression_2/logistic_reg_2.ipynb
diff --git a/logistic_regression_2/logistic_reg_2.qmd b/logistic_regression_2/logistic_reg_2.qmd
@@ -13,22 +13,22 @@ format:
 jupyter: python3
 ---
 
-::: {.callout-note collapse=\"false\"}
+::: {.callout-note collapse="false"}
 ## Learning Outcomes
 * Apply decision rules to make a classification
 * Learn when logistic regression works well and when it does not
 * Introduce new metrics for model performance
-...
+:::
 
-Today, we will continue studying the Logistic Regression model. We'll discuss decision boundaries that help inform the classification of a particular prediction and learn about linear separability. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discuss metrics that allow us to determine our model's performance in different scenarios. 
+Today, we will continue studying the Logistic Regression model and discuss decision boundaries that help inform the classification of a particular prediction and learn about linear separability. Picking up from last lecture's discussion of cross-entropy loss, we'll study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discuss metrics that allow us to determine our model's performance in different scenarios. 
 
-This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.
+This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model and learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.
 
 ## Decision Boundaries
 In logistic regression, we model the *probability* that a datapoint belongs to Class 1. 
 
 <center><img src="images/log_reg_summary.png" alt='tpr_fpr' width='800'></center>
-
+<br>
 Last week, we developed the logistic regression model to predict that probability, but we never actually made any *classifications* for whether our prediction $y$ belongs in Class 0 or Class 1. 
 
 $$ p = P(Y=1 | x) = \frac{1}{1 + e^{-x^{\top}\theta}}$$
@@ -53,7 +53,7 @@ e^{-(\theta_0 + \theta_1  \cdot  \text{feature1} +  \theta_2  \cdot  \text{featu
 \end{align} 
 $$
 
-For a model with 2 features, the decision boundary is a line in terms of its features. To make it easier to visualize, we've included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes. Here the color is the *true* class, rather than the model's predictions.
+For a model with 2 features, the decision boundary is a line in terms of its features. To make it easier to visualize, we've included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes. Here the color is the *predicted* class, rather than the true class.
 
 <center><img src="images/decision_boundary.png" alt='varying_threshold' width='800'></center>
 
@@ -78,34 +78,30 @@ This same definition holds in higher dimensions. If there are two features, the
 This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. Can it achieve 0 cross-entropy loss?
 
 $$-(y \log(p) + (1 - y) \log(1 - p))$$
-Loss is 0 if:
-- $p = 1$ when $y = 1$
-- $p = 0$ when $y = 0$
-
 
-Consider a simple model with one feature and no intercept. 
+Cross-entropy loss is 0 if $p = 1$ when $y = 1$, and $p = 0$ when $y = 0$. Consider a simple model with one feature and no intercept. 
 
 $$P_{\theta}(Y = 1|x) = \sigma(\theta x) = \frac{1}{1 + e^{-\theta x}}$$
 
 What $\theta$ will achieve 0 loss if we train on the datapoint $x = 1, y = 1$? We would want $p = 1$ which occurs when $\theta \rightarrow \infty$.
 
-However, (unexpected) complications may arise. When data is linearly separable, the optimal model parameters **diverge** to $\pm \infty$. The sigmoid can never output exactly 0 or 1, so no finite optimal $\theta$ exists. This can be a problem when using gradient descent to fit the model. Consider a simple, linearly separable "toy" dataset with two datapoints.
+However, (unexpected) complications may arise. When data is linearly separable, the optimal model parameters **diverge** to $\pm \infty$. *The sigmoid can never output exactly 0 or 1*, so no finite optimal $\theta$ exists. This can be a problem when using gradient descent to fit the model. Consider a simple, linearly separable "toy" dataset with two datapoints.
 
 <center><img src="images/toy_linear_separable_dataset.png" alt='toy_linear_separability' width='500'></center>
 
-Let's also visualize the mean cross entropy loss along with the direction of the gradient.
+Let's also visualize the mean cross entropy loss along with the direction of the gradient (how this loss surface is calculated is out of scope).
 
-<center><img src="images/mean_cross_entropy_loss_plot.png" alt='mean_cross_entropy_loss_plot' width='500'></center>
+<center><img src="images/mean_cross_entropy_loss_plot.png" alt='mean_cross_entropy_loss_plot' width='450'></center>
 
-Because gradient descent follows the tilted loss surface downwards, it never converges.
+It's nearly impossible to see, but the plateau to the right is slightly tilted. Because gradient descent follows the tilted loss surface downwards, it never converges.
 
-The diverging weights cause the model to be **overconfident**. For example, consider the new point $(x, y) = (-0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and thus, $\hat y = 0$.
-
-<center><img src="images/toy_linear_separable_dataset_2.png" alt='toy_linear_separability' width='500'></center>
+The diverging weights cause the model to be **overconfident**. Say we add a new point $(x, y) = (-0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and thus, $\hat y = 0$.
 
+<center><img src="images/toy_linear_separable_dataset_2.png" alt='toy_linear_separability' width='450'></center>
+<br>
 The loss incurred by this misclassified point is infinite.
 
-$$-(y\text{ log}(p) + (1-y)\text{ log}(1-p))=1\text{log}(0)$$
+$$-(y\text{ log}(p) + (1-y)\text{ log}(1-p))=1 * \text{log}(0)$$
 
 Thus, diverging weights ($|\theta| \rightarrow \infty$) occur with **linearly separable** data. "Overconfidence", as shown here, is a particularly dangerous version of overfitting.
 
@@ -115,17 +111,19 @@ To avoid large weights and infinite loss (particularly on linearly separable dat
 
 For example, $L2$ (Ridge) Logistic Regression takes on the form:
 
-$$\min_{\theta} -\frac{1}{n} \sum_{i=1}^{n} (y_i \text{log}(\sigma(x_i^T\theta)) + (1-y_i)\text{log}(1-\sigma(x_i^T\theta))) + \lambda \sum_{i=1}^{d} \theta_j^2$$
+$$\min_{\theta} -\frac{1}{n} \sum_{i=1}^{n} (y_i \text{log}(\sigma(X_i^T\theta)) + (1-y_i)\text{log}(1-\sigma(X_i^T\theta))) + \lambda \sum_{j=1}^{d} \theta_j^2$$
 
 Now, let us compare the loss functions of un-regularized and regularized logistic regression.
 
-<center><img src="images/unreg_loss_infinite_argmin.png" alt='unreg_loss' width='500'></center>
+<center><img src="images/unreg_loss_infinite_argmin.png" alt='unreg_loss' width='450'></center>
 
-<center><img src="images/reg_loss_finite_argmin.png" alt='reg_loss' width='500'></center>
+<center><img src="images/reg_loss_finite_argmin.png" alt='reg_loss' width='450'></center>
 
 As we can see, $L2$ regularization helps us prevent diverging weights and deters against "overconfidence."
 
-`sklearn`'s logistic regression defaults to $L2$ regularization and `C=1.0`; `C` is the inverse of $\lambda$: $C = \frac{1}{\lambda}$. Setting `C` to a large value, for example, `C=300.0`, results in minimal regularization.
+`sklearn`'s logistic regression defaults to $L2$ regularization and `C=1.0`; `C` is the inverse of $\lambda$: 
+$$C = \frac{1}{\lambda}$$ 
+Setting `C` to a large value, for example, `C=300.0`, results in minimal regularization.
 
     # sklearn defaults
     model = LogisticRegression(penalty = 'l2', C = 1.0, ...)
@@ -173,7 +171,7 @@ There are 4 different different classifications that our model might make:
 
 These classifications can be concisely summarized in a **confusion matrix**. 
 
-<center><img src="images/confusion_matrix.png" alt='confusion_matrix' width='500'></center>
+<center><img src="images/confusion_matrix.png" alt='confusion_matrix' width='450'></center>
 
 An easy way to remember this terminology is as follows:
 
@@ -183,7 +181,7 @@ An easy way to remember this terminology is as follows:
 We can now write the accuracy calculation as 
 $$\text{accuracy} = \frac{TP + TN}{n}$$
 
-In `sklearn`, we use the following syntax
+In `sklearn`, we use the following syntax to plot a confusion matrix:
 
     from sklearn.metrics import confusion_matrix
     cm = confusion_matrix(Y_true, Y_pred)
@@ -229,8 +227,6 @@ First, let's begin by creating the confusion matrix.
 |  1                | False Negative: 5 | True Positive: 0          |
 +-------------------+-------------------+---------------------------+
 
-Convince yourself of why our confusion matrix looks like so.
-
 $$\text{accuracy} = \frac{95}{100} = 0.95$$
 $$\text{precision} = \frac{0}{0 + 0} = \text{undefined}$$
 $$\text{recall} = \frac{0}{0 + 5} = 0$$
@@ -239,7 +235,7 @@ Notice how our precision is undefined because we never predicted class $1$. Our
 
 #### Model 2
 
-Our confusion matrix for Model 2 looks like so.
+The confusion matrix for Model 2 is:
 
 +-------------------+-------------------+---------------------------+
 |                   | 0                 | 1                         |
@@ -257,9 +253,7 @@ Our precision is low because we have many false positives, and our recall is per
 
 ### Precision vs. Recall
 
-Precision ($\frac{\text{TP}}{\text{TP} + \textbf{ FP}}$) penalizes false positives, while recall ($\frac{\text{TP}}{\text{TP} + \textbf{ FN}}$) penalizes false negatives.
-
-In fact, precision and recall are *inversely related*. This is evident in our second model -- we observed a high recall and low precision. Usually, there is a tradeoff in these two (most models can either minimize the number of FP or FN; and in rare cases, both). 
+Precision ($\frac{\text{TP}}{\text{TP} + \textbf{ FP}}$) penalizes false positives, while recall ($\frac{\text{TP}}{\text{TP} + \textbf{ FN}}$) penalizes false negatives. In fact, precision and recall are *inversely related*. This is evident in our second model -- we observed a high recall and low precision. Usually, there is a tradeoff in these two (most models can either minimize the number of FP or FN; and in rare cases, both). 
 
 The specific performance metric(s) to prioritize depends on the context. In many medical settings, there might be a much higher cost to missing positive cases. For instance, in our breast cancer example, it is more costly to misclassify malignant tumors (false negatives) than it is to incorrectly classify a benign tumor as malignant (false positives). In the case of the latter, pathologists can conduct further studies to verify malignant tumors. As such, we should minimize the number of false negatives. This is equivalent to maximizing recall.
 
@@ -295,7 +289,7 @@ $$\hat y = \begin{cases}
 
 The default threshold in `sklearn` is $T = 0.5$. As we increase the threshold $T$, we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”).
 
-<center><img src="images/varying_threshold.png" alt='varying_threshold' width='800'></center>
+<center><img src="images/varying_threshold.png" alt='varying_threshold' width='700'></center>
 
 As you may notice, the choice of threshold $T$ impacts our classifier's performance.
 
@@ -319,27 +313,27 @@ A **Precision-Recall Curve (PR Curve)** is an alternative to the ROC curve that
 
 Let's first consider how precision and recall change as a function of the threshold $T$. We know this quite well from earlier -- precision will generally increase, and recall will decrease.
 
-<center><img src="images/precision-recall-thresh.png" alt='precision-recall-thresh' width='750'></center>
+<center><img src="images/precision-recall-thresh.png" alt='precision-recall-thresh' width='650'></center>
 
 Displayed below is the PR Curve for the same `toy` dataset. Notice how threshold values increase as we move to the left.
 
-<center><img src="images/pr_curve_thresholds.png" alt='pr_curve_thresholds' width='685'></center>
+<center><img src="images/pr_curve_thresholds.png" alt='pr_curve_thresholds' width='600'></center>
 
 Once again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction.
 
-<center><img src="images/pr_curve_perfect.png" alt='pr_curve_perfect' width='675'></center>
+<center><img src="images/pr_curve_perfect.png" alt='pr_curve_perfect' width='600'></center>
 
 We want our PR curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine "closeness", with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5).
 
 ### The ROC Curve
 
 The “Receiver Operating Characteristic” Curve (**ROC Curve**) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold $T$ values. At lower thresholds, the FPR and TPR are both high as there are many positive predictions while at higher thresholds the FPR and TPR are both low as there are fewer positive predictions.
 
-<center><img src="images/roc_curve.png" alt='roc_curve' width='700'></center>
+<center><img src="images/roc_curve.png" alt='roc_curve' width='600'></center>
 
 The “perfect” classifier is the one that has a TPR of 1, and FPR of 0. This is achieved at the top-left of the plot below. More generally, it's ROC curve resembles the curve in orange.
 
-<center><img src="images/roc_curve_perfect.png" alt='roc_curve_perfect' width='700'></center>
+<center><img src="images/roc_curve_perfect.png" alt='roc_curve_perfect' width='600'></center>
 
 We want our model to be as close to this orange curve as possible. How do we quantify "closeness"?
 
@@ -349,21 +343,23 @@ We can compute the **area under curve (AUC)** of the ROC curve. Notice how the p
 #### (Extra) What is the “worst” AUC, and why is it 0.5? 
 On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predict $P(Y = 1 | x)$ to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.
 
-<center><img src="images/roc_curve_worst_predictor.png" alt='roc_curve_worst_predictor' width='900'></center>
+<center><img src="images/roc_curve_worst_predictor.png" alt='roc_curve_worst_predictor' width='700'></center>
 
 We can also illustrate this by comparing different thresholds and seeing their points on the ROC curve.
 
-<center><img src="images/roc_curve_worse_predictor_differing_T.png" alt = "roc_curve_worse_predictor_differing_T" width="900"> </center>
+<center><img src="images/roc_curve_worse_predictor_differing_T.png" alt = "roc_curve_worse_predictor_differing_T" width="700"> </center>
 
 
-## (Extra) Gradient Descent for Logistic Regression
-Let's define the following: 
+## (Bonus) Gradient Descent for Logistic Regression
+Let's define the following terms: 
 $$
-t_i = \phi(x_i)^T \theta \\
-p_i = \sigma(t_i) \\
-t_i = \log(\frac{p_i}{1 - p_i}) \\
-1 - \sigma(t_i) = \sigma(-t_i) \\
-\frac{d}{dt}  \sigma(t) =  \sigma(t) \sigma(-t)
+\begin{align}
+t_i &= \phi(x_i)^T \theta \\
+p_i &= \sigma(t_i) \\
+t_i &= \log(\frac{p_i}{1 - p_i}) \\
+1 - \sigma(t_i) &= \sigma(-t_i) \\
+\frac{d}{dt}  \sigma(t) &=  \sigma(t) \sigma(-t)
+\end{align}
 $$
 
 Now, we can simplify the cross-entropy loss