fix cross-entropy loss equation

DS-100 · Apr 16, 2024 · 1050c65 · 1050c65
1 parent 6ec9844
commit 1050c65
Show file tree

Hide file tree

Showing 4 changed files with 29 additions and 20 deletions.
diff --git a/docs/logistic_regression_1/logistic_reg_1.html b/docs/logistic_regression_1/logistic_reg_1.html
@@ -428,7 +428,7 @@ <h2 data-number="22.2" class="anchored" data-anchor-id="deriving-the-logistic-re
 <span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a>games <span class="op">=</span> pd.read_csv(<span class="st">"data/games"</span>).dropna()</span>
 <span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a>games.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
-<div class="cell-output cell-output-display" data-execution_count="1">
+<div class="cell-output cell-output-display" data-execution_count="37">
 <div>
 
 
@@ -760,7 +760,7 @@ <h3 data-number="22.4.1" class="anchored" data-anchor-id="why-not-mse"><span cla
 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>        <span class="st">"y"</span>: [<span class="dv">0</span>, <span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">1</span>]})</span>
 <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>toy_df.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
-<div class="cell-output cell-output-display" data-execution_count="8">
+<div class="cell-output cell-output-display" data-execution_count="44">
 <div>
 
 
@@ -906,31 +906,36 @@ <h3 data-number="22.4.2" class="anchored" data-anchor-id="motivating-cross-entro
 :::: -->
 <p>All good – we are seeing the behavior we want for our logistic regression model.</p>
 <p>The piecewise function we outlined above is difficult to optimize: we don’t want to constantly “check” which form of the loss function we should be using at each step of choosing the optimal model parameters. We can re-express cross-entropy loss in a more convenient way:</p>
-<p><span class="math display">\[\text{Cross-Entropy Loss} = -\left(y\log{(p)}-(1-y)\log{(1-p)}\right)\]</span></p>
+<p><span class="math display">\[\text{Cross-Entropy Loss} = -\left(y\log{(p)}+(1-y)\log{(1-p)}\right)\]</span></p>
 <p>By setting <span class="math inline">\(y\)</span> to 0 or 1, we see that this new form of cross-entropy loss gives us the same behavior as the original formulation. Another way to think about this is that in either scenario (y being equal to 0 or 1), only one of the cross-entropy loss terms is activated, which gives us a convenient way to combine the two independent loss functions.</p>
 <div class="columns">
 <div class="column" style="width:35%;">
 <p>When <span class="math inline">\(y=1\)</span>:</p>
 <p><span class="math display">\[\begin{align}
-\text{CE} &amp;= -\left((1)\log{(p)}-(1-1)\log{(1-p)}\right)\\
+\text{CE} &amp;= -\left((1)\log{(p)}+(1-1)\log{(1-p)}\right)\\
 &amp;= -\log{(p)}
 \end{align}\]</span></p>
 </div><div class="column" style="width:20%;">
 
 </div><div class="column" style="width:35%;">
 <p>When <span class="math inline">\(y=0\)</span>:</p>
 <p><span class="math display">\[\begin{align}
-\text{CE} &amp;= -\left((0)\log{(p)}-(1-0)\log{(1-p)}\right)\\
+\text{CE} &amp;= -\left((0)\log{(p)}+(1-0)\log{(1-p)}\right)\\
 &amp;= -\log{(1-p)}
 \end{align}\]</span></p>
 </div>
 </div>
 <p>The empirical risk of the logistic regression model is then the mean cross-entropy loss across all datapoints in the dataset. When fitting the model, we want to determine the model parameter <span class="math inline">\(\theta\)</span> that leads to the lowest mean cross-entropy loss possible.</p>
-<p><span class="math display">\[R(\theta) = - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{(p_i)}-(1-y_i)\log{(1-p_i)}\right)\]</span> <span class="math display">\[R(\theta) = - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{\sigma(X_i^{\top}\theta)}-(1-y_i)\log{(1-\sigma(X_i^{\top}\theta))}\right)\]</span></p>
+<p><span class="math display">\[
+\begin{align}
+R(\theta) &amp;= - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{(p_i)}+(1-y_i)\log{(1-p_i)}\right) \\
+&amp;= - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{\sigma(X_i^{\top}\theta)}+(1-y_i)\log{(1-\sigma(X_i^{\top}\theta))}\right)
+\end{align}
+\]</span></p>
 <p>The optimization problem is therefore to find the estimate <span class="math inline">\(\hat{\theta}\)</span> that minimizes <span class="math inline">\(R(\theta)\)</span>:</p>
-<p><span class="math display">\[\begin{align}
-\hat{\theta} = \underset{\theta}{\arg\min} (- \frac{1}{n} \sum_{i=1}^n \left(y_i\log{(\sigma(X_i^{\top}\theta))}-(1-y_i)\log{(1-\sigma(X_i^{\top}\theta))}\right))
-\end{align}\]</span></p>
+<p><span class="math display">\[
+\hat{\theta} = \underset{\theta}{\arg\min} - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{(\sigma(X_i^{\top}\theta))}+(1-y_i)\log{(1-\sigma(X_i^{\top}\theta))}\right)
+\]</span></p>
 <p>Plotting the cross-entropy loss surface for our <code>toy</code> dataset gives us a more encouraging result – our loss function is now convex. This means we can optimize it using gradient descent. Computing the gradient of the logistic model is fairly challenging, so we’ll let <code>sklearn</code> take care of this for us. You won’t need to compute the gradient of the logistic model in Data 100.</p>
 <div class="cell" data-execution_count="10">
 <details>
@@ -962,7 +967,7 @@ <h3 data-number="22.5.1" class="anchored" data-anchor-id="building-intuition-the
 <div class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>flips <span class="op">=</span> [<span class="dv">0</span>, <span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">1</span>, <span class="dv">1</span>, <span class="dv">1</span>, <span class="dv">0</span>, <span class="dv">0</span>, <span class="dv">0</span>, <span class="dv">0</span>]</span>
 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>flips</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div class="cell-output cell-output-display" data-execution_count="11">
+<div class="cell-output cell-output-display" data-execution_count="47">
 <pre><code>[0, 0, 1, 1, 1, 1, 0, 0, 0, 0]</code></pre>
 </div>
 </div>

diff --git a/docs/search.json b/docs/search.json
@@ -109,7 +109,7 @@
     "href": "logistic_regression_1/logistic_reg_1.html#cross-entropy-loss",
     "title": "22  Logistic Regression I",
     "section": "22.4 Cross-Entropy Loss",
-    "text": "22.4 Cross-Entropy Loss\nTo quantify the error of our logistic regression model, we’ll need to define a new loss function.\n\n22.4.1 Why Not MSE?\nYou may wonder: why not use our familiar mean squared error? It turns out that the MSE is not well suited for logistic regression. To see why, let’s consider a simple, artificially generated toy dataset with just one feature (this will be easier to work with than the more complicated games data).\n\n\nCode\ntoy_df = pd.DataFrame({\n        \"x\": [-4, -2, -0.5, 1, 3, 5],\n        \"y\": [0, 0, 1, 0, 1, 1]})\ntoy_df.head()\n\n\n\n\n\n\n\n\n\nx\ny\n\n\n\n\n0\n-4.0\n0\n\n\n1\n-2.0\n0\n\n\n2\n-0.5\n1\n\n\n3\n1.0\n0\n\n\n4\n3.0\n1\n\n\n\n\n\n\n\nWe’ll construct a basic logistic regression model with only one feature and no intercept term. Our predicted probabilities take the form:\n\\[p=P(Y=1|x)=\\frac{1}{1+e^{-\\theta_1 x}}\\]\nIn the cell below, we plot the MSE for our model on the data.\n\n\nCode\ndef sigmoid(z):\n    return 1/(1+np.e**(-z))\n    \ndef mse_on_toy_data(theta):\n    p_hat = sigmoid(toy_df['x'] * theta)\n    return np.mean((toy_df['y'] - p_hat)**2)\n\nthetas = np.linspace(-15, 5, 100)\nplt.plot(thetas, [mse_on_toy_data(theta) for theta in thetas])\nplt.title(\"MSE on toy classification data\")\nplt.xlabel(r'$\\theta_1$')\nplt.ylabel('MSE');\n\n\n\n\n\nThis looks nothing like the parabola we found when plotting the MSE of a linear regression model! In particular, we can identify two flaws with using the MSE for logistic regression:\n\nThe MSE loss surface is non-convex. There is both a global minimum and a (barely perceptible) local minimum in the loss surface above. This means that there is the risk of gradient descent converging on the local minimum of the loss surface, missing the true optimum parameter \\(\\theta_1\\).\n\n\n\nSquared loss is bounded for a classification task. Recall that each true \\(y\\) has a value of either 0 or 1. This means that even if our model makes the worst possible prediction (e.g. predicting \\(p=0\\) for \\(y=1\\)), the squared loss for an observation will be no greater than 1: \\[(y-p)^2=(1-0)^2=1\\] The MSE does not strongly penalize poor predictions.\n\n\n\n\n\n\n22.4.2 Motivating Cross-Entropy Loss\nSuffice to say, we don’t want to use the MSE when working with logistic regression. Instead, we’ll consider what kind of behavior we would like to see in a loss function.\nLet \\(y\\) be the binary label (it can either be 0 or 1), and \\(p\\) be the model’s predicted probability of the label \\(y\\) being 1.\n\nWhen the true \\(y\\) is 1, we should incur low loss when the model predicts large \\(p\\)\nWhen the true \\(y\\) is 0, we should incur high loss when the model predicts large \\(p\\)\n\nIn other words, our loss function should behave differently depending on the value of the true class, \\(y\\).\nThe cross-entropy loss incorporates this changing behavior. We will use it throughout our work on logistic regression. Below, we write out the cross-entropy loss for a single datapoint (no averages just yet).\n\\[\\text{Cross-Entropy Loss} = \\begin{cases}\n  -\\log{(p)}  & \\text{if } y=1 \\\\\n  -\\log{(1-p)} & \\text{if } y=0\n\\end{cases}\\]\nWhy does this (seemingly convoluted) loss function “work”? Let’s break it down.\n\n\n\n\n\n\n\nWhen \\(y=1\\)\nWhen \\(y=0\\)\n\n\n\n\n\n\n\n\nAs \\(p \\rightarrow 0\\), loss approches \\(\\infty\\)\nAs \\(p \\rightarrow 0\\), loss approches 0\n\n\nAs \\(p \\rightarrow 1\\), loss approaches 0\nAs \\(p \\rightarrow 1\\), loss approaches \\(\\infty\\)\n\n\n\n\nAll good – we are seeing the behavior we want for our logistic regression model.\nThe piecewise function we outlined above is difficult to optimize: we don’t want to constantly “check” which form of the loss function we should be using at each step of choosing the optimal model parameters. We can re-express cross-entropy loss in a more convenient way:\n\\[\\text{Cross-Entropy Loss} = -\\left(y\\log{(p)}-(1-y)\\log{(1-p)}\\right)\\]\nBy setting \\(y\\) to 0 or 1, we see that this new form of cross-entropy loss gives us the same behavior as the original formulation. Another way to think about this is that in either scenario (y being equal to 0 or 1), only one of the cross-entropy loss terms is activated, which gives us a convenient way to combine the two independent loss functions.\n\n\nWhen \\(y=1\\):\n\\[\\begin{align}\n\\text{CE} &= -\\left((1)\\log{(p)}-(1-1)\\log{(1-p)}\\right)\\\\\n&= -\\log{(p)}\n\\end{align}\\]\n\n\n\nWhen \\(y=0\\):\n\\[\\begin{align}\n\\text{CE} &= -\\left((0)\\log{(p)}-(1-0)\\log{(1-p)}\\right)\\\\\n&= -\\log{(1-p)}\n\\end{align}\\]\n\n\nThe empirical risk of the logistic regression model is then the mean cross-entropy loss across all datapoints in the dataset. When fitting the model, we want to determine the model parameter \\(\\theta\\) that leads to the lowest mean cross-entropy loss possible.\n\\[R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(p_i)}-(1-y_i)\\log{(1-p_i)}\\right)\\] \\[R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{\\sigma(X_i^{\\top}\\theta)}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta))}\\right)\\]\nThe optimization problem is therefore to find the estimate \\(\\hat{\\theta}\\) that minimizes \\(R(\\theta)\\):\n\\[\\begin{align}\n\\hat{\\theta} = \\underset{\\theta}{\\arg\\min} (- \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta))}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta))}\\right))\n\\end{align}\\]\nPlotting the cross-entropy loss surface for our toy dataset gives us a more encouraging result – our loss function is now convex. This means we can optimize it using gradient descent. Computing the gradient of the logistic model is fairly challenging, so we’ll let sklearn take care of this for us. You won’t need to compute the gradient of the logistic model in Data 100.\n\n\nCode\ndef cross_entropy(y, p_hat):\n    return - y * np.log(p_hat) - (1 - y) * np.log(1 - p_hat)\n\ndef mean_cross_entropy_on_toy_data(theta):\n    p_hat = sigmoid(toy_df['x'] * theta)\n    return np.mean(cross_entropy(toy_df['y'], p_hat))\n\nplt.plot(thetas, [mean_cross_entropy_on_toy_data(theta) for theta in thetas], color = 'green')\nplt.ylabel(r'Mean Cross-Entropy Loss($\\theta$)')\nplt.xlabel(r'$\\theta$');"
+    "text": "22.4 Cross-Entropy Loss\nTo quantify the error of our logistic regression model, we’ll need to define a new loss function.\n\n22.4.1 Why Not MSE?\nYou may wonder: why not use our familiar mean squared error? It turns out that the MSE is not well suited for logistic regression. To see why, let’s consider a simple, artificially generated toy dataset with just one feature (this will be easier to work with than the more complicated games data).\n\n\nCode\ntoy_df = pd.DataFrame({\n        \"x\": [-4, -2, -0.5, 1, 3, 5],\n        \"y\": [0, 0, 1, 0, 1, 1]})\ntoy_df.head()\n\n\n\n\n\n\n\n\n\nx\ny\n\n\n\n\n0\n-4.0\n0\n\n\n1\n-2.0\n0\n\n\n2\n-0.5\n1\n\n\n3\n1.0\n0\n\n\n4\n3.0\n1\n\n\n\n\n\n\n\nWe’ll construct a basic logistic regression model with only one feature and no intercept term. Our predicted probabilities take the form:\n\\[p=P(Y=1|x)=\\frac{1}{1+e^{-\\theta_1 x}}\\]\nIn the cell below, we plot the MSE for our model on the data.\n\n\nCode\ndef sigmoid(z):\n    return 1/(1+np.e**(-z))\n    \ndef mse_on_toy_data(theta):\n    p_hat = sigmoid(toy_df['x'] * theta)\n    return np.mean((toy_df['y'] - p_hat)**2)\n\nthetas = np.linspace(-15, 5, 100)\nplt.plot(thetas, [mse_on_toy_data(theta) for theta in thetas])\nplt.title(\"MSE on toy classification data\")\nplt.xlabel(r'$\\theta_1$')\nplt.ylabel('MSE');\n\n\n\n\n\nThis looks nothing like the parabola we found when plotting the MSE of a linear regression model! In particular, we can identify two flaws with using the MSE for logistic regression:\n\nThe MSE loss surface is non-convex. There is both a global minimum and a (barely perceptible) local minimum in the loss surface above. This means that there is the risk of gradient descent converging on the local minimum of the loss surface, missing the true optimum parameter \\(\\theta_1\\).\n\n\n\nSquared loss is bounded for a classification task. Recall that each true \\(y\\) has a value of either 0 or 1. This means that even if our model makes the worst possible prediction (e.g. predicting \\(p=0\\) for \\(y=1\\)), the squared loss for an observation will be no greater than 1: \\[(y-p)^2=(1-0)^2=1\\] The MSE does not strongly penalize poor predictions.\n\n\n\n\n\n\n22.4.2 Motivating Cross-Entropy Loss\nSuffice to say, we don’t want to use the MSE when working with logistic regression. Instead, we’ll consider what kind of behavior we would like to see in a loss function.\nLet \\(y\\) be the binary label (it can either be 0 or 1), and \\(p\\) be the model’s predicted probability of the label \\(y\\) being 1.\n\nWhen the true \\(y\\) is 1, we should incur low loss when the model predicts large \\(p\\)\nWhen the true \\(y\\) is 0, we should incur high loss when the model predicts large \\(p\\)\n\nIn other words, our loss function should behave differently depending on the value of the true class, \\(y\\).\nThe cross-entropy loss incorporates this changing behavior. We will use it throughout our work on logistic regression. Below, we write out the cross-entropy loss for a single datapoint (no averages just yet).\n\\[\\text{Cross-Entropy Loss} = \\begin{cases}\n  -\\log{(p)}  & \\text{if } y=1 \\\\\n  -\\log{(1-p)} & \\text{if } y=0\n\\end{cases}\\]\nWhy does this (seemingly convoluted) loss function “work”? Let’s break it down.\n\n\n\n\n\n\n\nWhen \\(y=1\\)\nWhen \\(y=0\\)\n\n\n\n\n\n\n\n\nAs \\(p \\rightarrow 0\\), loss approches \\(\\infty\\)\nAs \\(p \\rightarrow 0\\), loss approches 0\n\n\nAs \\(p \\rightarrow 1\\), loss approaches 0\nAs \\(p \\rightarrow 1\\), loss approaches \\(\\infty\\)\n\n\n\n\nAll good – we are seeing the behavior we want for our logistic regression model.\nThe piecewise function we outlined above is difficult to optimize: we don’t want to constantly “check” which form of the loss function we should be using at each step of choosing the optimal model parameters. We can re-express cross-entropy loss in a more convenient way:\n\\[\\text{Cross-Entropy Loss} = -\\left(y\\log{(p)}+(1-y)\\log{(1-p)}\\right)\\]\nBy setting \\(y\\) to 0 or 1, we see that this new form of cross-entropy loss gives us the same behavior as the original formulation. Another way to think about this is that in either scenario (y being equal to 0 or 1), only one of the cross-entropy loss terms is activated, which gives us a convenient way to combine the two independent loss functions.\n\n\nWhen \\(y=1\\):\n\\[\\begin{align}\n\\text{CE} &= -\\left((1)\\log{(p)}+(1-1)\\log{(1-p)}\\right)\\\\\n&= -\\log{(p)}\n\\end{align}\\]\n\n\n\nWhen \\(y=0\\):\n\\[\\begin{align}\n\\text{CE} &= -\\left((0)\\log{(p)}+(1-0)\\log{(1-p)}\\right)\\\\\n&= -\\log{(1-p)}\n\\end{align}\\]\n\n\nThe empirical risk of the logistic regression model is then the mean cross-entropy loss across all datapoints in the dataset. When fitting the model, we want to determine the model parameter \\(\\theta\\) that leads to the lowest mean cross-entropy loss possible.\n\\[\n\\begin{align}\nR(\\theta) &= - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(p_i)}+(1-y_i)\\log{(1-p_i)}\\right) \\\\\n&= - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{\\sigma(X_i^{\\top}\\theta)}+(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta))}\\right)\n\\end{align}\n\\]\nThe optimization problem is therefore to find the estimate \\(\\hat{\\theta}\\) that minimizes \\(R(\\theta)\\):\n\\[\n\\hat{\\theta} = \\underset{\\theta}{\\arg\\min} - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta))}+(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta))}\\right)\n\\]\nPlotting the cross-entropy loss surface for our toy dataset gives us a more encouraging result – our loss function is now convex. This means we can optimize it using gradient descent. Computing the gradient of the logistic model is fairly challenging, so we’ll let sklearn take care of this for us. You won’t need to compute the gradient of the logistic model in Data 100.\n\n\nCode\ndef cross_entropy(y, p_hat):\n    return - y * np.log(p_hat) - (1 - y) * np.log(1 - p_hat)\n\ndef mean_cross_entropy_on_toy_data(theta):\n    p_hat = sigmoid(toy_df['x'] * theta)\n    return np.mean(cross_entropy(toy_df['y'], p_hat))\n\nplt.plot(thetas, [mean_cross_entropy_on_toy_data(theta) for theta in thetas], color = 'green')\nplt.ylabel(r'Mean Cross-Entropy Loss($\\theta$)')\nplt.xlabel(r'$\\theta$');"
   },
   {
     "objectID": "logistic_regression_1/logistic_reg_1.html#bonus-maximum-likelihood-estimation",

diff --git a/docs/site_libs/bootstrap/bootstrap.min.css b/docs/site_libs/bootstrap/bootstrap.min.css