Few final changes to gradient descent

DS-100 · Oct 9, 2023 · 6404300 · 6404300
1 parent 3e4fef4
commit 6404300
Show file tree

Hide file tree

Showing 3 changed files with 605 additions and 78 deletions.
diff --git a/gradient_descent/gradient_descent.ipynb b/gradient_descent/gradient_descent.ipynb
diff --git a/gradient_descent/gradient_descent.qmd b/gradient_descent/gradient_descent.qmd
@@ -24,8 +24,8 @@ jupyter: python3
 * Applying gradient descent for numerical optimization
 :::
 
-At this point, we've grown quite familiar with the process of choosing a model and a corresponding loss function and optimizing parameters by choosing the values of $\theta$ that minimize the loss function. So far, we've optimized $\theta$ theta by
-1. Using calculus to take the derivative of the loss function with respect to $\theta$, set it equal to 0, and solve.
+At this point, we've grown quite familiar with the process of choosing a model and a corresponding loss function and optimizing parameters by choosing the values of $\theta$ that minimize the loss function. So far, we've optimized $\theta$ by
+1. Using calculus to take the derivative of the loss function with respect to $\theta$, setting it equal to 0, and solve.
 2. Using the geometric argument of orthogonality to derive the OLS solution $\hat{\theta} = (\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T \mathbb{Y}$.
 
 One thing to note, however, is that the techniques we used above can only be applied if we make some big assumptions. For the calculus approach, we assumed that the loss function was differentiable at all points and that the algebra was manageable; for the geometric approach, OLS *only* applies when using a linear model with MSE loss. What happens when we have more complex models with different, more complex loss functions? The techniques we've learned so far will not work, so we need a new optimization technique: **gradient descent**. 
@@ -113,7 +113,7 @@ minimize(arbitrary, x0 = 3.5)
 
 It turns out that under the hood, the `fit` method for `LinearRegression` models uses gradient descent. Gradient descent is also how much of machine learning works, including even advanced neural network models. 
 
-In Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it's important that we know the basic principles beyond the optimization functions that harness to find optimal parameters.
+In Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it's important that we know the underlying principles that optimization functions harness to find optimal parameters.
 
 
 ## Digging into Gradient Descent
@@ -504,4 +504,3 @@ The diagrams below represent a "bird's eye view" of a loss surface from above. N
   </table>
 </div>
 
-