Skip to content

Commit

Permalink
Few final changes to gradient descent
Browse files Browse the repository at this point in the history
  • Loading branch information
yashdave003 committed Oct 9, 2023
1 parent 3e4fef4 commit 6404300
Show file tree
Hide file tree
Showing 3 changed files with 605 additions and 78 deletions.
185 changes: 111 additions & 74 deletions gradient_descent/gradient_descent.ipynb

Large diffs are not rendered by default.

7 changes: 3 additions & 4 deletions gradient_descent/gradient_descent.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ jupyter: python3
* Applying gradient descent for numerical optimization
:::

At this point, we've grown quite familiar with the process of choosing a model and a corresponding loss function and optimizing parameters by choosing the values of $\theta$ that minimize the loss function. So far, we've optimized $\theta$ theta by
1. Using calculus to take the derivative of the loss function with respect to $\theta$, set it equal to 0, and solve.
At this point, we've grown quite familiar with the process of choosing a model and a corresponding loss function and optimizing parameters by choosing the values of $\theta$ that minimize the loss function. So far, we've optimized $\theta$ by
1. Using calculus to take the derivative of the loss function with respect to $\theta$, setting it equal to 0, and solve.
2. Using the geometric argument of orthogonality to derive the OLS solution $\hat{\theta} = (\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T \mathbb{Y}$.

One thing to note, however, is that the techniques we used above can only be applied if we make some big assumptions. For the calculus approach, we assumed that the loss function was differentiable at all points and that the algebra was manageable; for the geometric approach, OLS *only* applies when using a linear model with MSE loss. What happens when we have more complex models with different, more complex loss functions? The techniques we've learned so far will not work, so we need a new optimization technique: **gradient descent**.
Expand Down Expand Up @@ -113,7 +113,7 @@ minimize(arbitrary, x0 = 3.5)

It turns out that under the hood, the `fit` method for `LinearRegression` models uses gradient descent. Gradient descent is also how much of machine learning works, including even advanced neural network models.

In Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it's important that we know the basic principles beyond the optimization functions that harness to find optimal parameters.
In Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it's important that we know the underlying principles that optimization functions harness to find optimal parameters.


## Digging into Gradient Descent
Expand Down Expand Up @@ -504,4 +504,3 @@ The diagrams below represent a "bird's eye view" of a loss surface from above. N
</table>
</div>


Loading

0 comments on commit 6404300

Please sign in to comment.