Skip to content

Commit

Permalink
updating note 13
Browse files Browse the repository at this point in the history
  • Loading branch information
ishani07 committed Feb 29, 2024
1 parent e02c7e4 commit 06147af
Show file tree
Hide file tree
Showing 18 changed files with 866 additions and 1,419 deletions.
1,471 changes: 702 additions & 769 deletions docs/gradient_descent/gradient_descent.html

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 37 additions & 37 deletions docs/pandas_2/pandas_2.html
Original file line number Diff line number Diff line change
Expand Up @@ -1608,12 +1608,12 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">323128</td>
<td data-quarto-table-cell-role="th">61105</td>
<td>CA</td>
<td>M</td>
<td>1992</td>
<td>Lucio</td>
<td>29</td>
<td>F</td>
<td>1970</td>
<td>Adriana</td>
<td>168</td>
</tr>
</tbody>
</table>
Expand All @@ -1640,34 +1640,34 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">358226</td>
<td>2006</td>
<td>Jackson</td>
<td>772</td>
<td data-quarto-table-cell-role="th">261184</td>
<td>1949</td>
<td>Roman</td>
<td>10</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">267506</td>
<td>1956</td>
<td>Augustine</td>
<td>19</td>
<td data-quarto-table-cell-role="th">84347</td>
<td>1980</td>
<td>Chanelle</td>
<td>15</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">22853</td>
<td>1946</td>
<td>Judi</td>
<td>83</td>
<td data-quarto-table-cell-role="th">386460</td>
<td>2015</td>
<td>Dany</td>
<td>7</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">231031</td>
<td>2020</td>
<td>Raeya</td>
<td>8</td>
<td data-quarto-table-cell-role="th">187378</td>
<td>2009</td>
<td>Zion</td>
<td>12</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">196872</td>
<td>2011</td>
<td>Mazie</td>
<td>6</td>
<td data-quarto-table-cell-role="th">28878</td>
<td>1951</td>
<td>Candace</td>
<td>201</td>
</tr>
</tbody>
</table>
Expand All @@ -1693,28 +1693,28 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">343127</td>
<td data-quarto-table-cell-role="th">343396</td>
<td>2000</td>
<td>Long</td>
<td>30</td>
<td>Xander</td>
<td>18</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">343265</td>
<td data-quarto-table-cell-role="th">150903</td>
<td>2000</td>
<td>Geronimo</td>
<td>22</td>
<td>Mila</td>
<td>12</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">151163</td>
<td data-quarto-table-cell-role="th">149167</td>
<td>2000</td>
<td>Keiry</td>
<td>10</td>
<td>Nancy</td>
<td>427</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">343019</td>
<td data-quarto-table-cell-role="th">150298</td>
<td>2000</td>
<td>Pierce</td>
<td>40</td>
<td>Xitlali</td>
<td>22</td>
</tr>
</tbody>
</table>
Expand Down
14 changes: 7 additions & 7 deletions docs/pandas_3/pandas_3.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/regex/regex.html
Original file line number Diff line number Diff line change
Expand Up @@ -644,11 +644,11 @@ <h4 data-number="6.2.1.2" class="anchored" data-anchor-id="canonicalization-with
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>county_and_state[<span class="st">'clean_county_pandas'</span>] <span class="op">=</span> canonicalize_county_series(county_and_state[<span class="st">'County'</span>])</span>
<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>display(county_and_pop), display(county_and_state)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stderr">
<pre><code>/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_98183/2523629438.py:3: FutureWarning:
<pre><code>/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_323/2523629438.py:3: FutureWarning:

The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.

/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_98183/2523629438.py:3: FutureWarning:
/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_323/2523629438.py:3: FutureWarning:

The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
</code></pre>
Expand Down
6 changes: 3 additions & 3 deletions docs/sampling/sampling.html
Original file line number Diff line number Diff line change
Expand Up @@ -662,7 +662,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>random_sample <span class="op">=</span> movie.sample(n, replace <span class="op">=</span> <span class="va">False</span>) <span class="co">## By default, replace = False</span></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>np.mean(random_sample[<span class="st">"barbie"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="9">
<pre><code>0.5306208193747287</code></pre>
<pre><code>0.530954712907211</code></pre>
</div>
</div>
<p>This is very close to the actual vote of 0.5302792307692308!</p>
Expand All @@ -680,7 +680,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a>Markdown(<span class="ss">f"**Actual** = </span><span class="sc">{</span>actual_barbie<span class="sc">:.4f}</span><span class="ss">, **Sample** = </span><span class="sc">{</span>sample_barbie<span class="sc">:.4f}</span><span class="ss">, "</span></span>
<span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a> <span class="ss">f"**Err** = </span><span class="sc">{</span><span class="dv">100</span><span class="op">*</span>err<span class="sc">:.2f}</span><span class="ss">%."</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="10">
<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5275, <strong>Err</strong> = 0.52%.</p>
<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5100, <strong>Err</strong> = 3.82%.</p>
</div>
</div>
<p>We’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.</p>
Expand Down Expand Up @@ -713,7 +713,7 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
<div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>poll_result <span class="op">=</span> pd.Series(poll_result)</span>
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>np.<span class="bu">sum</span>(poll_result <span class="op">&gt;</span> <span class="fl">0.5</span>)<span class="op">/</span><span class="dv">1000</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="13">
<pre><code>0.944</code></pre>
<pre><code>0.956</code></pre>
</div>
</div>
<p>You can see the curve looks roughly Gaussian/normal. Using KDE:</p>
Expand Down
Binary file modified docs/sampling/sampling_files/figure-html/cell-13-output-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/sampling/sampling_files/figure-html/cell-15-output-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
478 changes: 0 additions & 478 deletions docs/search.json

This file was deleted.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 52 additions & 2 deletions feature_engineering/feature_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,17 @@ format:
- cosmo
- cerulean
callout-icon: false
jupyter: python3
jupyter:
jupytext:
text_representation:
extension: .qmd
format_name: quarto
format_version: '1.0'
jupytext_version: 1.16.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

::: {.callout-note collapse="false"}
Expand All @@ -33,6 +43,47 @@ In this lecture, we'll explore two techniques for model fitting:

With our new programming frameworks in hand, we will also add sophistication to our models by introducing more complex features to enhance model performance.


For example, let's look at the computational complexity for the solution to the normal equations.
<div align="middle">
<table style="width:100%">
<tr align="center">
<td><img src="images/complexity_normal_solution.png" alt='complexity_normal_solution' width='600'>
</td>
</tr>
</table>
</div>

Computing $(\mathbb{X}^{\top}\mathbb{X})^{-1}$ takes $O(nd^2) + O(d^3)$ while $\mathbb{X}^{\top}\mathbb{Y}$ takes $O(nd)$ time complexity where $n$ is the number of samples and $d$ is the number of features. The first term dominates the complexity and can be problematic for high dimensional models.

<div align="middle">
<table style="width:100%">
<tr align="center">
<td><img src="images/complexity_grad_descent.png" alt='complexity_grad_descent' width='600'>
</td>
</tr>
</table>
</div>

Looking back at the complexity for gradient descent, suppose we run $T$ iterations then the final complexity is $O(Tnd)$. Typically $n$ is much larger than $T$ or $d$. How should we reduce the cost of this algorithm using a technique in DATA 100?


Let's now compare the time complexity between batch gradient descent and stochastic gradient descent.


<div align="middle">
<table style="width:100%">
<tr align="center">
<td><img src="images/time-complexity-compare.png" alt='time-complexity-compare' width='600'>
</td>
</tr>
</table>
</div>
As shown above, the time complexity scales with the number of data points selected in the sample. Stochastic gradient descent helps approximate the gradient while also reducing the computation cost, leading to this tradeoff. As the batch-size increases, it will be a better estimate of the true gradient.

A few notes on the **batch size, b.** It is typically small and the original stochastic gradient descent algorithm proposed a batch size of 1. Today the choice of b depends on several factors. A larger batch size can lead to better gradient estimate and can be parallelized. However, there is diminishing returns. A smaller batch size means that there will be more frequent updates.


## Feature Engineering

At this point in the course, we've equipped ourselves with some powerful techniques to build and optimize models. We've explored how to develop models of multiple variables, as well as how to transform variables to help **linearize** a dataset and fit these models to maximize their performance.
Expand Down Expand Up @@ -267,4 +318,3 @@ We can see that there is a clear trade-off that comes from the complexity of our
The takeaway here: we need to strike a balance in the complexity of our models; we want models that are generalizable to "unseen" data. A model that is too simple won't be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting.

This begs the question: how do we control the complexity of a model? Stay tuned for our Lecture 17 on Cross-Validation and Regularization!

56 changes: 14 additions & 42 deletions gradient_descent/gradient_descent.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,17 @@ format:
- cosmo
- cerulean
callout-icon: false
jupyter: python3
jupyter:
jupytext:
text_representation:
extension: .qmd
format_name: quarto
format_version: '1.0'
jupytext_version: 1.16.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

::: {.callout-note collapse="false"}
Expand Down Expand Up @@ -62,11 +72,11 @@ Our goal will be to predict the value of the `"bill_depth_mm"` for a particular
penguins["bias"] = np.ones(len(penguins), dtype=int)
# Define the design matrix, X...
# Note that we use .to_numpy() to convert our DataFrame into a NumPy array so it's in Matrix form
# Note that we use .to_numpy() to convert our DataFrame into a NumPy array so it is in Matrix form
X = penguins[["bias", "flipper_length_mm", "body_mass_g"]].to_numpy()
# ...as well as the target variable, Y
# Again, we use .to_numpy() to convert our DataFrame into a NumPy array so it's in Matrix form
# Again, we use .to_numpy() to convert our DataFrame into a NumPy array so it is in Matrix form
Y = penguins[["bill_depth_mm"]].to_numpy()
```

Expand Down Expand Up @@ -318,7 +328,7 @@ This basic approach suffers from three major flaws:
2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.
3. It is *very* computationally inefficient, considering potentially vast numbers of guesses that are useless.

#### Scipy.optimize.minimize
#### `Scipy.optimize.minimize`

One way to minimize this mathematical function is to use the `scipy.optimize.minimize` function. It takes a function and a starting guess and tries to find the minimum.

Expand Down Expand Up @@ -716,29 +726,6 @@ $$\begin{bmatrix}

Formally, the algorithm we derived above is called **batch gradient descent.** For each iteration of the algorithm, the derivative of loss is computed across the *entire* batch of all $n$ datapoints. While this update rule works well in theory, it is not practical in most circumstances. For large datasets (with perhaps billions of datapoints), finding the gradient across all the data is incredibly computationally taxing; gradient descent will converge slowly because each individual update is slow.

For example, let's look at the computational complexity for the solution to the normal equations.
<div align="middle">
<table style="width:100%">
<tr align="center">
<td><img src="images/complexity_normal_solution.png" alt='complexity_normal_solution' width='600'>
</td>
</tr>
</table>
</div>

Computing $(\mathbb{X}^{\top}\mathbb{X})^{-1}$ takes $O(nd^2) + O(d^3)$ while $\mathbb{X}^{\top}\mathbb{Y}$ takes $O(nd)$ time complexity where $n$ is the number of samples and $d$ is the number of features. The first term dominates the complexity and can be problematic for high dimensional models.

<div align="middle">
<table style="width:100%">
<tr align="center">
<td><img src="images/complexity_grad_descent.png" alt='complexity_grad_descent' width='600'>
</td>
</tr>
</table>
</div>

Looking back at the complexity for gradient descent, suppose we run $T$ iterations then the final complexity is $O(Tnd)$. Typically $n$ is much larger than $T$ or $d$. How should we reduce the cost of this algorithm using a technique in DATA 100?

**Stochastic (mini-batch) gradient descent** tries to address this issue. In stochastic descent, only a *sample* of the full dataset is used at each update. We estimate the true gradient of the loss surface using just that sample of data. The **batch size** is the number of data points used in each sample. The sampling strategy is generally without replacement (data is shuffled and batch size examples are selected one at a time.)

Each complete "pass" through the data is known as a **training epoch**. After shuffling the data, in a single **training epoch** of stochastic gradient descent, we
Expand All @@ -763,21 +750,6 @@ The diagrams below represent a "bird's eye view" of a loss surface from above. N
</table>
</div>

Let's now compare the time complexity between batch gradient descent and stochastic gradient descent.


<div align="middle">
<table style="width:100%">
<tr align="center">
<td><img src="images/time-complexity-compare.png" alt='time-complexity-compare' width='600'>
</td>
</tr>
</table>
</div>
As shown above, the time complexity scales with the number of data points selected in the sample. Stochastic gradient descent helps approximate the gradient while also reducing the computation cost, leading to this tradeoff. As the batch-size increases, it will be a better estimate of the true gradient.

A few notes on the **batch size, b.** It is typically small and the original stochastic gradient descent algorithm proposed a batch size of 1. Today the choice of b depends on several factors. A larger batch size can lead to better gradient estimate and can be parallelized. However, there is diminishing returns. A smaller batch size means that there will be more frequent updates.

To summarize the tradeoffs of batch size:

| - | Smaller Batch Size | Larger Batch Size |
Expand Down
2 changes: 1 addition & 1 deletion index.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2024.2.22) 29 FEB 2024 00:14
This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2024.2.22) 29 FEB 2024 10:44
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
Expand Down
Binary file modified index.pdf
Binary file not shown.
Loading

0 comments on commit 06147af

Please sign in to comment.