updating note 13

DS-100 · Feb 29, 2024 · 06147af · 06147af
1 parent e02c7e4
commit 06147af
Show file tree

Hide file tree

Showing 18 changed files with 866 additions and 1,419 deletions.
diff --git a/docs/gradient_descent/gradient_descent.html b/docs/gradient_descent/gradient_descent.html
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png
diff --git a/docs/pandas_2/pandas_2.html b/docs/pandas_2/pandas_2.html
@@ -1608,12 +1608,12 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">323128</td>
+<td data-quarto-table-cell-role="th">61105</td>
 <td>CA</td>
-<td>M</td>
-<td>1992</td>
-<td>Lucio</td>
-<td>29</td>
+<td>F</td>
+<td>1970</td>
+<td>Adriana</td>
+<td>168</td>
 </tr>
 </tbody>
 </table>
@@ -1640,34 +1640,34 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">358226</td>
-<td>2006</td>
-<td>Jackson</td>
-<td>772</td>
+<td data-quarto-table-cell-role="th">261184</td>
+<td>1949</td>
+<td>Roman</td>
+<td>10</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">267506</td>
-<td>1956</td>
-<td>Augustine</td>
-<td>19</td>
+<td data-quarto-table-cell-role="th">84347</td>
+<td>1980</td>
+<td>Chanelle</td>
+<td>15</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">22853</td>
-<td>1946</td>
-<td>Judi</td>
-<td>83</td>
+<td data-quarto-table-cell-role="th">386460</td>
+<td>2015</td>
+<td>Dany</td>
+<td>7</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">231031</td>
-<td>2020</td>
-<td>Raeya</td>
-<td>8</td>
+<td data-quarto-table-cell-role="th">187378</td>
+<td>2009</td>
+<td>Zion</td>
+<td>12</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">196872</td>
-<td>2011</td>
-<td>Mazie</td>
-<td>6</td>
+<td data-quarto-table-cell-role="th">28878</td>
+<td>1951</td>
+<td>Candace</td>
+<td>201</td>
 </tr>
 </tbody>
 </table>
@@ -1693,28 +1693,28 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">343127</td>
+<td data-quarto-table-cell-role="th">343396</td>
 <td>2000</td>
-<td>Long</td>
-<td>30</td>
+<td>Xander</td>
+<td>18</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">343265</td>
+<td data-quarto-table-cell-role="th">150903</td>
 <td>2000</td>
-<td>Geronimo</td>
-<td>22</td>
+<td>Mila</td>
+<td>12</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">151163</td>
+<td data-quarto-table-cell-role="th">149167</td>
 <td>2000</td>
-<td>Keiry</td>
-<td>10</td>
+<td>Nancy</td>
+<td>427</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">343019</td>
+<td data-quarto-table-cell-role="th">150298</td>
 <td>2000</td>
-<td>Pierce</td>
-<td>40</td>
+<td>Xitlali</td>
+<td>22</td>
 </tr>
 </tbody>
 </table>

diff --git a/docs/pandas_3/pandas_3.html b/docs/pandas_3/pandas_3.html
diff --git a/docs/regex/regex.html b/docs/regex/regex.html
@@ -644,11 +644,11 @@ <h4 data-number="6.2.1.2" class="anchored" data-anchor-id="canonicalization-with
 <span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>county_and_state[<span class="st">'clean_county_pandas'</span>] <span class="op">=</span> canonicalize_county_series(county_and_state[<span class="st">'County'</span>])</span>
 <span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>display(county_and_pop), display(county_and_state)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stderr">
-<pre><code>/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_98183/2523629438.py:3: FutureWarning:
+<pre><code>/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_323/2523629438.py:3: FutureWarning:
 
 The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
 
-/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_98183/2523629438.py:3: FutureWarning:
+/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_323/2523629438.py:3: FutureWarning:
 
 The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
 </code></pre>

diff --git a/docs/sampling/sampling.html b/docs/sampling/sampling.html
@@ -662,7 +662,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>random_sample <span class="op">=</span> movie.sample(n, replace <span class="op">=</span> <span class="va">False</span>) <span class="co">## By default, replace = False</span></span>
 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>np.mean(random_sample[<span class="st">"barbie"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="9">
-<pre><code>0.5306208193747287</code></pre>
+<pre><code>0.530954712907211</code></pre>
 </div>
 </div>
 <p>This is very close to the actual vote of 0.5302792307692308!</p>
@@ -680,7 +680,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
 <span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a>Markdown(<span class="ss">f"**Actual** = </span><span class="sc">{</span>actual_barbie<span class="sc">:.4f}</span><span class="ss">, **Sample** = </span><span class="sc">{</span>sample_barbie<span class="sc">:.4f}</span><span class="ss">, "</span></span>
 <span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a>         <span class="ss">f"**Err** = </span><span class="sc">{</span><span class="dv">100</span><span class="op">*</span>err<span class="sc">:.2f}</span><span class="ss">%."</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="10">
-<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5275, <strong>Err</strong> = 0.52%.</p>
+<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5100, <strong>Err</strong> = 3.82%.</p>
 </div>
 </div>
 <p>We’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.</p>
@@ -713,7 +713,7 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>poll_result <span class="op">=</span> pd.Series(poll_result)</span>
 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>np.<span class="bu">sum</span>(poll_result <span class="op">&gt;</span> <span class="fl">0.5</span>)<span class="op">/</span><span class="dv">1000</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="13">
-<pre><code>0.944</code></pre>
+<pre><code>0.956</code></pre>
 </div>
 </div>
 <p>You can see the curve looks roughly Gaussian/normal. Using KDE:</p>

diff --git a/docs/sampling/sampling_files/figure-html/cell-13-output-1.png b/docs/sampling/sampling_files/figure-html/cell-13-output-1.png
diff --git a/docs/sampling/sampling_files/figure-html/cell-15-output-1.png b/docs/sampling/sampling_files/figure-html/cell-15-output-1.png
diff --git a/docs/search.json b/docs/search.json
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-2.png
diff --git a/feature_engineering/feature_engineering.qmd b/feature_engineering/feature_engineering.qmd
@@ -14,7 +14,17 @@ format:
       - cosmo
       - cerulean
     callout-icon: false
-jupyter: python3
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .qmd
+      format_name: quarto
+      format_version: '1.0'
+      jupytext_version: 1.16.1
+  kernelspec:
+    display_name: Python 3 (ipykernel)
+    language: python
+    name: python3
 ---
 
 ::: {.callout-note collapse="false"}
@@ -33,6 +43,47 @@ In this lecture, we'll explore two techniques for model fitting:
 
 With our new programming frameworks in hand, we will also add sophistication to our models by introducing more complex features to enhance model performance. 
 
+
+For example, let's look at the computational complexity for the solution to the normal equations.
+<div align="middle">
+  <table style="width:100%">
+    <tr align="center">
+      <td><img src="images/complexity_normal_solution.png" alt='complexity_normal_solution' width='600'>
+      </td>
+    </tr>
+  </table>
+</div>
+
+Computing $(\mathbb{X}^{\top}\mathbb{X})^{-1}$ takes $O(nd^2) + O(d^3)$ while $\mathbb{X}^{\top}\mathbb{Y}$ takes $O(nd)$ time complexity where $n$ is the number of samples and $d$ is the number of features. The first term dominates the complexity and can be problematic for high dimensional models.
+
+<div align="middle">
+  <table style="width:100%">
+    <tr align="center">
+      <td><img src="images/complexity_grad_descent.png" alt='complexity_grad_descent' width='600'>
+      </td>
+    </tr>
+  </table>
+</div>
+
+Looking back at the complexity for gradient descent, suppose we run $T$ iterations then the final complexity is $O(Tnd)$. Typically $n$ is much larger than $T$ or $d$. How should we reduce the cost of this algorithm using a technique in DATA 100?
+
+
+Let's now compare the time complexity between batch gradient descent and stochastic gradient descent.
+
+
+<div align="middle">
+  <table style="width:100%">
+    <tr align="center">
+      <td><img src="images/time-complexity-compare.png" alt='time-complexity-compare' width='600'>
+      </td>
+    </tr>
+  </table>
+</div>
+As shown above, the time complexity scales with the number of data points selected in the sample. Stochastic gradient descent helps approximate the gradient while also reducing the computation cost, leading to this tradeoff. As the batch-size increases, it will be a better estimate of the true gradient. 
+
+A few notes on the **batch size, b.** It is typically small and the original stochastic gradient descent algorithm proposed a batch size of 1. Today the choice of b depends on several factors. A larger batch size can lead to better gradient estimate and can be parallelized. However, there is diminishing returns. A smaller batch size means that there will be more frequent updates.
+
+
 ## Feature Engineering
 
 At this point in the course, we've equipped ourselves with some powerful techniques to build and optimize models. We've explored how to develop models of multiple variables, as well as how to transform variables to help **linearize** a dataset and fit these models to maximize their performance.
@@ -267,4 +318,3 @@ We can see that there is a clear trade-off that comes from the complexity of our
 The takeaway here: we need to strike a balance in the complexity of our models; we want models that are generalizable to "unseen" data. A model that is too simple won't be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting. 
 
 This begs the question: how do we control the complexity of a model? Stay tuned for our Lecture 17 on Cross-Validation and Regularization!
-
diff --git a/...escent/images/complexity_grad_descent.png → ...eering/images/complexity_grad_descent.png b/...escent/images/complexity_grad_descent.png → ...eering/images/complexity_grad_descent.png
diff --git a/...ent/images/complexity_normal_solution.png → ...ing/images/complexity_normal_solution.png b/...ent/images/complexity_normal_solution.png → ...ing/images/complexity_normal_solution.png
diff --git a/...escent/images/time-complexity-compare.png → ...eering/images/time-complexity-compare.png b/...escent/images/time-complexity-compare.png → ...eering/images/time-complexity-compare.png
diff --git a/gradient_descent/gradient_descent.qmd b/gradient_descent/gradient_descent.qmd
@@ -14,7 +14,17 @@ format:
       - cosmo
       - cerulean
     callout-icon: false
-jupyter: python3
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .qmd
+      format_name: quarto
+      format_version: '1.0'
+      jupytext_version: 1.16.1
+  kernelspec:
+    display_name: Python 3 (ipykernel)
+    language: python
+    name: python3
 ---
 
 ::: {.callout-note collapse="false"}
@@ -62,11 +72,11 @@ Our goal will be to predict the value of the `"bill_depth_mm"` for a particular
 penguins["bias"] = np.ones(len(penguins), dtype=int) 
 
 # Define the design matrix, X...
-# Note that we use .to_numpy() to convert our DataFrame into a NumPy array so it's in Matrix form
+# Note that we use .to_numpy() to convert our DataFrame into a NumPy array so it is in Matrix form
 X = penguins[["bias", "flipper_length_mm", "body_mass_g"]].to_numpy()
 
 # ...as well as the target variable, Y
-# Again, we use .to_numpy() to convert our DataFrame into a NumPy array so it's in Matrix form
+# Again, we use .to_numpy() to convert our DataFrame into a NumPy array so it is in Matrix form
 Y = penguins[["bill_depth_mm"]].to_numpy()
 ```
 
@@ -318,7 +328,7 @@ This basic approach suffers from three major flaws:
 2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.
 3. It is *very* computationally inefficient, considering potentially vast numbers of guesses that are useless.
 
-#### Scipy.optimize.minimize
+#### `Scipy.optimize.minimize`
 
 One way to minimize this mathematical function is to use the `scipy.optimize.minimize` function. It takes a function and a starting guess and tries to find the minimum.
 
@@ -716,29 +726,6 @@ $$\begin{bmatrix}
 
 Formally, the algorithm we derived above is called **batch gradient descent.** For each iteration of the algorithm, the derivative of loss is computed across the *entire* batch of all $n$ datapoints. While this update rule works well in theory, it is not practical in most circumstances. For large datasets (with perhaps billions of datapoints), finding the gradient across all the data is incredibly computationally taxing; gradient descent will converge slowly because each individual update is slow.
 
-For example, let's look at the computational complexity for the solution to the normal equations.
-<div align="middle">
-  <table style="width:100%">
-    <tr align="center">
-      <td><img src="images/complexity_normal_solution.png" alt='complexity_normal_solution' width='600'>
-      </td>
-    </tr>
-  </table>
-</div>
-
-Computing $(\mathbb{X}^{\top}\mathbb{X})^{-1}$ takes $O(nd^2) + O(d^3)$ while $\mathbb{X}^{\top}\mathbb{Y}$ takes $O(nd)$ time complexity where $n$ is the number of samples and $d$ is the number of features. The first term dominates the complexity and can be problematic for high dimensional models.
-
-<div align="middle">
-  <table style="width:100%">
-    <tr align="center">
-      <td><img src="images/complexity_grad_descent.png" alt='complexity_grad_descent' width='600'>
-      </td>
-    </tr>
-  </table>
-</div>
-
-Looking back at the complexity for gradient descent, suppose we run $T$ iterations then the final complexity is $O(Tnd)$. Typically $n$ is much larger than $T$ or $d$. How should we reduce the cost of this algorithm using a technique in DATA 100?
-
 **Stochastic (mini-batch) gradient descent** tries to address this issue. In stochastic descent, only a *sample* of the full dataset is used at each update. We estimate the true gradient of the loss surface using just that sample of data. The **batch size** is the number of data points used in each sample. The sampling strategy is generally without replacement (data is shuffled and batch size examples are selected one at a time.)
 
 Each complete "pass" through the data is known as a **training epoch**. After shuffling the data, in a single **training epoch** of stochastic gradient descent, we
@@ -763,21 +750,6 @@ The diagrams below represent a "bird's eye view" of a loss surface from above. N
   </table>
 </div>
 
-Let's now compare the time complexity between batch gradient descent and stochastic gradient descent.
-
-
-<div align="middle">
-  <table style="width:100%">
-    <tr align="center">
-      <td><img src="images/time-complexity-compare.png" alt='time-complexity-compare' width='600'>
-      </td>
-    </tr>
-  </table>
-</div>
-As shown above, the time complexity scales with the number of data points selected in the sample. Stochastic gradient descent helps approximate the gradient while also reducing the computation cost, leading to this tradeoff. As the batch-size increases, it will be a better estimate of the true gradient. 
-
-A few notes on the **batch size, b.** It is typically small and the original stochastic gradient descent algorithm proposed a batch size of 1. Today the choice of b depends on several factors. A larger batch size can lead to better gradient estimate and can be parallelized. However, there is diminishing returns. A smaller batch size means that there will be more frequent updates.
-
 To summarize the tradeoffs of batch size: 
 
 | - | Smaller Batch Size | Larger Batch Size | 

diff --git a/index.log b/index.log
@@ -1,4 +1,4 @@
-This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2024.2.22)  29 FEB 2024 00:14
+This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2024.2.22)  29 FEB 2024 10:44
 entering extended mode
  restricted \write18 enabled.
  %&-line parsing enabled.

diff --git a/index.pdf b/index.pdf