diff --git a/_quarto.yml b/_quarto.yml
index 6deded89..a88ebe7c 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -25,7 +25,7 @@ book:
         - visualization_1/visualization_1.qmd
         - visualization_2/visualization_2.qmd
         - sampling/sampling.qmd
-        # - intro_to_modeling/intro_to_modeling.qmd
+        - intro_to_modeling/intro_to_modeling.qmd
         # - constant_model_loss_transformations/loss_transformations.qmd
         # - ols/ols.qmd
         # - gradient_descent/gradient_descent.qmd
diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index bf485fe7..246b651a 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -237,6 +237,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -325,7 +331,7 @@ <h2 id="toc-title">Data Cleaning and EDA</h2>
 </header>
 
 
-<div id="584fe3b6" class="cell" data-execution_count="1">
+<div id="4a3a59dc" class="cell" data-execution_count="1">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
@@ -390,7 +396,7 @@ <h3 data-number="5.1.1" class="anchored" data-anchor-id="file-formats"><span cla
 <section id="csv" class="level4" data-number="5.1.1.1">
 <h4 data-number="5.1.1.1" class="anchored" data-anchor-id="csv"><span class="header-section-number">5.1.1.1</span> CSV</h4>
 <p>CSVs, which stand for <strong>Comma-Separated Values</strong>, are a common tabular data format. In the past two <code>pandas</code> lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our <code>elections</code> and <code>babynames</code> datasets were stored and loaded as CSVs:</p>
-<div id="3b37dfaf" class="cell" data-execution_count="2">
+<div id="668aa7ec" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>pd.read_csv(<span class="st">"data/elections.csv"</span>).head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="2">
 <div>
@@ -461,7 +467,7 @@ <h4 data-number="5.1.1.1" class="anchored" data-anchor-id="csv"><span class="hea
 </div>
 </div>
 <p>To better understand the properties of a CSV, let’s take a look at the first few rows of the raw data file to see what it looks like before being loaded into a <code>DataFrame</code>. We’ll use the <code>repr()</code> function to return the raw string with its special characters:</p>
-<div id="e5063b44" class="cell" data-execution_count="3">
+<div id="8061633e" class="cell" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
@@ -482,7 +488,7 @@ <h4 data-number="5.1.1.1" class="anchored" data-anchor-id="csv"><span class="hea
 <h4 data-number="5.1.1.2" class="anchored" data-anchor-id="tsv"><span class="header-section-number">5.1.1.2</span> TSV</h4>
 <p>Another common file type is <strong>TSV (Tab-Separated Values)</strong>. In a TSV, records are still delimited by a newline <code>\n</code>, while fields are delimited by <code>\t</code> tab character.</p>
 <p>Let’s check out the first few rows of the raw TSV file. Again, we’ll use the <code>repr()</code> function so that <code>print</code> shows the special characters.</p>
-<div id="b55bf138" class="cell" data-execution_count="4">
+<div id="85e89687" class="cell" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.txt"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
@@ -498,7 +504,7 @@ <h4 data-number="5.1.1.2" class="anchored" data-anchor-id="tsv"><span class="hea
 </div>
 </div>
 <p>TSVs can be loaded into <code>pandas</code> using <code>pd.read_csv</code>. We’ll need to specify the <strong>delimiter</strong> with parameter<code>sep='\t'</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">(documentation)</a>.</p>
-<div id="2e75db0c" class="cell" data-execution_count="5">
+<div id="1bd8c0d2" class="cell" data-execution_count="5">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>pd.read_csv(<span class="st">"data/elections.txt"</span>, sep<span class="op">=</span><span class="st">'</span><span class="ch">\t</span><span class="st">'</span>).head(<span class="dv">3</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="5">
 <div>
@@ -555,7 +561,7 @@ <h4 data-number="5.1.1.2" class="anchored" data-anchor-id="tsv"><span class="hea
 <section id="json" class="level4" data-number="5.1.1.3">
 <h4 data-number="5.1.1.3" class="anchored" data-anchor-id="json"><span class="header-section-number">5.1.1.3</span> JSON</h4>
 <p><strong>JSON (JavaScript Object Notation)</strong> files behave similarly to Python dictionaries. A raw JSON is shown below.</p>
-<div id="02097134" class="cell" data-execution_count="6">
+<div id="dac175be" class="cell" data-execution_count="6">
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.json"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
@@ -585,7 +591,7 @@ <h4 data-number="5.1.1.3" class="anchored" data-anchor-id="json"><span class="he
 </div>
 </div>
 <p>JSON files can be loaded into <code>pandas</code> using <code>pd.read_json</code>.</p>
-<div id="33ac9232" class="cell" data-execution_count="7">
+<div id="601c0201" class="cell" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>pd.read_json(<span class="st">'data/elections.json'</span>).head(<span class="dv">3</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="7">
 <div>
@@ -640,7 +646,7 @@ <h4 data-number="5.1.1.3" class="anchored" data-anchor-id="json"><span class="he
 <section id="eda-with-json-berkeley-covid-19-data" class="level5" data-number="5.1.1.3.1">
 <h5 data-number="5.1.1.3.1" class="anchored" data-anchor-id="eda-with-json-berkeley-covid-19-data"><span class="header-section-number">5.1.1.3.1</span> EDA with JSON: Berkeley COVID-19 Data</h5>
 <p>The City of Berkeley Open Data <a href="https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766">website</a> has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let’s download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the <a href="https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html"><code>ds100_utils.py</code></a> file that we can reuse these helper functions in many different notebooks.</p>
-<div id="e7a36b8a" class="cell" data-execution_count="8">
+<div id="2e810b18" class="cell" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> ds100_utils <span class="im">import</span> fetch_and_cache</span>
 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>covid_file <span class="op">=</span> fetch_and_cache(</span>
@@ -659,7 +665,7 @@ <h5 data-number="5.1.1.3.1" class="anchored" data-anchor-id="eda-with-json-berke
 <h6 data-number="5.1.1.3.1.1" class="anchored" data-anchor-id="file-size"><span class="header-section-number">5.1.1.3.1.1</span> File Size</h6>
 <p>Let’s start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use <code>Python</code> tools to probe the file.</p>
 <p>Since there seem to be text files, let’s investigate the number of lines, which often corresponds to the number of records</p>
-<div id="3f1a5bc4" class="cell" data-execution_count="9">
+<div id="2f96fc29" class="cell" data-execution_count="9">
 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> os</span>
 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(covid_file, <span class="st">"is"</span>, os.path.getsize(covid_file) <span class="op">/</span> <span class="fl">1e6</span>, <span class="st">"MB"</span>)</span>
@@ -677,7 +683,7 @@ <h6 data-number="5.1.1.3.1.2" class="anchored" data-anchor-id="unix-commands"><s
 <p>As part of the EDA workflow, Unix commands can come in very handy. In fact, there’s an entire book called <a href="https://datascienceatthecommandline.com/">“Data Science at the Command Line”</a> that explores this idea in depth! In Jupyter/IPython, you can prefix lines with <code>!</code> to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax <code>{expr}</code>.</p>
 <p>Here, we use the <code>ls</code> command to list files, using the <code>-lh</code> flags, which request “long format with information in human-readable form.” We also use the <code>wc</code> command for “word count,” but with the <code>-l</code> flag, which asks for line counts instead of words.</p>
 <p>These two give us the same information as the code above, albeit in a slightly different form:</p>
-<div id="5a4a0b91" class="cell" data-execution_count="10">
+<div id="42a4da0a" class="cell" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>ls <span class="op">-</span>lh {covid_file}</span>
 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>wc <span class="op">-</span>l {covid_file}</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
@@ -689,7 +695,7 @@ <h6 data-number="5.1.1.3.1.2" class="anchored" data-anchor-id="unix-commands"><s
 <section id="file-contents" class="level6" data-number="5.1.1.3.1.3">
 <h6 data-number="5.1.1.3.1.3" class="anchored" data-anchor-id="file-contents"><span class="header-section-number">5.1.1.3.1.3</span> File Contents</h6>
 <p>Let’s explore the data format using <code>Python</code>.</p>
-<div id="962a4d97" class="cell" data-execution_count="11">
+<div id="9f1780cb" class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> i, row <span class="kw">in</span> <span class="bu">enumerate</span>(f):</span>
 <span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row)) <span class="co"># print raw strings</span></span>
@@ -703,7 +709,7 @@ <h6 data-number="5.1.1.3.1.3" class="anchored" data-anchor-id="file-contents"><s
 </div>
 </div>
 <p>We can use the <code>head</code> Unix command (which is where <code>pandas</code>’ <code>head</code> method comes from!) to see the first few lines of the file:</p>
-<div id="f161a9a2" class="cell" data-execution_count="12">
+<div id="0f99292c" class="cell" data-execution_count="12">
 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>head <span class="op">-</span><span class="dv">5</span> {covid_file}</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>{
@@ -714,21 +720,21 @@ <h6 data-number="5.1.1.3.1.3" class="anchored" data-anchor-id="file-contents"><s
 </div>
 </div>
 <p>In order to load the JSON file into <code>pandas</code>, Let’s first do some EDA with Oython’s <code>json</code> package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into <code>pandas</code>. Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the <code>json</code> package.</p>
-<div id="81449488" class="cell" data-execution_count="13">
+<div id="27fb5d7f" class="cell" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> json</span>
 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"rb"</span>) <span class="im">as</span> f:</span>
 <span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a>    covid_json <span class="op">=</span> json.load(f)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>The <code>covid_json</code> variable is now a dictionary encoding the data in the file:</p>
-<div id="f45d02e2" class="cell" data-execution_count="14">
+<div id="4d3ce182" class="cell" data-execution_count="14">
 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(covid_json)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="14">
 <pre><code>dict</code></pre>
 </div>
 </div>
 <p>We can examine what keys are in the top level JSON object by listing out the keys.</p>
-<div id="a8f1e4c5" class="cell" data-execution_count="15">
+<div id="a20ef170" class="cell" data-execution_count="15">
 <div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a>covid_json.keys()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="15">
 <pre><code>dict_keys(['meta', 'data'])</code></pre>
@@ -736,14 +742,14 @@ <h6 data-number="5.1.1.3.1.3" class="anchored" data-anchor-id="file-contents"><s
 </div>
 <p><strong>Observation</strong>: The JSON dictionary contains a <code>meta</code> key which likely refers to metadata (data about the data). Metadata is often maintained with the data and can be a good source of additional information.</p>
 <p>We can investigate the metadata further by examining the keys associated with the metadata.</p>
-<div id="891f4f9f" class="cell" data-execution_count="16">
+<div id="5744b4e4" class="cell" data-execution_count="16">
 <div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a>covid_json[<span class="st">'meta'</span>].keys()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="16">
 <pre><code>dict_keys(['view'])</code></pre>
 </div>
 </div>
 <p>The <code>meta</code> key contains another dictionary called <code>view</code>. This likely refers to metadata about a particular “view” of some underlying database. We will learn more about views when we study SQL later in the class.</p>
-<div id="15de9ac0" class="cell" data-execution_count="17">
+<div id="7fed737e" class="cell" data-execution_count="17">
 <div class="sourceCode cell-code" id="cb29"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a>covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>].keys()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="17">
 <pre><code>dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])</code></pre>
@@ -763,7 +769,7 @@ <h6 data-number="5.1.1.3.1.3" class="anchored" data-anchor-id="file-contents"><s
     | -&gt; columns
     ...</code></pre>
 <p>There is a key called description in the view sub dictionary. This likely contains a description of the data:</p>
-<div id="e6b83090" class="cell" data-execution_count="18">
+<div id="4530ab54" class="cell" data-execution_count="18">
 <div class="sourceCode cell-code" id="cb32"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'description'</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>Counts of confirmed COVID-19 cases among Berkeley residents by date.</code></pre>
@@ -773,7 +779,7 @@ <h6 data-number="5.1.1.3.1.3" class="anchored" data-anchor-id="file-contents"><s
 <section id="examining-the-data-field-for-records" class="level6" data-number="5.1.1.3.1.4">
 <h6 data-number="5.1.1.3.1.4" class="anchored" data-anchor-id="examining-the-data-field-for-records"><span class="header-section-number">5.1.1.3.1.4</span> Examining the Data Field for Records</h6>
 <p>We can look at a few entries in the <code>data</code> field. This is what we’ll load into <code>pandas</code>.</p>
-<div id="8cf1d832" class="cell" data-execution_count="19">
+<div id="baece834" class="cell" data-execution_count="19">
 <div class="sourceCode cell-code" id="cb34"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">3</span>):</span>
 <span id="cb34-2"><a href="#cb34-2" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>i<span class="sc">:03}</span><span class="ss"> | </span><span class="sc">{</span>covid_json[<span class="st">'data'</span>][i]<span class="sc">}</span><span class="ss">"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
@@ -784,7 +790,7 @@ <h6 data-number="5.1.1.3.1.4" class="anchored" data-anchor-id="examining-the-dat
 </div>
 <p>Observations: * These look like equal-length records, so maybe <code>data</code> is a table! * But what do each of values in the record mean? Where can we find column headers?</p>
 <p>For that, we’ll need the <code>columns</code> key in the metadata dictionary. This returns a list:</p>
-<div id="b3e0493f" class="cell" data-execution_count="20">
+<div id="56dac3b6" class="cell" data-execution_count="20">
 <div class="sourceCode cell-code" id="cb36"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'columns'</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="20">
 <pre><code>list</code></pre>
@@ -811,7 +817,7 @@ <h6 data-number="5.1.1.3.1.6" class="anchored" data-anchor-id="loading-covid-dat
 <li><p>Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.</p></li>
 <li><p>Examine the <code>tail</code> of the table.</p></li>
 </ol>
-<div id="bd5de5a6" class="cell" data-execution_count="21">
+<div id="77283c86" class="cell" data-execution_count="21">
 <div class="sourceCode cell-code" id="cb38"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><a href="#cb38-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Load the data from JSON and assign column titles</span></span>
 <span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a>covid <span class="op">=</span> pd.DataFrame(</span>
 <span id="cb38-3"><a href="#cb38-3" aria-hidden="true" tabindex="-1"></a>    covid_json[<span class="st">'data'</span>],</span>
@@ -924,7 +930,7 @@ <h6 data-number="5.1.1.3.1.6" class="anchored" data-anchor-id="loading-covid-dat
 <h3 data-number="5.1.2" class="anchored" data-anchor-id="primary-and-foreign-keys"><span class="header-section-number">5.1.2</span> Primary and Foreign Keys</h3>
 <p>Last time, we introduced <code>.merge</code> as the <code>pandas</code> method for joining multiple <code>DataFrame</code>s together. In our discussion of joins, we touched on the idea of using a “key” to determine what rows should be merged from each table. Let’s take a moment to examine this idea more closely.</p>
 <p>The <strong>primary key</strong> is the column or set of columns in a table that <em>uniquely</em> determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student’s Cal ID as the primary key.</p>
-<div id="c3548287" class="cell" data-execution_count="22">
+<div id="3fcafd3c" class="cell" data-execution_count="22">
 <div class="cell-output cell-output-display" data-execution_count="22">
 <div>
 
@@ -970,7 +976,7 @@ <h3 data-number="5.1.2" class="anchored" data-anchor-id="primary-and-foreign-key
 </div>
 </div>
 <p>The <strong>foreign key</strong> is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset’s foreign keys can be useful when assigning the <code>left_on</code> and <code>right_on</code> parameters of <code>.merge</code>. In the table of office hour tickets below, <code>"Cal ID"</code> is a foreign key referencing the previous table.</p>
-<div id="d5a05331" class="cell" data-execution_count="23">
+<div id="a313df36" class="cell" data-execution_count="23">
 <div class="cell-output cell-output-display" data-execution_count="23">
 <div>
 
@@ -1063,7 +1069,7 @@ <h3 data-number="5.2.3" class="anchored" data-anchor-id="temporality"><span clas
 <section id="temporality-with-pandas-dt-accessors" class="level4" data-number="5.2.3.1">
 <h4 data-number="5.2.3.1" class="anchored" data-anchor-id="temporality-with-pandas-dt-accessors"><span class="header-section-number">5.2.3.1</span> Temporality with <code>pandas</code>’ <code>dt</code> accessors</h4>
 <p>Let’s briefly look at how we can use <code>pandas</code>’ <code>dt</code> accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.</p>
-<div id="69c3b3b0" class="cell" data-execution_count="24">
+<div id="8adc7786" class="cell" data-execution_count="24">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb39"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1"><a href="#cb39-1" aria-hidden="true" tabindex="-1"></a>calls <span class="op">=</span> pd.read_csv(<span class="st">"data/Berkeley_PD_-_Calls_for_Service.csv"</span>)</span>
@@ -1170,11 +1176,11 @@ <h4 data-number="5.2.3.1" class="anchored" data-anchor-id="temporality-with-pand
 <p>Looks like there are three columns with dates/times: <code>EVENTDT</code>, <code>EVENTTM</code>, and <code>InDbDate</code>.</p>
 <p>Most likely, <code>EVENTDT</code> stands for the date when the event took place, <code>EVENTTM</code> stands for the time of day the event took place (in 24-hr format), and <code>InDbDate</code> is the date this call is recorded onto the database.</p>
 <p>If we check the data type of these columns, we will see they are stored as strings. We can convert them to <code>datetime</code> objects using pandas <code>to_datetime</code> function.</p>
-<div id="7fc86f36" class="cell" data-execution_count="25">
+<div id="311049c7" class="cell" data-execution_count="25">
 <div class="sourceCode cell-code" id="cb40"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1"><a href="#cb40-1" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>] <span class="op">=</span> pd.to_datetime(calls[<span class="st">"EVENTDT"</span>])</span>
 <span id="cb40-2"><a href="#cb40-2" aria-hidden="true" tabindex="-1"></a>calls.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stderr">
-<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73700/874729699.py:1: UserWarning:
+<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83785/874729699.py:1: UserWarning:
 
 Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
 </code></pre>
@@ -1279,7 +1285,7 @@ <h4 data-number="5.2.3.1" class="anchored" data-anchor-id="temporality-with-pand
 </div>
 <p>Now, we can use the <code>dt</code> accessor on this column.</p>
 <p>We can get the month:</p>
-<div id="621aeab0" class="cell" data-execution_count="26">
+<div id="7d737f55" class="cell" data-execution_count="26">
 <div class="sourceCode cell-code" id="cb42"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1"><a href="#cb42-1" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>].dt.month.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="26">
 <pre><code>0    4
@@ -1291,7 +1297,7 @@ <h4 data-number="5.2.3.1" class="anchored" data-anchor-id="temporality-with-pand
 </div>
 </div>
 <p>Which day of the week the date is on:</p>
-<div id="f1390c37" class="cell" data-execution_count="27">
+<div id="9649191e" class="cell" data-execution_count="27">
 <div class="sourceCode cell-code" id="cb44"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1"><a href="#cb44-1" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>].dt.dayofweek.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="27">
 <pre><code>0    3
@@ -1303,7 +1309,7 @@ <h4 data-number="5.2.3.1" class="anchored" data-anchor-id="temporality-with-pand
 </div>
 </div>
 <p>Check the mimimum values to see if there are any suspicious-looking, 70s dates:</p>
-<div id="8141b147" class="cell" data-execution_count="28">
+<div id="c5b82a43" class="cell" data-execution_count="28">
 <div class="sourceCode cell-code" id="cb46"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb46-1"><a href="#cb46-1" aria-hidden="true" tabindex="-1"></a>calls.sort_values(<span class="st">"EVENTDT"</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="28">
 <div>
@@ -1450,7 +1456,7 @@ <h3 data-number="5.4.1" class="anchored" data-anchor-id="csvs-and-field-names"><
 <p>We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways: 1. Using a text editor like emacs, vim, VSCode, etc. 2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc. 3. The <code>Python</code> file object 4. <code>pandas</code>, using <code>pd.read_csv()</code></p>
 <p>To try out options 1 and 2, you can view or download the Tuberculosis from the <a href="https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&amp;urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&amp;branch=main">lecture demo notebook</a> under the <code>data</code> folder in the left hand menu. Notice how the CSV file is a type of <strong>rectangular data (i.e., tabular data) stored as comma-separated values</strong>.</p>
 <p>Next, let’s try out option 3 using the <code>Python</code> file object. We’ll look at the first four lines:</p>
-<div id="199c3c4a" class="cell" data-execution_count="29">
+<div id="23ff1c8e" class="cell" data-execution_count="29">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb47"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb47-1"><a href="#cb47-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/cdc_tuberculosis.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
@@ -1475,7 +1481,7 @@ <h3 data-number="5.4.1" class="anchored" data-anchor-id="csvs-and-field-names"><
 <p>Whoa, why are there blank lines interspaced between the lines of the CSV?</p>
 <p>You may recall that all line breaks in text files are encoded as the special newline character <code>\n</code>. Python’s <code>print()</code> prints each string (including the newline), and an additional newline on top of that.</p>
 <p>If you’re curious, we can use the <code>repr()</code> function to return the raw string with all special characters:</p>
-<div id="f9edded7" class="cell" data-execution_count="30">
+<div id="b7a79b0e" class="cell" data-execution_count="30">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb49"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb49-1"><a href="#cb49-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/cdc_tuberculosis.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
@@ -1494,7 +1500,7 @@ <h3 data-number="5.4.1" class="anchored" data-anchor-id="csvs-and-field-names"><
 </div>
 </div>
 <p>Finally, let’s try option 4 and use the tried-and-true Data 100 approach: <code>pandas</code>.</p>
-<div id="48958883" class="cell" data-execution_count="31">
+<div id="095f7286" class="cell" data-execution_count="31">
 <div class="sourceCode cell-code" id="cb51"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1"><a href="#cb51-1" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>)</span>
 <span id="cb51-2"><a href="#cb51-2" aria-hidden="true" tabindex="-1"></a>tb_df.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="31">
@@ -1574,7 +1580,7 @@ <h3 data-number="5.4.1" class="anchored" data-anchor-id="csvs-and-field-names"><
 <p>You may notice some strange things about this table: what’s up with the “Unnamed” column names and the first row?</p>
 <p>Congratulations — you’re ready to wrangle your data! Because of how things are stored, we’ll need to clean the data a bit to name our columns better.</p>
 <p>A reasonable first step is to identify the row with the right header. The <code>pd.read_csv()</code> function (<a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">documentation</a>) has the convenient <code>header</code> parameter that we can set to use the elements in row 1 as the appropriate columns:</p>
-<div id="52edc58a" class="cell" data-execution_count="32">
+<div id="b8273d3a" class="cell" data-execution_count="32">
 <div class="sourceCode cell-code" id="cb52"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb52-1"><a href="#cb52-1" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>, header<span class="op">=</span><span class="dv">1</span>) <span class="co"># row index</span></span>
 <span id="cb52-2"><a href="#cb52-2" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="32">
@@ -1653,7 +1659,7 @@ <h3 data-number="5.4.1" class="anchored" data-anchor-id="csvs-and-field-names"><
 </div>
 <p>Wait…but now we can’t differentiate betwen the “Number of TB cases” and “TB incidence” year columns. <code>pandas</code> has tried to make our lives easier by automatically adding “.1” to the latter columns, but this doesn’t help us, as humans, understand the data.</p>
 <p>We can do this manually with <code>df.rename()</code> (<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename">documentation</a>):</p>
-<div id="d132d207" class="cell" data-execution_count="33">
+<div id="0c6712ba" class="cell" data-execution_count="33">
 <div class="sourceCode cell-code" id="cb53"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb53-1"><a href="#cb53-1" aria-hidden="true" tabindex="-1"></a>rename_dict <span class="op">=</span> {<span class="st">'2019'</span>: <span class="st">'TB cases 2019'</span>,</span>
 <span id="cb53-2"><a href="#cb53-2" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2020'</span>: <span class="st">'TB cases 2020'</span>,</span>
 <span id="cb53-3"><a href="#cb53-3" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2021'</span>: <span class="st">'TB cases 2021'</span>,</span>
@@ -1743,7 +1749,7 @@ <h3 data-number="5.4.2" class="anchored" data-anchor-id="record-granularity"><sp
 <p>Row 0 is what we call a <strong>rollup record</strong>, or summary record. It’s often useful when displaying tables to humans. The <strong>granularity</strong> of record 0 (Totals) vs the rest of the records (States) is different.</p>
 <p>Okay, EDA step two. How was the rollup record aggregated?</p>
 <p>Let’s check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get <strong>2x</strong> the total cases in each of our TB cases by year (why do you think this is?).</p>
-<div id="30b25efd" class="cell" data-execution_count="34">
+<div id="5ea2548a" class="cell" data-execution_count="34">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb54"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb54-1"><a href="#cb54-1" aria-hidden="true" tabindex="-1"></a>tb_df.<span class="bu">sum</span>(axis<span class="op">=</span><span class="dv">0</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1760,7 +1766,7 @@ <h3 data-number="5.4.2" class="anchored" data-anchor-id="record-granularity"><sp
 </div>
 </div>
 <p>Whoa, what’s going on with the TB cases in 2019, 2020, and 2021? Check out the column types:</p>
-<div id="c0911509" class="cell" data-execution_count="35">
+<div id="9432ac68" class="cell" data-execution_count="35">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb56"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb56-1"><a href="#cb56-1" aria-hidden="true" tabindex="-1"></a>tb_df.dtypes</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1778,7 +1784,7 @@ <h3 data-number="5.4.2" class="anchored" data-anchor-id="record-granularity"><sp
 </div>
 <p>Since there are commas in the values for TB cases, the numbers are read as the <code>object</code> datatype, or <strong>storage type</strong> (close to the <code>Python</code> string datatype), so <code>pandas</code> is concatenating strings instead of adding integers (recall that Python can “sum”, or concatenate, strings together: <code>"data" + "100"</code> evaluates to <code>"data100"</code>).</p>
 <p>Fortunately <code>read_csv</code> also has a <code>thousands</code> parameter (<a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">documentation</a>):</p>
-<div id="ece3e975" class="cell" data-execution_count="36">
+<div id="f83979b7" class="cell" data-execution_count="36">
 <div class="sourceCode cell-code" id="cb58"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb58-1"><a href="#cb58-1" aria-hidden="true" tabindex="-1"></a><span class="co"># improve readability: chaining method calls with outer parentheses/line breaks</span></span>
 <span id="cb58-2"><a href="#cb58-2" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> (</span>
 <span id="cb58-3"><a href="#cb58-3" aria-hidden="true" tabindex="-1"></a>    pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>, header<span class="op">=</span><span class="dv">1</span>, thousands<span class="op">=</span><span class="st">','</span>)</span>
@@ -1859,7 +1865,7 @@ <h3 data-number="5.4.2" class="anchored" data-anchor-id="record-granularity"><sp
 </div>
 </div>
 </div>
-<div id="b98e7ccf" class="cell" data-execution_count="37">
+<div id="72785fcb" class="cell" data-execution_count="37">
 <div class="sourceCode cell-code" id="cb59"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb59-1"><a href="#cb59-1" aria-hidden="true" tabindex="-1"></a>tb_df.<span class="bu">sum</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="37">
 <pre><code>U.S. jurisdiction    TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
@@ -1874,7 +1880,7 @@ <h3 data-number="5.4.2" class="anchored" data-anchor-id="record-granularity"><sp
 </div>
 <p>The total TB cases look right. Phew!</p>
 <p>Let’s just look at the records with <strong>state-level granularity</strong>:</p>
-<div id="3f3b1598" class="cell" data-execution_count="38">
+<div id="34009774" class="cell" data-execution_count="38">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb61"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb61-1"><a href="#cb61-1" aria-hidden="true" tabindex="-1"></a>state_tb_df <span class="op">=</span> tb_df[<span class="dv">1</span>:]</span>
@@ -1959,7 +1965,7 @@ <h3 data-number="5.4.2" class="anchored" data-anchor-id="record-granularity"><sp
 <h3 data-number="5.4.3" class="anchored" data-anchor-id="gather-census-data"><span class="header-section-number">5.4.3</span> Gather Census Data</h3>
 <p>U.S. Census population estimates <a href="https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html">source</a> (2019), <a href="https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html">source</a> (2020-2021).</p>
 <p>Running the below cells cleans the data. There are a few new methods here: * <code>df.convert_dtypes()</code> (<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html">documentation</a>) conveniently converts all float dtypes into ints and is out of scope for the class. * <code>df.drop_na()</code> (<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html">documentation</a>) will be explained in more detail next time.</p>
-<div id="f64008f5" class="cell" data-execution_count="39">
+<div id="059fc7b9" class="cell" data-execution_count="39">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb62"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb62-1"><a href="#cb62-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 2010s census data</span></span>
@@ -2083,7 +2089,7 @@ <h3 data-number="5.4.3" class="anchored" data-anchor-id="gather-census-data"><sp
 <p>or use <code>iPython</code> magic which will intelligently import code when files change:</p>
 <div class="sourceCode" id="cb64"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb64-1"><a href="#cb64-1" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>load_ext autoreload</span>
 <span id="cb64-2"><a href="#cb64-2" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>autoreload <span class="dv">2</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div id="5fa3533a" class="cell" data-execution_count="40">
+<div id="ca768c06" class="cell" data-execution_count="40">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb65"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb65-1"><a href="#cb65-1" aria-hidden="true" tabindex="-1"></a><span class="co"># census 2020s data</span></span>
@@ -2160,7 +2166,7 @@ <h3 data-number="5.4.3" class="anchored" data-anchor-id="gather-census-data"><sp
 <section id="joining-data-merging-dataframes" class="level3" data-number="5.4.4">
 <h3 data-number="5.4.4" class="anchored" data-anchor-id="joining-data-merging-dataframes"><span class="header-section-number">5.4.4</span> Joining Data (Merging <code>DataFrame</code>s)</h3>
 <p>Time to <code>merge</code>! Here we use the <code>DataFrame</code> method <code>df1.merge(right=df2, ...)</code> on <code>DataFrame</code> <code>df1</code> (<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html">documentation</a>). Contrast this with the function <code>pd.merge(left=df1, right=df2, ...)</code> (<a href="https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge">documentation</a>). Feel free to use either.</p>
-<div id="5b746a4c" class="cell" data-execution_count="41">
+<div id="a1d462ea" class="cell" data-execution_count="41">
 <div class="sourceCode cell-code" id="cb66"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb66-1"><a href="#cb66-1" aria-hidden="true" tabindex="-1"></a><span class="co"># merge TB DataFrame with two US census DataFrames</span></span>
 <span id="cb66-2"><a href="#cb66-2" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
 <span id="cb66-3"><a href="#cb66-3" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
@@ -2335,7 +2341,7 @@ <h3 data-number="5.4.4" class="anchored" data-anchor-id="joining-data-merging-da
 </div>
 </div>
 <p>Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census <code>DataFrame</code>s. Let’s do the latter.</p>
-<div id="d998cc25" class="cell" data-execution_count="42">
+<div id="581c2a0d" class="cell" data-execution_count="42">
 <div class="sourceCode cell-code" id="cb67"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb67-1"><a href="#cb67-1" aria-hidden="true" tabindex="-1"></a><span class="co"># try merging again, but cleaner this time</span></span>
 <span id="cb67-2"><a href="#cb67-2" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
 <span id="cb67-3"><a href="#cb67-3" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
@@ -2448,7 +2454,7 @@ <h3 data-number="5.4.5" class="anchored" data-anchor-id="reproducing-data-comput
 <p><span class="math display">\[\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} \]</span></p>
 <p><span class="math display">\[= \frac{\text{TB cases in population}}{\text{population}} \times 100000\]</span></p>
 <p>Let’s try this for 2019:</p>
-<div id="a8759d98" class="cell" data-execution_count="43">
+<div id="c63eb040" class="cell" data-execution_count="43">
 <div class="sourceCode cell-code" id="cb68"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb68-1"><a href="#cb68-1" aria-hidden="true" tabindex="-1"></a>tb_census_df[<span class="st">"recompute incidence 2019"</span>] <span class="op">=</span> tb_census_df[<span class="st">"TB cases 2019"</span>]<span class="op">/</span>tb_census_df[<span class="st">"2019"</span>]<span class="op">*</span><span class="dv">100000</span></span>
 <span id="cb68-2"><a href="#cb68-2" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="43">
@@ -2551,7 +2557,7 @@ <h3 data-number="5.4.5" class="anchored" data-anchor-id="reproducing-data-comput
 </div>
 <p>Awesome!!!</p>
 <p>Let’s use a for-loop and Python format strings to compute TB incidence for all years. Python f-strings are just used for the purposes of this demo, but they’re handy to know when you explore data beyond this course (<a href="https://docs.python.org/3/tutorial/inputoutput.html">documentation</a>).</p>
-<div id="cac78845" class="cell" data-execution_count="44">
+<div id="34a9e1d6" class="cell" data-execution_count="44">
 <div class="sourceCode cell-code" id="cb69"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb69-1"><a href="#cb69-1" aria-hidden="true" tabindex="-1"></a><span class="co"># recompute incidence for all years</span></span>
 <span id="cb69-2"><a href="#cb69-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2019</span>, <span class="dv">2020</span>, <span class="dv">2021</span>]:</span>
 <span id="cb69-3"><a href="#cb69-3" aria-hidden="true" tabindex="-1"></a>    tb_census_df[<span class="ss">f"recompute incidence </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>] <span class="op">=</span> tb_census_df[<span class="ss">f"TB cases </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">/</span>tb_census_df[<span class="ss">f"</span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">*</span><span class="dv">100000</span></span>
@@ -2667,7 +2673,7 @@ <h3 data-number="5.4.5" class="anchored" data-anchor-id="reproducing-data-comput
 </div>
 </div>
 <p>These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.</p>
-<div id="956cf302" class="cell" data-execution_count="45">
+<div id="2ee6ba22" class="cell" data-execution_count="45">
 <div class="sourceCode cell-code" id="cb70"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb70-1"><a href="#cb70-1" aria-hidden="true" tabindex="-1"></a>tb_census_df.describe()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="45">
 <div>
@@ -2828,7 +2834,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 <p>This is TB incidence computed across the entire U.S. population! How do we reproduce this? * We need to reproduce the “Total” TB incidences in our rolled record. * But our current <code>tb_census_df</code> only has 51 entries (50 states plus Washington, D.C.). There is no rolled record. * What happened…?</p>
 <p>Let’s get exploring!</p>
 <p>Before we keep exploring, we’ll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.</p>
-<div id="a673fc3d" class="cell" data-execution_count="46">
+<div id="b9c6e249" class="cell" data-execution_count="46">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb71"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb71-1"><a href="#cb71-1" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> tb_df.set_index(<span class="st">"U.S. jurisdiction"</span>)</span>
@@ -2911,7 +2917,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 </div>
 </div>
-<div id="d5eb4486" class="cell" data-execution_count="47">
+<div id="0e9a0513" class="cell" data-execution_count="47">
 <div class="sourceCode cell-code" id="cb72"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb72-1"><a href="#cb72-1" aria-hidden="true" tabindex="-1"></a>census_2010s_df <span class="op">=</span> census_2010s_df.set_index(<span class="st">"Geographic Area"</span>)</span>
 <span id="cb72-2"><a href="#cb72-2" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="47">
@@ -3019,7 +3025,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 </div>
 </div>
-<div id="3fa3e1ab" class="cell" data-execution_count="48">
+<div id="820cd4b7" class="cell" data-execution_count="48">
 <div class="sourceCode cell-code" id="cb73"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb73-1"><a href="#cb73-1" aria-hidden="true" tabindex="-1"></a>census_2020s_df <span class="op">=</span> census_2020s_df.set_index(<span class="st">"Geographic Area"</span>)</span>
 <span id="cb73-2"><a href="#cb73-2" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="48">
@@ -3079,7 +3085,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 </div>
 <p>It turns out that our merge above only kept state records, even though our original <code>tb_df</code> had the “Total” rolled record:</p>
-<div id="09f3dc8e" class="cell" data-execution_count="49">
+<div id="be2f3f38" class="cell" data-execution_count="49">
 <div class="sourceCode cell-code" id="cb74"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb74-1"><a href="#cb74-1" aria-hidden="true" tabindex="-1"></a>tb_df.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="49">
 <div>
@@ -3160,7 +3166,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 <p>Recall that <code>merge</code> by default does an <strong>inner</strong> merge by default, meaning that it only preserves keys that are present in <strong>both</strong> <code>DataFrame</code>s.</p>
 <p>The rolled records in our census <code>DataFrame</code> have different <code>Geographic Area</code> fields, which was the key we merged on:</p>
-<div id="80545f63" class="cell" data-execution_count="50">
+<div id="5609c106" class="cell" data-execution_count="50">
 <div class="sourceCode cell-code" id="cb75"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb75-1"><a href="#cb75-1" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="50">
 <div>
@@ -3269,7 +3275,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 <p>The Census <code>DataFrame</code> has several rolled records. The aggregate record we are looking for actually has the Geographic Area named “United States”.</p>
 <p>One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we’ll use <code>df.rename()</code> (<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html">documentation</a>):</p>
-<div id="f96db225" class="cell" data-execution_count="51">
+<div id="91a45f73" class="cell" data-execution_count="51">
 <div class="sourceCode cell-code" id="cb76"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb76-1"><a href="#cb76-1" aria-hidden="true" tabindex="-1"></a><span class="co"># rename rolled record for 2010s</span></span>
 <span id="cb76-2"><a href="#cb76-2" aria-hidden="true" tabindex="-1"></a>census_2010s_df.rename(index<span class="op">=</span>{<span class="st">'United States'</span>:<span class="st">'Total'</span>}, inplace<span class="op">=</span><span class="va">True</span>)</span>
 <span id="cb76-3"><a href="#cb76-3" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -3378,7 +3384,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 </div>
 </div>
-<div id="80c4a073" class="cell" data-execution_count="52">
+<div id="89d56db7" class="cell" data-execution_count="52">
 <div class="sourceCode cell-code" id="cb77"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb77-1"><a href="#cb77-1" aria-hidden="true" tabindex="-1"></a><span class="co"># same, but for 2020s rename rolled record</span></span>
 <span id="cb77-2"><a href="#cb77-2" aria-hidden="true" tabindex="-1"></a>census_2020s_df.rename(index<span class="op">=</span>{<span class="st">'United States'</span>:<span class="st">'Total'</span>}, inplace<span class="op">=</span><span class="va">True</span>)</span>
 <span id="cb77-3"><a href="#cb77-3" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -3440,7 +3446,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 <p><br></p>
 <p>Next let’s rerun our merge. Note the different chaining, because we are now merging on indexes (<code>df.merge()</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html">documentation</a>).</p>
-<div id="77390ced" class="cell" data-execution_count="53">
+<div id="e5f82bef" class="cell" data-execution_count="53">
 <div class="sourceCode cell-code" id="cb78"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb78-1"><a href="#cb78-1" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
 <span id="cb78-2"><a href="#cb78-2" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
 <span id="cb78-3"><a href="#cb78-3" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2010s_df[[<span class="st">"2019"</span>]],</span>
@@ -3537,7 +3543,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 </div>
 <p><br></p>
 <p>Finally, let’s recompute our incidences:</p>
-<div id="e28d80a8" class="cell" data-execution_count="54">
+<div id="745edaa1" class="cell" data-execution_count="54">
 <div class="sourceCode cell-code" id="cb79"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb79-1"><a href="#cb79-1" aria-hidden="true" tabindex="-1"></a><span class="co"># recompute incidence for all years</span></span>
 <span id="cb79-2"><a href="#cb79-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2019</span>, <span class="dv">2020</span>, <span class="dv">2021</span>]:</span>
 <span id="cb79-3"><a href="#cb79-3" aria-hidden="true" tabindex="-1"></a>    tb_census_df[<span class="ss">f"recompute incidence </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>] <span class="op">=</span> tb_census_df[<span class="ss">f"TB cases </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">/</span>tb_census_df[<span class="ss">f"</span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">*</span><span class="dv">100000</span></span>
@@ -3652,21 +3658,21 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 <p>Reported TB incidence (cases per 100,000 persons) increased <strong>9.4%</strong>, from <strong>2.2</strong> during 2020 to <strong>2.4</strong> during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.</p>
 </blockquote>
 <p>Recall that percent change from <span class="math inline">\(A\)</span> to <span class="math inline">\(B\)</span> is computed as <span class="math inline">\(\text{percent change} = \frac{B - A}{A} \times 100\)</span>.</p>
-<div id="46ecaa8f" class="cell" data-execution_count="55">
+<div id="68f2417c" class="cell" data-execution_count="55">
 <div class="sourceCode cell-code" id="cb80"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb80-1"><a href="#cb80-1" aria-hidden="true" tabindex="-1"></a>incidence_2020 <span class="op">=</span> tb_census_df.loc[<span class="st">'Total'</span>, <span class="st">'recompute incidence 2020'</span>]</span>
 <span id="cb80-2"><a href="#cb80-2" aria-hidden="true" tabindex="-1"></a>incidence_2020</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="55">
 <pre><code>np.float64(2.1637257652759883)</code></pre>
 </div>
 </div>
-<div id="fe564ecb" class="cell" data-execution_count="56">
+<div id="56f4576e" class="cell" data-execution_count="56">
 <div class="sourceCode cell-code" id="cb82"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb82-1"><a href="#cb82-1" aria-hidden="true" tabindex="-1"></a>incidence_2021 <span class="op">=</span> tb_census_df.loc[<span class="st">'Total'</span>, <span class="st">'recompute incidence 2021'</span>]</span>
 <span id="cb82-2"><a href="#cb82-2" aria-hidden="true" tabindex="-1"></a>incidence_2021</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="56">
 <pre><code>np.float64(2.3672448914298068)</code></pre>
 </div>
 </div>
-<div id="459fde6f" class="cell" data-execution_count="57">
+<div id="b040fe55" class="cell" data-execution_count="57">
 <div class="sourceCode cell-code" id="cb84"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb84-1"><a href="#cb84-1" aria-hidden="true" tabindex="-1"></a>difference <span class="op">=</span> (incidence_2021 <span class="op">-</span> incidence_2020)<span class="op">/</span>incidence_2020 <span class="op">*</span> <span class="dv">100</span></span>
 <span id="cb84-2"><a href="#cb84-2" aria-hidden="true" tabindex="-1"></a>difference</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="57">
@@ -3678,7 +3684,7 @@ <h3 data-number="5.4.6" class="anchored" data-anchor-id="bonus-eda-reproducing-t
 <section id="eda-demo-2-mauna-loa-co2-data-a-lesson-in-data-faithfulness" class="level2" data-number="5.5">
 <h2 data-number="5.5" class="anchored" data-anchor-id="eda-demo-2-mauna-loa-co2-data-a-lesson-in-data-faithfulness"><span class="header-section-number">5.5</span> EDA Demo 2: Mauna Loa CO<sub>2</sub> Data – A Lesson in Data Faithfulness</h2>
 <p><a href="https://gml.noaa.gov/ccgg/trends/data.html">Mauna Loa Observatory</a> has been monitoring CO<sub>2</sub> concentrations since 1958.</p>
-<div id="b6d85a58" class="cell" data-execution_count="58">
+<div id="f6b25eb9" class="cell" data-execution_count="58">
 <div class="sourceCode cell-code" id="cb86"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb86-1"><a href="#cb86-1" aria-hidden="true" tabindex="-1"></a>co2_file <span class="op">=</span> <span class="st">"data/co2_mm_mlo.txt"</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Let’s do some <strong>EDA</strong>!!</p>
@@ -3703,7 +3709,7 @@ <h3 data-number="5.5.1" class="anchored" data-anchor-id="reading-this-file-into-
 <li>The 71st and 72nd lines in the file contain column headings split over two lines.</li>
 </ul>
 <p>We can use&nbsp;<code>read_csv</code>&nbsp;to read the data into a <code>pandas</code> <code>DataFrame</code>, and we provide several arguments to specify that the separators are white space, there is no header (<strong>we will set our own column names</strong>), and to skip the first 72 rows of the file.</p>
-<div id="092beb9b" class="cell" data-execution_count="59">
+<div id="bc75886f" class="cell" data-execution_count="59">
 <div class="sourceCode cell-code" id="cb88"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb88-1"><a href="#cb88-1" aria-hidden="true" tabindex="-1"></a>co2 <span class="op">=</span> pd.read_csv(</span>
 <span id="cb88-2"><a href="#cb88-2" aria-hidden="true" tabindex="-1"></a>    co2_file, header <span class="op">=</span> <span class="va">None</span>, skiprows <span class="op">=</span> <span class="dv">72</span>,</span>
 <span id="cb88-3"><a href="#cb88-3" aria-hidden="true" tabindex="-1"></a>    sep <span class="op">=</span> <span class="vs">r'\s+'</span>       <span class="co">#delimiter for continuous whitespace (stay tuned for regex next lecture))</span></span>
@@ -3791,7 +3797,7 @@ <h3 data-number="5.5.1" class="anchored" data-anchor-id="reading-this-file-into-
 <h3 data-number="5.5.2" class="anchored" data-anchor-id="exploring-variable-feature-types"><span class="header-section-number">5.5.2</span> Exploring Variable Feature Types</h3>
 <p>The NOAA <a href="https://gml.noaa.gov/ccgg/trends/">webpage</a> might have some useful tidbits (in this case it doesn’t).</p>
 <p>Using this information, we’ll rerun <code>pd.read_csv</code>, but this time with some <strong>custom column names.</strong></p>
-<div id="35aff177" class="cell" data-execution_count="60">
+<div id="0720c446" class="cell" data-execution_count="60">
 <div class="sourceCode cell-code" id="cb89"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb89-1"><a href="#cb89-1" aria-hidden="true" tabindex="-1"></a>co2 <span class="op">=</span> pd.read_csv(</span>
 <span id="cb89-2"><a href="#cb89-2" aria-hidden="true" tabindex="-1"></a>    co2_file, header <span class="op">=</span> <span class="va">None</span>, skiprows <span class="op">=</span> <span class="dv">72</span>,</span>
 <span id="cb89-3"><a href="#cb89-3" aria-hidden="true" tabindex="-1"></a>    sep <span class="op">=</span> <span class="st">'\s+'</span>, <span class="co">#regex for continuous whitespace (next lecture)</span></span>
@@ -3807,7 +3813,7 @@ <h3 data-number="5.5.2" class="anchored" data-anchor-id="exploring-variable-feat
 
 invalid escape sequence '\s'
 
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73700/150137587.py:3: SyntaxWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83785/150137587.py:3: SyntaxWarning:
 
 invalid escape sequence '\s'
 </code></pre>
@@ -3890,7 +3896,7 @@ <h3 data-number="5.5.2" class="anchored" data-anchor-id="exploring-variable-feat
 <section id="visualizing-co2" class="level3" data-number="5.5.3">
 <h3 data-number="5.5.3" class="anchored" data-anchor-id="visualizing-co2"><span class="header-section-number">5.5.3</span> Visualizing CO<sub>2</sub></h3>
 <p>Scientific studies tend to have very clean data, right…? Let’s jump right in and make a time series plot of CO<sub>2</sub> monthly averages.</p>
-<div id="07cd8c10" class="cell" data-execution_count="61">
+<div id="e6a86ae6" class="cell" data-execution_count="61">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb91"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb91-1"><a href="#cb91-1" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -3905,7 +3911,7 @@ <h3 data-number="5.5.3" class="anchored" data-anchor-id="visualizing-co2"><span
 </div>
 <p>The code above uses the <code>seaborn</code> plotting library (abbreviated <code>sns</code>). We will cover this in the Visualization lecture, but now you don’t need to worry about how it works!</p>
 <p>Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some <strong>missing values</strong>. What happened here?</p>
-<div id="3e8d073d" class="cell" data-execution_count="62">
+<div id="2bc5cb60" class="cell" data-execution_count="62">
 <div class="sourceCode cell-code" id="cb92"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb92-1"><a href="#cb92-1" aria-hidden="true" tabindex="-1"></a>co2.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="62">
 <div>
@@ -3981,7 +3987,7 @@ <h3 data-number="5.5.3" class="anchored" data-anchor-id="visualizing-co2"><span
 </div>
 </div>
 </div>
-<div id="9695ae7f" class="cell" data-execution_count="63">
+<div id="8d22bc23" class="cell" data-execution_count="63">
 <div class="sourceCode cell-code" id="cb93"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb93-1"><a href="#cb93-1" aria-hidden="true" tabindex="-1"></a>co2.tail()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="63">
 <div>
@@ -4074,7 +4080,7 @@ <h3 data-number="5.5.4" class="anchored" data-anchor-id="sanity-checks-reasoning
 <li>Data from March 1958 to August 2019.</li>
 <li>We should have $ 12 (2019-1957) - 2 - 4 = 738 $ records.</li>
 </ul>
-<div id="c89a5c34" class="cell" data-execution_count="64">
+<div id="c89354af" class="cell" data-execution_count="64">
 <div class="sourceCode cell-code" id="cb94"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb94-1"><a href="#cb94-1" aria-hidden="true" tabindex="-1"></a>co2.shape</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="64">
 <pre><code>(738, 7)</code></pre>
@@ -4088,7 +4094,7 @@ <h3 data-number="5.5.5" class="anchored" data-anchor-id="understanding-missing-v
 <p><code>Days</code> is a time field, so let’s analyze other time fields to see if there is an explanation for missing values of days of operation.</p>
 <p>Let’s start with <strong>months</strong>, <code>Mo</code>.</p>
 <p>Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).</p>
-<div id="cb56088d" class="cell" data-execution_count="65">
+<div id="9bef2dc0" class="cell" data-execution_count="65">
 <div class="sourceCode cell-code" id="cb96"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb96-1"><a href="#cb96-1" aria-hidden="true" tabindex="-1"></a>co2[<span class="st">"Mo"</span>].value_counts().sort_index()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="65">
 <pre><code>Mo
@@ -4110,7 +4116,7 @@ <h3 data-number="5.5.5" class="anchored" data-anchor-id="understanding-missing-v
 <p>As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.</p>
 <p><br></p>
 <p>Next let’s explore <strong>days</strong> <code>Days</code> itself, which is the number of days that the measurement equipment worked.</p>
-<div id="fd0ea048" class="cell" data-execution_count="66">
+<div id="94ba94f0" class="cell" data-execution_count="66">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb98"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb98-1"><a href="#cb98-1" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Days'</span>])<span class="op">;</span></span>
@@ -4128,7 +4134,7 @@ <h3 data-number="5.5.5" class="anchored" data-anchor-id="understanding-missing-v
 <p><br></p>
 <p>Finally, let’s check the last time feature, <strong>year</strong> <code>Yr</code>.</p>
 <p>Let’s check to see if there is any connection between missing-ness and the year of the recording.</p>
-<div id="6cb28eb4" class="cell" data-execution_count="67">
+<div id="343b52af" class="cell" data-execution_count="67">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb99"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb99-1"><a href="#cb99-1" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(x<span class="op">=</span><span class="st">"Yr"</span>, y<span class="op">=</span><span class="st">"Days"</span>, data<span class="op">=</span>co2)<span class="op">;</span></span>
@@ -4157,7 +4163,7 @@ <h3 data-number="5.5.5" class="anchored" data-anchor-id="understanding-missing-v
 <section id="understanding-missing-value-2-avg" class="level3" data-number="5.5.6">
 <h3 data-number="5.5.6" class="anchored" data-anchor-id="understanding-missing-value-2-avg"><span class="header-section-number">5.5.6</span> Understanding Missing Value 2: <code>Avg</code></h3>
 <p>Next, let’s return to the -99.99 values in <code>Avg</code> to analyze the overall quality of the CO<sub>2</sub> measurements. We’ll plot a histogram of the average CO<sub>2</sub> measurements</p>
-<div id="4af48929" class="cell" data-execution_count="68">
+<div id="5d49cb8d" class="cell" data-execution_count="68">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb100"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb100-1"><a href="#cb100-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Histograms of average CO2 measurements</span></span>
@@ -4173,7 +4179,7 @@ <h3 data-number="5.5.6" class="anchored" data-anchor-id="understanding-missing-v
 </div>
 <p>The non-missing values are in the 300-400 range (a regular range of CO<sub>2</sub> levels).</p>
 <p>We also see that there are only a few missing <code>Avg</code> values (<strong>&lt;1% of values</strong>). Let’s examine all of them:</p>
-<div id="b3018378" class="cell" data-execution_count="69">
+<div id="d95582f8" class="cell" data-execution_count="69">
 <div class="sourceCode cell-code" id="cb101"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb101-1"><a href="#cb101-1" aria-hidden="true" tabindex="-1"></a>co2[co2[<span class="st">"Avg"</span>] <span class="op">&lt;</span> <span class="dv">0</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="69">
 <div>
@@ -4280,7 +4286,7 @@ <h3 data-number="5.5.7" class="anchored" data-anchor-id="drop-nan-or-impute-miss
 <li>Impute using some strategy</li>
 </ol>
 <p>Remember we want to fix the following plot:</p>
-<div id="4b528779" class="cell" data-execution_count="70">
+<div id="56a1877d" class="cell" data-execution_count="70">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb102"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb102-1"><a href="#cb102-1" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)</span>
@@ -4298,7 +4304,7 @@ <h3 data-number="5.5.7" class="anchored" data-anchor-id="drop-nan-or-impute-miss
 <p>Let’s consider a few options: 1. Drop those records 2. Replace -99.99 with NaN 3. Substitute it with a likely value for the average CO<sub>2</sub>?</p>
 <p>What do you think are the pros and cons of each possible action?</p>
 <p>Let’s examine each of these three options.</p>
-<div id="06fb5172" class="cell" data-execution_count="71">
+<div id="ae1da78d" class="cell" data-execution_count="71">
 <div class="sourceCode cell-code" id="cb103"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb103-1"><a href="#cb103-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Drop missing values</span></span>
 <span id="cb103-2"><a href="#cb103-2" aria-hidden="true" tabindex="-1"></a>co2_drop <span class="op">=</span> co2[co2[<span class="st">'Avg'</span>] <span class="op">&gt;</span> <span class="dv">0</span>]</span>
 <span id="cb103-3"><a href="#cb103-3" aria-hidden="true" tabindex="-1"></a>co2_drop.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -4376,7 +4382,7 @@ <h3 data-number="5.5.7" class="anchored" data-anchor-id="drop-nan-or-impute-miss
 </div>
 </div>
 </div>
-<div id="0ad70858" class="cell" data-execution_count="72">
+<div id="b05178e0" class="cell" data-execution_count="72">
 <div class="sourceCode cell-code" id="cb104"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb104-1"><a href="#cb104-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 2. Replace NaN with -99.99</span></span>
 <span id="cb104-2"><a href="#cb104-2" aria-hidden="true" tabindex="-1"></a>co2_NA <span class="op">=</span> co2.replace(<span class="op">-</span><span class="fl">99.99</span>, np.nan)</span>
 <span id="cb104-3"><a href="#cb104-3" aria-hidden="true" tabindex="-1"></a>co2_NA.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -4462,7 +4468,7 @@ <h3 data-number="5.5.7" class="anchored" data-anchor-id="drop-nan-or-impute-miss
 </blockquote>
 <p>The <code>Int</code> feature has values that exactly match those in <code>Avg</code>, except when <code>Avg</code> is -99.99, and then a <strong>reasonable</strong> estimate is used instead.</p>
 <p>So, the third version of our data will use the <code>Int</code> feature instead of <code>Avg</code>.</p>
-<div id="f7e02ab2" class="cell" data-execution_count="73">
+<div id="c6b2d3b4" class="cell" data-execution_count="73">
 <div class="sourceCode cell-code" id="cb105"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb105-1"><a href="#cb105-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 3. Use interpolated column which estimates missing Avg values</span></span>
 <span id="cb105-2"><a href="#cb105-2" aria-hidden="true" tabindex="-1"></a>co2_impute <span class="op">=</span> co2.copy()</span>
 <span id="cb105-3"><a href="#cb105-3" aria-hidden="true" tabindex="-1"></a>co2_impute[<span class="st">'Avg'</span>] <span class="op">=</span> co2[<span class="st">'Int'</span>]</span>
@@ -4543,7 +4549,7 @@ <h3 data-number="5.5.7" class="anchored" data-anchor-id="drop-nan-or-impute-miss
 </div>
 <p>What’s a <strong>reasonable</strong> estimate?</p>
 <p>To answer this question, let’s zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).</p>
-<div id="57170063" class="cell" data-execution_count="74">
+<div id="20c1b687" class="cell" data-execution_count="74">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb106"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb106-1"><a href="#cb106-1" aria-hidden="true" tabindex="-1"></a><span class="co"># results of plotting data in 1958</span></span>
@@ -4586,7 +4592,7 @@ <h3 data-number="5.5.7" class="anchored" data-anchor-id="drop-nan-or-impute-miss
 <li>We are plotting all months in our data as a line plot</li>
 </ul>
 <p>Let’s replot our original figure with option 3:</p>
-<div id="9595210a" class="cell" data-execution_count="75">
+<div id="579380dd" class="cell" data-execution_count="75">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb107"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb107-1"><a href="#cb107-1" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_impute)</span>
@@ -4618,7 +4624,7 @@ <h3 data-number="5.5.8" class="anchored" data-anchor-id="presenting-the-data-a-d
 <ul>
 <li>You might be happier with a <strong>coarser granularity</strong> of average year data!</li>
 </ul>
-<div id="92adaeda" class="cell" data-execution_count="76">
+<div id="bea4d773" class="cell" data-execution_count="76">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb108"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb108-1"><a href="#cb108-1" aria-hidden="true" tabindex="-1"></a>co2_year <span class="op">=</span> co2_impute.groupby(<span class="st">'Yr'</span>).mean()</span>
diff --git a/docs/eda/eda_files/figure-pdf/cell-62-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-62-output-1.pdf
index e0ce43d0..6a675d93 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-62-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-62-output-1.pdf differ
diff --git a/docs/eda/eda_files/figure-pdf/cell-67-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-67-output-1.pdf
index b5515027..9f96ab5c 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-67-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-67-output-1.pdf differ
diff --git a/docs/eda/eda_files/figure-pdf/cell-68-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-68-output-1.pdf
index 8afa9f87..85966af7 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-68-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-68-output-1.pdf differ
diff --git a/docs/eda/eda_files/figure-pdf/cell-69-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-69-output-1.pdf
index 6d5baddc..5ee25009 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-69-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-69-output-1.pdf differ
diff --git a/docs/eda/eda_files/figure-pdf/cell-71-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-71-output-1.pdf
index 1e6c365c..d3df3aad 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-71-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-71-output-1.pdf differ
diff --git a/docs/eda/eda_files/figure-pdf/cell-75-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-75-output-1.pdf
index 4ea59d3c..be04c6b6 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-75-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-75-output-1.pdf differ
diff --git a/docs/eda/eda_files/figure-pdf/cell-76-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-76-output-1.pdf
index c402a7a9..8e759ef2 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-76-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-76-output-1.pdf differ
diff --git a/docs/eda/eda_files/figure-pdf/cell-77-output-1.pdf b/docs/eda/eda_files/figure-pdf/cell-77-output-1.pdf
index ec1f63e9..cf7defed 100644
Binary files a/docs/eda/eda_files/figure-pdf/cell-77-output-1.pdf and b/docs/eda/eda_files/figure-pdf/cell-77-output-1.pdf differ
diff --git a/docs/index.html b/docs/index.html
index 9affd6b2..779f0448 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -182,6 +182,12 @@
   <a href="./sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="./intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
diff --git a/docs/intro_lec/introduction.html b/docs/intro_lec/introduction.html
index a2a7843d..52073a5d 100644
--- a/docs/intro_lec/introduction.html
+++ b/docs/intro_lec/introduction.html
@@ -171,6 +171,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
diff --git a/docs/intro_to_modeling/images/reg_line_1.png b/docs/intro_to_modeling/images/reg_line_1.png
new file mode 100644
index 00000000..f85fd063
Binary files /dev/null and b/docs/intro_to_modeling/images/reg_line_1.png differ
diff --git a/docs/intro_to_modeling/images/reg_line_2.png b/docs/intro_to_modeling/images/reg_line_2.png
new file mode 100644
index 00000000..10f5246c
Binary files /dev/null and b/docs/intro_to_modeling/images/reg_line_2.png differ
diff --git a/docs/intro_to_modeling/intro_to_modeling.html b/docs/intro_to_modeling/intro_to_modeling.html
new file mode 100644
index 00000000..2d20da31
--- /dev/null
+++ b/docs/intro_to_modeling/intro_to_modeling.html
@@ -0,0 +1,1942 @@
+<!DOCTYPE html>
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>
+
+<meta charset="utf-8">
+<meta name="generator" content="quarto-1.4.549">
+
+<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
+
+
+<title>Principles and Techniques of Data Science - 10&nbsp; Introduction to Modeling</title>
+<style>
+code{white-space: pre-wrap;}
+span.smallcaps{font-variant: small-caps;}
+div.columns{display: flex; gap: min(4vw, 1.5em);}
+div.column{flex: auto; overflow-x: auto;}
+div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
+ul.task-list{list-style: none;}
+ul.task-list li input[type="checkbox"] {
+  width: 0.8em;
+  margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ 
+  vertical-align: middle;
+}
+/* CSS for syntax highlighting */
+pre > code.sourceCode { white-space: pre; position: relative; }
+pre > code.sourceCode > span { line-height: 1.25; }
+pre > code.sourceCode > span:empty { height: 1.2em; }
+.sourceCode { overflow: visible; }
+code.sourceCode > span { color: inherit; text-decoration: inherit; }
+div.sourceCode { margin: 1em 0; }
+pre.sourceCode { margin: 0; }
+@media screen {
+div.sourceCode { overflow: auto; }
+}
+@media print {
+pre > code.sourceCode { white-space: pre-wrap; }
+pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
+}
+pre.numberSource code
+  { counter-reset: source-line 0; }
+pre.numberSource code > span
+  { position: relative; left: -4em; counter-increment: source-line; }
+pre.numberSource code > span > a:first-child::before
+  { content: counter(source-line);
+    position: relative; left: -1em; text-align: right; vertical-align: baseline;
+    border: none; display: inline-block;
+    -webkit-touch-callout: none; -webkit-user-select: none;
+    -khtml-user-select: none; -moz-user-select: none;
+    -ms-user-select: none; user-select: none;
+    padding: 0 4px; width: 4em;
+  }
+pre.numberSource { margin-left: 3em;  padding-left: 4px; }
+div.sourceCode
+  {   }
+@media screen {
+pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
+}
+</style>
+
+
+<script src="../site_libs/quarto-nav/quarto-nav.js"></script>
+<script src="../site_libs/quarto-nav/headroom.min.js"></script>
+<script src="../site_libs/clipboard/clipboard.min.js"></script>
+<script src="../site_libs/quarto-search/autocomplete.umd.js"></script>
+<script src="../site_libs/quarto-search/fuse.min.js"></script>
+<script src="../site_libs/quarto-search/quarto-search.js"></script>
+<meta name="quarto:offset" content="../">
+<link href="../sampling/sampling.html" rel="prev">
+<link href="../data100_logo.png" rel="icon" type="image/png">
+<script src="../site_libs/quarto-html/quarto.js"></script>
+<script src="../site_libs/quarto-html/popper.min.js"></script>
+<script src="../site_libs/quarto-html/tippy.umd.min.js"></script>
+<script src="../site_libs/quarto-html/anchor.min.js"></script>
+<link href="../site_libs/quarto-html/tippy.css" rel="stylesheet">
+<link href="../site_libs/quarto-html/quarto-syntax-highlighting.css" rel="stylesheet" id="quarto-text-highlighting-styles">
+<script src="../site_libs/bootstrap/bootstrap.min.js"></script>
+<link href="../site_libs/bootstrap/bootstrap-icons.css" rel="stylesheet">
+<link href="../site_libs/bootstrap/bootstrap.min.css" rel="stylesheet" id="quarto-bootstrap" data-mode="light">
+<script id="quarto-search-options" type="application/json">{
+  "location": "sidebar",
+  "copy-button": false,
+  "collapse-after": 3,
+  "panel-placement": "start",
+  "type": "textbox",
+  "limit": 50,
+  "keyboard-shortcut": [
+    "f",
+    "/",
+    "s"
+  ],
+  "language": {
+    "search-no-results-text": "No results",
+    "search-matching-documents-text": "matching documents",
+    "search-copy-link-title": "Copy link to search",
+    "search-hide-matches-text": "Hide additional matches",
+    "search-more-match-text": "more match in this document",
+    "search-more-matches-text": "more matches in this document",
+    "search-clear-button-title": "Clear",
+    "search-text-placeholder": "",
+    "search-detached-cancel-button-title": "Cancel",
+    "search-submit-button-title": "Submit",
+    "search-label": "Search"
+  }
+}</script>
+
+  <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
+  <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script>
+
+<script type="text/javascript">
+const typesetMath = (el) => {
+  if (window.MathJax) {
+    // MathJax Typeset
+    window.MathJax.typeset([el]);
+  } else if (window.katex) {
+    // KaTeX Render
+    var mathElements = el.getElementsByClassName("math");
+    var macros = [];
+    for (var i = 0; i < mathElements.length; i++) {
+      var texText = mathElements[i].firstChild;
+      if (mathElements[i].tagName == "SPAN") {
+        window.katex.render(texText.data, mathElements[i], {
+          displayMode: mathElements[i].classList.contains('display'),
+          throwOnError: false,
+          macros: macros,
+          fleqn: false
+        });
+      }
+    }
+  }
+}
+window.Quarto = {
+  typesetMath
+};
+</script>
+
+</head>
+
+<body class="nav-sidebar floating">
+
+<div id="quarto-search-results"></div>
+  <header id="quarto-header" class="headroom fixed-top">
+  <nav class="quarto-secondary-nav">
+    <div class="container-fluid d-flex">
+      <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">
+        <i class="bi bi-layout-text-sidebar-reverse"></i>
+      </button>
+        <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../intro_to_modeling/intro_to_modeling.html"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></a></li></ol></nav>
+        <a class="flex-grow-1" role="button" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      
+        </a>
+      <button type="button" class="btn quarto-search-button" aria-label="" onclick="window.quartoOpenSearch();">
+        <i class="bi bi-search"></i>
+      </button>
+    </div>
+  </nav>
+</header>
+<!-- content -->
+<div id="quarto-content" class="quarto-container page-columns page-rows-contents page-layout-full">
+<!-- sidebar -->
+  <nav id="quarto-sidebar" class="sidebar collapse collapse-horizontal quarto-sidebar-collapse-item sidebar-navigation floating overflow-auto">
+    <div class="pt-lg-2 mt-2 text-left sidebar-header sidebar-header-stacked">
+      <a href="../index.html" class="sidebar-logo-link">
+      <img src="../data100_logo.png" alt="" class="sidebar-logo py-0 d-lg-inline d-none">
+      </a>
+    <div class="sidebar-title mb-0 py-0">
+      <a href="../">Principles and Techniques of Data Science</a> 
+        <div class="sidebar-tools-main">
+    <a href="https://github.com/DS-100/course-notes" title="Source Code" class="quarto-navigation-tool px-1" aria-label="Source Code"><i class="bi bi-github"></i></a>
+</div>
+    </div>
+      </div>
+        <div class="mt-2 flex-shrink-0 align-items-center">
+        <div class="sidebar-search">
+        <div id="quarto-search" class="" title="Search"></div>
+        </div>
+        </div>
+    <div class="sidebar-menu-container"> 
+    <ul class="list-unstyled mt-1">
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../index.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text">Welcome</span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_lec/introduction.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">1</span>&nbsp; <span class="chapter-title">Introduction</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../pandas_1/pandas_1.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">Pandas I</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../pandas_2/pandas_2.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">3</span>&nbsp; <span class="chapter-title">Pandas II</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../pandas_3/pandas_3.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">4</span>&nbsp; <span class="chapter-title">Pandas III</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../eda/eda.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">5</span>&nbsp; <span class="chapter-title">Data Cleaning and EDA</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../regex/regex.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Regular Expressions</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../visualization_1/visualization_1.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">Visualization I</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../visualization_2/visualization_2.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">8</span>&nbsp; <span class="chapter-title">Visualization II</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
+  </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link active">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
+</li>
+    </ul>
+    </div>
+</nav>
+<div id="quarto-sidebar-glass" class="quarto-sidebar-collapse-item" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item"></div>
+<!-- margin-sidebar -->
+    <div id="quarto-margin-sidebar" class="sidebar margin-sidebar">
+        <nav id="TOC" role="doc-toc" class="toc-active">
+    <h2 id="toc-title">Introduction to Modeling</h2>
+   
+  <ul>
+  <li><a href="#what-is-a-model" id="toc-what-is-a-model" class="nav-link active" data-scroll-target="#what-is-a-model"><span class="header-section-number">10.1</span> What is a Model?</a>
+  <ul>
+  <li><a href="#reasons-for-building-models" id="toc-reasons-for-building-models" class="nav-link" data-scroll-target="#reasons-for-building-models"><span class="header-section-number">10.1.1</span> Reasons for Building Models</a></li>
+  <li><a href="#common-types-of-models" id="toc-common-types-of-models" class="nav-link" data-scroll-target="#common-types-of-models"><span class="header-section-number">10.1.2</span> Common Types of Models</a></li>
+  </ul></li>
+  <li><a href="#simple-linear-regression" id="toc-simple-linear-regression" class="nav-link" data-scroll-target="#simple-linear-regression"><span class="header-section-number">10.2</span> Simple Linear Regression</a>
+  <ul>
+  <li><a href="#notations-and-definitions" id="toc-notations-and-definitions" class="nav-link" data-scroll-target="#notations-and-definitions"><span class="header-section-number">10.2.1</span> Notations and Definitions</a>
+  <ul>
+  <li><a href="#standard-units" id="toc-standard-units" class="nav-link" data-scroll-target="#standard-units"><span class="header-section-number">10.2.1.1</span> Standard Units</a></li>
+  <li><a href="#correlation" id="toc-correlation" class="nav-link" data-scroll-target="#correlation"><span class="header-section-number">10.2.1.2</span> Correlation</a></li>
+  </ul></li>
+  <li><a href="#alternate-form" id="toc-alternate-form" class="nav-link" data-scroll-target="#alternate-form"><span class="header-section-number">10.2.2</span> Alternate Form</a></li>
+  <li><a href="#derivation" id="toc-derivation" class="nav-link" data-scroll-target="#derivation"><span class="header-section-number">10.2.3</span> Derivation</a></li>
+  </ul></li>
+  <li><a href="#the-modeling-process" id="toc-the-modeling-process" class="nav-link" data-scroll-target="#the-modeling-process"><span class="header-section-number">10.3</span> The Modeling Process</a></li>
+  <li><a href="#choosing-a-model" id="toc-choosing-a-model" class="nav-link" data-scroll-target="#choosing-a-model"><span class="header-section-number">10.4</span> Choosing a Model</a></li>
+  <li><a href="#choosing-a-loss-function" id="toc-choosing-a-loss-function" class="nav-link" data-scroll-target="#choosing-a-loss-function"><span class="header-section-number">10.5</span> Choosing a Loss Function</a></li>
+  <li><a href="#fitting-the-model" id="toc-fitting-the-model" class="nav-link" data-scroll-target="#fitting-the-model"><span class="header-section-number">10.6</span> Fitting the Model</a></li>
+  <li><a href="#evaluating-the-slr-model" id="toc-evaluating-the-slr-model" class="nav-link" data-scroll-target="#evaluating-the-slr-model"><span class="header-section-number">10.7</span> Evaluating the SLR Model</a>
+  <ul>
+  <li><a href="#four-mysterious-datasets-anscombes-quartet" id="toc-four-mysterious-datasets-anscombes-quartet" class="nav-link" data-scroll-target="#four-mysterious-datasets-anscombes-quartet"><span class="header-section-number">10.7.1</span> Four Mysterious Datasets (Anscombe’s quartet)</a></li>
+  </ul></li>
+  </ul>
+</nav>
+    </div>
+<!-- main -->
+<main class="content column-body" id="quarto-document-content">
+
+<header id="title-block-header" class="quarto-title-block default">
+<div class="quarto-title">
+<div class="quarto-title-block"><div><h1 class="title"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></h1><button type="button" class="btn code-tools-button dropdown-toggle" id="quarto-code-tools-menu" data-bs-toggle="dropdown" aria-expanded="false"><i class="bi"></i> Code</button><ul class="dropdown-menu dropdown-menu-end" aria-labelelledby="quarto-code-tools-menu"><li><a id="quarto-show-all-code" class="dropdown-item" href="javascript:void(0)" role="button">Show All Code</a></li><li><a id="quarto-hide-all-code" class="dropdown-item" href="javascript:void(0)" role="button">Hide All Code</a></li><li><hr class="dropdown-divider"></li><li><a id="quarto-view-source" class="dropdown-item" href="javascript:void(0)" role="button">View Source</a></li></ul></div></div>
+</div>
+
+
+
+<div class="quarto-title-meta column-body">
+
+    
+  
+    
+  </div>
+  
+
+
+</header>
+
+
+<div class="callout callout-style-default callout-note no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Learning Outcomes
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-1" class="callout-1-contents callout-collapse collapse show">
+<div class="callout-body-container callout-body">
+<ul>
+<li>Understand what models are and how to carry out the four-step modeling process.</li>
+<li>Define the concept of loss and gain familiarity with <span class="math inline">\(L_1\)</span> and <span class="math inline">\(L_2\)</span> loss.</li>
+<li>Fit the Simple Linear Regression model using minimization techniques.</li>
+</ul>
+</div>
+</div>
+</div>
+<p>Up until this point in the semester, we’ve focused on analyzing datasets. We’ve looked into the early stages of the data science lifecycle, focusing on the programming tools, visualization techniques, and data cleaning methods needed for data analysis.</p>
+<p>This lecture marks a shift in focus. We will move away from examining datasets to actually <em>using</em> our data to better understand the world. Specifically, the next sequence of lectures will explore predictive modeling: generating models to make some predictions about the world around us. In this lecture, we’ll introduce the conceptual framework for setting up a modeling task. In the next few lectures, we’ll put this framework into practice by implementing various kinds of models.</p>
+<section id="what-is-a-model" class="level2" data-number="10.1">
+<h2 data-number="10.1" class="anchored" data-anchor-id="what-is-a-model"><span class="header-section-number">10.1</span> What is a Model?</h2>
+<p>A model is an <strong>idealized representation</strong> of a system. A system is a set of principles or procedures according to which something functions. We live in a world full of systems: the procedure of turning on a light happens according to a specific set of rules dictating the flow of electricity. The truth behind how any event occurs is usually complex, and many times the specifics are unknown. The workings of the world can be viewed as its own giant procedure. Models seek to simplify the world and distill them into workable pieces.</p>
+<p>Example: We model the fall of an object on Earth as subject to a constant acceleration of <span class="math inline">\(9.81 m/s^2\)</span> due to gravity.</p>
+<ul>
+<li>While this describes the behavior of our system, it is merely an approximation.</li>
+<li>It doesn’t account for the effects of air resistance, local variations in gravity, etc.</li>
+<li>In practice, it’s accurate enough to be useful!</li>
+</ul>
+<section id="reasons-for-building-models" class="level3" data-number="10.1.1">
+<h3 data-number="10.1.1" class="anchored" data-anchor-id="reasons-for-building-models"><span class="header-section-number">10.1.1</span> Reasons for Building Models</h3>
+<p>Why do we want to build models? As far as data scientists and statisticians are concerned, there are three reasons, and each implies a different focus on modeling.</p>
+<ol type="1">
+<li><p>To explain complex phenomena occurring in the world we live in. Examples of this might be:</p>
+<ul>
+<li>How are the parents’ average height related to their children’s average height?</li>
+<li>How does an object’s velocity and acceleration impact how far it travels? (Physics: <span class="math inline">\(d = d_0 + vt + \frac{1}{2}at^2\)</span>)</li>
+</ul>
+<p>In these cases, we care about creating models that are <em>simple and interpretable</em>, allowing us to understand what the relationships between our variables are.</p></li>
+<li><p>To make accurate predictions about unseen data. Some examples include:</p>
+<ul>
+<li>Can we predict if an email is spam or not?</li>
+<li>Can we generate a one-sentence summary of this 10-page long article?</li>
+</ul>
+<p>When making predictions, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models and are common in fields like deep learning.</p></li>
+<li><p>To measure the causal effects of one event on some other event. For example,</p>
+<ul>
+<li>Does smoking <em>cause</em> lung cancer?</li>
+<li>Does a job training program <em>cause</em> increases in employment and wages?</li>
+</ul>
+<p>This is a much harder question because most statistical tools are designed to infer association, not causation. We will not focus on this task in Data 100, but you can take other advanced classes on causal inference (e.g., Stat 156, Data 102) if you are intrigued!</p></li>
+</ol>
+<p>Most of the time, we aim to strike a balance between building <strong>interpretable</strong> models and building <strong>accurate models</strong>.</p>
+</section>
+<section id="common-types-of-models" class="level3" data-number="10.1.2">
+<h3 data-number="10.1.2" class="anchored" data-anchor-id="common-types-of-models"><span class="header-section-number">10.1.2</span> Common Types of Models</h3>
+<p>In general, models can be split into two categories:</p>
+<ol type="1">
+<li><p>Deterministic physical (mechanistic) models: Laws that govern how the world works.</p>
+<ul>
+<li><a href="https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion#Third_law">Kepler’s Third Law of Planetary Motion (1619)</a>: The ratio of the square of an object’s orbital period with the cube of the semi-major axis of its orbit is the same for all objects orbiting the same primary.
+<ul>
+<li><span class="math inline">\(T^2 \propto R^3\)</span></li>
+</ul></li>
+<li><a href="https://en.wikipedia.org/wiki/Newton%27s_laws_of_motion">Newton’s Laws: motion and gravitation (1687)</a>: Newton’s second law of motion models the relationship between the mass of an object and the force required to accelerate it.
+<ul>
+<li><span class="math inline">\(F = ma\)</span></li>
+<li><span class="math inline">\(F_g = G \frac{m_1 m_2}{r^2}\)</span> <br></li>
+</ul></li>
+</ul></li>
+<li><p>Probabilistic models: Models that attempt to understand how random processes evolve. These are more general and can be used to describe many phenomena in the real world. These models commonly make simplifying assumptions about the nature of the world.</p>
+<ul>
+<li><a href="https://en.wikipedia.org/wiki/Poisson_point_process">Poisson Process models</a>: Used to model random events that happen with some probability at any point in time and are strictly increasing in count, such as the arrival of customers at a store.</li>
+</ul></li>
+</ol>
+<p>Note: These specific models are not in the scope of Data 100 and exist to serve as motivation.</p>
+</section>
+</section>
+<section id="simple-linear-regression" class="level2" data-number="10.2">
+<h2 data-number="10.2" class="anchored" data-anchor-id="simple-linear-regression"><span class="header-section-number">10.2</span> Simple Linear Regression</h2>
+<p>The <strong>regression line</strong> is the unique straight line that minimizes the <strong>mean squared error</strong> of estimation among all straight lines. As with any straight line, it can be defined by a slope and a y-intercept:</p>
+<ul>
+<li><span class="math inline">\(\text{slope} = r \cdot \frac{\text{Standard Deviation of } y}{\text{Standard Deviation of }x}\)</span></li>
+<li><span class="math inline">\(y\text{-intercept} = \text{average of }y - \text{slope}\cdot\text{average of }x\)</span></li>
+<li><span class="math inline">\(\text{regression estimate} = y\text{-intercept} + \text{slope}\cdot\text{}x\)</span></li>
+<li><span class="math inline">\(\text{residual} =\text{observed }y - \text{regression estimate}\)</span></li>
+</ul>
+<div id="2342e5b1" class="cell" data-execution_count="1">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
+<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
+<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
+<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
+<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Set random seed for consistency </span></span>
+<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">43</span>)</span>
+<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a>plt.style.use(<span class="st">'default'</span>) </span>
+<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate random noise for plotting</span></span>
+<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a>x <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a>y <span class="op">=</span> x <span class="op">*</span> <span class="fl">0.5</span> <span class="op">-</span> <span class="dv">1</span> <span class="op">+</span> np.random.randn(<span class="dv">100</span>) <span class="op">*</span> <span class="fl">0.3</span></span>
+<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot regression line</span></span>
+<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a>sns.regplot(x<span class="op">=</span>x,y<span class="op">=</span>y)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+<div class="cell-output cell-output-display">
+<div>
+<figure class="figure">
+<p><img src="intro_to_modeling_files/figure-html/cell-2-output-1.png" class="img-fluid figure-img"></p>
+</figure>
+</div>
+</div>
+</div>
+<section id="notations-and-definitions" class="level3" data-number="10.2.1">
+<h3 data-number="10.2.1" class="anchored" data-anchor-id="notations-and-definitions"><span class="header-section-number">10.2.1</span> Notations and Definitions</h3>
+<p>For a pair of variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> representing our data <span class="math inline">\(\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}\)</span>, we denote their means/averages as <span class="math inline">\(\bar x\)</span> and <span class="math inline">\(\bar y\)</span> and standard deviations as <span class="math inline">\(\sigma_x\)</span> and <span class="math inline">\(\sigma_y\)</span>.</p>
+<section id="standard-units" class="level4" data-number="10.2.1.1">
+<h4 data-number="10.2.1.1" class="anchored" data-anchor-id="standard-units"><span class="header-section-number">10.2.1.1</span> Standard Units</h4>
+<p>A variable is represented in standard units if the following are true:</p>
+<ol type="1">
+<li>0 in standard units is equal to the mean (<span class="math inline">\(\bar{x}\)</span>) in the original variable’s units.</li>
+<li>An increase of 1 standard unit is an increase of 1 standard deviation (<span class="math inline">\(\sigma_x\)</span>) in the original variable’s units.</li>
+</ol>
+<p>To convert a variable <span class="math inline">\(x_i\)</span> into standard units, we subtract its mean from it and divide it by its standard deviation. For example, <span class="math inline">\(x_i\)</span> in standard units is <span class="math inline">\(\frac{x_i - \bar x}{\sigma_x}\)</span>.</p>
+</section>
+<section id="correlation" class="level4" data-number="10.2.1.2">
+<h4 data-number="10.2.1.2" class="anchored" data-anchor-id="correlation"><span class="header-section-number">10.2.1.2</span> Correlation</h4>
+<p>The correlation (<span class="math inline">\(r\)</span>) is the average of the product of <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, both measured in <em>standard units</em>.</p>
+<p><span class="math display">\[r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i - \bar{x}}{\sigma_x})(\frac{y_i - \bar{y}}{\sigma_y})\]</span></p>
+<ol type="1">
+<li>Correlation measures the strength of a <strong>linear association</strong> between two variables.</li>
+<li>Correlations range between -1 and 1: <span class="math inline">\(|r| \leq 1\)</span>, with <span class="math inline">\(r=1\)</span> indicating perfect positive linear association, and <span class="math inline">\(r=-1\)</span> indicating perfect negative association. The closer <span class="math inline">\(r\)</span> is to <span class="math inline">\(0\)</span>, the weaker the linear association is.</li>
+<li>Correlation says nothing about causation and non-linear association. Correlation does <strong>not</strong> imply causation. When <span class="math inline">\(r = 0\)</span>, the two variables are uncorrelated. However, they could still be related through some non-linear relationship.</li>
+</ol>
+<div id="f07efd37" class="cell" data-execution_count="2">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> plot_and_get_corr(ax, x, y, title):</span>
+<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>    ax.set_xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>)</span>
+<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>    ax.set_ylim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>)</span>
+<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>    ax.set_xticks([])</span>
+<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>    ax.set_yticks([])</span>
+<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>    ax.scatter(x, y, alpha <span class="op">=</span> <span class="fl">0.73</span>)</span>
+<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>    r <span class="op">=</span> np.corrcoef(x, y)[<span class="dv">0</span>, <span class="dv">1</span>]</span>
+<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>    ax.set_title(title <span class="op">+</span> <span class="st">" (corr: </span><span class="sc">{}</span><span class="st">)"</span>.<span class="bu">format</span>(r.<span class="bu">round</span>(<span class="dv">2</span>)))</span>
+<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> r</span>
+<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a>fig, axs <span class="op">=</span> plt.subplots(<span class="dv">2</span>, <span class="dv">2</span>, figsize <span class="op">=</span> (<span class="dv">10</span>, <span class="dv">10</span>))</span>
+<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Just noise</span></span>
+<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a>x1, y1 <span class="op">=</span> np.random.randn(<span class="dv">2</span>, <span class="dv">100</span>)</span>
+<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a>corr1 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">0</span>, <span class="dv">0</span>], x1, y1, title <span class="op">=</span> <span class="st">"noise"</span>)</span>
+<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a><span class="co"># Strong linear</span></span>
+<span id="cb2-18"><a href="#cb2-18" aria-hidden="true" tabindex="-1"></a>x2 <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb2-19"><a href="#cb2-19" aria-hidden="true" tabindex="-1"></a>y2 <span class="op">=</span> x2 <span class="op">*</span> <span class="fl">0.5</span> <span class="op">-</span> <span class="dv">1</span> <span class="op">+</span> np.random.randn(<span class="dv">100</span>) <span class="op">*</span> <span class="fl">0.3</span></span>
+<span id="cb2-20"><a href="#cb2-20" aria-hidden="true" tabindex="-1"></a>corr2 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">0</span>, <span class="dv">1</span>], x2, y2, title <span class="op">=</span> <span class="st">"strong linear"</span>)</span>
+<span id="cb2-21"><a href="#cb2-21" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb2-22"><a href="#cb2-22" aria-hidden="true" tabindex="-1"></a><span class="co"># Unequal spread</span></span>
+<span id="cb2-23"><a href="#cb2-23" aria-hidden="true" tabindex="-1"></a>x3 <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb2-24"><a href="#cb2-24" aria-hidden="true" tabindex="-1"></a>y3 <span class="op">=</span> <span class="op">-</span> x3<span class="op">/</span><span class="dv">3</span> <span class="op">+</span> np.random.randn(<span class="dv">100</span>)<span class="op">*</span>(x3)<span class="op">/</span><span class="fl">2.5</span></span>
+<span id="cb2-25"><a href="#cb2-25" aria-hidden="true" tabindex="-1"></a>corr3 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">1</span>, <span class="dv">0</span>], x3, y3, title <span class="op">=</span> <span class="st">"strong linear"</span>)</span>
+<span id="cb2-26"><a href="#cb2-26" aria-hidden="true" tabindex="-1"></a>extent <span class="op">=</span> axs[<span class="dv">1</span>, <span class="dv">0</span>].get_window_extent().transformed(fig.dpi_scale_trans.inverted())</span>
+<span id="cb2-27"><a href="#cb2-27" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb2-28"><a href="#cb2-28" aria-hidden="true" tabindex="-1"></a><span class="co"># Strong non-linear</span></span>
+<span id="cb2-29"><a href="#cb2-29" aria-hidden="true" tabindex="-1"></a>x4 <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb2-30"><a href="#cb2-30" aria-hidden="true" tabindex="-1"></a>y4 <span class="op">=</span> <span class="dv">2</span><span class="op">*</span>np.sin(x3 <span class="op">-</span> <span class="fl">1.5</span>) <span class="op">+</span> np.random.randn(<span class="dv">100</span>) <span class="op">*</span> <span class="fl">0.3</span></span>
+<span id="cb2-31"><a href="#cb2-31" aria-hidden="true" tabindex="-1"></a>corr4 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">1</span>, <span class="dv">1</span>], x4, y4, title <span class="op">=</span> <span class="st">"strong non-linear"</span>)</span>
+<span id="cb2-32"><a href="#cb2-32" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb2-33"><a href="#cb2-33" aria-hidden="true" tabindex="-1"></a>plt.show()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+<div class="cell-output cell-output-display">
+<div>
+<figure class="figure">
+<p><img src="intro_to_modeling_files/figure-html/cell-3-output-1.png" class="img-fluid figure-img"></p>
+</figure>
+</div>
+</div>
+</div>
+</section>
+</section>
+<section id="alternate-form" class="level3" data-number="10.2.2">
+<h3 data-number="10.2.2" class="anchored" data-anchor-id="alternate-form"><span class="header-section-number">10.2.2</span> Alternate Form</h3>
+<p>When the variables <span class="math inline">\(y\)</span> and <span class="math inline">\(x\)</span> are measured in <em>standard units</em>, the regression line for predicting <span class="math inline">\(y\)</span> based on <span class="math inline">\(x\)</span> has slope <span class="math inline">\(r\)</span> and passes through the origin.</p>
+<p><span class="math display">\[\hat{y}_{su} = r \cdot x_{su}\]</span></p>
+<p><img src="images/reg_line_1.png" class="img-fluid"></p>
+<ul>
+<li>In the original units, this becomes</li>
+</ul>
+<p><span class="math display">\[\frac{\hat{y} - \bar{y}}{\sigma_y} = r \cdot \frac{x - \bar{x}}{\sigma_x}\]</span></p>
+<p><img src="images/reg_line_2.png" class="img-fluid"></p>
+</section>
+<section id="derivation" class="level3" data-number="10.2.3">
+<h3 data-number="10.2.3" class="anchored" data-anchor-id="derivation"><span class="header-section-number">10.2.3</span> Derivation</h3>
+<p>Starting from the top, we have our claimed form of the regression line, and we want to show that it is equivalent to the optimal linear regression line: <span class="math inline">\(\hat{y} = \hat{a} + \hat{b}x\)</span>.</p>
+<p>Recall:</p>
+<ul>
+<li><span class="math inline">\(\hat{b} = r \cdot \frac{\text{Standard Deviation of }y}{\text{Standard Deviation of }x}\)</span></li>
+<li><span class="math inline">\(\hat{a} = \text{average of }y - \text{slope}\cdot\text{average of }x\)</span></li>
+</ul>
+<div class="callout callout-style-simple callout-none no-icon">
+<div class="callout-body d-flex">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-body-container">
+<p>Proof:</p>
+<p><span class="math display">\[\frac{\hat{y} - \bar{y}}{\sigma_y} = r \cdot \frac{x - \bar{x}}{\sigma_x}\]</span></p>
+<p>Multiply by <span class="math inline">\(\sigma_y\)</span>, and add <span class="math inline">\(\bar{y}\)</span> on both sides.</p>
+<p><span class="math display">\[\hat{y} = \sigma_y \cdot r \cdot \frac{x - \bar{x}}{\sigma_x} + \bar{y}\]</span></p>
+<p>Distribute coefficient <span class="math inline">\(\sigma_{y}\cdot r\)</span> to the <span class="math inline">\(\frac{x - \bar{x}}{\sigma_x}\)</span> term</p>
+<p><span class="math display">\[\hat{y} = (\frac{r\sigma_y}{\sigma_x} ) \cdot x + (\bar{y} - (\frac{r\sigma_y}{\sigma_x} ) \bar{x})\]</span></p>
+<p>We now see that we have a line that matches our claim:</p>
+<ul>
+<li>slope: <span class="math inline">\(r\cdot\frac{\text{SD of y}}{\text{SD of x}} = r\cdot\frac{\sigma_y}{\sigma_x}\)</span></li>
+<li>intercept: <span class="math inline">\(\bar{y} - \text{slope}\cdot \bar{x}\)</span></li>
+</ul>
+<p>Note that the error for the i-th datapoint is: <span class="math inline">\(e_i = y_i - \hat{y_i}\)</span></p>
+</div>
+</div>
+</div>
+</section>
+</section>
+<section id="the-modeling-process" class="level2" data-number="10.3">
+<h2 data-number="10.3" class="anchored" data-anchor-id="the-modeling-process"><span class="header-section-number">10.3</span> The Modeling Process</h2>
+<p>At a high level, a model is a way of representing a system. In Data 100, we’ll treat a model as some mathematical rule we use to describe the relationship between variables.</p>
+<p>What variables are we modeling? Typically, we use a subset of the variables in our sample of collected data to model another variable in this data. To put this more formally, say we have the following dataset <span class="math inline">\(\mathcal{D}\)</span>:</p>
+<p><span class="math display">\[\mathcal{D} = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}\]</span></p>
+<p>Each pair of values <span class="math inline">\((x_i, y_i)\)</span> represents a datapoint. In a modeling setting, we call these <strong>observations</strong>. <span class="math inline">\(y_i\)</span> is the dependent variable we are trying to model, also called an <strong>output</strong> or <strong>response</strong>. <span class="math inline">\(x_i\)</span> is the independent variable inputted into the model to make predictions, also known as a <strong>feature</strong>.</p>
+<p>Our goal in modeling is to use the observed data <span class="math inline">\(\mathcal{D}\)</span> to predict the output variable <span class="math inline">\(y_i\)</span>. We denote each prediction as <span class="math inline">\(\hat{y}_i\)</span> (read: “y hat sub i”).</p>
+<p>How do we generate these predictions? Some examples of models we’ll encounter in the next few lectures are given below:</p>
+<p><span class="math display">\[\hat{y}_i = \theta\]</span> <span class="math display">\[\hat{y}_i = \theta_0 + \theta_1 x_i\]</span></p>
+<p>The examples above are known as <strong>parametric models</strong>. They relate the collected data, <span class="math inline">\(x_i\)</span>, to the prediction we make, <span class="math inline">\(\hat{y}_i\)</span>. A few parameters (<span class="math inline">\(\theta\)</span>, <span class="math inline">\(\theta_0\)</span>, <span class="math inline">\(\theta_1\)</span>) are used to describe the relationship between <span class="math inline">\(x_i\)</span> and <span class="math inline">\(\hat{y}_i\)</span>.</p>
+<p>Notice that we don’t immediately know the values of these parameters. While the features, <span class="math inline">\(x_i\)</span>, are taken from our observed data, we need to decide what values to give <span class="math inline">\(\theta\)</span>, <span class="math inline">\(\theta_0\)</span>, and <span class="math inline">\(\theta_1\)</span> ourselves. This is the heart of parametric modeling: <em>what parameter values should we choose so our model makes the best possible predictions?</em></p>
+<p>To choose our model parameters, we’ll work through the <strong>modeling process</strong>.</p>
+<ol type="1">
+<li>Choose a model: how should we represent the world?</li>
+<li>Choose a loss function: how do we quantify prediction error?</li>
+<li>Fit the model: how do we choose the best parameters of our model given our data?</li>
+<li>Evaluate model performance: how do we evaluate whether this process gave rise to a good model?</li>
+</ol>
+</section>
+<section id="choosing-a-model" class="level2" data-number="10.4">
+<h2 data-number="10.4" class="anchored" data-anchor-id="choosing-a-model"><span class="header-section-number">10.4</span> Choosing a Model</h2>
+<p>Our first step is choosing a model: defining the mathematical rule that describes the relationship between the features, <span class="math inline">\(x_i\)</span>, and predictions <span class="math inline">\(\hat{y}_i\)</span>.</p>
+<p>In <a href="https://inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html">Data 8</a>, you learned about the <strong>Simple Linear Regression (SLR) model</strong>. You learned that the model takes the form: <span class="math display">\[\hat{y}_i = a + bx_i\]</span></p>
+<p>In Data 100, we’ll use slightly different notation: we will replace <span class="math inline">\(a\)</span> with <span class="math inline">\(\theta_0\)</span> and <span class="math inline">\(b\)</span> with <span class="math inline">\(\theta_1\)</span>. This will allow us to use the same notation when we explore more complex models later on in the course.</p>
+<p><span class="math display">\[\hat{y}_i = \theta_0 + \theta_1 x_i\]</span></p>
+<p>The parameters of the SLR model are <span class="math inline">\(\theta_0\)</span>, also called the intercept term, and <span class="math inline">\(\theta_1\)</span>, also called the slope term. To create an effective model, we want to choose values for <span class="math inline">\(\theta_0\)</span> and <span class="math inline">\(\theta_1\)</span> that most accurately predict the output variable. The “best” fitting model parameters are given the special names: <span class="math inline">\(\hat{\theta}_0\)</span> and <span class="math inline">\(\hat{\theta}_1\)</span>; they are the specific parameter values that allow our model to generate the best possible predictions.</p>
+<p>In Data 8, you learned that the best SLR model parameters are: <span class="math display">\[\hat{\theta}_0 = \bar{y} - \hat{\theta}_1\bar{x} \qquad \qquad \hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}\]</span></p>
+<p>A quick reminder on notation:</p>
+<ul>
+<li><span class="math inline">\(\bar{y}\)</span> and <span class="math inline">\(\bar{x}\)</span> indicate the mean value of <span class="math inline">\(y\)</span> and <span class="math inline">\(x\)</span>, respectively</li>
+<li><span class="math inline">\(\sigma_y\)</span> and <span class="math inline">\(\sigma_x\)</span> indicate the standard deviations of <span class="math inline">\(y\)</span> and <span class="math inline">\(x\)</span></li>
+<li><span class="math inline">\(r\)</span> is the <a href="https://inferentialthinking.com/chapters/15/1/Correlation.html#the-correlation-coefficient">correlation coefficient</a>, defined as the average of the product of <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> measured in standard units: <span class="math inline">\(\frac{1}{n} \sum_{i=1}^n (\frac{x_i-\bar{x}}{\sigma_x})(\frac{y_i-\bar{y}}{\sigma_y})\)</span></li>
+</ul>
+<p>In Data 100, we want to understand <em>how</em> to derive these best model coefficients. To do so, we’ll introduce the concept of a loss function.</p>
+</section>
+<section id="choosing-a-loss-function" class="level2" data-number="10.5">
+<h2 data-number="10.5" class="anchored" data-anchor-id="choosing-a-loss-function"><span class="header-section-number">10.5</span> Choosing a Loss Function</h2>
+<p>We’ve talked about the idea of creating the “best” possible predictions. This begs the question: how do we decide how “good” or “bad” our model’s predictions are?</p>
+<p>A <strong>loss function</strong> characterizes the cost, error, or fit resulting from a particular choice of model or model parameters. This function, <span class="math inline">\(L(y, \hat{y})\)</span>, quantifies how “bad” or “far off” a single prediction by our model is from a true, observed value in our collected data.</p>
+<p>The choice of loss function for a particular model will affect the accuracy and computational cost of estimation, and it’ll also depend on the estimation task at hand. For example,</p>
+<ul>
+<li>Are outputs quantitative or qualitative?</li>
+<li>Do outliers matter?</li>
+<li>Are all errors equally costly? (e.g., a false negative on a cancer test is arguably more dangerous than a false positive)</li>
+</ul>
+<p>Regardless of the specific function used, a loss function should follow two basic principles:</p>
+<ul>
+<li>If the prediction <span class="math inline">\(\hat{y}_i\)</span> is <em>close</em> to the actual value <span class="math inline">\(y_i\)</span>, loss should be low.</li>
+<li>If the prediction <span class="math inline">\(\hat{y}_i\)</span> is <em>far</em> from the actual value <span class="math inline">\(y_i\)</span>, loss should be high.</li>
+</ul>
+<p>Two common choices of loss function are squared loss and absolute loss.</p>
+<p><strong>Squared loss</strong>, also known as <strong>L2 loss</strong>, computes loss as the square of the difference between the observed <span class="math inline">\(y_i\)</span> and predicted <span class="math inline">\(\hat{y}_i\)</span>: <span class="math display">\[L(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2\]</span></p>
+<p><strong>Absolute loss</strong>, also known as <strong>L1 loss</strong>, computes loss as the absolute difference between the observed <span class="math inline">\(y_i\)</span> and predicted <span class="math inline">\(\hat{y}_i\)</span>: <span class="math display">\[L(y_i, \hat{y}_i) = |y_i - \hat{y}_i|\]</span></p>
+<p>L1 and L2 loss give us a tool for quantifying our model’s performance on a single data point. This is a good start, but ideally, we want to understand how our model performs across our <em>entire</em> dataset. A natural way to do this is to compute the average loss across all data points in the dataset. This is known as the <strong>cost function</strong>, <span class="math inline">\(\hat{R}(\theta)\)</span>: <span class="math display">\[\hat{R}(\theta) = \frac{1}{n} \sum^n_{i=1} L(y_i, \hat{y}_i)\]</span></p>
+<p>The cost function has many names in the statistics literature. You may also encounter the terms:</p>
+<ul>
+<li>Empirical risk (this is why we give the cost function the name <span class="math inline">\(R\)</span>)</li>
+<li>Error function</li>
+<li>Average loss</li>
+</ul>
+<p>We can substitute our L1 and L2 loss into the cost function definition. The <strong>Mean Squared Error (MSE)</strong> is the average squared loss across a dataset: <span class="math display">\[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]</span></p>
+<p>The <strong>Mean Absolute Error (MAE)</strong> is the average absolute loss across a dataset: <span class="math display">\[\text{MAE}= \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|\]</span></p>
+</section>
+<section id="fitting-the-model" class="level2" data-number="10.6">
+<h2 data-number="10.6" class="anchored" data-anchor-id="fitting-the-model"><span class="header-section-number">10.6</span> Fitting the Model</h2>
+<p>Now that we’ve established the concept of a loss function, we can return to our original goal of choosing model parameters. Specifically, we want to choose the best set of model parameters that will minimize the model’s cost on our dataset. This process is called fitting the model.</p>
+<p>We know from calculus that a function is minimized when (1) its first derivative is equal to zero and (2) its second derivative is positive. We often call the function being minimized the <strong>objective function</strong> (our objective is to find its minimum).</p>
+<p>To find the optimal model parameter, we:</p>
+<ol type="1">
+<li>Take the derivative of the cost function with respect to that parameter</li>
+<li>Set the derivative equal to 0</li>
+<li>Solve for the parameter</li>
+</ol>
+<p>We repeat this process for each parameter present in the model. For now, we’ll disregard the second derivative condition.</p>
+<p>To help us make sense of this process, let’s put it into action by deriving the optimal model parameters for simple linear regression using the mean squared error as our cost function. Remember: although the notation may look tricky, all we are doing is following the three steps above!</p>
+<p>Step 1: take the derivative of the cost function with respect to each model parameter. We substitute the SLR model, <span class="math inline">\(\hat{y}_i = \theta_0+\theta_1 x_i\)</span>, into the definition of MSE above and differentiate with respect to <span class="math inline">\(\theta_0\)</span> and <span class="math inline">\(\theta_1\)</span>. <span class="math display">\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)^2\]</span></p>
+<p><span class="math display">\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} y_i - \theta_0 - \theta_1 x_i\]</span></p>
+<p><span class="math display">\[\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)x_i\]</span></p>
+<p>Let’s walk through these derivations in more depth, starting with the derivative of MSE with respect to <span class="math inline">\(\theta_0\)</span>.</p>
+<p>Given our MSE above, we know that: <span class="math display">\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{\partial}{\partial \theta_0} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]</span></p>
+<p>Noting that the derivative of sum is equivalent to the sum of derivatives, this then becomes: <span class="math display">\[ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_0} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]</span></p>
+<p>We can then apply the chain rule.</p>
+<p><span class="math display">\[ = \frac{1}{n} \sum_{i=1}^{n} 2 \cdot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-1)\]</span></p>
+<p>Finally, we can simplify the constants, leaving us with our answer.</p>
+<p><span class="math display">\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n}{(y_i - \theta_0 - \theta_1 x_i)}\]</span></p>
+<p>Following the same procedure, we can take the derivative of MSE with respect to <span class="math inline">\(\theta_1\)</span>.</p>
+<p><span class="math display">\[\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{\partial}{\partial \theta_1} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]</span></p>
+<p><span class="math display">\[ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_1} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]</span></p>
+<p><span class="math display">\[ = \frac{1}{n} \sum_{i=1}^{n} 2 \dot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-x_i)\]</span></p>
+<p><span class="math display">\[= \frac{-2}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}x_i\]</span></p>
+<p>Step 2: set the derivatives equal to 0. After simplifying terms, this produces two <strong>estimating equations</strong>. The best set of model parameters <span class="math inline">\((\hat{\theta}_0, \hat{\theta}_1)\)</span> <em>must</em> satisfy these two optimality conditions. <span class="math display">\[0 = \frac{-2}{n} \sum_{i=1}^{n} y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} y_i - \hat{y}_i = 0\]</span> <span class="math display">\[0 = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i)x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)x_i = 0\]</span></p>
+<p>Step 3: solve the estimating equations to compute estimates for <span class="math inline">\(\hat{\theta}_0\)</span> and <span class="math inline">\(\hat{\theta}_1\)</span>.</p>
+<p>Taking the first equation gives the estimate of <span class="math inline">\(\hat{\theta}_0\)</span>: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i = 0 \]</span></p>
+<p><span class="math display">\[\left(\frac{1}{n} \sum_{i=1}^n y_i \right) - \hat{\theta}_0 - \hat{\theta}_1\left(\frac{1}{n} \sum_{i=1}^n x_i \right) = 0\]</span></p>
+<p><span class="math display">\[ \hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x}\]</span></p>
+<p>With a bit more maneuvering, the second equation gives the estimate of <span class="math inline">\(\hat{\theta}_1\)</span>. Start by multiplying the first estimating equation by <span class="math inline">\(\bar{x}\)</span>, then subtracting the result from the second estimating equation.</p>
+<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)x_i - \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)\bar{x} = 0 \]</span></p>
+<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)(x_i - \bar{x}) = 0 \]</span></p>
+<p>Next, plug in <span class="math inline">\(\hat{y}_i = \hat{\theta}_0 + \hat{\theta}_1 x_i = \bar{y} + \hat{\theta}_1(x_i - \bar{x})\)</span>:</p>
+<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y} - \hat{\theta}_1(x - \bar{x}))(x_i - \bar{x}) = 0 \]</span></p>
+<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x}) = \hat{\theta}_1 \times \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2
+\]</span></p>
+<p>By using the definition of correlation <span class="math inline">\(\left(r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i-\bar{x}}{\sigma_x})(\frac{y_i-\bar{y}}{\sigma_y}) \right)\)</span> and standard deviation <span class="math inline">\(\left(\sigma_x = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} \right)\)</span>, we can conclude: <span class="math display">\[r \sigma_x \sigma_y = \hat{\theta}_1 \times \sigma_x^2\]</span> <span class="math display">\[\hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}\]</span></p>
+<p>Just as was given in Data 8!</p>
+<p>Remember, this derivation found the optimal model parameters for SLR when using the MSE cost function. If we had used a different model or different loss function, we likely would have found different values for the best model parameters. However, regardless of the model and loss used, we can <em>always</em> follow these three steps to fit the model.</p>
+</section>
+<section id="evaluating-the-slr-model" class="level2" data-number="10.7">
+<h2 data-number="10.7" class="anchored" data-anchor-id="evaluating-the-slr-model"><span class="header-section-number">10.7</span> Evaluating the SLR Model</h2>
+<p>Now that we’ve explored the mathematics behind (1) choosing a model, (2) choosing a loss function, and (3) fitting the model, we’re left with one final question – how “good” are the predictions made by this “best” fitted model? To determine this, we can:</p>
+<ol type="1">
+<li><p>Visualize data and compute statistics:</p>
+<ul>
+<li>Plot the original data.</li>
+<li>Compute each column’s mean and standard deviation. If the mean and standard deviation of our predictions are close to those of the original observed <span class="math inline">\(y_i\)</span>’s, we might be inclined to say that our model has done well.</li>
+<li>(If we’re fitting a linear model) Compute the correlation <span class="math inline">\(r\)</span>. A large magnitude for the correlation coefficient between the feature and response variables could also indicate that our model has done well.</li>
+</ul></li>
+<li><p>Performance metrics:</p>
+<ul>
+<li>We can take the <strong>Root Mean Squared Error (RMSE)</strong>.
+<ul>
+<li>It’s the square root of the mean squared error (MSE), which is the average loss that we’ve been minimizing to determine optimal model parameters.</li>
+<li>RMSE is in the same units as <span class="math inline">\(y\)</span>.</li>
+<li>A lower RMSE indicates more “accurate” predictions, as we have a lower “average loss” across the data.</li>
+</ul></li>
+</ul>
+<p><span class="math display">\[\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}\]</span></p></li>
+<li><p>Visualization:</p>
+<ul>
+<li>Look at the residual plot of <span class="math inline">\(e_i = y_i - \hat{y_i}\)</span> to visualize the difference between actual and predicted values. The good residual plot should not show any pattern between input/features <span class="math inline">\(x_i\)</span> and residual values <span class="math inline">\(e_i\)</span>.</li>
+</ul></li>
+</ol>
+<p>To illustrate this process, let’s take a look at <strong>Anscombe’s quartet</strong>.</p>
+<section id="four-mysterious-datasets-anscombes-quartet" class="level3" data-number="10.7.1">
+<h3 data-number="10.7.1" class="anchored" data-anchor-id="four-mysterious-datasets-anscombes-quartet"><span class="header-section-number">10.7.1</span> Four Mysterious Datasets (Anscombe’s quartet)</h3>
+<p>Let’s take a look at four different datasets.</p>
+<div id="dc45719f" class="cell" data-execution_count="3">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
+<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
+<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
+<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>matplotlib inline</span>
+<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
+<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> itertools</span>
+<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> mpl_toolkits.mplot3d <span class="im">import</span> Axes3D</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+</div>
+<div id="a29af7cd" class="cell" data-execution_count="4">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Big font helper</span></span>
+<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> adjust_fontsize(size<span class="op">=</span><span class="va">None</span>):</span>
+<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>    SMALL_SIZE <span class="op">=</span> <span class="dv">8</span></span>
+<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>    MEDIUM_SIZE <span class="op">=</span> <span class="dv">10</span></span>
+<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>    BIGGER_SIZE <span class="op">=</span> <span class="dv">12</span></span>
+<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> size <span class="op">!=</span> <span class="va">None</span>:</span>
+<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>        SMALL_SIZE <span class="op">=</span> MEDIUM_SIZE <span class="op">=</span> BIGGER_SIZE <span class="op">=</span> size</span>
+<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"font"</span>, size<span class="op">=</span>SMALL_SIZE)  <span class="co"># controls default text sizes</span></span>
+<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"axes"</span>, titlesize<span class="op">=</span>SMALL_SIZE)  <span class="co"># fontsize of the axes title</span></span>
+<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"axes"</span>, labelsize<span class="op">=</span>MEDIUM_SIZE)  <span class="co"># fontsize of the x and y labels</span></span>
+<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"xtick"</span>, labelsize<span class="op">=</span>SMALL_SIZE)  <span class="co"># fontsize of the tick labels</span></span>
+<span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"ytick"</span>, labelsize<span class="op">=</span>SMALL_SIZE)  <span class="co"># fontsize of the tick labels</span></span>
+<span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"legend"</span>, fontsize<span class="op">=</span>SMALL_SIZE)  <span class="co"># legend fontsize</span></span>
+<span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"figure"</span>, titlesize<span class="op">=</span>BIGGER_SIZE)  <span class="co"># fontsize of the figure title</span></span>
+<span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-18"><a href="#cb4-18" aria-hidden="true" tabindex="-1"></a><span class="co"># Helper functions</span></span>
+<span id="cb4-19"><a href="#cb4-19" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> standard_units(x):</span>
+<span id="cb4-20"><a href="#cb4-20" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (x <span class="op">-</span> np.mean(x)) <span class="op">/</span> np.std(x)</span>
+<span id="cb4-21"><a href="#cb4-21" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-22"><a href="#cb4-22" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-23"><a href="#cb4-23" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> correlation(x, y):</span>
+<span id="cb4-24"><a href="#cb4-24" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> np.mean(standard_units(x) <span class="op">*</span> standard_units(y))</span>
+<span id="cb4-25"><a href="#cb4-25" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-26"><a href="#cb4-26" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-27"><a href="#cb4-27" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> slope(x, y):</span>
+<span id="cb4-28"><a href="#cb4-28" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> correlation(x, y) <span class="op">*</span> np.std(y) <span class="op">/</span> np.std(x)</span>
+<span id="cb4-29"><a href="#cb4-29" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-30"><a href="#cb4-30" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-31"><a href="#cb4-31" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> intercept(x, y):</span>
+<span id="cb4-32"><a href="#cb4-32" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> np.mean(y) <span class="op">-</span> slope(x, y) <span class="op">*</span> np.mean(x)</span>
+<span id="cb4-33"><a href="#cb4-33" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-34"><a href="#cb4-34" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-35"><a href="#cb4-35" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> fit_least_squares(x, y):</span>
+<span id="cb4-36"><a href="#cb4-36" aria-hidden="true" tabindex="-1"></a>    theta_0 <span class="op">=</span> intercept(x, y)</span>
+<span id="cb4-37"><a href="#cb4-37" aria-hidden="true" tabindex="-1"></a>    theta_1 <span class="op">=</span> slope(x, y)</span>
+<span id="cb4-38"><a href="#cb4-38" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> theta_0, theta_1</span>
+<span id="cb4-39"><a href="#cb4-39" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-40"><a href="#cb4-40" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-41"><a href="#cb4-41" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> predict(x, theta_0, theta_1):</span>
+<span id="cb4-42"><a href="#cb4-42" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> theta_0 <span class="op">+</span> theta_1 <span class="op">*</span> x</span>
+<span id="cb4-43"><a href="#cb4-43" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-44"><a href="#cb4-44" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-45"><a href="#cb4-45" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> compute_mse(y, yhat):</span>
+<span id="cb4-46"><a href="#cb4-46" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> np.mean((y <span class="op">-</span> yhat) <span class="op">**</span> <span class="dv">2</span>)</span>
+<span id="cb4-47"><a href="#cb4-47" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-48"><a href="#cb4-48" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-49"><a href="#cb4-49" aria-hidden="true" tabindex="-1"></a>plt.style.use(<span class="st">"default"</span>)  <span class="co"># Revert style to default mpl</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+</div>
+<div id="40b40b77" class="cell" data-execution_count="5">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>plt.style.use(<span class="st">"default"</span>)  <span class="co"># Revert style to default mpl</span></span>
+<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>NO_VIZ, RESID, RESID_SCATTER <span class="op">=</span> <span class="bu">range</span>(<span class="dv">3</span>)</span>
+<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> least_squares_evaluation(x, y, visualize<span class="op">=</span>NO_VIZ):</span>
+<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>    <span class="co"># statistics</span></span>
+<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"x_mean : </span><span class="sc">{</span>np<span class="sc">.</span>mean(x)<span class="sc">:.2f}</span><span class="ss">, y_mean : </span><span class="sc">{</span>np<span class="sc">.</span>mean(y)<span class="sc">:.2f}</span><span class="ss">"</span>)</span>
+<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"x_stdev: </span><span class="sc">{</span>np<span class="sc">.</span>std(x)<span class="sc">:.2f}</span><span class="ss">, y_stdev: </span><span class="sc">{</span>np<span class="sc">.</span>std(y)<span class="sc">:.2f}</span><span class="ss">"</span>)</span>
+<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"r = Correlation(x, y): </span><span class="sc">{</span>correlation(x, y)<span class="sc">:.3f}</span><span class="ss">"</span>)</span>
+<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Performance metrics</span></span>
+<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a>    ahat, bhat <span class="op">=</span> fit_least_squares(x, y)</span>
+<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a>    yhat <span class="op">=</span> predict(x, ahat, bhat)</span>
+<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"</span><span class="ch">\t</span><span class="ss">heta_0: </span><span class="sc">{</span>ahat<span class="sc">:.2f}</span><span class="ss">, </span><span class="ch">\t</span><span class="ss">heta_1: </span><span class="sc">{</span>bhat<span class="sc">:.2f}</span><span class="ss">"</span>)</span>
+<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"RMSE: </span><span class="sc">{</span>np<span class="sc">.</span>sqrt(compute_mse(y, yhat))<span class="sc">:.3f}</span><span class="ss">"</span>)</span>
+<span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-17"><a href="#cb5-17" aria-hidden="true" tabindex="-1"></a>    <span class="co"># visualization</span></span>
+<span id="cb5-18"><a href="#cb5-18" aria-hidden="true" tabindex="-1"></a>    fig, ax_resid <span class="op">=</span> <span class="va">None</span>, <span class="va">None</span></span>
+<span id="cb5-19"><a href="#cb5-19" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> visualize <span class="op">==</span> RESID_SCATTER:</span>
+<span id="cb5-20"><a href="#cb5-20" aria-hidden="true" tabindex="-1"></a>        fig, axs <span class="op">=</span> plt.subplots(<span class="dv">1</span>, <span class="dv">2</span>, figsize<span class="op">=</span>(<span class="dv">8</span>, <span class="dv">3</span>))</span>
+<span id="cb5-21"><a href="#cb5-21" aria-hidden="true" tabindex="-1"></a>        axs[<span class="dv">0</span>].scatter(x, y)</span>
+<span id="cb5-22"><a href="#cb5-22" aria-hidden="true" tabindex="-1"></a>        axs[<span class="dv">0</span>].plot(x, yhat)</span>
+<span id="cb5-23"><a href="#cb5-23" aria-hidden="true" tabindex="-1"></a>        axs[<span class="dv">0</span>].set_title(<span class="st">"LS fit"</span>)</span>
+<span id="cb5-24"><a href="#cb5-24" aria-hidden="true" tabindex="-1"></a>        ax_resid <span class="op">=</span> axs[<span class="dv">1</span>]</span>
+<span id="cb5-25"><a href="#cb5-25" aria-hidden="true" tabindex="-1"></a>    <span class="cf">elif</span> visualize <span class="op">==</span> RESID:</span>
+<span id="cb5-26"><a href="#cb5-26" aria-hidden="true" tabindex="-1"></a>        fig <span class="op">=</span> plt.figure(figsize<span class="op">=</span>(<span class="dv">4</span>, <span class="dv">3</span>))</span>
+<span id="cb5-27"><a href="#cb5-27" aria-hidden="true" tabindex="-1"></a>        ax_resid <span class="op">=</span> plt.gca()</span>
+<span id="cb5-28"><a href="#cb5-28" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-29"><a href="#cb5-29" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> ax_resid <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>:</span>
+<span id="cb5-30"><a href="#cb5-30" aria-hidden="true" tabindex="-1"></a>        ax_resid.scatter(x, y <span class="op">-</span> yhat, color<span class="op">=</span><span class="st">"red"</span>)</span>
+<span id="cb5-31"><a href="#cb5-31" aria-hidden="true" tabindex="-1"></a>        ax_resid.plot([<span class="dv">4</span>, <span class="dv">14</span>], [<span class="dv">0</span>, <span class="dv">0</span>], color<span class="op">=</span><span class="st">"black"</span>)</span>
+<span id="cb5-32"><a href="#cb5-32" aria-hidden="true" tabindex="-1"></a>        ax_resid.set_title(<span class="st">"Residuals"</span>)</span>
+<span id="cb5-33"><a href="#cb5-33" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-34"><a href="#cb5-34" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> fig</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+</div>
+<div id="eeae5630" class="cell" data-execution_count="6">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Load in four different datasets: I, II, III, IV</span></span>
+<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>x <span class="op">=</span> [<span class="dv">10</span>, <span class="dv">8</span>, <span class="dv">13</span>, <span class="dv">9</span>, <span class="dv">11</span>, <span class="dv">14</span>, <span class="dv">6</span>, <span class="dv">4</span>, <span class="dv">12</span>, <span class="dv">7</span>, <span class="dv">5</span>]</span>
+<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>y1 <span class="op">=</span> [<span class="fl">8.04</span>, <span class="fl">6.95</span>, <span class="fl">7.58</span>, <span class="fl">8.81</span>, <span class="fl">8.33</span>, <span class="fl">9.96</span>, <span class="fl">7.24</span>, <span class="fl">4.26</span>, <span class="fl">10.84</span>, <span class="fl">4.82</span>, <span class="fl">5.68</span>]</span>
+<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>y2 <span class="op">=</span> [<span class="fl">9.14</span>, <span class="fl">8.14</span>, <span class="fl">8.74</span>, <span class="fl">8.77</span>, <span class="fl">9.26</span>, <span class="fl">8.10</span>, <span class="fl">6.13</span>, <span class="fl">3.10</span>, <span class="fl">9.13</span>, <span class="fl">7.26</span>, <span class="fl">4.74</span>]</span>
+<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>y3 <span class="op">=</span> [<span class="fl">7.46</span>, <span class="fl">6.77</span>, <span class="fl">12.74</span>, <span class="fl">7.11</span>, <span class="fl">7.81</span>, <span class="fl">8.84</span>, <span class="fl">6.08</span>, <span class="fl">5.39</span>, <span class="fl">8.15</span>, <span class="fl">6.42</span>, <span class="fl">5.73</span>]</span>
+<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>x4 <span class="op">=</span> [<span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">19</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>]</span>
+<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>y4 <span class="op">=</span> [<span class="fl">6.58</span>, <span class="fl">5.76</span>, <span class="fl">7.71</span>, <span class="fl">8.84</span>, <span class="fl">8.47</span>, <span class="fl">7.04</span>, <span class="fl">5.25</span>, <span class="fl">12.50</span>, <span class="fl">5.56</span>, <span class="fl">7.91</span>, <span class="fl">6.89</span>]</span>
+<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a>anscombe <span class="op">=</span> {</span>
+<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a>    <span class="st">"I"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x, y1)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a>    <span class="st">"II"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x, y2)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a>    <span class="st">"III"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x, y3)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>    <span class="st">"IV"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x4, y4)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>}</span>
+<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot the scatter plot and line of best fit</span></span>
+<span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a>fig, axs <span class="op">=</span> plt.subplots(<span class="dv">2</span>, <span class="dv">2</span>, figsize<span class="op">=</span>(<span class="dv">10</span>, <span class="dv">10</span>))</span>
+<span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i, dataset <span class="kw">in</span> <span class="bu">enumerate</span>([<span class="st">"I"</span>, <span class="st">"II"</span>, <span class="st">"III"</span>, <span class="st">"IV"</span>]):</span>
+<span id="cb6-20"><a href="#cb6-20" aria-hidden="true" tabindex="-1"></a>    ans <span class="op">=</span> anscombe[dataset]</span>
+<span id="cb6-21"><a href="#cb6-21" aria-hidden="true" tabindex="-1"></a>    x, y <span class="op">=</span> ans[<span class="st">"x"</span>], ans[<span class="st">"y"</span>]</span>
+<span id="cb6-22"><a href="#cb6-22" aria-hidden="true" tabindex="-1"></a>    ahat, bhat <span class="op">=</span> fit_least_squares(x, y)</span>
+<span id="cb6-23"><a href="#cb6-23" aria-hidden="true" tabindex="-1"></a>    yhat <span class="op">=</span> predict(x, ahat, bhat)</span>
+<span id="cb6-24"><a href="#cb6-24" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].scatter(x, y, alpha<span class="op">=</span><span class="fl">0.6</span>, color<span class="op">=</span><span class="st">"red"</span>)  <span class="co"># plot the x, y points</span></span>
+<span id="cb6-25"><a href="#cb6-25" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].plot(x, yhat)  <span class="co"># plot the line of best fit</span></span>
+<span id="cb6-26"><a href="#cb6-26" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_xlabel(<span class="ss">f"$x_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb6-27"><a href="#cb6-27" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_ylabel(<span class="ss">f"$y_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb6-28"><a href="#cb6-28" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_title(<span class="ss">f"Dataset </span><span class="sc">{</span>dataset<span class="sc">}</span><span class="ss">"</span>)</span>
+<span id="cb6-29"><a href="#cb6-29" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb6-30"><a href="#cb6-30" aria-hidden="true" tabindex="-1"></a>plt.show()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+<div class="cell-output cell-output-display">
+<div>
+<figure class="figure">
+<p><img src="intro_to_modeling_files/figure-html/cell-7-output-1.png" class="img-fluid figure-img"></p>
+</figure>
+</div>
+</div>
+</div>
+<p>While these four sets of datapoints look very different, they actually all have identical means <span class="math inline">\(\bar x\)</span>, <span class="math inline">\(\bar y\)</span>, standard deviations <span class="math inline">\(\sigma_x\)</span>, <span class="math inline">\(\sigma_y\)</span>, correlation <span class="math inline">\(r\)</span>, and RMSE! If we only look at these statistics, we would probably be inclined to say that these datasets are similar.</p>
+<div id="831b3b49" class="cell" data-execution_count="7">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> dataset <span class="kw">in</span> [<span class="st">"I"</span>, <span class="st">"II"</span>, <span class="st">"III"</span>, <span class="st">"IV"</span>]:</span>
+<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"&gt;&gt;&gt; Dataset </span><span class="sc">{</span>dataset<span class="sc">}</span><span class="ss">:"</span>)</span>
+<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>    ans <span class="op">=</span> anscombe[dataset]</span>
+<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>    fig <span class="op">=</span> least_squares_evaluation(ans[<span class="st">"x"</span>], ans[<span class="st">"y"</span>], visualize<span class="op">=</span>NO_VIZ)</span>
+<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>()</span>
+<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+<div class="cell-output cell-output-stdout">
+<pre><code>&gt;&gt;&gt; Dataset I:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.119
+
+
+&gt;&gt;&gt; Dataset II:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.119
+
+
+&gt;&gt;&gt; Dataset III:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.118
+
+
+&gt;&gt;&gt; Dataset IV:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.817
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.118
+
+</code></pre>
+</div>
+</div>
+<p>We may also wish to visualize the model’s <strong>residuals</strong>, defined as the difference between the observed and predicted <span class="math inline">\(y_i\)</span> value (<span class="math inline">\(e_i = y_i - \hat{y}_i\)</span>). This gives a high-level view of how “off” each prediction is from the true observed value. Recall that you explored this concept in <a href="https://inferentialthinking.com/chapters/15/5/Visual_Diagnostics.html?highlight=heteroscedasticity#detecting-heteroscedasticity">Data 8</a>: a good regression fit should display no clear pattern in its plot of residuals. The residual plots for Anscombe’s quartet are displayed below. Note how only the first plot shows no clear pattern to the magnitude of residuals. This is an indication that SLR is not the best choice of model for the remaining three sets of points.</p>
+<!-- <img src="images/residual.png" alt='residual' width='600'> -->
+<div id="2a6b1d76" class="cell" data-execution_count="8">
+<details class="code-fold">
+<summary>Code</summary>
+<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Residual visualization</span></span>
+<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>fig, axs <span class="op">=</span> plt.subplots(<span class="dv">2</span>, <span class="dv">2</span>, figsize<span class="op">=</span>(<span class="dv">10</span>, <span class="dv">10</span>))</span>
+<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i, dataset <span class="kw">in</span> <span class="bu">enumerate</span>([<span class="st">"I"</span>, <span class="st">"II"</span>, <span class="st">"III"</span>, <span class="st">"IV"</span>]):</span>
+<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>    ans <span class="op">=</span> anscombe[dataset]</span>
+<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>    x, y <span class="op">=</span> ans[<span class="st">"x"</span>], ans[<span class="st">"y"</span>]</span>
+<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>    ahat, bhat <span class="op">=</span> fit_least_squares(x, y)</span>
+<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a>    yhat <span class="op">=</span> predict(x, ahat, bhat)</span>
+<span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].scatter(</span>
+<span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>        x, y <span class="op">-</span> yhat, alpha<span class="op">=</span><span class="fl">0.6</span>, color<span class="op">=</span><span class="st">"red"</span></span>
+<span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a>    )  <span class="co"># plot the x, y points</span></span>
+<span id="cb9-12"><a href="#cb9-12" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].plot(</span>
+<span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a>        x, np.zeros_like(x), color<span class="op">=</span><span class="st">"black"</span></span>
+<span id="cb9-14"><a href="#cb9-14" aria-hidden="true" tabindex="-1"></a>    )  <span class="co"># plot the residual line</span></span>
+<span id="cb9-15"><a href="#cb9-15" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_xlabel(<span class="ss">f"$x_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_ylabel(<span class="ss">f"$e_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb9-17"><a href="#cb9-17" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_title(<span class="ss">f"Dataset </span><span class="sc">{</span>dataset<span class="sc">}</span><span class="ss"> Residuals"</span>)</span>
+<span id="cb9-18"><a href="#cb9-18" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb9-19"><a href="#cb9-19" aria-hidden="true" tabindex="-1"></a>plt.show()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</details>
+<div class="cell-output cell-output-display">
+<div>
+<figure class="figure">
+<p><img src="intro_to_modeling_files/figure-html/cell-9-output-1.png" class="img-fluid figure-img"></p>
+</figure>
+</div>
+</div>
+</div>
+
+
+<!-- -->
+
+</section>
+</section>
+
+</main> <!-- /main -->
+<script id="quarto-html-after-body" type="application/javascript">
+window.document.addEventListener("DOMContentLoaded", function (event) {
+  const toggleBodyColorMode = (bsSheetEl) => {
+    const mode = bsSheetEl.getAttribute("data-mode");
+    const bodyEl = window.document.querySelector("body");
+    if (mode === "dark") {
+      bodyEl.classList.add("quarto-dark");
+      bodyEl.classList.remove("quarto-light");
+    } else {
+      bodyEl.classList.add("quarto-light");
+      bodyEl.classList.remove("quarto-dark");
+    }
+  }
+  const toggleBodyColorPrimary = () => {
+    const bsSheetEl = window.document.querySelector("link#quarto-bootstrap");
+    if (bsSheetEl) {
+      toggleBodyColorMode(bsSheetEl);
+    }
+  }
+  toggleBodyColorPrimary();  
+  const icon = "";
+  const anchorJS = new window.AnchorJS();
+  anchorJS.options = {
+    placement: 'right',
+    icon: icon
+  };
+  anchorJS.add('.anchored');
+  const isCodeAnnotation = (el) => {
+    for (const clz of el.classList) {
+      if (clz.startsWith('code-annotation-')) {                     
+        return true;
+      }
+    }
+    return false;
+  }
+  const clipboard = new window.ClipboardJS('.code-copy-button', {
+    text: function(trigger) {
+      const codeEl = trigger.previousElementSibling.cloneNode(true);
+      for (const childEl of codeEl.children) {
+        if (isCodeAnnotation(childEl)) {
+          childEl.remove();
+        }
+      }
+      return codeEl.innerText;
+    }
+  });
+  clipboard.on('success', function(e) {
+    // button target
+    const button = e.trigger;
+    // don't keep focus
+    button.blur();
+    // flash "checked"
+    button.classList.add('code-copy-button-checked');
+    var currentTitle = button.getAttribute("title");
+    button.setAttribute("title", "Copied!");
+    let tooltip;
+    if (window.bootstrap) {
+      button.setAttribute("data-bs-toggle", "tooltip");
+      button.setAttribute("data-bs-placement", "left");
+      button.setAttribute("data-bs-title", "Copied!");
+      tooltip = new bootstrap.Tooltip(button, 
+        { trigger: "manual", 
+          customClass: "code-copy-button-tooltip",
+          offset: [0, -8]});
+      tooltip.show();    
+    }
+    setTimeout(function() {
+      if (tooltip) {
+        tooltip.hide();
+        button.removeAttribute("data-bs-title");
+        button.removeAttribute("data-bs-toggle");
+        button.removeAttribute("data-bs-placement");
+      }
+      button.setAttribute("title", currentTitle);
+      button.classList.remove('code-copy-button-checked');
+    }, 1000);
+    // clear code selection
+    e.clearSelection();
+  });
+  const viewSource = window.document.getElementById('quarto-view-source') ||
+                     window.document.getElementById('quarto-code-tools-source');
+  if (viewSource) {
+    const sourceUrl = viewSource.getAttribute("data-quarto-source-url");
+    viewSource.addEventListener("click", function(e) {
+      if (sourceUrl) {
+        // rstudio viewer pane
+        if (/\bcapabilities=\b/.test(window.location)) {
+          window.open(sourceUrl);
+        } else {
+          window.location.href = sourceUrl;
+        }
+      } else {
+        const modal = new bootstrap.Modal(document.getElementById('quarto-embedded-source-code-modal'));
+        modal.show();
+      }
+      return false;
+    });
+  }
+  function toggleCodeHandler(show) {
+    return function(e) {
+      const detailsSrc = window.document.querySelectorAll(".cell > details > .sourceCode");
+      for (let i=0; i<detailsSrc.length; i++) {
+        const details = detailsSrc[i].parentElement;
+        if (show) {
+          details.open = true;
+        } else {
+          details.removeAttribute("open");
+        }
+      }
+      const cellCodeDivs = window.document.querySelectorAll(".cell > .sourceCode");
+      const fromCls = show ? "hidden" : "unhidden";
+      const toCls = show ? "unhidden" : "hidden";
+      for (let i=0; i<cellCodeDivs.length; i++) {
+        const codeDiv = cellCodeDivs[i];
+        if (codeDiv.classList.contains(fromCls)) {
+          codeDiv.classList.remove(fromCls);
+          codeDiv.classList.add(toCls);
+        } 
+      }
+      return false;
+    }
+  }
+  const hideAllCode = window.document.getElementById("quarto-hide-all-code");
+  if (hideAllCode) {
+    hideAllCode.addEventListener("click", toggleCodeHandler(false));
+  }
+  const showAllCode = window.document.getElementById("quarto-show-all-code");
+  if (showAllCode) {
+    showAllCode.addEventListener("click", toggleCodeHandler(true));
+  }
+  function tippyHover(el, contentFn, onTriggerFn, onUntriggerFn) {
+    const config = {
+      allowHTML: true,
+      maxWidth: 500,
+      delay: 100,
+      arrow: false,
+      appendTo: function(el) {
+          return el.parentElement;
+      },
+      interactive: true,
+      interactiveBorder: 10,
+      theme: 'quarto',
+      placement: 'bottom-start',
+    };
+    if (contentFn) {
+      config.content = contentFn;
+    }
+    if (onTriggerFn) {
+      config.onTrigger = onTriggerFn;
+    }
+    if (onUntriggerFn) {
+      config.onUntrigger = onUntriggerFn;
+    }
+    window.tippy(el, config); 
+  }
+  const noterefs = window.document.querySelectorAll('a[role="doc-noteref"]');
+  for (var i=0; i<noterefs.length; i++) {
+    const ref = noterefs[i];
+    tippyHover(ref, function() {
+      // use id or data attribute instead here
+      let href = ref.getAttribute('data-footnote-href') || ref.getAttribute('href');
+      try { href = new URL(href).hash; } catch {}
+      const id = href.replace(/^#\/?/, "");
+      const note = window.document.getElementById(id);
+      return note.innerHTML;
+    });
+  }
+  const xrefs = window.document.querySelectorAll('a.quarto-xref');
+  const processXRef = (id, note) => {
+    // Strip column container classes
+    const stripColumnClz = (el) => {
+      el.classList.remove("page-full", "page-columns");
+      if (el.children) {
+        for (const child of el.children) {
+          stripColumnClz(child);
+        }
+      }
+    }
+    stripColumnClz(note)
+    if (id === null || id.startsWith('sec-')) {
+      // Special case sections, only their first couple elements
+      const container = document.createElement("div");
+      if (note.children && note.children.length > 2) {
+        container.appendChild(note.children[0].cloneNode(true));
+        for (let i = 1; i < note.children.length; i++) {
+          const child = note.children[i];
+          if (child.tagName === "P" && child.innerText === "") {
+            continue;
+          } else {
+            container.appendChild(child.cloneNode(true));
+            break;
+          }
+        }
+        if (window.Quarto?.typesetMath) {
+          window.Quarto.typesetMath(container);
+        }
+        return container.innerHTML
+      } else {
+        if (window.Quarto?.typesetMath) {
+          window.Quarto.typesetMath(note);
+        }
+        return note.innerHTML;
+      }
+    } else {
+      // Remove any anchor links if they are present
+      const anchorLink = note.querySelector('a.anchorjs-link');
+      if (anchorLink) {
+        anchorLink.remove();
+      }
+      if (window.Quarto?.typesetMath) {
+        window.Quarto.typesetMath(note);
+      }
+      // TODO in 1.5, we should make sure this works without a callout special case
+      if (note.classList.contains("callout")) {
+        return note.outerHTML;
+      } else {
+        return note.innerHTML;
+      }
+    }
+  }
+  for (var i=0; i<xrefs.length; i++) {
+    const xref = xrefs[i];
+    tippyHover(xref, undefined, function(instance) {
+      instance.disable();
+      let url = xref.getAttribute('href');
+      let hash = undefined; 
+      if (url.startsWith('#')) {
+        hash = url;
+      } else {
+        try { hash = new URL(url).hash; } catch {}
+      }
+      if (hash) {
+        const id = hash.replace(/^#\/?/, "");
+        const note = window.document.getElementById(id);
+        if (note !== null) {
+          try {
+            const html = processXRef(id, note.cloneNode(true));
+            instance.setContent(html);
+          } finally {
+            instance.enable();
+            instance.show();
+          }
+        } else {
+          // See if we can fetch this
+          fetch(url.split('#')[0])
+          .then(res => res.text())
+          .then(html => {
+            const parser = new DOMParser();
+            const htmlDoc = parser.parseFromString(html, "text/html");
+            const note = htmlDoc.getElementById(id);
+            if (note !== null) {
+              const html = processXRef(id, note);
+              instance.setContent(html);
+            } 
+          }).finally(() => {
+            instance.enable();
+            instance.show();
+          });
+        }
+      } else {
+        // See if we can fetch a full url (with no hash to target)
+        // This is a special case and we should probably do some content thinning / targeting
+        fetch(url)
+        .then(res => res.text())
+        .then(html => {
+          const parser = new DOMParser();
+          const htmlDoc = parser.parseFromString(html, "text/html");
+          const note = htmlDoc.querySelector('main.content');
+          if (note !== null) {
+            // This should only happen for chapter cross references
+            // (since there is no id in the URL)
+            // remove the first header
+            if (note.children.length > 0 && note.children[0].tagName === "HEADER") {
+              note.children[0].remove();
+            }
+            const html = processXRef(null, note);
+            instance.setContent(html);
+          } 
+        }).finally(() => {
+          instance.enable();
+          instance.show();
+        });
+      }
+    }, function(instance) {
+    });
+  }
+      let selectedAnnoteEl;
+      const selectorForAnnotation = ( cell, annotation) => {
+        let cellAttr = 'data-code-cell="' + cell + '"';
+        let lineAttr = 'data-code-annotation="' +  annotation + '"';
+        const selector = 'span[' + cellAttr + '][' + lineAttr + ']';
+        return selector;
+      }
+      const selectCodeLines = (annoteEl) => {
+        const doc = window.document;
+        const targetCell = annoteEl.getAttribute("data-target-cell");
+        const targetAnnotation = annoteEl.getAttribute("data-target-annotation");
+        const annoteSpan = window.document.querySelector(selectorForAnnotation(targetCell, targetAnnotation));
+        const lines = annoteSpan.getAttribute("data-code-lines").split(",");
+        const lineIds = lines.map((line) => {
+          return targetCell + "-" + line;
+        })
+        let top = null;
+        let height = null;
+        let parent = null;
+        if (lineIds.length > 0) {
+            //compute the position of the single el (top and bottom and make a div)
+            const el = window.document.getElementById(lineIds[0]);
+            top = el.offsetTop;
+            height = el.offsetHeight;
+            parent = el.parentElement.parentElement;
+          if (lineIds.length > 1) {
+            const lastEl = window.document.getElementById(lineIds[lineIds.length - 1]);
+            const bottom = lastEl.offsetTop + lastEl.offsetHeight;
+            height = bottom - top;
+          }
+          if (top !== null && height !== null && parent !== null) {
+            // cook up a div (if necessary) and position it 
+            let div = window.document.getElementById("code-annotation-line-highlight");
+            if (div === null) {
+              div = window.document.createElement("div");
+              div.setAttribute("id", "code-annotation-line-highlight");
+              div.style.position = 'absolute';
+              parent.appendChild(div);
+            }
+            div.style.top = top - 2 + "px";
+            div.style.height = height + 4 + "px";
+            div.style.left = 0;
+            let gutterDiv = window.document.getElementById("code-annotation-line-highlight-gutter");
+            if (gutterDiv === null) {
+              gutterDiv = window.document.createElement("div");
+              gutterDiv.setAttribute("id", "code-annotation-line-highlight-gutter");
+              gutterDiv.style.position = 'absolute';
+              const codeCell = window.document.getElementById(targetCell);
+              const gutter = codeCell.querySelector('.code-annotation-gutter');
+              gutter.appendChild(gutterDiv);
+            }
+            gutterDiv.style.top = top - 2 + "px";
+            gutterDiv.style.height = height + 4 + "px";
+          }
+          selectedAnnoteEl = annoteEl;
+        }
+      };
+      const unselectCodeLines = () => {
+        const elementsIds = ["code-annotation-line-highlight", "code-annotation-line-highlight-gutter"];
+        elementsIds.forEach((elId) => {
+          const div = window.document.getElementById(elId);
+          if (div) {
+            div.remove();
+          }
+        });
+        selectedAnnoteEl = undefined;
+      };
+        // Handle positioning of the toggle
+    window.addEventListener(
+      "resize",
+      throttle(() => {
+        elRect = undefined;
+        if (selectedAnnoteEl) {
+          selectCodeLines(selectedAnnoteEl);
+        }
+      }, 10)
+    );
+    function throttle(fn, ms) {
+    let throttle = false;
+    let timer;
+      return (...args) => {
+        if(!throttle) { // first call gets through
+            fn.apply(this, args);
+            throttle = true;
+        } else { // all the others get throttled
+            if(timer) clearTimeout(timer); // cancel #2
+            timer = setTimeout(() => {
+              fn.apply(this, args);
+              timer = throttle = false;
+            }, ms);
+        }
+      };
+    }
+      // Attach click handler to the DT
+      const annoteDls = window.document.querySelectorAll('dt[data-target-cell]');
+      for (const annoteDlNode of annoteDls) {
+        annoteDlNode.addEventListener('click', (event) => {
+          const clickedEl = event.target;
+          if (clickedEl !== selectedAnnoteEl) {
+            unselectCodeLines();
+            const activeEl = window.document.querySelector('dt[data-target-cell].code-annotation-active');
+            if (activeEl) {
+              activeEl.classList.remove('code-annotation-active');
+            }
+            selectCodeLines(clickedEl);
+            clickedEl.classList.add('code-annotation-active');
+          } else {
+            // Unselect the line
+            unselectCodeLines();
+            clickedEl.classList.remove('code-annotation-active');
+          }
+        });
+      }
+  const findCites = (el) => {
+    const parentEl = el.parentElement;
+    if (parentEl) {
+      const cites = parentEl.dataset.cites;
+      if (cites) {
+        return {
+          el,
+          cites: cites.split(' ')
+        };
+      } else {
+        return findCites(el.parentElement)
+      }
+    } else {
+      return undefined;
+    }
+  };
+  var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]');
+  for (var i=0; i<bibliorefs.length; i++) {
+    const ref = bibliorefs[i];
+    const citeInfo = findCites(ref);
+    if (citeInfo) {
+      tippyHover(citeInfo.el, function() {
+        var popup = window.document.createElement('div');
+        citeInfo.cites.forEach(function(cite) {
+          var citeDiv = window.document.createElement('div');
+          citeDiv.classList.add('hanging-indent');
+          citeDiv.classList.add('csl-entry');
+          var biblioDiv = window.document.getElementById('ref-' + cite);
+          if (biblioDiv) {
+            citeDiv.innerHTML = biblioDiv.innerHTML;
+          }
+          popup.appendChild(citeDiv);
+        });
+        return popup.innerHTML;
+      });
+    }
+  }
+});
+</script>
+<nav class="page-navigation column-body">
+  <div class="nav-page nav-page-previous">
+      <a href="../sampling/sampling.html" class="pagination-link  aria-label=" &lt;span="">
+        <i class="bi bi-arrow-left-short"></i> <span class="nav-page-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span>
+      </a>          
+  </div>
+  <div class="nav-page nav-page-next">
+  </div>
+</nav><div class="modal fade" id="quarto-embedded-source-code-modal" tabindex="-1" aria-labelledby="quarto-embedded-source-code-modal-label" aria-hidden="true"><div class="modal-dialog modal-dialog-scrollable"><div class="modal-content"><div class="modal-header"><h5 class="modal-title" id="quarto-embedded-source-code-modal-label">Source Code</h5><button class="btn-close" data-bs-dismiss="modal"></button></div><div class="modal-body"><div class="">
+<div class="sourceCode" id="cb10" data-shortcodes="false"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
+<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="an">title:</span><span class="co"> Introduction to Modeling</span></span>
+<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="an">execute:</span></span>
+<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="co">  echo: true</span></span>
+<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="an">format:</span></span>
+<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="co">  html:</span></span>
+<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="co">    code-fold: true</span></span>
+<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="co">    code-tools: true</span></span>
+<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="co">    toc: true</span></span>
+<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co">    toc-title: Introduction to Modeling</span></span>
+<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a><span class="co">    page-layout: full</span></span>
+<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co">    theme:</span></span>
+<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="co">      - cosmo</span></span>
+<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a><span class="co">      - cerulean</span></span>
+<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a><span class="co">    callout-icon: false</span></span>
+<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a><span class="an">jupyter:</span><span class="co"> python3</span></span>
+<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
+<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>::: {.callout-note collapse="false"}</span>
+<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a><span class="fu">## Learning Outcomes</span></span>
+<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Understand what models are and how to carry out the four-step modeling process.</span>
+<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Define the concept of loss and gain familiarity with $L_1$ and $L_2$ loss.</span>
+<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Fit the Simple Linear Regression model using minimization techniques.</span>
+<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a>:::</span>
+<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a>Up until this point in the semester, we've focused on analyzing datasets. We've looked into the early stages of the data science lifecycle, focusing on the programming tools, visualization techniques, and data cleaning methods needed for data analysis.</span>
+<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a>This lecture marks a shift in focus. We will move away from examining datasets to actually *using* our data to better understand the world. Specifically, the next sequence of lectures will explore predictive modeling: generating models to make some predictions about the world around us. In this lecture, we'll introduce the conceptual framework for setting up a modeling task. In the next few lectures, we'll put this framework into practice by implementing various kinds of models.</span>
+<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a><span class="fu">## What is a Model?</span></span>
+<span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-32"><a href="#cb10-32" aria-hidden="true" tabindex="-1"></a>A model is an **idealized representation** of a system. A system is a set of principles or procedures according to which something functions. We live in a world full of systems: the procedure of turning on a light happens according to a specific set of rules dictating the flow of electricity. The truth behind how any event occurs is usually complex, and many times the specifics are unknown. The workings of the world can be viewed as its own giant procedure. Models seek to simplify the world and distill them into workable pieces.  </span>
+<span id="cb10-33"><a href="#cb10-33" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-34"><a href="#cb10-34" aria-hidden="true" tabindex="-1"></a>Example:</span>
+<span id="cb10-35"><a href="#cb10-35" aria-hidden="true" tabindex="-1"></a>We model the fall of an object on Earth as subject to a constant acceleration of $9.81 m/s^2$ due to gravity.</span>
+<span id="cb10-36"><a href="#cb10-36" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-37"><a href="#cb10-37" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>While this describes the behavior of our system, it is merely an approximation.</span>
+<span id="cb10-38"><a href="#cb10-38" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>It doesn’t account for the effects of air resistance, local variations in gravity, etc.</span>
+<span id="cb10-39"><a href="#cb10-39" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>In practice, it’s accurate enough to be useful!</span>
+<span id="cb10-40"><a href="#cb10-40" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-41"><a href="#cb10-41" aria-hidden="true" tabindex="-1"></a><span class="fu">### Reasons for Building Models</span></span>
+<span id="cb10-42"><a href="#cb10-42" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-43"><a href="#cb10-43" aria-hidden="true" tabindex="-1"></a>Why do we want to build models? As far as data scientists and statisticians are concerned, there are three reasons, and each implies a different focus on modeling.</span>
+<span id="cb10-44"><a href="#cb10-44" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-45"><a href="#cb10-45" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>To explain complex phenomena occurring in the world we live in. Examples of this might be:</span>
+<span id="cb10-46"><a href="#cb10-46" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-47"><a href="#cb10-47" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span>How are the parents' average height related to their children's average height?</span>
+<span id="cb10-48"><a href="#cb10-48" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span>How does an object’s velocity and acceleration impact how far it travels? (Physics: $d = d_0 + vt + \frac{1}{2}at^2$) </span>
+<span id="cb10-49"><a href="#cb10-49" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb10-50"><a href="#cb10-50" aria-hidden="true" tabindex="-1"></a>    In these cases, we care about creating models that are *simple and interpretable*, allowing us to understand what the relationships between our variables are.</span>
+<span id="cb10-51"><a href="#cb10-51" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-52"><a href="#cb10-52" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>To make accurate predictions about unseen data. Some examples include:</span>
+<span id="cb10-53"><a href="#cb10-53" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-54"><a href="#cb10-54" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span>Can we predict if an email is spam or not?</span>
+<span id="cb10-55"><a href="#cb10-55" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span>Can we generate a one-sentence summary of this 10-page long article?</span>
+<span id="cb10-56"><a href="#cb10-56" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-57"><a href="#cb10-57" aria-hidden="true" tabindex="-1"></a>    When making predictions, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models and are common in fields like deep learning.</span>
+<span id="cb10-58"><a href="#cb10-58" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-59"><a href="#cb10-59" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>To measure the causal effects of one event on some other event. For example,</span>
+<span id="cb10-60"><a href="#cb10-60" aria-hidden="true" tabindex="-1"></a>   </span>
+<span id="cb10-61"><a href="#cb10-61" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span>Does smoking *cause* lung cancer?</span>
+<span id="cb10-62"><a href="#cb10-62" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span>Does a job training program *cause* increases in employment and wages?</span>
+<span id="cb10-63"><a href="#cb10-63" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-64"><a href="#cb10-64" aria-hidden="true" tabindex="-1"></a>    This is a much harder question because most statistical tools are designed to infer association, not causation. We will not focus on this task in Data 100, but you can take other advanced classes on causal inference (e.g., Stat 156, Data 102) if you are intrigued! </span>
+<span id="cb10-65"><a href="#cb10-65" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-66"><a href="#cb10-66" aria-hidden="true" tabindex="-1"></a>Most of the time, we aim to strike a balance between building **interpretable** models and building **accurate models**.</span>
+<span id="cb10-67"><a href="#cb10-67" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-68"><a href="#cb10-68" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-69"><a href="#cb10-69" aria-hidden="true" tabindex="-1"></a><span class="fu">### Common Types of Models</span></span>
+<span id="cb10-70"><a href="#cb10-70" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-71"><a href="#cb10-71" aria-hidden="true" tabindex="-1"></a>In general, models can be split into two categories:</span>
+<span id="cb10-72"><a href="#cb10-72" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-73"><a href="#cb10-73" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Deterministic physical (mechanistic) models: Laws that govern how the world works.</span>
+<span id="cb10-74"><a href="#cb10-74" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-75"><a href="#cb10-75" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span><span class="co">[</span><span class="ot">Kepler's Third Law of Planetary Motion (1619)</span><span class="co">](https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion#Third_law)</span>: The ratio of the square of an object's orbital period with the cube of the semi-major axis of its orbit is the same for all objects orbiting the same primary.</span>
+<span id="cb10-76"><a href="#cb10-76" aria-hidden="true" tabindex="-1"></a><span class="ss">        - </span>$T^2 \propto R^3$</span>
+<span id="cb10-77"><a href="#cb10-77" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-78"><a href="#cb10-78" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span><span class="co">[</span><span class="ot">Newton's Laws: motion and gravitation (1687)</span><span class="co">](https://en.wikipedia.org/wiki/Newton%27s_laws_of_motion)</span>: Newton’s second law of motion models the relationship between the mass of an object and the force required to accelerate it.</span>
+<span id="cb10-79"><a href="#cb10-79" aria-hidden="true" tabindex="-1"></a><span class="ss">        - </span>$F = ma$</span>
+<span id="cb10-80"><a href="#cb10-80" aria-hidden="true" tabindex="-1"></a><span class="ss">        - </span>$F_g = G \frac{m_1 m_2}{r^2}$</span>
+<span id="cb10-81"><a href="#cb10-81" aria-hidden="true" tabindex="-1"></a>&lt;br&gt;</span>
+<span id="cb10-82"><a href="#cb10-82" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-83"><a href="#cb10-83" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Probabilistic models: Models that attempt to understand how random processes evolve. These are more general and can be used to describe many phenomena in the real world. These models commonly make simplifying assumptions about the nature of the world.</span>
+<span id="cb10-84"><a href="#cb10-84" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-85"><a href="#cb10-85" aria-hidden="true" tabindex="-1"></a><span class="ss">    - </span><span class="co">[</span><span class="ot">Poisson Process models</span><span class="co">](https://en.wikipedia.org/wiki/Poisson_point_process)</span>: Used to model random events that happen with some probability at any point in time and are strictly increasing in count, such as the arrival of customers at a store. </span>
+<span id="cb10-86"><a href="#cb10-86" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-87"><a href="#cb10-87" aria-hidden="true" tabindex="-1"></a>Note: These specific models are not in the scope of Data 100 and exist to serve as motivation.</span>
+<span id="cb10-88"><a href="#cb10-88" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-89"><a href="#cb10-89" aria-hidden="true" tabindex="-1"></a><span class="fu">## Simple Linear Regression </span></span>
+<span id="cb10-90"><a href="#cb10-90" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-91"><a href="#cb10-91" aria-hidden="true" tabindex="-1"></a>The **regression line** is the unique straight line that minimizes the **mean squared error** of estimation among all straight lines. As with any straight line, it can be defined by a slope and a y-intercept:</span>
+<span id="cb10-92"><a href="#cb10-92" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-93"><a href="#cb10-93" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$\text{slope} = r \cdot \frac{\text{Standard Deviation of } y}{\text{Standard Deviation of }x}$</span>
+<span id="cb10-94"><a href="#cb10-94" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$y\text{-intercept} = \text{average of }y - \text{slope}\cdot\text{average of }x$</span>
+<span id="cb10-95"><a href="#cb10-95" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$\text{regression estimate} = y\text{-intercept} + \text{slope}\cdot\text{}x$</span>
+<span id="cb10-96"><a href="#cb10-96" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$\text{residual} =\text{observed }y - \text{regression estimate}$</span>
+<span id="cb10-97"><a href="#cb10-97" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-100"><a href="#cb10-100" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-101"><a href="#cb10-101" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
+<span id="cb10-102"><a href="#cb10-102" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
+<span id="cb10-103"><a href="#cb10-103" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
+<span id="cb10-104"><a href="#cb10-104" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
+<span id="cb10-105"><a href="#cb10-105" aria-hidden="true" tabindex="-1"></a><span class="co"># Set random seed for consistency </span></span>
+<span id="cb10-106"><a href="#cb10-106" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">43</span>)</span>
+<span id="cb10-107"><a href="#cb10-107" aria-hidden="true" tabindex="-1"></a>plt.style.use(<span class="st">'default'</span>) </span>
+<span id="cb10-108"><a href="#cb10-108" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-109"><a href="#cb10-109" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate random noise for plotting</span></span>
+<span id="cb10-110"><a href="#cb10-110" aria-hidden="true" tabindex="-1"></a>x <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb10-111"><a href="#cb10-111" aria-hidden="true" tabindex="-1"></a>y <span class="op">=</span> x <span class="op">*</span> <span class="fl">0.5</span> <span class="op">-</span> <span class="dv">1</span> <span class="op">+</span> np.random.randn(<span class="dv">100</span>) <span class="op">*</span> <span class="fl">0.3</span></span>
+<span id="cb10-112"><a href="#cb10-112" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-113"><a href="#cb10-113" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot regression line</span></span>
+<span id="cb10-114"><a href="#cb10-114" aria-hidden="true" tabindex="-1"></a>sns.regplot(x<span class="op">=</span>x,y<span class="op">=</span>y)<span class="op">;</span></span>
+<span id="cb10-115"><a href="#cb10-115" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb10-116"><a href="#cb10-116" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-117"><a href="#cb10-117" aria-hidden="true" tabindex="-1"></a><span class="fu">### Notations and Definitions</span></span>
+<span id="cb10-118"><a href="#cb10-118" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-119"><a href="#cb10-119" aria-hidden="true" tabindex="-1"></a>For a pair of variables $x$ and $y$ representing our data $\mathcal{D} = <span class="sc">\{</span>(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)<span class="sc">\}</span>$, we denote their means/averages as $\bar x$ and $\bar y$ and standard deviations as $\sigma_x$ and $\sigma_y$.</span>
+<span id="cb10-120"><a href="#cb10-120" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-121"><a href="#cb10-121" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Standard Units</span></span>
+<span id="cb10-122"><a href="#cb10-122" aria-hidden="true" tabindex="-1"></a>A variable is represented in standard units if the following are true:</span>
+<span id="cb10-123"><a href="#cb10-123" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-124"><a href="#cb10-124" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>0 in standard units is equal to the mean ($\bar{x}$) in the original variable's units.</span>
+<span id="cb10-125"><a href="#cb10-125" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>An increase of 1 standard unit is an increase of 1 standard deviation ($\sigma_x$) in the original variable's units.</span>
+<span id="cb10-126"><a href="#cb10-126" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-127"><a href="#cb10-127" aria-hidden="true" tabindex="-1"></a>To convert a variable $x_i$ into standard units, we subtract its mean from it and divide it by its standard deviation. For example, $x_i$ in standard units is $\frac{x_i - \bar x}{\sigma_x}$.</span>
+<span id="cb10-128"><a href="#cb10-128" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-129"><a href="#cb10-129" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Correlation</span></span>
+<span id="cb10-130"><a href="#cb10-130" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-131"><a href="#cb10-131" aria-hidden="true" tabindex="-1"></a>The correlation ($r$) is the average of the product of $x$ and $y$, both measured in *standard units*.</span>
+<span id="cb10-132"><a href="#cb10-132" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-133"><a href="#cb10-133" aria-hidden="true" tabindex="-1"></a>$$r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i - \bar{x}}{\sigma_x})(\frac{y_i - \bar{y}}{\sigma_y})$$</span>
+<span id="cb10-134"><a href="#cb10-134" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-135"><a href="#cb10-135" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Correlation measures the strength of a **linear association** between two variables.</span>
+<span id="cb10-136"><a href="#cb10-136" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Correlations range between -1 and 1: $|r| \leq 1$, with $r=1$ indicating perfect positive linear association, and $r=-1$ indicating perfect negative association. The closer $r$ is to $0$, the weaker the linear association is.</span>
+<span id="cb10-137"><a href="#cb10-137" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Correlation says nothing about causation and non-linear association. Correlation does **not** imply causation. When $r = 0$, the two variables are uncorrelated. However, they could still be related through some non-linear relationship.</span>
+<span id="cb10-138"><a href="#cb10-138" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-141"><a href="#cb10-141" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-142"><a href="#cb10-142" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> plot_and_get_corr(ax, x, y, title):</span>
+<span id="cb10-143"><a href="#cb10-143" aria-hidden="true" tabindex="-1"></a>    ax.set_xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>)</span>
+<span id="cb10-144"><a href="#cb10-144" aria-hidden="true" tabindex="-1"></a>    ax.set_ylim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>)</span>
+<span id="cb10-145"><a href="#cb10-145" aria-hidden="true" tabindex="-1"></a>    ax.set_xticks([])</span>
+<span id="cb10-146"><a href="#cb10-146" aria-hidden="true" tabindex="-1"></a>    ax.set_yticks([])</span>
+<span id="cb10-147"><a href="#cb10-147" aria-hidden="true" tabindex="-1"></a>    ax.scatter(x, y, alpha <span class="op">=</span> <span class="fl">0.73</span>)</span>
+<span id="cb10-148"><a href="#cb10-148" aria-hidden="true" tabindex="-1"></a>    r <span class="op">=</span> np.corrcoef(x, y)[<span class="dv">0</span>, <span class="dv">1</span>]</span>
+<span id="cb10-149"><a href="#cb10-149" aria-hidden="true" tabindex="-1"></a>    ax.set_title(title <span class="op">+</span> <span class="st">" (corr: </span><span class="sc">{}</span><span class="st">)"</span>.<span class="bu">format</span>(r.<span class="bu">round</span>(<span class="dv">2</span>)))</span>
+<span id="cb10-150"><a href="#cb10-150" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> r</span>
+<span id="cb10-151"><a href="#cb10-151" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-152"><a href="#cb10-152" aria-hidden="true" tabindex="-1"></a>fig, axs <span class="op">=</span> plt.subplots(<span class="dv">2</span>, <span class="dv">2</span>, figsize <span class="op">=</span> (<span class="dv">10</span>, <span class="dv">10</span>))</span>
+<span id="cb10-153"><a href="#cb10-153" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-154"><a href="#cb10-154" aria-hidden="true" tabindex="-1"></a><span class="co"># Just noise</span></span>
+<span id="cb10-155"><a href="#cb10-155" aria-hidden="true" tabindex="-1"></a>x1, y1 <span class="op">=</span> np.random.randn(<span class="dv">2</span>, <span class="dv">100</span>)</span>
+<span id="cb10-156"><a href="#cb10-156" aria-hidden="true" tabindex="-1"></a>corr1 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">0</span>, <span class="dv">0</span>], x1, y1, title <span class="op">=</span> <span class="st">"noise"</span>)</span>
+<span id="cb10-157"><a href="#cb10-157" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-158"><a href="#cb10-158" aria-hidden="true" tabindex="-1"></a><span class="co"># Strong linear</span></span>
+<span id="cb10-159"><a href="#cb10-159" aria-hidden="true" tabindex="-1"></a>x2 <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb10-160"><a href="#cb10-160" aria-hidden="true" tabindex="-1"></a>y2 <span class="op">=</span> x2 <span class="op">*</span> <span class="fl">0.5</span> <span class="op">-</span> <span class="dv">1</span> <span class="op">+</span> np.random.randn(<span class="dv">100</span>) <span class="op">*</span> <span class="fl">0.3</span></span>
+<span id="cb10-161"><a href="#cb10-161" aria-hidden="true" tabindex="-1"></a>corr2 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">0</span>, <span class="dv">1</span>], x2, y2, title <span class="op">=</span> <span class="st">"strong linear"</span>)</span>
+<span id="cb10-162"><a href="#cb10-162" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-163"><a href="#cb10-163" aria-hidden="true" tabindex="-1"></a><span class="co"># Unequal spread</span></span>
+<span id="cb10-164"><a href="#cb10-164" aria-hidden="true" tabindex="-1"></a>x3 <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb10-165"><a href="#cb10-165" aria-hidden="true" tabindex="-1"></a>y3 <span class="op">=</span> <span class="op">-</span> x3<span class="op">/</span><span class="dv">3</span> <span class="op">+</span> np.random.randn(<span class="dv">100</span>)<span class="op">*</span>(x3)<span class="op">/</span><span class="fl">2.5</span></span>
+<span id="cb10-166"><a href="#cb10-166" aria-hidden="true" tabindex="-1"></a>corr3 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">1</span>, <span class="dv">0</span>], x3, y3, title <span class="op">=</span> <span class="st">"strong linear"</span>)</span>
+<span id="cb10-167"><a href="#cb10-167" aria-hidden="true" tabindex="-1"></a>extent <span class="op">=</span> axs[<span class="dv">1</span>, <span class="dv">0</span>].get_window_extent().transformed(fig.dpi_scale_trans.inverted())</span>
+<span id="cb10-168"><a href="#cb10-168" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-169"><a href="#cb10-169" aria-hidden="true" tabindex="-1"></a><span class="co"># Strong non-linear</span></span>
+<span id="cb10-170"><a href="#cb10-170" aria-hidden="true" tabindex="-1"></a>x4 <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">3</span>, <span class="dv">100</span>)</span>
+<span id="cb10-171"><a href="#cb10-171" aria-hidden="true" tabindex="-1"></a>y4 <span class="op">=</span> <span class="dv">2</span><span class="op">*</span>np.sin(x3 <span class="op">-</span> <span class="fl">1.5</span>) <span class="op">+</span> np.random.randn(<span class="dv">100</span>) <span class="op">*</span> <span class="fl">0.3</span></span>
+<span id="cb10-172"><a href="#cb10-172" aria-hidden="true" tabindex="-1"></a>corr4 <span class="op">=</span> plot_and_get_corr(axs[<span class="dv">1</span>, <span class="dv">1</span>], x4, y4, title <span class="op">=</span> <span class="st">"strong non-linear"</span>)</span>
+<span id="cb10-173"><a href="#cb10-173" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-174"><a href="#cb10-174" aria-hidden="true" tabindex="-1"></a>plt.show()</span>
+<span id="cb10-175"><a href="#cb10-175" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb10-176"><a href="#cb10-176" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-177"><a href="#cb10-177" aria-hidden="true" tabindex="-1"></a><span class="fu">### Alternate Form</span></span>
+<span id="cb10-178"><a href="#cb10-178" aria-hidden="true" tabindex="-1"></a>When the variables $y$ and $x$ are measured in *standard units*, the regression line for predicting $y$ based on $x$ has slope $r$ and passes through the origin.</span>
+<span id="cb10-179"><a href="#cb10-179" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-180"><a href="#cb10-180" aria-hidden="true" tabindex="-1"></a> $$\hat{y}_{su} = r \cdot x_{su}$$</span>
+<span id="cb10-181"><a href="#cb10-181" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-182"><a href="#cb10-182" aria-hidden="true" tabindex="-1"></a><span class="al">![](images/reg_line_1.png)</span></span>
+<span id="cb10-183"><a href="#cb10-183" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-184"><a href="#cb10-184" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>In the original units, this becomes</span>
+<span id="cb10-185"><a href="#cb10-185" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-186"><a href="#cb10-186" aria-hidden="true" tabindex="-1"></a>$$\frac{\hat{y} - \bar{y}}{\sigma_y} = r \cdot \frac{x - \bar{x}}{\sigma_x}$$</span>
+<span id="cb10-187"><a href="#cb10-187" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-188"><a href="#cb10-188" aria-hidden="true" tabindex="-1"></a><span class="al">![](images/reg_line_2.png)</span></span>
+<span id="cb10-189"><a href="#cb10-189" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-190"><a href="#cb10-190" aria-hidden="true" tabindex="-1"></a><span class="fu">### Derivation</span></span>
+<span id="cb10-191"><a href="#cb10-191" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-192"><a href="#cb10-192" aria-hidden="true" tabindex="-1"></a>Starting from the top, we have our claimed form of the regression line, and we want to show that it is equivalent to the optimal linear regression line: $\hat{y} = \hat{a} + \hat{b}x$.</span>
+<span id="cb10-193"><a href="#cb10-193" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-194"><a href="#cb10-194" aria-hidden="true" tabindex="-1"></a>Recall: </span>
+<span id="cb10-195"><a href="#cb10-195" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-196"><a href="#cb10-196" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$\hat{b} = r \cdot \frac{\text{Standard Deviation of }y}{\text{Standard Deviation of }x}$</span>
+<span id="cb10-197"><a href="#cb10-197" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$\hat{a} = \text{average of }y - \text{slope}\cdot\text{average of }x$</span>
+<span id="cb10-198"><a href="#cb10-198" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-199"><a href="#cb10-199" aria-hidden="true" tabindex="-1"></a>:::{.callout}</span>
+<span id="cb10-200"><a href="#cb10-200" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-201"><a href="#cb10-201" aria-hidden="true" tabindex="-1"></a>Proof: </span>
+<span id="cb10-202"><a href="#cb10-202" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-203"><a href="#cb10-203" aria-hidden="true" tabindex="-1"></a>$$\frac{\hat{y} - \bar{y}}{\sigma_y} = r \cdot \frac{x - \bar{x}}{\sigma_x}$$</span>
+<span id="cb10-204"><a href="#cb10-204" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-205"><a href="#cb10-205" aria-hidden="true" tabindex="-1"></a>Multiply by $\sigma_y$, and add $\bar{y}$ on both sides.</span>
+<span id="cb10-206"><a href="#cb10-206" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-207"><a href="#cb10-207" aria-hidden="true" tabindex="-1"></a>$$\hat{y} = \sigma_y \cdot r \cdot \frac{x - \bar{x}}{\sigma_x} + \bar{y}$$</span>
+<span id="cb10-208"><a href="#cb10-208" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-209"><a href="#cb10-209" aria-hidden="true" tabindex="-1"></a>Distribute coefficient $\sigma_{y}\cdot r$ to the $\frac{x - \bar{x}}{\sigma_x}$ term</span>
+<span id="cb10-210"><a href="#cb10-210" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-211"><a href="#cb10-211" aria-hidden="true" tabindex="-1"></a>$$\hat{y} = (\frac{r\sigma_y}{\sigma_x} ) \cdot x + (\bar{y} - (\frac{r\sigma_y}{\sigma_x} ) \bar{x})$$</span>
+<span id="cb10-212"><a href="#cb10-212" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-213"><a href="#cb10-213" aria-hidden="true" tabindex="-1"></a>We now see that we have a line that matches our claim:</span>
+<span id="cb10-214"><a href="#cb10-214" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-215"><a href="#cb10-215" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>slope: $r\cdot\frac{\text{SD of y}}{\text{SD of x}} = r\cdot\frac{\sigma_y}{\sigma_x}$</span>
+<span id="cb10-216"><a href="#cb10-216" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>intercept: $\bar{y} - \text{slope}\cdot \bar{x}$</span>
+<span id="cb10-217"><a href="#cb10-217" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-218"><a href="#cb10-218" aria-hidden="true" tabindex="-1"></a>Note that the error for the i-th datapoint is: $e_i = y_i - \hat{y_i}$</span>
+<span id="cb10-219"><a href="#cb10-219" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-220"><a href="#cb10-220" aria-hidden="true" tabindex="-1"></a>:::</span>
+<span id="cb10-221"><a href="#cb10-221" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-222"><a href="#cb10-222" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-223"><a href="#cb10-223" aria-hidden="true" tabindex="-1"></a><span class="fu">## The Modeling Process</span></span>
+<span id="cb10-224"><a href="#cb10-224" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-225"><a href="#cb10-225" aria-hidden="true" tabindex="-1"></a>At a high level, a model is a way of representing a system. In Data 100, we'll treat a model as some mathematical rule we use to describe the relationship between variables. </span>
+<span id="cb10-226"><a href="#cb10-226" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-227"><a href="#cb10-227" aria-hidden="true" tabindex="-1"></a>What variables are we modeling? Typically, we use a subset of the variables in our sample of collected data to model another variable in this data. To put this more formally, say we have the following dataset $\mathcal{D}$:</span>
+<span id="cb10-228"><a href="#cb10-228" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-229"><a href="#cb10-229" aria-hidden="true" tabindex="-1"></a>$$\mathcal{D} = <span class="sc">\{</span>(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)<span class="sc">\}</span>$$</span>
+<span id="cb10-230"><a href="#cb10-230" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-231"><a href="#cb10-231" aria-hidden="true" tabindex="-1"></a>Each pair of values $(x_i, y_i)$ represents a datapoint. In a modeling setting, we call these **observations**. $y_i$ is the dependent variable we are trying to model, also called an **output** or **response**. $x_i$ is the independent variable inputted into the model to make predictions, also known as a **feature**. </span>
+<span id="cb10-232"><a href="#cb10-232" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-233"><a href="#cb10-233" aria-hidden="true" tabindex="-1"></a>Our goal in modeling is to use the observed data $\mathcal{D}$ to predict the output variable $y_i$. We denote each prediction as $\hat{y}_i$ (read: "y hat sub i").</span>
+<span id="cb10-234"><a href="#cb10-234" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-235"><a href="#cb10-235" aria-hidden="true" tabindex="-1"></a>How do we generate these predictions? Some examples of models we'll encounter in the next few lectures are given below:</span>
+<span id="cb10-236"><a href="#cb10-236" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-237"><a href="#cb10-237" aria-hidden="true" tabindex="-1"></a>$$\hat{y}_i = \theta$$</span>
+<span id="cb10-238"><a href="#cb10-238" aria-hidden="true" tabindex="-1"></a>$$\hat{y}_i = \theta_0 + \theta_1 x_i$$</span>
+<span id="cb10-239"><a href="#cb10-239" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-240"><a href="#cb10-240" aria-hidden="true" tabindex="-1"></a>The examples above are known as **parametric models**. They relate the collected data, $x_i$, to the prediction we make, $\hat{y}_i$. A few parameters ($\theta$, $\theta_0$, $\theta_1$) are used to describe the relationship between $x_i$ and $\hat{y}_i$.</span>
+<span id="cb10-241"><a href="#cb10-241" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-242"><a href="#cb10-242" aria-hidden="true" tabindex="-1"></a>Notice that we don't immediately know the values of these parameters. While the features, $x_i$, are taken from our observed data, we need to decide what values to give $\theta$, $\theta_0$, and $\theta_1$ ourselves. This is the heart of parametric modeling: *what parameter values should we choose so our model makes the best possible predictions?*</span>
+<span id="cb10-243"><a href="#cb10-243" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-244"><a href="#cb10-244" aria-hidden="true" tabindex="-1"></a>To choose our model parameters, we'll work through the **modeling process**. </span>
+<span id="cb10-245"><a href="#cb10-245" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-246"><a href="#cb10-246" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Choose a model: how should we represent the world?</span>
+<span id="cb10-247"><a href="#cb10-247" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Choose a loss function: how do we quantify prediction error?</span>
+<span id="cb10-248"><a href="#cb10-248" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Fit the model: how do we choose the best parameters of our model given our data?</span>
+<span id="cb10-249"><a href="#cb10-249" aria-hidden="true" tabindex="-1"></a><span class="ss">4. </span>Evaluate model performance: how do we evaluate whether this process gave rise to a good model?</span>
+<span id="cb10-250"><a href="#cb10-250" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-251"><a href="#cb10-251" aria-hidden="true" tabindex="-1"></a><span class="fu">## Choosing a Model</span></span>
+<span id="cb10-252"><a href="#cb10-252" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-253"><a href="#cb10-253" aria-hidden="true" tabindex="-1"></a>Our first step is choosing a model: defining the mathematical rule that describes the relationship between the features, $x_i$, and predictions $\hat{y}_i$. </span>
+<span id="cb10-254"><a href="#cb10-254" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-255"><a href="#cb10-255" aria-hidden="true" tabindex="-1"></a>In <span class="co">[</span><span class="ot">Data 8</span><span class="co">](https://inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html)</span>, you learned about the **Simple Linear Regression (SLR) model**. You learned that the model takes the form:</span>
+<span id="cb10-256"><a href="#cb10-256" aria-hidden="true" tabindex="-1"></a>$$\hat{y}_i = a + bx_i$$</span>
+<span id="cb10-257"><a href="#cb10-257" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-258"><a href="#cb10-258" aria-hidden="true" tabindex="-1"></a>In Data 100, we'll use slightly different notation: we will replace $a$ with $\theta_0$ and $b$ with $\theta_1$. This will allow us to use the same notation when we explore more complex models later on in the course.</span>
+<span id="cb10-259"><a href="#cb10-259" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-260"><a href="#cb10-260" aria-hidden="true" tabindex="-1"></a>$$\hat{y}_i = \theta_0 + \theta_1 x_i$$</span>
+<span id="cb10-261"><a href="#cb10-261" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-262"><a href="#cb10-262" aria-hidden="true" tabindex="-1"></a>The parameters of the SLR model are $\theta_0$, also called the intercept term, and $\theta_1$, also called the slope term. To create an effective model, we want to choose values for $\theta_0$ and $\theta_1$ that most accurately predict the output variable. The "best" fitting model parameters are given the special names: $\hat{\theta}_0$ and $\hat{\theta}_1$; they are the specific parameter values that allow our model to generate the best possible predictions.</span>
+<span id="cb10-263"><a href="#cb10-263" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-264"><a href="#cb10-264" aria-hidden="true" tabindex="-1"></a>In Data 8, you learned that the best SLR model parameters are:</span>
+<span id="cb10-265"><a href="#cb10-265" aria-hidden="true" tabindex="-1"></a>$$\hat{\theta}_0 = \bar{y} - \hat{\theta}_1\bar{x} \qquad \qquad \hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}$$</span>
+<span id="cb10-266"><a href="#cb10-266" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-267"><a href="#cb10-267" aria-hidden="true" tabindex="-1"></a>A quick reminder on notation:</span>
+<span id="cb10-268"><a href="#cb10-268" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-269"><a href="#cb10-269" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>$\bar{y}$ and $\bar{x}$ indicate the mean value of $y$ and $x$, respectively</span>
+<span id="cb10-270"><a href="#cb10-270" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>$\sigma_y$ and $\sigma_x$ indicate the standard deviations of $y$ and $x$</span>
+<span id="cb10-271"><a href="#cb10-271" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>$r$ is the <span class="co">[</span><span class="ot">correlation coefficient</span><span class="co">](https://inferentialthinking.com/chapters/15/1/Correlation.html#the-correlation-coefficient)</span>, defined as the average of the product of $x$ and $y$ measured in standard units: $\frac{1}{n} \sum_{i=1}^n (\frac{x_i-\bar{x}}{\sigma_x})(\frac{y_i-\bar{y}}{\sigma_y})$</span>
+<span id="cb10-272"><a href="#cb10-272" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-273"><a href="#cb10-273" aria-hidden="true" tabindex="-1"></a>In Data 100, we want to understand *how* to derive these best model coefficients. To do so, we'll introduce the concept of a loss function.</span>
+<span id="cb10-274"><a href="#cb10-274" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-275"><a href="#cb10-275" aria-hidden="true" tabindex="-1"></a><span class="fu">## Choosing a Loss Function</span></span>
+<span id="cb10-276"><a href="#cb10-276" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-277"><a href="#cb10-277" aria-hidden="true" tabindex="-1"></a>We've talked about the idea of creating the "best" possible predictions. This begs the question: how do we decide how "good" or "bad" our model's predictions are?</span>
+<span id="cb10-278"><a href="#cb10-278" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-279"><a href="#cb10-279" aria-hidden="true" tabindex="-1"></a>A **loss function** characterizes the cost, error, or fit resulting from a particular choice of model or model parameters. This function, $L(y, \hat{y})$, quantifies how "bad" or "far off" a single prediction by our model is from a true, observed value in our collected data. </span>
+<span id="cb10-280"><a href="#cb10-280" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-281"><a href="#cb10-281" aria-hidden="true" tabindex="-1"></a>The choice of loss function for a particular model will affect the accuracy and computational cost of estimation, and it'll also depend on the estimation task at hand. For example, </span>
+<span id="cb10-282"><a href="#cb10-282" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-283"><a href="#cb10-283" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are outputs quantitative or qualitative? </span>
+<span id="cb10-284"><a href="#cb10-284" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Do outliers matter? </span>
+<span id="cb10-285"><a href="#cb10-285" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are all errors equally costly? (e.g., a false negative on a cancer test is arguably more dangerous than a false positive) </span>
+<span id="cb10-286"><a href="#cb10-286" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-287"><a href="#cb10-287" aria-hidden="true" tabindex="-1"></a>Regardless of the specific function used, a loss function should follow two basic principles:</span>
+<span id="cb10-288"><a href="#cb10-288" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-289"><a href="#cb10-289" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If the prediction $\hat{y}_i$ is *close* to the actual value $y_i$, loss should be low.</span>
+<span id="cb10-290"><a href="#cb10-290" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If the prediction $\hat{y}_i$ is *far* from the actual value $y_i$, loss should be high.</span>
+<span id="cb10-291"><a href="#cb10-291" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-292"><a href="#cb10-292" aria-hidden="true" tabindex="-1"></a>Two common choices of loss function are squared loss and absolute loss. </span>
+<span id="cb10-293"><a href="#cb10-293" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-294"><a href="#cb10-294" aria-hidden="true" tabindex="-1"></a>**Squared loss**, also known as **L2 loss**, computes loss as the square of the difference between the observed $y_i$ and predicted $\hat{y}_i$:</span>
+<span id="cb10-295"><a href="#cb10-295" aria-hidden="true" tabindex="-1"></a>$$L(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2$$</span>
+<span id="cb10-296"><a href="#cb10-296" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-297"><a href="#cb10-297" aria-hidden="true" tabindex="-1"></a>**Absolute loss**, also known as **L1 loss**, computes loss as the absolute difference between the observed $y_i$ and predicted $\hat{y}_i$:</span>
+<span id="cb10-298"><a href="#cb10-298" aria-hidden="true" tabindex="-1"></a>$$L(y_i, \hat{y}_i) = |y_i - \hat{y}_i|$$</span>
+<span id="cb10-299"><a href="#cb10-299" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-300"><a href="#cb10-300" aria-hidden="true" tabindex="-1"></a>L1 and L2 loss give us a tool for quantifying our model's performance on a single data point. This is a good start, but ideally, we want to understand how our model performs across our *entire* dataset. A natural way to do this is to compute the average loss across all data points in the dataset. This is known as the **cost function**, $\hat{R}(\theta)$:</span>
+<span id="cb10-301"><a href="#cb10-301" aria-hidden="true" tabindex="-1"></a>$$\hat{R}(\theta) = \frac{1}{n} \sum^n_{i=1} L(y_i, \hat{y}_i)$$</span>
+<span id="cb10-302"><a href="#cb10-302" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-303"><a href="#cb10-303" aria-hidden="true" tabindex="-1"></a>The cost function has many names in the statistics literature. You may also encounter the terms:</span>
+<span id="cb10-304"><a href="#cb10-304" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-305"><a href="#cb10-305" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Empirical risk (this is why we give the cost function the name $R$)</span>
+<span id="cb10-306"><a href="#cb10-306" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Error function</span>
+<span id="cb10-307"><a href="#cb10-307" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Average loss</span>
+<span id="cb10-308"><a href="#cb10-308" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-309"><a href="#cb10-309" aria-hidden="true" tabindex="-1"></a>We can substitute our L1 and L2 loss into the cost function definition. The **Mean Squared Error (MSE)** is the average squared loss across a dataset:</span>
+<span id="cb10-310"><a href="#cb10-310" aria-hidden="true" tabindex="-1"></a>$$\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$$</span>
+<span id="cb10-311"><a href="#cb10-311" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-312"><a href="#cb10-312" aria-hidden="true" tabindex="-1"></a>The **Mean Absolute Error (MAE)** is the average absolute loss across a dataset:</span>
+<span id="cb10-313"><a href="#cb10-313" aria-hidden="true" tabindex="-1"></a>$$\text{MAE}= \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$$</span>
+<span id="cb10-314"><a href="#cb10-314" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-315"><a href="#cb10-315" aria-hidden="true" tabindex="-1"></a><span class="fu">## Fitting the Model</span></span>
+<span id="cb10-316"><a href="#cb10-316" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-317"><a href="#cb10-317" aria-hidden="true" tabindex="-1"></a>Now that we've established the concept of a loss function, we can return to our original goal of choosing model parameters. Specifically, we want to choose the best set of model parameters that will minimize the model's cost on our dataset. This process is called fitting the model.</span>
+<span id="cb10-318"><a href="#cb10-318" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-319"><a href="#cb10-319" aria-hidden="true" tabindex="-1"></a>We know from calculus that a function is minimized when (1) its first derivative is equal to zero and (2) its second derivative is positive. We often call the function being minimized the **objective function** (our objective is to find its minimum).</span>
+<span id="cb10-320"><a href="#cb10-320" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-321"><a href="#cb10-321" aria-hidden="true" tabindex="-1"></a>To find the optimal model parameter, we:</span>
+<span id="cb10-322"><a href="#cb10-322" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-323"><a href="#cb10-323" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Take the derivative of the cost function with respect to that parameter</span>
+<span id="cb10-324"><a href="#cb10-324" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Set the derivative equal to 0</span>
+<span id="cb10-325"><a href="#cb10-325" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Solve for the parameter</span>
+<span id="cb10-326"><a href="#cb10-326" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-327"><a href="#cb10-327" aria-hidden="true" tabindex="-1"></a>We repeat this process for each parameter present in the model. For now, we'll disregard the second derivative condition. </span>
+<span id="cb10-328"><a href="#cb10-328" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-329"><a href="#cb10-329" aria-hidden="true" tabindex="-1"></a>To help us make sense of this process, let's put it into action by deriving the optimal model parameters for simple linear regression using the mean squared error as our cost function. Remember: although the notation may look tricky, all we are doing is following the three steps above!</span>
+<span id="cb10-330"><a href="#cb10-330" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-331"><a href="#cb10-331" aria-hidden="true" tabindex="-1"></a>Step 1: take the derivative of the cost function with respect to each model parameter. We substitute the SLR model, $\hat{y}_i = \theta_0+\theta_1 x_i$, into the definition of MSE above and differentiate with respect to $\theta_0$ and $\theta_1$.</span>
+<span id="cb10-332"><a href="#cb10-332" aria-hidden="true" tabindex="-1"></a>$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)^2$$</span>
+<span id="cb10-333"><a href="#cb10-333" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-334"><a href="#cb10-334" aria-hidden="true" tabindex="-1"></a>$$\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} y_i - \theta_0 - \theta_1 x_i$$</span>
+<span id="cb10-335"><a href="#cb10-335" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-336"><a href="#cb10-336" aria-hidden="true" tabindex="-1"></a>$$\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)x_i$$</span>
+<span id="cb10-337"><a href="#cb10-337" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-338"><a href="#cb10-338" aria-hidden="true" tabindex="-1"></a>Let's walk through these derivations in more depth, starting with the derivative of MSE with respect to $\theta_0$.</span>
+<span id="cb10-339"><a href="#cb10-339" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-340"><a href="#cb10-340" aria-hidden="true" tabindex="-1"></a>Given our MSE above, we know that:</span>
+<span id="cb10-341"><a href="#cb10-341" aria-hidden="true" tabindex="-1"></a>$$\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{\partial}{\partial \theta_0} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}$$</span>
+<span id="cb10-342"><a href="#cb10-342" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-343"><a href="#cb10-343" aria-hidden="true" tabindex="-1"></a>Noting that the derivative of sum is equivalent to the sum of derivatives, this then becomes:</span>
+<span id="cb10-344"><a href="#cb10-344" aria-hidden="true" tabindex="-1"></a>$$ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_0} {(y_i - \theta_0 - \theta_1 x_i)}^{2}$$</span>
+<span id="cb10-345"><a href="#cb10-345" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-346"><a href="#cb10-346" aria-hidden="true" tabindex="-1"></a>We can then apply the chain rule.</span>
+<span id="cb10-347"><a href="#cb10-347" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-348"><a href="#cb10-348" aria-hidden="true" tabindex="-1"></a>$$ = \frac{1}{n} \sum_{i=1}^{n} 2 \cdot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-1)$$</span>
+<span id="cb10-349"><a href="#cb10-349" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-350"><a href="#cb10-350" aria-hidden="true" tabindex="-1"></a>Finally, we can simplify the constants, leaving us with our answer. </span>
+<span id="cb10-351"><a href="#cb10-351" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-352"><a href="#cb10-352" aria-hidden="true" tabindex="-1"></a>$$\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n}{(y_i - \theta_0 - \theta_1 x_i)}$$</span>
+<span id="cb10-353"><a href="#cb10-353" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-354"><a href="#cb10-354" aria-hidden="true" tabindex="-1"></a>Following the same procedure, we can take the derivative of MSE with respect to  $\theta_1$.</span>
+<span id="cb10-355"><a href="#cb10-355" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-356"><a href="#cb10-356" aria-hidden="true" tabindex="-1"></a>$$\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{\partial}{\partial \theta_1} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}$$</span>
+<span id="cb10-357"><a href="#cb10-357" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-358"><a href="#cb10-358" aria-hidden="true" tabindex="-1"></a>$$ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_1} {(y_i - \theta_0 - \theta_1 x_i)}^{2}$$</span>
+<span id="cb10-359"><a href="#cb10-359" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-360"><a href="#cb10-360" aria-hidden="true" tabindex="-1"></a>$$ = \frac{1}{n} \sum_{i=1}^{n} 2 \dot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-x_i)$$</span>
+<span id="cb10-361"><a href="#cb10-361" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-362"><a href="#cb10-362" aria-hidden="true" tabindex="-1"></a>$$= \frac{-2}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}x_i$$</span>
+<span id="cb10-363"><a href="#cb10-363" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-364"><a href="#cb10-364" aria-hidden="true" tabindex="-1"></a>Step 2: set the derivatives equal to 0. After simplifying terms, this produces two **estimating equations**. The best set of model parameters $(\hat{\theta}_0, \hat{\theta}_1)$ *must* satisfy these two optimality conditions.</span>
+<span id="cb10-365"><a href="#cb10-365" aria-hidden="true" tabindex="-1"></a>$$0 = \frac{-2}{n} \sum_{i=1}^{n} y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} y_i - \hat{y}_i = 0$$</span>
+<span id="cb10-366"><a href="#cb10-366" aria-hidden="true" tabindex="-1"></a>$$0 = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i)x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)x_i = 0$$</span>
+<span id="cb10-367"><a href="#cb10-367" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-368"><a href="#cb10-368" aria-hidden="true" tabindex="-1"></a>Step 3: solve the estimating equations to compute estimates for $\hat{\theta}_0$ and $\hat{\theta}_1$.</span>
+<span id="cb10-369"><a href="#cb10-369" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-370"><a href="#cb10-370" aria-hidden="true" tabindex="-1"></a>Taking the first equation gives the estimate of $\hat{\theta}_0$:</span>
+<span id="cb10-371"><a href="#cb10-371" aria-hidden="true" tabindex="-1"></a>$$\frac{1}{n} \sum_{i=1}^n y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i = 0 $$ </span>
+<span id="cb10-372"><a href="#cb10-372" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-373"><a href="#cb10-373" aria-hidden="true" tabindex="-1"></a>$$\left(\frac{1}{n} \sum_{i=1}^n y_i \right) - \hat{\theta}_0 - \hat{\theta}_1\left(\frac{1}{n} \sum_{i=1}^n x_i \right) = 0$$</span>
+<span id="cb10-374"><a href="#cb10-374" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-375"><a href="#cb10-375" aria-hidden="true" tabindex="-1"></a>$$ \hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x}$$</span>
+<span id="cb10-376"><a href="#cb10-376" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-377"><a href="#cb10-377" aria-hidden="true" tabindex="-1"></a>With a bit more maneuvering, the second equation gives the estimate of $\hat{\theta}_1$. Start by multiplying the first estimating equation by $\bar{x}$, then subtracting the result from the second estimating equation.</span>
+<span id="cb10-378"><a href="#cb10-378" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-379"><a href="#cb10-379" aria-hidden="true" tabindex="-1"></a>$$\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)x_i - \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)\bar{x} = 0 $$</span>
+<span id="cb10-380"><a href="#cb10-380" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-381"><a href="#cb10-381" aria-hidden="true" tabindex="-1"></a>$$\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)(x_i - \bar{x}) = 0 $$</span>
+<span id="cb10-382"><a href="#cb10-382" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-383"><a href="#cb10-383" aria-hidden="true" tabindex="-1"></a>Next, plug in $\hat{y}_i = \hat{\theta}_0 + \hat{\theta}_1 x_i = \bar{y} + \hat{\theta}_1(x_i - \bar{x})$:</span>
+<span id="cb10-384"><a href="#cb10-384" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-385"><a href="#cb10-385" aria-hidden="true" tabindex="-1"></a>$$\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y} - \hat{\theta}_1(x - \bar{x}))(x_i - \bar{x}) = 0 $$</span>
+<span id="cb10-386"><a href="#cb10-386" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-387"><a href="#cb10-387" aria-hidden="true" tabindex="-1"></a>$$\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x}) = \hat{\theta}_1 \times \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2</span>
+<span id="cb10-388"><a href="#cb10-388" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb10-389"><a href="#cb10-389" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-390"><a href="#cb10-390" aria-hidden="true" tabindex="-1"></a>By using the definition of correlation $\left(r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i-\bar{x}}{\sigma_x})(\frac{y_i-\bar{y}}{\sigma_y}) \right)$ and standard deviation $\left(\sigma_x = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} \right)$, we can conclude:</span>
+<span id="cb10-391"><a href="#cb10-391" aria-hidden="true" tabindex="-1"></a>$$r \sigma_x \sigma_y = \hat{\theta}_1 \times \sigma_x^2$$</span>
+<span id="cb10-392"><a href="#cb10-392" aria-hidden="true" tabindex="-1"></a>$$\hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}$$</span>
+<span id="cb10-393"><a href="#cb10-393" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-394"><a href="#cb10-394" aria-hidden="true" tabindex="-1"></a>Just as was given in Data 8! </span>
+<span id="cb10-395"><a href="#cb10-395" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-396"><a href="#cb10-396" aria-hidden="true" tabindex="-1"></a>Remember, this derivation found the optimal model parameters for SLR when using the MSE cost function. If we had used a different model or different loss function, we likely would have found different values for the best model parameters. However, regardless of the model and loss used, we can *always* follow these three steps to fit the model.</span>
+<span id="cb10-397"><a href="#cb10-397" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-398"><a href="#cb10-398" aria-hidden="true" tabindex="-1"></a><span class="fu">## Evaluating the SLR Model</span></span>
+<span id="cb10-399"><a href="#cb10-399" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-400"><a href="#cb10-400" aria-hidden="true" tabindex="-1"></a>Now that we've explored the mathematics behind (1) choosing a model, (2) choosing a loss function, and (3) fitting the model, we're left with one final question – how "good" are the predictions made by this "best" fitted model? To determine this, we can:</span>
+<span id="cb10-401"><a href="#cb10-401" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-402"><a href="#cb10-402" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Visualize data and compute statistics:</span>
+<span id="cb10-403"><a href="#cb10-403" aria-hidden="true" tabindex="-1"></a><span class="ss">   - </span>Plot the original data.</span>
+<span id="cb10-404"><a href="#cb10-404" aria-hidden="true" tabindex="-1"></a><span class="ss">   - </span>Compute each column's mean and standard deviation. If the mean and standard deviation of our predictions are close to those of the original observed $y_i$'s, we might be inclined to say that our model has done well.</span>
+<span id="cb10-405"><a href="#cb10-405" aria-hidden="true" tabindex="-1"></a><span class="ss">   - </span>(If we're fitting a linear model) Compute the correlation $r$. A large magnitude for the correlation coefficient between the feature and response variables could also indicate that our model has done well.    </span>
+<span id="cb10-406"><a href="#cb10-406" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-407"><a href="#cb10-407" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Performance metrics:</span>
+<span id="cb10-408"><a href="#cb10-408" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-409"><a href="#cb10-409" aria-hidden="true" tabindex="-1"></a><span class="ss">   - </span>We can take the **Root Mean Squared Error (RMSE)**.</span>
+<span id="cb10-410"><a href="#cb10-410" aria-hidden="true" tabindex="-1"></a><span class="ss">     - </span>It's the square root of the mean squared error (MSE), which is the average loss that we've been minimizing to determine optimal model parameters.</span>
+<span id="cb10-411"><a href="#cb10-411" aria-hidden="true" tabindex="-1"></a><span class="ss">     - </span>RMSE is in the same units as $y$.</span>
+<span id="cb10-412"><a href="#cb10-412" aria-hidden="true" tabindex="-1"></a><span class="ss">     - </span>A lower RMSE indicates more "accurate" predictions, as we have a lower "average loss" across the data.</span>
+<span id="cb10-413"><a href="#cb10-413" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-414"><a href="#cb10-414" aria-hidden="true" tabindex="-1"></a>   $$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$$</span>
+<span id="cb10-415"><a href="#cb10-415" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-416"><a href="#cb10-416" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Visualization:</span>
+<span id="cb10-417"><a href="#cb10-417" aria-hidden="true" tabindex="-1"></a><span class="ss">   - </span>Look at the residual plot of $e_i = y_i - \hat{y_i}$ to visualize the difference between actual and predicted values. The good residual plot should not show any pattern between input/features $x_i$ and residual values $e_i$.</span>
+<span id="cb10-418"><a href="#cb10-418" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-419"><a href="#cb10-419" aria-hidden="true" tabindex="-1"></a>To illustrate this process, let's take a look at **Anscombe's quartet**.</span>
+<span id="cb10-420"><a href="#cb10-420" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-421"><a href="#cb10-421" aria-hidden="true" tabindex="-1"></a><span class="fu">### Four Mysterious Datasets (Anscombe’s quartet)</span></span>
+<span id="cb10-422"><a href="#cb10-422" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-423"><a href="#cb10-423" aria-hidden="true" tabindex="-1"></a>Let's take a look at four different datasets.</span>
+<span id="cb10-424"><a href="#cb10-424" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-427"><a href="#cb10-427" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-428"><a href="#cb10-428" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb10-429"><a href="#cb10-429" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
+<span id="cb10-430"><a href="#cb10-430" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
+<span id="cb10-431"><a href="#cb10-431" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
+<span id="cb10-432"><a href="#cb10-432" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>matplotlib inline</span>
+<span id="cb10-433"><a href="#cb10-433" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
+<span id="cb10-434"><a href="#cb10-434" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> itertools</span>
+<span id="cb10-435"><a href="#cb10-435" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> mpl_toolkits.mplot3d <span class="im">import</span> Axes3D</span>
+<span id="cb10-436"><a href="#cb10-436" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb10-437"><a href="#cb10-437" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-440"><a href="#cb10-440" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-441"><a href="#cb10-441" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb10-442"><a href="#cb10-442" aria-hidden="true" tabindex="-1"></a><span class="co"># Big font helper</span></span>
+<span id="cb10-443"><a href="#cb10-443" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> adjust_fontsize(size<span class="op">=</span><span class="va">None</span>):</span>
+<span id="cb10-444"><a href="#cb10-444" aria-hidden="true" tabindex="-1"></a>    SMALL_SIZE <span class="op">=</span> <span class="dv">8</span></span>
+<span id="cb10-445"><a href="#cb10-445" aria-hidden="true" tabindex="-1"></a>    MEDIUM_SIZE <span class="op">=</span> <span class="dv">10</span></span>
+<span id="cb10-446"><a href="#cb10-446" aria-hidden="true" tabindex="-1"></a>    BIGGER_SIZE <span class="op">=</span> <span class="dv">12</span></span>
+<span id="cb10-447"><a href="#cb10-447" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> size <span class="op">!=</span> <span class="va">None</span>:</span>
+<span id="cb10-448"><a href="#cb10-448" aria-hidden="true" tabindex="-1"></a>        SMALL_SIZE <span class="op">=</span> MEDIUM_SIZE <span class="op">=</span> BIGGER_SIZE <span class="op">=</span> size</span>
+<span id="cb10-449"><a href="#cb10-449" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-450"><a href="#cb10-450" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"font"</span>, size<span class="op">=</span>SMALL_SIZE)  <span class="co"># controls default text sizes</span></span>
+<span id="cb10-451"><a href="#cb10-451" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"axes"</span>, titlesize<span class="op">=</span>SMALL_SIZE)  <span class="co"># fontsize of the axes title</span></span>
+<span id="cb10-452"><a href="#cb10-452" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"axes"</span>, labelsize<span class="op">=</span>MEDIUM_SIZE)  <span class="co"># fontsize of the x and y labels</span></span>
+<span id="cb10-453"><a href="#cb10-453" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"xtick"</span>, labelsize<span class="op">=</span>SMALL_SIZE)  <span class="co"># fontsize of the tick labels</span></span>
+<span id="cb10-454"><a href="#cb10-454" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"ytick"</span>, labelsize<span class="op">=</span>SMALL_SIZE)  <span class="co"># fontsize of the tick labels</span></span>
+<span id="cb10-455"><a href="#cb10-455" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"legend"</span>, fontsize<span class="op">=</span>SMALL_SIZE)  <span class="co"># legend fontsize</span></span>
+<span id="cb10-456"><a href="#cb10-456" aria-hidden="true" tabindex="-1"></a>    plt.rc(<span class="st">"figure"</span>, titlesize<span class="op">=</span>BIGGER_SIZE)  <span class="co"># fontsize of the figure title</span></span>
+<span id="cb10-457"><a href="#cb10-457" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-458"><a href="#cb10-458" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-459"><a href="#cb10-459" aria-hidden="true" tabindex="-1"></a><span class="co"># Helper functions</span></span>
+<span id="cb10-460"><a href="#cb10-460" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> standard_units(x):</span>
+<span id="cb10-461"><a href="#cb10-461" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (x <span class="op">-</span> np.mean(x)) <span class="op">/</span> np.std(x)</span>
+<span id="cb10-462"><a href="#cb10-462" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-463"><a href="#cb10-463" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-464"><a href="#cb10-464" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> correlation(x, y):</span>
+<span id="cb10-465"><a href="#cb10-465" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> np.mean(standard_units(x) <span class="op">*</span> standard_units(y))</span>
+<span id="cb10-466"><a href="#cb10-466" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-467"><a href="#cb10-467" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-468"><a href="#cb10-468" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> slope(x, y):</span>
+<span id="cb10-469"><a href="#cb10-469" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> correlation(x, y) <span class="op">*</span> np.std(y) <span class="op">/</span> np.std(x)</span>
+<span id="cb10-470"><a href="#cb10-470" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-471"><a href="#cb10-471" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-472"><a href="#cb10-472" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> intercept(x, y):</span>
+<span id="cb10-473"><a href="#cb10-473" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> np.mean(y) <span class="op">-</span> slope(x, y) <span class="op">*</span> np.mean(x)</span>
+<span id="cb10-474"><a href="#cb10-474" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-475"><a href="#cb10-475" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-476"><a href="#cb10-476" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> fit_least_squares(x, y):</span>
+<span id="cb10-477"><a href="#cb10-477" aria-hidden="true" tabindex="-1"></a>    theta_0 <span class="op">=</span> intercept(x, y)</span>
+<span id="cb10-478"><a href="#cb10-478" aria-hidden="true" tabindex="-1"></a>    theta_1 <span class="op">=</span> slope(x, y)</span>
+<span id="cb10-479"><a href="#cb10-479" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> theta_0, theta_1</span>
+<span id="cb10-480"><a href="#cb10-480" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-481"><a href="#cb10-481" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-482"><a href="#cb10-482" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> predict(x, theta_0, theta_1):</span>
+<span id="cb10-483"><a href="#cb10-483" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> theta_0 <span class="op">+</span> theta_1 <span class="op">*</span> x</span>
+<span id="cb10-484"><a href="#cb10-484" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-485"><a href="#cb10-485" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-486"><a href="#cb10-486" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> compute_mse(y, yhat):</span>
+<span id="cb10-487"><a href="#cb10-487" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> np.mean((y <span class="op">-</span> yhat) <span class="op">**</span> <span class="dv">2</span>)</span>
+<span id="cb10-488"><a href="#cb10-488" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-489"><a href="#cb10-489" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-490"><a href="#cb10-490" aria-hidden="true" tabindex="-1"></a>plt.style.use(<span class="st">"default"</span>)  <span class="co"># Revert style to default mpl</span></span>
+<span id="cb10-491"><a href="#cb10-491" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb10-492"><a href="#cb10-492" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-495"><a href="#cb10-495" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-496"><a href="#cb10-496" aria-hidden="true" tabindex="-1"></a>plt.style.use(<span class="st">"default"</span>)  <span class="co"># Revert style to default mpl</span></span>
+<span id="cb10-497"><a href="#cb10-497" aria-hidden="true" tabindex="-1"></a>NO_VIZ, RESID, RESID_SCATTER <span class="op">=</span> <span class="bu">range</span>(<span class="dv">3</span>)</span>
+<span id="cb10-498"><a href="#cb10-498" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-499"><a href="#cb10-499" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-500"><a href="#cb10-500" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> least_squares_evaluation(x, y, visualize<span class="op">=</span>NO_VIZ):</span>
+<span id="cb10-501"><a href="#cb10-501" aria-hidden="true" tabindex="-1"></a>    <span class="co"># statistics</span></span>
+<span id="cb10-502"><a href="#cb10-502" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"x_mean : </span><span class="sc">{</span>np<span class="sc">.</span>mean(x)<span class="sc">:.2f}</span><span class="ss">, y_mean : </span><span class="sc">{</span>np<span class="sc">.</span>mean(y)<span class="sc">:.2f}</span><span class="ss">"</span>)</span>
+<span id="cb10-503"><a href="#cb10-503" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"x_stdev: </span><span class="sc">{</span>np<span class="sc">.</span>std(x)<span class="sc">:.2f}</span><span class="ss">, y_stdev: </span><span class="sc">{</span>np<span class="sc">.</span>std(y)<span class="sc">:.2f}</span><span class="ss">"</span>)</span>
+<span id="cb10-504"><a href="#cb10-504" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"r = Correlation(x, y): </span><span class="sc">{</span>correlation(x, y)<span class="sc">:.3f}</span><span class="ss">"</span>)</span>
+<span id="cb10-505"><a href="#cb10-505" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-506"><a href="#cb10-506" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Performance metrics</span></span>
+<span id="cb10-507"><a href="#cb10-507" aria-hidden="true" tabindex="-1"></a>    ahat, bhat <span class="op">=</span> fit_least_squares(x, y)</span>
+<span id="cb10-508"><a href="#cb10-508" aria-hidden="true" tabindex="-1"></a>    yhat <span class="op">=</span> predict(x, ahat, bhat)</span>
+<span id="cb10-509"><a href="#cb10-509" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"</span><span class="ch">\t</span><span class="ss">heta_0: </span><span class="sc">{</span>ahat<span class="sc">:.2f}</span><span class="ss">, </span><span class="ch">\t</span><span class="ss">heta_1: </span><span class="sc">{</span>bhat<span class="sc">:.2f}</span><span class="ss">"</span>)</span>
+<span id="cb10-510"><a href="#cb10-510" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"RMSE: </span><span class="sc">{</span>np<span class="sc">.</span>sqrt(compute_mse(y, yhat))<span class="sc">:.3f}</span><span class="ss">"</span>)</span>
+<span id="cb10-511"><a href="#cb10-511" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-512"><a href="#cb10-512" aria-hidden="true" tabindex="-1"></a>    <span class="co"># visualization</span></span>
+<span id="cb10-513"><a href="#cb10-513" aria-hidden="true" tabindex="-1"></a>    fig, ax_resid <span class="op">=</span> <span class="va">None</span>, <span class="va">None</span></span>
+<span id="cb10-514"><a href="#cb10-514" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> visualize <span class="op">==</span> RESID_SCATTER:</span>
+<span id="cb10-515"><a href="#cb10-515" aria-hidden="true" tabindex="-1"></a>        fig, axs <span class="op">=</span> plt.subplots(<span class="dv">1</span>, <span class="dv">2</span>, figsize<span class="op">=</span>(<span class="dv">8</span>, <span class="dv">3</span>))</span>
+<span id="cb10-516"><a href="#cb10-516" aria-hidden="true" tabindex="-1"></a>        axs[<span class="dv">0</span>].scatter(x, y)</span>
+<span id="cb10-517"><a href="#cb10-517" aria-hidden="true" tabindex="-1"></a>        axs[<span class="dv">0</span>].plot(x, yhat)</span>
+<span id="cb10-518"><a href="#cb10-518" aria-hidden="true" tabindex="-1"></a>        axs[<span class="dv">0</span>].set_title(<span class="st">"LS fit"</span>)</span>
+<span id="cb10-519"><a href="#cb10-519" aria-hidden="true" tabindex="-1"></a>        ax_resid <span class="op">=</span> axs[<span class="dv">1</span>]</span>
+<span id="cb10-520"><a href="#cb10-520" aria-hidden="true" tabindex="-1"></a>    <span class="cf">elif</span> visualize <span class="op">==</span> RESID:</span>
+<span id="cb10-521"><a href="#cb10-521" aria-hidden="true" tabindex="-1"></a>        fig <span class="op">=</span> plt.figure(figsize<span class="op">=</span>(<span class="dv">4</span>, <span class="dv">3</span>))</span>
+<span id="cb10-522"><a href="#cb10-522" aria-hidden="true" tabindex="-1"></a>        ax_resid <span class="op">=</span> plt.gca()</span>
+<span id="cb10-523"><a href="#cb10-523" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-524"><a href="#cb10-524" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> ax_resid <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>:</span>
+<span id="cb10-525"><a href="#cb10-525" aria-hidden="true" tabindex="-1"></a>        ax_resid.scatter(x, y <span class="op">-</span> yhat, color<span class="op">=</span><span class="st">"red"</span>)</span>
+<span id="cb10-526"><a href="#cb10-526" aria-hidden="true" tabindex="-1"></a>        ax_resid.plot([<span class="dv">4</span>, <span class="dv">14</span>], [<span class="dv">0</span>, <span class="dv">0</span>], color<span class="op">=</span><span class="st">"black"</span>)</span>
+<span id="cb10-527"><a href="#cb10-527" aria-hidden="true" tabindex="-1"></a>        ax_resid.set_title(<span class="st">"Residuals"</span>)</span>
+<span id="cb10-528"><a href="#cb10-528" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-529"><a href="#cb10-529" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> fig</span>
+<span id="cb10-530"><a href="#cb10-530" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb10-531"><a href="#cb10-531" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-534"><a href="#cb10-534" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-535"><a href="#cb10-535" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb10-536"><a href="#cb10-536" aria-hidden="true" tabindex="-1"></a><span class="co"># Load in four different datasets: I, II, III, IV</span></span>
+<span id="cb10-537"><a href="#cb10-537" aria-hidden="true" tabindex="-1"></a>x <span class="op">=</span> [<span class="dv">10</span>, <span class="dv">8</span>, <span class="dv">13</span>, <span class="dv">9</span>, <span class="dv">11</span>, <span class="dv">14</span>, <span class="dv">6</span>, <span class="dv">4</span>, <span class="dv">12</span>, <span class="dv">7</span>, <span class="dv">5</span>]</span>
+<span id="cb10-538"><a href="#cb10-538" aria-hidden="true" tabindex="-1"></a>y1 <span class="op">=</span> [<span class="fl">8.04</span>, <span class="fl">6.95</span>, <span class="fl">7.58</span>, <span class="fl">8.81</span>, <span class="fl">8.33</span>, <span class="fl">9.96</span>, <span class="fl">7.24</span>, <span class="fl">4.26</span>, <span class="fl">10.84</span>, <span class="fl">4.82</span>, <span class="fl">5.68</span>]</span>
+<span id="cb10-539"><a href="#cb10-539" aria-hidden="true" tabindex="-1"></a>y2 <span class="op">=</span> [<span class="fl">9.14</span>, <span class="fl">8.14</span>, <span class="fl">8.74</span>, <span class="fl">8.77</span>, <span class="fl">9.26</span>, <span class="fl">8.10</span>, <span class="fl">6.13</span>, <span class="fl">3.10</span>, <span class="fl">9.13</span>, <span class="fl">7.26</span>, <span class="fl">4.74</span>]</span>
+<span id="cb10-540"><a href="#cb10-540" aria-hidden="true" tabindex="-1"></a>y3 <span class="op">=</span> [<span class="fl">7.46</span>, <span class="fl">6.77</span>, <span class="fl">12.74</span>, <span class="fl">7.11</span>, <span class="fl">7.81</span>, <span class="fl">8.84</span>, <span class="fl">6.08</span>, <span class="fl">5.39</span>, <span class="fl">8.15</span>, <span class="fl">6.42</span>, <span class="fl">5.73</span>]</span>
+<span id="cb10-541"><a href="#cb10-541" aria-hidden="true" tabindex="-1"></a>x4 <span class="op">=</span> [<span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">19</span>, <span class="dv">8</span>, <span class="dv">8</span>, <span class="dv">8</span>]</span>
+<span id="cb10-542"><a href="#cb10-542" aria-hidden="true" tabindex="-1"></a>y4 <span class="op">=</span> [<span class="fl">6.58</span>, <span class="fl">5.76</span>, <span class="fl">7.71</span>, <span class="fl">8.84</span>, <span class="fl">8.47</span>, <span class="fl">7.04</span>, <span class="fl">5.25</span>, <span class="fl">12.50</span>, <span class="fl">5.56</span>, <span class="fl">7.91</span>, <span class="fl">6.89</span>]</span>
+<span id="cb10-543"><a href="#cb10-543" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-544"><a href="#cb10-544" aria-hidden="true" tabindex="-1"></a>anscombe <span class="op">=</span> {</span>
+<span id="cb10-545"><a href="#cb10-545" aria-hidden="true" tabindex="-1"></a>    <span class="st">"I"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x, y1)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb10-546"><a href="#cb10-546" aria-hidden="true" tabindex="-1"></a>    <span class="st">"II"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x, y2)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb10-547"><a href="#cb10-547" aria-hidden="true" tabindex="-1"></a>    <span class="st">"III"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x, y3)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb10-548"><a href="#cb10-548" aria-hidden="true" tabindex="-1"></a>    <span class="st">"IV"</span>: pd.DataFrame(<span class="bu">list</span>(<span class="bu">zip</span>(x4, y4)), columns<span class="op">=</span>[<span class="st">"x"</span>, <span class="st">"y"</span>]),</span>
+<span id="cb10-549"><a href="#cb10-549" aria-hidden="true" tabindex="-1"></a>}</span>
+<span id="cb10-550"><a href="#cb10-550" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-551"><a href="#cb10-551" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot the scatter plot and line of best fit</span></span>
+<span id="cb10-552"><a href="#cb10-552" aria-hidden="true" tabindex="-1"></a>fig, axs <span class="op">=</span> plt.subplots(<span class="dv">2</span>, <span class="dv">2</span>, figsize<span class="op">=</span>(<span class="dv">10</span>, <span class="dv">10</span>))</span>
+<span id="cb10-553"><a href="#cb10-553" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-554"><a href="#cb10-554" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i, dataset <span class="kw">in</span> <span class="bu">enumerate</span>([<span class="st">"I"</span>, <span class="st">"II"</span>, <span class="st">"III"</span>, <span class="st">"IV"</span>]):</span>
+<span id="cb10-555"><a href="#cb10-555" aria-hidden="true" tabindex="-1"></a>    ans <span class="op">=</span> anscombe[dataset]</span>
+<span id="cb10-556"><a href="#cb10-556" aria-hidden="true" tabindex="-1"></a>    x, y <span class="op">=</span> ans[<span class="st">"x"</span>], ans[<span class="st">"y"</span>]</span>
+<span id="cb10-557"><a href="#cb10-557" aria-hidden="true" tabindex="-1"></a>    ahat, bhat <span class="op">=</span> fit_least_squares(x, y)</span>
+<span id="cb10-558"><a href="#cb10-558" aria-hidden="true" tabindex="-1"></a>    yhat <span class="op">=</span> predict(x, ahat, bhat)</span>
+<span id="cb10-559"><a href="#cb10-559" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].scatter(x, y, alpha<span class="op">=</span><span class="fl">0.6</span>, color<span class="op">=</span><span class="st">"red"</span>)  <span class="co"># plot the x, y points</span></span>
+<span id="cb10-560"><a href="#cb10-560" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].plot(x, yhat)  <span class="co"># plot the line of best fit</span></span>
+<span id="cb10-561"><a href="#cb10-561" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_xlabel(<span class="ss">f"$x_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb10-562"><a href="#cb10-562" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_ylabel(<span class="ss">f"$y_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb10-563"><a href="#cb10-563" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_title(<span class="ss">f"Dataset </span><span class="sc">{</span>dataset<span class="sc">}</span><span class="ss">"</span>)</span>
+<span id="cb10-564"><a href="#cb10-564" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-565"><a href="#cb10-565" aria-hidden="true" tabindex="-1"></a>plt.show()</span>
+<span id="cb10-566"><a href="#cb10-566" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb10-567"><a href="#cb10-567" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-568"><a href="#cb10-568" aria-hidden="true" tabindex="-1"></a>While these four sets of datapoints look very different, they actually all have identical means $\bar x$, $\bar y$, standard deviations $\sigma_x$, $\sigma_y$, correlation $r$, and RMSE! If we only look at these statistics, we would probably be inclined to say that these datasets are similar.</span>
+<span id="cb10-569"><a href="#cb10-569" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-572"><a href="#cb10-572" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-573"><a href="#cb10-573" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb10-574"><a href="#cb10-574" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> dataset <span class="kw">in</span> [<span class="st">"I"</span>, <span class="st">"II"</span>, <span class="st">"III"</span>, <span class="st">"IV"</span>]:</span>
+<span id="cb10-575"><a href="#cb10-575" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"&gt;&gt;&gt; Dataset </span><span class="sc">{</span>dataset<span class="sc">}</span><span class="ss">:"</span>)</span>
+<span id="cb10-576"><a href="#cb10-576" aria-hidden="true" tabindex="-1"></a>    ans <span class="op">=</span> anscombe[dataset]</span>
+<span id="cb10-577"><a href="#cb10-577" aria-hidden="true" tabindex="-1"></a>    fig <span class="op">=</span> least_squares_evaluation(ans[<span class="st">"x"</span>], ans[<span class="st">"y"</span>], visualize<span class="op">=</span>NO_VIZ)</span>
+<span id="cb10-578"><a href="#cb10-578" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>()</span>
+<span id="cb10-579"><a href="#cb10-579" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>()</span>
+<span id="cb10-580"><a href="#cb10-580" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb10-581"><a href="#cb10-581" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-582"><a href="#cb10-582" aria-hidden="true" tabindex="-1"></a>We may also wish to visualize the model's **residuals**, defined as the difference between the observed and predicted $y_i$ value ($e_i = y_i - \hat{y}_i$). This gives a high-level view of how "off" each prediction is from the true observed value. Recall that you explored this concept in <span class="co">[</span><span class="ot">Data 8</span><span class="co">](https://inferentialthinking.com/chapters/15/5/Visual_Diagnostics.html?highlight=heteroscedasticity#detecting-heteroscedasticity)</span>: a good regression fit should display no clear pattern in its plot of residuals. The residual plots for Anscombe's quartet are displayed below. Note how only the first plot shows no clear pattern to the magnitude of residuals. This is an indication that SLR is not the best choice of model for the remaining three sets of points.</span>
+<span id="cb10-583"><a href="#cb10-583" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-584"><a href="#cb10-584" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- &lt;img src="images/residual.png" alt='residual' width='600'&gt; --&gt;</span></span>
+<span id="cb10-585"><a href="#cb10-585" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-588"><a href="#cb10-588" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb10-589"><a href="#cb10-589" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb10-590"><a href="#cb10-590" aria-hidden="true" tabindex="-1"></a><span class="co"># Residual visualization</span></span>
+<span id="cb10-591"><a href="#cb10-591" aria-hidden="true" tabindex="-1"></a>fig, axs <span class="op">=</span> plt.subplots(<span class="dv">2</span>, <span class="dv">2</span>, figsize<span class="op">=</span>(<span class="dv">10</span>, <span class="dv">10</span>))</span>
+<span id="cb10-592"><a href="#cb10-592" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-593"><a href="#cb10-593" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i, dataset <span class="kw">in</span> <span class="bu">enumerate</span>([<span class="st">"I"</span>, <span class="st">"II"</span>, <span class="st">"III"</span>, <span class="st">"IV"</span>]):</span>
+<span id="cb10-594"><a href="#cb10-594" aria-hidden="true" tabindex="-1"></a>    ans <span class="op">=</span> anscombe[dataset]</span>
+<span id="cb10-595"><a href="#cb10-595" aria-hidden="true" tabindex="-1"></a>    x, y <span class="op">=</span> ans[<span class="st">"x"</span>], ans[<span class="st">"y"</span>]</span>
+<span id="cb10-596"><a href="#cb10-596" aria-hidden="true" tabindex="-1"></a>    ahat, bhat <span class="op">=</span> fit_least_squares(x, y)</span>
+<span id="cb10-597"><a href="#cb10-597" aria-hidden="true" tabindex="-1"></a>    yhat <span class="op">=</span> predict(x, ahat, bhat)</span>
+<span id="cb10-598"><a href="#cb10-598" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].scatter(</span>
+<span id="cb10-599"><a href="#cb10-599" aria-hidden="true" tabindex="-1"></a>        x, y <span class="op">-</span> yhat, alpha<span class="op">=</span><span class="fl">0.6</span>, color<span class="op">=</span><span class="st">"red"</span></span>
+<span id="cb10-600"><a href="#cb10-600" aria-hidden="true" tabindex="-1"></a>    )  <span class="co"># plot the x, y points</span></span>
+<span id="cb10-601"><a href="#cb10-601" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].plot(</span>
+<span id="cb10-602"><a href="#cb10-602" aria-hidden="true" tabindex="-1"></a>        x, np.zeros_like(x), color<span class="op">=</span><span class="st">"black"</span></span>
+<span id="cb10-603"><a href="#cb10-603" aria-hidden="true" tabindex="-1"></a>    )  <span class="co"># plot the residual line</span></span>
+<span id="cb10-604"><a href="#cb10-604" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_xlabel(<span class="ss">f"$x_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb10-605"><a href="#cb10-605" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_ylabel(<span class="ss">f"$e_</span><span class="sc">{</span>i<span class="op">+</span><span class="dv">1</span><span class="sc">}</span><span class="ss">$"</span>)</span>
+<span id="cb10-606"><a href="#cb10-606" aria-hidden="true" tabindex="-1"></a>    axs[i <span class="op">//</span> <span class="dv">2</span>, i <span class="op">%</span> <span class="dv">2</span>].set_title(<span class="ss">f"Dataset </span><span class="sc">{</span>dataset<span class="sc">}</span><span class="ss"> Residuals"</span>)</span>
+<span id="cb10-607"><a href="#cb10-607" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-608"><a href="#cb10-608" aria-hidden="true" tabindex="-1"></a>plt.show()</span>
+<span id="cb10-609"><a href="#cb10-609" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+</code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</div></div></div></div></div>
+</div> <!-- /content -->
+
+
+
+
+</body></html>
\ No newline at end of file
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png
new file mode 100644
index 00000000..fe24beb6
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-3-output-1.png b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-3-output-1.png
new file mode 100644
index 00000000..77b58e06
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-3-output-1.png differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-7-output-1.png b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-7-output-1.png
new file mode 100644
index 00000000..3b44a8d7
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-7-output-1.png differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-9-output-1.png b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-9-output-1.png
new file mode 100644
index 00000000..87f9e7be
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-9-output-1.png differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-2-output-1.pdf b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-2-output-1.pdf
new file mode 100644
index 00000000..9820e715
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-2-output-1.pdf differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-3-output-1.pdf b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-3-output-1.pdf
new file mode 100644
index 00000000..0dff430a
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-3-output-1.pdf differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-7-output-1.pdf b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-7-output-1.pdf
new file mode 100644
index 00000000..336cbb6e
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-7-output-1.pdf differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-9-output-1.pdf b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-9-output-1.pdf
new file mode 100644
index 00000000..68ea58fd
Binary files /dev/null and b/docs/intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-9-output-1.pdf differ
diff --git a/docs/pandas_1/pandas_1.html b/docs/pandas_1/pandas_1.html
index a3aed9ae..ddbe72cc 100644
--- a/docs/pandas_1/pandas_1.html
+++ b/docs/pandas_1/pandas_1.html
@@ -237,6 +237,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -355,7 +361,7 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="tabular-data"><span class
 <section id="series-dataframes-and-indices" class="level2" data-number="2.2">
 <h2 data-number="2.2" class="anchored" data-anchor-id="series-dataframes-and-indices"><span class="header-section-number">2.2</span> <code>Series</code>, <code>DataFrame</code>s, and Indices</h2>
 <p>To begin our work in <code>pandas</code>, we must first import the library into our Python environment. This will allow us to use <code>pandas</code> data structures and methods in our code.</p>
-<div id="4d31a8d7" class="cell" data-execution_count="1">
+<div id="0f9a7ca7" class="cell" data-execution_count="1">
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co"># `pd` is the conventional alias for Pandas, as `np` is for NumPy</span></span>
 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
@@ -378,7 +384,7 @@ <h3 data-number="2.2.1" class="anchored" data-anchor-id="series"><span class="he
 <li>A sequence of data labels called the <strong>index</strong>.</li>
 </ul>
 <p>In the cell below, we create a <code>Series</code> named <code>s</code>.</p>
-<div id="fd9b4f67" class="cell" data-execution_count="2">
+<div id="99ba0ef5" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>s <span class="op">=</span> pd.Series([<span class="st">"welcome"</span>, <span class="st">"to"</span>, <span class="st">"data 100"</span>])</span>
 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>s</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="2">
@@ -388,14 +394,14 @@ <h3 data-number="2.2.1" class="anchored" data-anchor-id="series"><span class="he
 dtype: object</code></pre>
 </div>
 </div>
-<div id="200f1f04" class="cell" data-execution_count="3">
+<div id="503a2c24" class="cell" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a> <span class="co"># Accessing data values within the Series</span></span>
 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> s.values</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="3">
 <pre><code>array(['welcome', 'to', 'data 100'], dtype=object)</code></pre>
 </div>
 </div>
-<div id="91087707" class="cell" data-execution_count="4">
+<div id="4730f3e1" class="cell" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a> <span class="co"># Accessing the Index of the Series</span></span>
 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a> s.index</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="4">
@@ -403,7 +409,7 @@ <h3 data-number="2.2.1" class="anchored" data-anchor-id="series"><span class="he
 </div>
 </div>
 <p>By default, the <code>index</code> of a <code>Series</code> is a sequential list of integers beginning from 0. Optionally, a manually specified list of desired indices can be passed to the <code>index</code> argument.</p>
-<div id="eb07db5f" class="cell" data-execution_count="5">
+<div id="4ebfdc5b" class="cell" data-execution_count="5">
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>s <span class="op">=</span> pd.Series([<span class="op">-</span><span class="dv">1</span>, <span class="dv">10</span>, <span class="dv">2</span>], index <span class="op">=</span> [<span class="st">"a"</span>, <span class="st">"b"</span>, <span class="st">"c"</span>])</span>
 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>s</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="5">
@@ -413,14 +419,14 @@ <h3 data-number="2.2.1" class="anchored" data-anchor-id="series"><span class="he
 dtype: int64</code></pre>
 </div>
 </div>
-<div id="388cff6b" class="cell" data-execution_count="6">
+<div id="39b5197a" class="cell" data-execution_count="6">
 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>s.index</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="6">
 <pre><code>Index(['a', 'b', 'c'], dtype='object')</code></pre>
 </div>
 </div>
 <p>Indices can also be changed after initialization.</p>
-<div id="a31c7562" class="cell" data-execution_count="7">
+<div id="7e74cdbf" class="cell" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>s.index <span class="op">=</span> [<span class="st">"first"</span>, <span class="st">"second"</span>, <span class="st">"third"</span>]</span>
 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>s</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="7">
@@ -430,7 +436,7 @@ <h3 data-number="2.2.1" class="anchored" data-anchor-id="series"><span class="he
 dtype: int64</code></pre>
 </div>
 </div>
-<div id="34a166f2" class="cell" data-execution_count="8">
+<div id="c43615dd" class="cell" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>s.index</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="8">
 <pre><code>Index(['first', 'second', 'third'], dtype='object')</code></pre>
@@ -445,7 +451,7 @@ <h4 data-number="2.2.1.1" class="anchored" data-anchor-id="selection-in-series">
 <li>A filtering condition.</li>
 </ol>
 <p>To demonstrate this, let’s define a new Series <code>s</code>.</p>
-<div id="64226f2a" class="cell" data-execution_count="9">
+<div id="c5ee8de6" class="cell" data-execution_count="9">
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>s <span class="op">=</span> pd.Series([<span class="dv">4</span>, <span class="op">-</span><span class="dv">2</span>, <span class="dv">0</span>, <span class="dv">6</span>], index <span class="op">=</span> [<span class="st">"a"</span>, <span class="st">"b"</span>, <span class="st">"c"</span>, <span class="st">"d"</span>])</span>
 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>s</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="9">
@@ -458,7 +464,7 @@ <h4 data-number="2.2.1.1" class="anchored" data-anchor-id="selection-in-series">
 </div>
 <section id="a-single-label" class="level5" data-number="2.2.1.1.1">
 <h5 data-number="2.2.1.1.1" class="anchored" data-anchor-id="a-single-label"><span class="header-section-number">2.2.1.1.1</span> A Single Label</h5>
-<div id="df9ef640" class="cell" data-execution_count="10">
+<div id="abe5a498" class="cell" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="co"># We return the value stored at the index label "a"</span></span>
 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>s[<span class="st">"a"</span>] </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="10">
@@ -468,7 +474,7 @@ <h5 data-number="2.2.1.1.1" class="anchored" data-anchor-id="a-single-label"><sp
 </section>
 <section id="a-list-of-labels" class="level5" data-number="2.2.1.1.2">
 <h5 data-number="2.2.1.1.2" class="anchored" data-anchor-id="a-list-of-labels"><span class="header-section-number">2.2.1.1.2</span> A List of Labels</h5>
-<div id="aaecd78f" class="cell" data-execution_count="11">
+<div id="4c904250" class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co"># We return a Series of the values stored at the index labels "a" and "c"</span></span>
 <span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>s[[<span class="st">"a"</span>, <span class="st">"c"</span>]] </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="11">
@@ -482,7 +488,7 @@ <h5 data-number="2.2.1.1.2" class="anchored" data-anchor-id="a-list-of-labels"><
 <h5 data-number="2.2.1.1.3" class="anchored" data-anchor-id="a-filtering-condition"><span class="header-section-number">2.2.1.1.3</span> A Filtering Condition</h5>
 <p>Perhaps the most interesting (and useful) method of selecting data from a <code>Series</code> is by using a filtering condition.</p>
 <p>First, we apply a boolean operation to the <code>Series</code>. This creates <strong>a new <code>Series</code> of boolean values</strong>.</p>
-<div id="826086da" class="cell" data-execution_count="12">
+<div id="6480093e" class="cell" data-execution_count="12">
 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Filter condition: select all elements greater than 0</span></span>
 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>s <span class="op">&gt;</span> <span class="dv">0</span> </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="12">
@@ -494,7 +500,7 @@ <h5 data-number="2.2.1.1.3" class="anchored" data-anchor-id="a-filtering-conditi
 </div>
 </div>
 <p>We then use this boolean condition to index into our original <code>Series</code>. <code>pandas</code> will select only the entries in the original <code>Series</code> that satisfy the condition.</p>
-<div id="56ce3143" class="cell" data-execution_count="13">
+<div id="8d5e25fc" class="cell" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb24"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a>s[s <span class="op">&gt;</span> <span class="dv">0</span>] </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="13">
 <pre><code>a    4
@@ -524,7 +530,7 @@ <h4 data-number="2.2.2.1" class="anchored" data-anchor-id="creating-a-dataframe"
 <h5 data-number="2.2.2.1.1" class="anchored" data-anchor-id="from-a-csv-file"><span class="header-section-number">2.2.2.1.1</span> From a CSV file</h5>
 <p>In Data 100, our data are typically stored in a CSV (comma-separated values) file format. We can import a CSV file into a <code>DataFrame</code> by passing the data path as an argument to the following <code>pandas</code> function. <br>  <code>pd.read_csv("filename.csv")</code></p>
 <p>With our new understanding of <code>pandas</code> in hand, let’s return to the <code>elections</code> dataset from before. Now, we can recognize that it is represented as a <code>pandas</code> <code>DataFrame</code>.</p>
-<div id="a5ca19b6" class="cell" data-execution_count="14">
+<div id="cfc6cdbd" class="cell" data-execution_count="14">
 <div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a>elections <span class="op">=</span> pd.read_csv(<span class="st">"data/elections.csv"</span>)</span>
 <span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a>elections</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="14">
@@ -656,7 +662,7 @@ <h5 data-number="2.2.2.1.1" class="anchored" data-anchor-id="from-a-csv-file"><s
 <h5 data-number="2.2.2.1.2" class="anchored" data-anchor-id="using-a-list-and-column-names"><span class="header-section-number">2.2.2.1.2</span> Using a List and Column Name(s)</h5>
 <p>We’ll now explore creating a <code>DataFrame</code> with data of our own.</p>
 <p>Consider the following examples. The first code cell creates a <code>DataFrame</code> with a single column <code>Numbers</code>.</p>
-<div id="815a159d" class="cell" data-execution_count="15">
+<div id="42cd5309" class="cell" data-execution_count="15">
 <div class="sourceCode cell-code" id="cb28"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a>df_list <span class="op">=</span> pd.DataFrame([<span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>], columns<span class="op">=</span>[<span class="st">"Numbers"</span>])</span>
 <span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a>df_list</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="15">
@@ -690,7 +696,7 @@ <h5 data-number="2.2.2.1.2" class="anchored" data-anchor-id="using-a-list-and-co
 </div>
 </div>
 <p>The second creates a <code>DataFrame</code> with the columns <code>Numbers</code> and <code>Description</code>. Notice how a 2D list of values is required to initialize the second <code>DataFrame</code> — each nested list represents a single row of data.</p>
-<div id="b61eab2d" class="cell" data-execution_count="16">
+<div id="a518cb41" class="cell" data-execution_count="16">
 <div class="sourceCode cell-code" id="cb29"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a>df_list <span class="op">=</span> pd.DataFrame([[<span class="dv">1</span>, <span class="st">"one"</span>], [<span class="dv">2</span>, <span class="st">"two"</span>]], columns <span class="op">=</span> [<span class="st">"Number"</span>, <span class="st">"Description"</span>])</span>
 <span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a>df_list</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="16">
@@ -727,7 +733,7 @@ <h5 data-number="2.2.2.1.2" class="anchored" data-anchor-id="using-a-list-and-co
 <h5 data-number="2.2.2.1.3" class="anchored" data-anchor-id="from-a-dictionary"><span class="header-section-number">2.2.2.1.3</span> From a Dictionary</h5>
 <p>A third (and more common) way to create a <code>DataFrame</code> is with a dictionary. The dictionary keys represent the column names, and the dictionary values represent the column values.</p>
 <p>Below are two ways of implementing this approach. The first is based on specifying the columns of the <code>DataFrame</code>, whereas the second is based on specifying the rows of the <code>DataFrame</code>.</p>
-<div id="332bb4d5" class="cell" data-execution_count="17">
+<div id="560ed704" class="cell" data-execution_count="17">
 <div class="sourceCode cell-code" id="cb30"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true" tabindex="-1"></a>df_dict <span class="op">=</span> pd.DataFrame({</span>
 <span id="cb30-2"><a href="#cb30-2" aria-hidden="true" tabindex="-1"></a>    <span class="st">"Fruit"</span>: [<span class="st">"Strawberry"</span>, <span class="st">"Orange"</span>], </span>
 <span id="cb30-3"><a href="#cb30-3" aria-hidden="true" tabindex="-1"></a>    <span class="st">"Price"</span>: [<span class="fl">5.49</span>, <span class="fl">3.99</span>]</span>
@@ -762,7 +768,7 @@ <h5 data-number="2.2.2.1.3" class="anchored" data-anchor-id="from-a-dictionary">
 </div>
 </div>
 </div>
-<div id="02f7eb53" class="cell" data-execution_count="18">
+<div id="f682e0dd" class="cell" data-execution_count="18">
 <div class="sourceCode cell-code" id="cb31"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a>df_dict <span class="op">=</span> pd.DataFrame(</span>
 <span id="cb31-2"><a href="#cb31-2" aria-hidden="true" tabindex="-1"></a>    [</span>
 <span id="cb31-3"><a href="#cb31-3" aria-hidden="true" tabindex="-1"></a>        {<span class="st">"Fruit"</span>:<span class="st">"Strawberry"</span>, <span class="st">"Price"</span>:<span class="fl">5.49</span>}, </span>
@@ -804,14 +810,14 @@ <h5 data-number="2.2.2.1.3" class="anchored" data-anchor-id="from-a-dictionary">
 <h5 data-number="2.2.2.1.4" class="anchored" data-anchor-id="from-a-series"><span class="header-section-number">2.2.2.1.4</span> From a <code>Series</code></h5>
 <p>Earlier, we explained how a <code>Series</code> was synonymous to a column in a <code>DataFrame</code>. It follows, then, that a <code>DataFrame</code> is equivalent to a collection of <code>Series</code>, which all share the same <code>Index</code>.</p>
 <p>In fact, we can initialize a <code>DataFrame</code> by merging two or more <code>Series</code>. Consider the <code>Series</code> <code>s_a</code> and <code>s_b</code>.</p>
-<div id="2d66060a" class="cell" data-execution_count="19">
+<div id="f3230340" class="cell" data-execution_count="19">
 <div class="sourceCode cell-code" id="cb32"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Notice how our indices, or row labels, are the same</span></span>
 <span id="cb32-2"><a href="#cb32-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb32-3"><a href="#cb32-3" aria-hidden="true" tabindex="-1"></a>s_a <span class="op">=</span> pd.Series([<span class="st">"a1"</span>, <span class="st">"a2"</span>, <span class="st">"a3"</span>], index <span class="op">=</span> [<span class="st">"r1"</span>, <span class="st">"r2"</span>, <span class="st">"r3"</span>])</span>
 <span id="cb32-4"><a href="#cb32-4" aria-hidden="true" tabindex="-1"></a>s_b <span class="op">=</span> pd.Series([<span class="st">"b1"</span>, <span class="st">"b2"</span>, <span class="st">"b3"</span>], index <span class="op">=</span> [<span class="st">"r1"</span>, <span class="st">"r2"</span>, <span class="st">"r3"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>We can turn individual <code>Series</code> into a <code>DataFrame</code> using two common methods (shown below):</p>
-<div id="361b91db" class="cell" data-execution_count="20">
+<div id="79c49fb4" class="cell" data-execution_count="20">
 <div class="sourceCode cell-code" id="cb33"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a>pd.DataFrame(s_a)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="20">
 <div>
@@ -843,7 +849,7 @@ <h5 data-number="2.2.2.1.4" class="anchored" data-anchor-id="from-a-series"><spa
 </div>
 </div>
 </div>
-<div id="06c4c931" class="cell" data-execution_count="21">
+<div id="6770fde6" class="cell" data-execution_count="21">
 <div class="sourceCode cell-code" id="cb34"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true" tabindex="-1"></a>s_b.to_frame()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="21">
 <div>
@@ -876,7 +882,7 @@ <h5 data-number="2.2.2.1.4" class="anchored" data-anchor-id="from-a-series"><spa
 </div>
 </div>
 <p>To merge the two <code>Series</code> and specify their column names, we use the following syntax:</p>
-<div id="a507e6cf" class="cell" data-execution_count="22">
+<div id="f50f25fd" class="cell" data-execution_count="22">
 <div class="sourceCode cell-code" id="cb35"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({</span>
 <span id="cb35-2"><a href="#cb35-2" aria-hidden="true" tabindex="-1"></a>    <span class="st">"A-column"</span>: s_a, </span>
 <span id="cb35-3"><a href="#cb35-3" aria-hidden="true" tabindex="-1"></a>    <span class="st">"B-column"</span>: s_b</span>
@@ -921,7 +927,7 @@ <h5 data-number="2.2.2.1.4" class="anchored" data-anchor-id="from-a-series"><spa
 <section id="indices" class="level3" data-number="2.2.3">
 <h3 data-number="2.2.3" class="anchored" data-anchor-id="indices"><span class="header-section-number">2.2.3</span> Indices</h3>
 <p>On a more technical note, an index doesn’t have to be an integer, nor does it have to be unique. For example, we can set the index of the <code>elections</code> <code>DataFrame</code> to be the name of presidential candidates.</p>
-<div id="b9221716" class="cell" data-execution_count="23">
+<div id="b9adbf1e" class="cell" data-execution_count="23">
 <div class="sourceCode cell-code" id="cb36"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Creating a DataFrame from a CSV file and specifying the index column</span></span>
 <span id="cb36-2"><a href="#cb36-2" aria-hidden="true" tabindex="-1"></a>elections <span class="op">=</span> pd.read_csv(<span class="st">"data/elections.csv"</span>, index_col <span class="op">=</span> <span class="st">"Candidate"</span>)</span>
 <span id="cb36-3"><a href="#cb36-3" aria-hidden="true" tabindex="-1"></a>elections</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1045,7 +1051,7 @@ <h3 data-number="2.2.3" class="anchored" data-anchor-id="indices"><span class="h
 </div>
 </div>
 <p>We can also select a new column and set it as the index of the <code>DataFrame</code>. For example, we can set the index of the <code>elections</code> <code>DataFrame</code> to represent the candidate’s party.</p>
-<div id="7871a7ba" class="cell" data-execution_count="24">
+<div id="8f277544" class="cell" data-execution_count="24">
 <div class="sourceCode cell-code" id="cb37"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true" tabindex="-1"></a>elections.reset_index(inplace <span class="op">=</span> <span class="va">True</span>) <span class="co"># Resetting the index so we can set it again</span></span>
 <span id="cb37-2"><a href="#cb37-2" aria-hidden="true" tabindex="-1"></a><span class="co"># This sets the index to the "Party" column</span></span>
 <span id="cb37-3"><a href="#cb37-3" aria-hidden="true" tabindex="-1"></a>elections.set_index(<span class="st">"Party"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1169,7 +1175,7 @@ <h3 data-number="2.2.3" class="anchored" data-anchor-id="indices"><span class="h
 </div>
 </div>
 <p>And, if we’d like, we can revert the index back to the default list of integers.</p>
-<div id="99939153" class="cell" data-execution_count="25">
+<div id="5464fc21" class="cell" data-execution_count="25">
 <div class="sourceCode cell-code" id="cb38"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><a href="#cb38-1" aria-hidden="true" tabindex="-1"></a><span class="co"># This resets the index to be the default list of integer</span></span>
 <span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a>elections.reset_index(inplace<span class="op">=</span><span class="va">True</span>) </span>
 <span id="cb38-3"><a href="#cb38-3" aria-hidden="true" tabindex="-1"></a>elections.index</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1192,7 +1198,7 @@ <h3 data-number="2.2.3" class="anchored" data-anchor-id="indices"><span class="h
 <h2 data-number="2.3" class="anchored" data-anchor-id="dataframe-attributes-index-columns-and-shape"><span class="header-section-number">2.3</span> <code>DataFrame</code> Attributes: Index, Columns, and Shape</h2>
 <p>On the other hand, column names in a <code>DataFrame</code> are almost always unique. Looking back to the <code>elections</code> dataset, it wouldn’t make sense to have two columns named <code>"Candidate"</code>. Sometimes, you’ll want to extract these different values, in particular, the list of row and column labels.</p>
 <p>For index/row labels, use <code>DataFrame.index</code>:</p>
-<div id="9b4b30ab" class="cell" data-execution_count="26">
+<div id="12c6edf4" class="cell" data-execution_count="26">
 <div class="sourceCode cell-code" id="cb40"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1"><a href="#cb40-1" aria-hidden="true" tabindex="-1"></a>elections.set_index(<span class="st">"Party"</span>, inplace <span class="op">=</span> <span class="va">True</span>)</span>
 <span id="cb40-2"><a href="#cb40-2" aria-hidden="true" tabindex="-1"></a>elections.index</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="26">
@@ -1207,14 +1213,14 @@ <h2 data-number="2.3" class="anchored" data-anchor-id="dataframe-attributes-inde
 </div>
 </div>
 <p>For column labels, use <code>DataFrame.columns</code>:</p>
-<div id="370885fb" class="cell" data-execution_count="27">
+<div id="f47eef48" class="cell" data-execution_count="27">
 <div class="sourceCode cell-code" id="cb42"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1"><a href="#cb42-1" aria-hidden="true" tabindex="-1"></a>elections.columns</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="27">
 <pre><code>Index(['index', 'Candidate', 'Year', 'Popular vote', 'Result', '%'], dtype='object')</code></pre>
 </div>
 </div>
 <p>And for the shape of the <code>DataFrame</code>, we can use <code>DataFrame.shape</code> to get the number of rows followed by the number of columns:</p>
-<div id="304c355c" class="cell" data-execution_count="28">
+<div id="374cfef3" class="cell" data-execution_count="28">
 <div class="sourceCode cell-code" id="cb44"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1"><a href="#cb44-1" aria-hidden="true" tabindex="-1"></a>elections.shape</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="28">
 <pre><code>(182, 6)</code></pre>
@@ -1243,13 +1249,13 @@ <h2 data-number="2.4" class="anchored" data-anchor-id="slicing-in-dataframes"><s
 <h3 data-number="2.4.1" class="anchored" data-anchor-id="extracting-data-with-.head-and-.tail"><span class="header-section-number">2.4.1</span> Extracting data with <code>.head</code> and <code>.tail</code></h3>
 <p>The simplest scenario in which we want to extract data is when we simply want to select the first or last few rows of the <code>DataFrame</code>.</p>
 <p>To extract the first <code>n</code> rows of a <code>DataFrame</code> <code>df</code>, we use the syntax <code>df.head(n)</code>.</p>
-<div id="f26b4978" class="cell" data-execution_count="29">
+<div id="41e06438" class="cell" data-execution_count="29">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb46"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb46-1"><a href="#cb46-1" aria-hidden="true" tabindex="-1"></a>elections <span class="op">=</span> pd.read_csv(<span class="st">"data/elections.csv"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 </div>
-<div id="181893bb" class="cell" data-execution_count="30">
+<div id="234e762f" class="cell" data-execution_count="30">
 <div class="sourceCode cell-code" id="cb47"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb47-1"><a href="#cb47-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Extract the first 5 rows of the DataFrame</span></span>
 <span id="cb47-2"><a href="#cb47-2" aria-hidden="true" tabindex="-1"></a>elections.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="30">
@@ -1321,7 +1327,7 @@ <h3 data-number="2.4.1" class="anchored" data-anchor-id="extracting-data-with-.h
 </div>
 </div>
 <p>Similarly, calling <code>df.tail(n)</code> allows us to extract the last <code>n</code> rows of the <code>DataFrame</code>.</p>
-<div id="419e591a" class="cell" data-execution_count="31">
+<div id="d72b4804" class="cell" data-execution_count="31">
 <div class="sourceCode cell-code" id="cb48"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb48-1"><a href="#cb48-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Extract the last 5 rows of the DataFrame</span></span>
 <span id="cb48-2"><a href="#cb48-2" aria-hidden="true" tabindex="-1"></a>elections.tail(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="31">
@@ -1407,14 +1413,14 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 <li>A list.</li>
 </ul>
 <p>For example, to select a single value, we can select the row labeled <code>0</code> and the column labeled <code>Candidate</code> from the <code>elections</code> <code>DataFrame</code>.</p>
-<div id="af567b10" class="cell" data-execution_count="32">
+<div id="4ce9dfa6" class="cell" data-execution_count="32">
 <div class="sourceCode cell-code" id="cb49"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb49-1"><a href="#cb49-1" aria-hidden="true" tabindex="-1"></a>elections.loc[<span class="dv">0</span>, <span class="st">'Candidate'</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="32">
 <pre><code>'Andrew Jackson'</code></pre>
 </div>
 </div>
 <p>Keep in mind that passing in just one argument as a single value will produce a <code>Series</code>. Below, we’ve extracted a subset of the <code>"Popular vote"</code> column as a <code>Series</code>.</p>
-<div id="ced0162d" class="cell" data-execution_count="33">
+<div id="2dcb67c6" class="cell" data-execution_count="33">
 <div class="sourceCode cell-code" id="cb51"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1"><a href="#cb51-1" aria-hidden="true" tabindex="-1"></a>elections.loc[[<span class="dv">87</span>, <span class="dv">25</span>, <span class="dv">179</span>], <span class="st">"Popular vote"</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="33">
 <pre><code>87     15761254
@@ -1424,7 +1430,7 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 </div>
 </div>
 <p>Note that if we pass <code>"Popular vote"</code> as a list, the output will be a <code>DataFrame</code>.</p>
-<div id="6090c608" class="cell" data-execution_count="34">
+<div id="7676ff73" class="cell" data-execution_count="34">
 <div class="sourceCode cell-code" id="cb53"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb53-1"><a href="#cb53-1" aria-hidden="true" tabindex="-1"></a>elections.loc[[<span class="dv">87</span>, <span class="dv">25</span>, <span class="dv">179</span>], [<span class="st">"Popular vote"</span>]]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="34">
 <div>
@@ -1457,7 +1463,7 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 </div>
 </div>
 <p>To select <em>multiple</em> rows and columns, we can use Python slice notation. Here, we select the rows from labels <code>0</code> to <code>3</code> and the columns from labels <code>"Year"</code> to <code>"Popular vote"</code>. Notice that unlike Python slicing, <code>.loc</code> is <em>inclusive</em> of the right upper bound.</p>
-<div id="cbed5828" class="cell" data-execution_count="35">
+<div id="f989d6fa" class="cell" data-execution_count="35">
 <div class="sourceCode cell-code" id="cb54"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb54-1"><a href="#cb54-1" aria-hidden="true" tabindex="-1"></a>elections.loc[<span class="dv">0</span>:<span class="dv">3</span>, <span class="st">'Year'</span>:<span class="st">'Popular vote'</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="35">
 <div>
@@ -1509,7 +1515,7 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 </div>
 </div>
 <p>Suppose that instead, we want to extract <em>all</em> column values for the first four rows in the <code>elections</code> <code>DataFrame</code>. The shorthand <code>:</code> is useful for this.</p>
-<div id="bf2dcc6c" class="cell" data-execution_count="36">
+<div id="7b8e093a" class="cell" data-execution_count="36">
 <div class="sourceCode cell-code" id="cb55"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb55-1"><a href="#cb55-1" aria-hidden="true" tabindex="-1"></a>elections.loc[<span class="dv">0</span>:<span class="dv">3</span>, :]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="36">
 <div>
@@ -1571,7 +1577,7 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 </div>
 </div>
 <p>We can use the same shorthand to extract all rows.</p>
-<div id="2dc98689" class="cell" data-execution_count="37">
+<div id="abb34a26" class="cell" data-execution_count="37">
 <div class="sourceCode cell-code" id="cb56"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb56-1"><a href="#cb56-1" aria-hidden="true" tabindex="-1"></a>elections.loc[:, [<span class="st">"Year"</span>, <span class="st">"Candidate"</span>, <span class="st">"Result"</span>]]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="37">
 <div>
@@ -1662,7 +1668,7 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 </div>
 <p>There are a couple of things we should note. Firstly, unlike conventional Python, <code>pandas</code> allows us to slice string values (in our example, the column labels). Secondly, slicing with <code>.loc</code> is <em>inclusive</em>. Notice how our resulting <code>DataFrame</code> includes every row and column between and including the slice labels we specified.</p>
 <p>Equivalently, we can use a list to obtain multiple rows and columns in our <code>elections</code> <code>DataFrame</code>.</p>
-<div id="1e4c92b2" class="cell" data-execution_count="38">
+<div id="b96fd8e0" class="cell" data-execution_count="38">
 <div class="sourceCode cell-code" id="cb57"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb57-1"><a href="#cb57-1" aria-hidden="true" tabindex="-1"></a>elections.loc[[<span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>], [<span class="st">'Year'</span>, <span class="st">'Candidate'</span>, <span class="st">'Party'</span>, <span class="st">'Popular vote'</span>]]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="38">
 <div>
@@ -1714,7 +1720,7 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 </div>
 </div>
 <p>Lastly, we can interchange list and slicing notation.</p>
-<div id="80767560" class="cell" data-execution_count="39">
+<div id="2f9b7c38" class="cell" data-execution_count="39">
 <div class="sourceCode cell-code" id="cb58"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb58-1"><a href="#cb58-1" aria-hidden="true" tabindex="-1"></a>elections.loc[[<span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>], :]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="39">
 <div>
@@ -1780,7 +1786,7 @@ <h3 data-number="2.4.2" class="anchored" data-anchor-id="label-based-extraction-
 <h3 data-number="2.4.3" class="anchored" data-anchor-id="integer-based-extraction-indexing-with-.iloc"><span class="header-section-number">2.4.3</span> Integer-based Extraction: Indexing with <code>.iloc</code></h3>
 <p>Slicing with <code>.iloc</code> works similarly to <code>.loc</code>. However, <code>.iloc</code> uses the <em>index positions</em> of rows and columns rather than the labels (think to yourself: <strong>l</strong>oc uses <strong>l</strong>ables; <strong>i</strong>loc uses <strong>i</strong>ndices). The arguments to the <code>.iloc</code> function also behave similarly — single values, lists, indices, and any combination of these are permitted.</p>
 <p>Let’s begin reproducing our results from above. We’ll begin by selecting the first presidential candidate in our <code>elections</code> <code>DataFrame</code>:</p>
-<div id="21d6c186" class="cell" data-execution_count="40">
+<div id="f986a80d" class="cell" data-execution_count="40">
 <div class="sourceCode cell-code" id="cb59"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb59-1"><a href="#cb59-1" aria-hidden="true" tabindex="-1"></a><span class="co"># elections.loc[0, "Candidate"] - Previous approach</span></span>
 <span id="cb59-2"><a href="#cb59-2" aria-hidden="true" tabindex="-1"></a>elections.iloc[<span class="dv">0</span>, <span class="dv">1</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="40">
@@ -1789,7 +1795,7 @@ <h3 data-number="2.4.3" class="anchored" data-anchor-id="integer-based-extractio
 </div>
 <p>Notice how the first argument to both <code>.loc</code> and <code>.iloc</code> are the same. This is because the row with a label of <code>0</code> is conveniently in the <span class="math inline">\(0^{\text{th}}\)</span> (equivalently, the first position) of the <code>elections</code> <code>DataFrame</code>. Generally, this is true of any <code>DataFrame</code> where the row labels are incremented in ascending order from 0.</p>
 <p>And, as before, if we were to pass in only one single value argument, our result would be a <code>Series</code>.</p>
-<div id="8036da51" class="cell" data-execution_count="41">
+<div id="4eb7d55d" class="cell" data-execution_count="41">
 <div class="sourceCode cell-code" id="cb61"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb61-1"><a href="#cb61-1" aria-hidden="true" tabindex="-1"></a>elections.iloc[[<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>],<span class="dv">1</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="41">
 <pre><code>1    John Quincy Adams
@@ -1799,7 +1805,7 @@ <h3 data-number="2.4.3" class="anchored" data-anchor-id="integer-based-extractio
 </div>
 </div>
 <p>However, when we select the first four rows and columns using <code>.iloc</code>, we notice something.</p>
-<div id="bd5516b0" class="cell" data-execution_count="42">
+<div id="25d6c01e" class="cell" data-execution_count="42">
 <div class="sourceCode cell-code" id="cb63"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb63-1"><a href="#cb63-1" aria-hidden="true" tabindex="-1"></a><span class="co"># elections.loc[0:3, 'Year':'Popular vote'] - Previous approach</span></span>
 <span id="cb63-2"><a href="#cb63-2" aria-hidden="true" tabindex="-1"></a>elections.iloc[<span class="dv">0</span>:<span class="dv">4</span>, <span class="dv">0</span>:<span class="dv">4</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="42">
@@ -1853,7 +1859,7 @@ <h3 data-number="2.4.3" class="anchored" data-anchor-id="integer-based-extractio
 </div>
 <p>Slicing is no longer inclusive in <code>.iloc</code> — it’s <em>exclusive</em>. In other words, the right end of a slice is not included when using <code>.iloc</code>. This is one of the subtleties of <code>pandas</code> syntax; you will get used to it with practice.</p>
 <p>List behavior works just as expected.</p>
-<div id="a2e1555a" class="cell" data-execution_count="43">
+<div id="07869626" class="cell" data-execution_count="43">
 <div class="sourceCode cell-code" id="cb64"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb64-1"><a href="#cb64-1" aria-hidden="true" tabindex="-1"></a><span class="co">#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach</span></span>
 <span id="cb64-2"><a href="#cb64-2" aria-hidden="true" tabindex="-1"></a>elections.iloc[[<span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>], [<span class="dv">0</span>, <span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>]]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="43">
@@ -1906,7 +1912,7 @@ <h3 data-number="2.4.3" class="anchored" data-anchor-id="integer-based-extractio
 </div>
 </div>
 <p>And just like with <code>.loc</code>, we can use a colon with <code>.iloc</code> to extract all rows or columns.</p>
-<div id="78e9ee29" class="cell" data-execution_count="44">
+<div id="303e9067" class="cell" data-execution_count="44">
 <div class="sourceCode cell-code" id="cb65"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb65-1"><a href="#cb65-1" aria-hidden="true" tabindex="-1"></a>elections.iloc[:, <span class="dv">0</span>:<span class="dv">3</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="44">
 <div>
@@ -2014,7 +2020,7 @@ <h3 data-number="2.4.4" class="anchored" data-anchor-id="context-dependent-extra
 <section id="a-slice-of-row-numbers" class="level4" data-number="2.4.4.1">
 <h4 data-number="2.4.4.1" class="anchored" data-anchor-id="a-slice-of-row-numbers"><span class="header-section-number">2.4.4.1</span> A slice of row numbers</h4>
 <p>Say we wanted the first four rows of our <code>elections</code> <code>DataFrame</code>.</p>
-<div id="a36c2b1d" class="cell" data-execution_count="45">
+<div id="2a607ddc" class="cell" data-execution_count="45">
 <div class="sourceCode cell-code" id="cb66"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb66-1"><a href="#cb66-1" aria-hidden="true" tabindex="-1"></a>elections[<span class="dv">0</span>:<span class="dv">4</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="45">
 <div>
@@ -2079,7 +2085,7 @@ <h4 data-number="2.4.4.1" class="anchored" data-anchor-id="a-slice-of-row-number
 <section id="a-list-of-column-labels" class="level4" data-number="2.4.4.2">
 <h4 data-number="2.4.4.2" class="anchored" data-anchor-id="a-list-of-column-labels"><span class="header-section-number">2.4.4.2</span> A list of column labels</h4>
 <p>Suppose we now want the first four columns.</p>
-<div id="59652ad1" class="cell" data-execution_count="46">
+<div id="2bf5ad8e" class="cell" data-execution_count="46">
 <div class="sourceCode cell-code" id="cb67"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb67-1"><a href="#cb67-1" aria-hidden="true" tabindex="-1"></a>elections[[<span class="st">"Year"</span>, <span class="st">"Candidate"</span>, <span class="st">"Party"</span>, <span class="st">"Popular vote"</span>]]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="46">
 <div>
@@ -2184,7 +2190,7 @@ <h4 data-number="2.4.4.2" class="anchored" data-anchor-id="a-list-of-column-labe
 <section id="a-single-column-label" class="level4" data-number="2.4.4.3">
 <h4 data-number="2.4.4.3" class="anchored" data-anchor-id="a-single-column-label"><span class="header-section-number">2.4.4.3</span> A single-column label</h4>
 <p>Lastly, <code>[]</code> allows us to extract only the <code>"Candidate"</code> column.</p>
-<div id="59e8512f" class="cell" data-execution_count="47">
+<div id="f0e29255" class="cell" data-execution_count="47">
 <div class="sourceCode cell-code" id="cb68"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb68-1"><a href="#cb68-1" aria-hidden="true" tabindex="-1"></a>elections[<span class="st">"Candidate"</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="47">
 <pre><code>0         Andrew Jackson
diff --git a/docs/pandas_2/pandas_2.html b/docs/pandas_2/pandas_2.html
index 0371226e..2dbf40ce 100644
--- a/docs/pandas_2/pandas_2.html
+++ b/docs/pandas_2/pandas_2.html
@@ -208,6 +208,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -280,7 +286,7 @@ <h1 class="title"><span class="chapter-number">3</span>&nbsp; <span class="chapt
 <p>Last time, we introduced the <code>pandas</code> library as a toolkit for processing data. We learned the <code>DataFrame</code> and <code>Series</code> data structures, familiarized ourselves with the basic syntax for manipulating tabular data, and began writing our first lines of <code>pandas</code> code.</p>
 <p>In this lecture, we’ll start to dive into some advanced <code>pandas</code> syntax. You may find it helpful to follow along with a notebook of your own as we walk through these new pieces of code.</p>
 <p>We’ll start by loading the <code>babynames</code> dataset.</p>
-<div id="2fa95829" class="cell" data-execution_count="1">
+<div id="a6fcaa11" class="cell" data-execution_count="1">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co"># This code pulls census data and loads it into a DataFrame</span></span>
@@ -373,7 +379,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 <p>Conditional selection allows us to select a subset of rows in a <code>DataFrame</code> that satisfy some specified condition.</p>
 <p>To understand how to use conditional selection, we must look at another possible input of the <code>.loc</code> and <code>[]</code> methods – a boolean array, which is simply an array or <code>Series</code> where each element is either <code>True</code> or <code>False</code>. This boolean array must have a length equal to the number of rows in the <code>DataFrame</code>. It will return all rows that correspond to a value of <code>True</code> in the array. We used a very similar technique when performing conditional extraction from a <code>Series</code> in the last lecture.</p>
 <p>To see this in action, let’s select all even-indexed rows in the first 10 rows of our <code>DataFrame</code>.</p>
-<div id="efd602aa" class="cell" data-execution_count="2">
+<div id="389fe08a" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Ask yourself: why is :9 is the correct slice to select the first 10 rows?</span></span>
 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>babynames_first_10_rows <span class="op">=</span> babynames.loc[:<span class="dv">9</span>, :]</span>
 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -442,7 +448,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 </div>
 <p>We can perform a similar operation using <code>.loc</code>.</p>
-<div id="027a73a9" class="cell" data-execution_count="3">
+<div id="498b6ea5" class="cell" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>babynames_first_10_rows.loc[[<span class="va">True</span>, <span class="va">False</span>, <span class="va">True</span>, <span class="va">False</span>, <span class="va">True</span>, <span class="va">False</span>, <span class="va">True</span>, <span class="va">False</span>, <span class="va">True</span>, <span class="va">False</span>], :]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="3">
 <div>
@@ -508,7 +514,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 <p>These techniques worked well in this example, but you can imagine how tedious it might be to list out <code>True</code> and <code>False</code>for every row in a larger <code>DataFrame</code>. To make things easier, we can instead provide a logical condition as an input to <code>.loc</code> or <code>[]</code> that returns a boolean array with the necessary length.</p>
 <p>For example, to return all names associated with <code>F</code> sex:</p>
-<div id="2ccd6e1f" class="cell" data-execution_count="4">
+<div id="57edbacd" class="cell" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First, use a logical condition to generate a boolean array</span></span>
 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>logical_operator <span class="op">=</span> (babynames[<span class="st">"Sex"</span>] <span class="op">==</span> <span class="st">"F"</span>)</span>
 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -578,7 +584,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 <p>Recall from the previous lecture that <code>.head()</code> will return only the first few rows in the <code>DataFrame</code>. In reality, <code>babynames[logical operator]</code> contains as many rows as there are entries in the original <code>babynames</code> <code>DataFrame</code> with sex <code>"F"</code>.</p>
 <p>Here, <code>logical_operator</code> evaluates to a <code>Series</code> of boolean values with length 407428.</p>
-<div id="a7001602" class="cell" data-execution_count="5">
+<div id="71298911" class="cell" data-execution_count="5">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"There are a total of </span><span class="sc">{}</span><span class="st"> values in 'logical_operator'"</span>.<span class="bu">format</span>(<span class="bu">len</span>(logical_operator)))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -588,7 +594,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 </div>
 <p>Rows starting at row 0 and ending at row 239536 evaluate to <code>True</code> and are thus returned in the <code>DataFrame</code>. Rows from 239537 onwards evaluate to <code>False</code> and are omitted from the output.</p>
-<div id="d57a6e3d" class="cell" data-execution_count="6">
+<div id="c89a5699" class="cell" data-execution_count="6">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"The 0th item in this 'logical_operator' is: </span><span class="sc">{}</span><span class="st">"</span>.<span class="bu">format</span>(logical_operator.iloc[<span class="dv">0</span>]))</span>
@@ -603,7 +609,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 <p>Passing a <code>Series</code> as an argument to <code>babynames[]</code> has the same effect as using a boolean array. In fact, the <code>[]</code> selection operator can take a boolean <code>Series</code>, array, and list as arguments. These three are used interchangeably throughout the course.</p>
 <p>We can also use <code>.loc</code> to achieve similar results.</p>
-<div id="6ebe014a" class="cell" data-execution_count="7">
+<div id="c8316fe0" class="cell" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>babynames.loc[babynames[<span class="st">"Sex"</span>] <span class="op">==</span> <span class="st">"F"</span>].head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="7">
 <div>
@@ -701,7 +707,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </table>
 <p>When combining multiple conditions with logical operators, we surround each individual condition with a set of parenthesis <code>()</code>. This imposes an order of operations on <code>pandas</code> evaluating your logic and can avoid code erroring.</p>
 <p>For example, if we want to return data on all names with sex <code>"F"</code> born before the year 2000, we can write:</p>
-<div id="5e0b1e9b" class="cell" data-execution_count="8">
+<div id="368441a8" class="cell" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>babynames[(babynames[<span class="st">"Sex"</span>] <span class="op">==</span> <span class="st">"F"</span>) <span class="op">&amp;</span> (babynames[<span class="st">"Year"</span>] <span class="op">&lt;</span> <span class="dv">2000</span>)].head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="8">
 <div>
@@ -766,12 +772,12 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 </div>
 <p>Note that we’re working with <code>Series</code>, so using <code>and</code> in place of <code>&amp;</code>, or <code>or</code> in place <code>|</code> will error.</p>
-<div id="541377a6" class="cell" data-execution_count="9">
+<div id="e0fc2782" class="cell" data-execution_count="9">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># This line of code will raise a ValueError</span></span>
 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="co"># babynames[(babynames["Sex"] == "F") and (babynames["Year"] &lt; 2000)].head()</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>If we want to return data on all names with sex <code>"F"</code> <em>or</em> all born before the year 2000, we can write:</p>
-<div id="2ea3568c" class="cell" data-execution_count="10">
+<div id="63119fd9" class="cell" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>babynames[(babynames[<span class="st">"Sex"</span>] <span class="op">==</span> <span class="st">"F"</span>) <span class="op">|</span> (babynames[<span class="st">"Year"</span>] <span class="op">&lt;</span> <span class="dv">2000</span>)].head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="10">
 <div>
@@ -836,7 +842,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 </div>
 <p>Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions. In the example below, our boolean condition is long enough to extend for several lines of code.</p>
-<div id="55e8ec17" class="cell" data-execution_count="11">
+<div id="0f5e03af" class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability</span></span>
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>(</span>
 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>    babynames[(babynames[<span class="st">"Name"</span>] <span class="op">==</span> <span class="st">"Bella"</span>) <span class="op">|</span> </span>
@@ -908,7 +914,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 <p>Fortunately, <code>pandas</code> provides many alternative methods for constructing boolean filters.</p>
 <p>The <code>.isin</code> function is one such example. This method evaluates if the values in a <code>Series</code> are contained in a different sequence (list, array, or <code>Series</code>) of values. In the cell below, we achieve equivalent results to the <code>DataFrame</code> above with far more concise code.</p>
-<div id="e6cbc6d9" class="cell" data-execution_count="12">
+<div id="23baacba" class="cell" data-execution_count="12">
 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>names <span class="op">=</span> [<span class="st">"Bella"</span>, <span class="st">"Alex"</span>, <span class="st">"Narges"</span>, <span class="st">"Lisa"</span>]</span>
 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"Name"</span>].isin(names).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="12">
@@ -920,7 +926,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 Name: Name, dtype: bool</code></pre>
 </div>
 </div>
-<div id="1fa63630" class="cell" data-execution_count="13">
+<div id="098a9c9e" class="cell" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>babynames[babynames[<span class="st">"Name"</span>].isin(names)].head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="13">
 <div>
@@ -985,7 +991,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 </div>
 </div>
 <p>The function <code>str.startswith</code> can be used to define a filter based on string values in a <code>Series</code> object. It checks to see if string values in a <code>Series</code> start with a particular character.</p>
-<div id="f0d2bad9" class="cell" data-execution_count="14">
+<div id="15f1d6e9" class="cell" data-execution_count="14">
 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Identify whether names begin with the letter "N"</span></span>
 <span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"Name"</span>].<span class="bu">str</span>.startswith(<span class="st">"N"</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="14">
@@ -997,7 +1003,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 Name: Name, dtype: bool</code></pre>
 </div>
 </div>
-<div id="86d4503f" class="cell" data-execution_count="15">
+<div id="cbe7e263" class="cell" data-execution_count="15">
 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Extracting names that begin with the letter "N"</span></span>
 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>babynames[babynames[<span class="st">"Name"</span>].<span class="bu">str</span>.startswith(<span class="st">"N"</span>)].head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="15">
@@ -1067,7 +1073,7 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="conditional-selection"><s
 <h2 data-number="3.2" class="anchored" data-anchor-id="adding-removing-and-modifying-columns"><span class="header-section-number">3.2</span> Adding, Removing, and Modifying Columns</h2>
 <p>In many data science tasks, we may need to change the columns contained in our <code>DataFrame</code> in some way. Fortunately, the syntax to do so is fairly straightforward.</p>
 <p>To add a new column to a <code>DataFrame</code>, we use a syntax similar to that used when accessing an existing column. Specify the name of the new column by writing <code>df["column"]</code>, then assign this to a <code>Series</code> or array containing the values that will populate this column.</p>
-<div id="a072f8d4" class="cell" data-execution_count="16">
+<div id="1148c3c6" class="cell" data-execution_count="16">
 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a Series of the length of each name. </span></span>
 <span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>babyname_lengths <span class="op">=</span> babynames[<span class="st">"Name"</span>].<span class="bu">str</span>.<span class="bu">len</span>()</span>
 <span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1143,7 +1149,7 @@ <h2 data-number="3.2" class="anchored" data-anchor-id="adding-removing-and-modif
 </div>
 </div>
 <p>If we need to later modify an existing column, we can do so by referencing this column again with the syntax <code>df["column"]</code>, then re-assigning it to a new <code>Series</code> or array of the appropriate length.</p>
-<div id="268dfa15" class="cell" data-execution_count="17">
+<div id="49f02133" class="cell" data-execution_count="17">
 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Modify the “name_lengths” column to be one less than its original value</span></span>
 <span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"name_lengths"</span>] <span class="op">=</span> babynames[<span class="st">"name_lengths"</span>] <span class="op">-</span> <span class="dv">1</span></span>
 <span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a>babynames.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1216,7 +1222,7 @@ <h2 data-number="3.2" class="anchored" data-anchor-id="adding-removing-and-modif
 </div>
 </div>
 <p>We can rename a column using the <code>.rename()</code> method. It takes in a dictionary that maps old column names to their new ones.</p>
-<div id="a9d24d4a" class="cell" data-execution_count="18">
+<div id="4041cf93" class="cell" data-execution_count="18">
 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Rename “name_lengths” to “Length”</span></span>
 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>babynames <span class="op">=</span> babynames.rename(columns<span class="op">=</span>{<span class="st">"name_lengths"</span>:<span class="st">"Length"</span>})</span>
 <span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a>babynames.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1289,7 +1295,7 @@ <h2 data-number="3.2" class="anchored" data-anchor-id="adding-removing-and-modif
 </div>
 </div>
 <p>If we want to remove a column or row of a <code>DataFrame</code>, we can call the <code>.drop</code> <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">(documentation)</a> method. Use the <code>axis</code> parameter to specify whether a column or row should be dropped. Unless otherwise specified, <code>pandas</code> will assume that we are dropping a row by default.</p>
-<div id="10392f0d" class="cell" data-execution_count="19">
+<div id="1ec97982" class="cell" data-execution_count="19">
 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Drop our new "Length" column from the DataFrame</span></span>
 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>babynames <span class="op">=</span> babynames.drop(<span class="st">"Length"</span>, axis<span class="op">=</span><span class="st">"columns"</span>)</span>
 <span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a>babynames.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1357,7 +1363,7 @@ <h2 data-number="3.2" class="anchored" data-anchor-id="adding-removing-and-modif
 </div>
 <p>Notice that we <em>re-assigned</em> <code>babynames</code> to the result of <code>babynames.drop(...)</code>. This is a subtle but important point: <code>pandas</code> table operations <strong>do not occur in-place</strong>. Calling <code>df.drop(...)</code> will output a <em>copy</em> of <code>df</code> with the row/column of interest removed without modifying the original <code>df</code> table.</p>
 <p>In other words, if we simply call:</p>
-<div id="9d49b846" class="cell" data-execution_count="20">
+<div id="33d1b171" class="cell" data-execution_count="20">
 <div class="sourceCode cell-code" id="cb24"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a><span class="co"># This creates a copy of `babynames` and removes the column "Name"...</span></span>
 <span id="cb24-2"><a href="#cb24-2" aria-hidden="true" tabindex="-1"></a>babynames.drop(<span class="st">"Name"</span>, axis<span class="op">=</span><span class="st">"columns"</span>)</span>
 <span id="cb24-3"><a href="#cb24-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1445,7 +1451,7 @@ <h2 data-number="3.3" class="anchored" data-anchor-id="useful-utility-functions"
 <section id="numpy" class="level3" data-number="3.3.1">
 <h3 data-number="3.3.1" class="anchored" data-anchor-id="numpy"><span class="header-section-number">3.3.1</span> <code>NumPy</code></h3>
 <p><code>pandas</code> is designed to work well with <code>NumPy</code>, the framework for array computations you encountered in <a href="https://www.data8.org/su23/reference/#array-functions-and-methods">Data 8</a>. Just about any <code>NumPy</code> function can be applied to <code>pandas</code> <code>DataFrame</code>s and <code>Series</code>.</p>
-<div id="95009965" class="cell" data-execution_count="21">
+<div id="629fe580" class="cell" data-execution_count="21">
 <div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Pull out the number of babies named Yash each year</span></span>
 <span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a>yash_count <span class="op">=</span> babynames[babynames[<span class="st">"Name"</span>] <span class="op">==</span> <span class="st">"Yash"</span>][<span class="st">"Count"</span>]</span>
 <span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a>yash_count.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1458,14 +1464,14 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="numpy"><span class="hea
 Name: Count, dtype: int64</code></pre>
 </div>
 </div>
-<div id="2aa806b3" class="cell" data-execution_count="22">
+<div id="0ba042d4" class="cell" data-execution_count="22">
 <div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Average number of babies named Yash each year</span></span>
 <span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a>np.mean(yash_count)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="22">
 <pre><code>np.float64(17.142857142857142)</code></pre>
 </div>
 </div>
-<div id="12ce74c5" class="cell" data-execution_count="23">
+<div id="05959505" class="cell" data-execution_count="23">
 <div class="sourceCode cell-code" id="cb29"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Max number of babies named Yash born in any one year</span></span>
 <span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a>np.<span class="bu">max</span>(yash_count)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="23">
@@ -1477,14 +1483,14 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="numpy"><span class="hea
 <h3 data-number="3.3.2" class="anchored" data-anchor-id="shape-and-.size"><span class="header-section-number">3.3.2</span> <code>.shape</code> and <code>.size</code></h3>
 <p><code>.shape</code> and <code>.size</code> are attributes of <code>Series</code> and <code>DataFrame</code>s that measure the “amount” of data stored in the structure. Calling <code>.shape</code> returns a tuple containing the number of rows and columns present in the <code>DataFrame</code> or <code>Series</code>. <code>.size</code> is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.</p>
 <p>Many functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.</p>
-<div id="0f914ac0" class="cell" data-execution_count="24">
+<div id="d782e812" class="cell" data-execution_count="24">
 <div class="sourceCode cell-code" id="cb31"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Return the shape of the DataFrame, in the format (num_rows, num_columns)</span></span>
 <span id="cb31-2"><a href="#cb31-2" aria-hidden="true" tabindex="-1"></a>babynames.shape</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="24">
 <pre><code>(407428, 5)</code></pre>
 </div>
 </div>
-<div id="24cf1ba4" class="cell" data-execution_count="25">
+<div id="bfff134d" class="cell" data-execution_count="25">
 <div class="sourceCode cell-code" id="cb33"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Return the size of the DataFrame, equal to num_rows * num_columns</span></span>
 <span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a>babynames.size</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="25">
@@ -1495,7 +1501,7 @@ <h3 data-number="3.3.2" class="anchored" data-anchor-id="shape-and-.size"><span
 <section id="describe" class="level3" data-number="3.3.3">
 <h3 data-number="3.3.3" class="anchored" data-anchor-id="describe"><span class="header-section-number">3.3.3</span> <code>.describe()</code></h3>
 <p>If many statistics are required from a <code>DataFrame</code> (minimum value, maximum value, mean value, etc.), then <code>.describe()</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html">(documentation)</a> can be used to compute all of them at once.</p>
-<div id="ecd36f29" class="cell" data-execution_count="26">
+<div id="3c90b6ea" class="cell" data-execution_count="26">
 <div class="sourceCode cell-code" id="cb35"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true" tabindex="-1"></a>babynames.describe()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="26">
 <div>
@@ -1557,7 +1563,7 @@ <h3 data-number="3.3.3" class="anchored" data-anchor-id="describe"><span class="
 </div>
 </div>
 <p>A different set of statistics will be reported if <code>.describe()</code> is called on a <code>Series</code>.</p>
-<div id="297fabea" class="cell" data-execution_count="27">
+<div id="5f80b9d5" class="cell" data-execution_count="27">
 <div class="sourceCode cell-code" id="cb36"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"Sex"</span>].describe()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="27">
 <pre><code>count     407428
@@ -1572,7 +1578,7 @@ <h3 data-number="3.3.3" class="anchored" data-anchor-id="describe"><span class="
 <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="header-section-number">3.3.4</span> <code>.sample()</code></h3>
 <p>As we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). <code>.sample()</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html">(documentation)</a> lets us quickly select random entries (a row if called from a <code>DataFrame</code>, or a value if called from a <code>Series</code>).</p>
 <p>By default, <code>.sample()</code> selects entries <em>without</em> replacement. Pass in the argument <code>replace=True</code> to sample with replacement.</p>
-<div id="21fa6764" class="cell" data-execution_count="28">
+<div id="7b02ab58" class="cell" data-execution_count="28">
 <div class="sourceCode cell-code" id="cb38"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><a href="#cb38-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Sample a single row</span></span>
 <span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a>babynames.sample()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="28">
@@ -1592,12 +1598,12 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">377796</td>
+<td data-quarto-table-cell-role="th">44846</td>
 <td>CA</td>
-<td>M</td>
-<td>2012</td>
-<td>Maxson</td>
-<td>7</td>
+<td>F</td>
+<td>1961</td>
+<td>Darcie</td>
+<td>13</td>
 </tr>
 </tbody>
 </table>
@@ -1606,7 +1612,7 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </div>
 </div>
 <p>Naturally, this can be chained with other methods and operators (<code>iloc</code>, etc.).</p>
-<div id="0f957755" class="cell" data-execution_count="29">
+<div id="5527d0f6" class="cell" data-execution_count="29">
 <div class="sourceCode cell-code" id="cb39"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1"><a href="#cb39-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Sample 5 random rows, and select all columns after column 2</span></span>
 <span id="cb39-2"><a href="#cb39-2" aria-hidden="true" tabindex="-1"></a>babynames.sample(<span class="dv">5</span>).iloc[:, <span class="dv">2</span>:]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="29">
@@ -1624,34 +1630,34 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">399452</td>
-<td>2020</td>
-<td>Kade</td>
-<td>51</td>
+<td data-quarto-table-cell-role="th">214600</td>
+<td>2016</td>
+<td>Aarna</td>
+<td>30</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">354928</td>
-<td>2004</td>
-<td>Ojani</td>
-<td>6</td>
+<td data-quarto-table-cell-role="th">284937</td>
+<td>1971</td>
+<td>Rigoberto</td>
+<td>36</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">94414</td>
-<td>1984</td>
-<td>Maggie</td>
-<td>62</td>
+<td data-quarto-table-cell-role="th">5580</td>
+<td>1922</td>
+<td>Claudine</td>
+<td>6</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">294766</td>
-<td>1978</td>
-<td>Emmanuel</td>
-<td>52</td>
+<td data-quarto-table-cell-role="th">241833</td>
+<td>1917</td>
+<td>Bernhard</td>
+<td>5</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">356937</td>
-<td>2005</td>
-<td>Derik</td>
-<td>10</td>
+<td data-quarto-table-cell-role="th">278221</td>
+<td>1965</td>
+<td>Johnson</td>
+<td>5</td>
 </tr>
 </tbody>
 </table>
@@ -1659,7 +1665,7 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </div>
 </div>
 </div>
-<div id="77469572" class="cell" data-execution_count="30">
+<div id="908383ba" class="cell" data-execution_count="30">
 <div class="sourceCode cell-code" id="cb40"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1"><a href="#cb40-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Randomly sample 4 names from the year 2000, with replacement, and select all columns after column 2</span></span>
 <span id="cb40-2"><a href="#cb40-2" aria-hidden="true" tabindex="-1"></a>babynames[babynames[<span class="st">"Year"</span>] <span class="op">==</span> <span class="dv">2000</span>].sample(<span class="dv">4</span>, replace <span class="op">=</span> <span class="va">True</span>).iloc[:, <span class="dv">2</span>:]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="30">
@@ -1677,28 +1683,28 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">150626</td>
+<td data-quarto-table-cell-role="th">344323</td>
 <td>2000</td>
-<td>Karley</td>
-<td>15</td>
+<td>Remington</td>
+<td>7</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">152317</td>
+<td data-quarto-table-cell-role="th">150678</td>
 <td>2000</td>
-<td>Amyah</td>
-<td>5</td>
+<td>Jessalyn</td>
+<td>14</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">149628</td>
+<td data-quarto-table-cell-role="th">344765</td>
 <td>2000</td>
-<td>Frida</td>
-<td>63</td>
+<td>Jaelyn</td>
+<td>5</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">343239</td>
+<td data-quarto-table-cell-role="th">151921</td>
 <td>2000</td>
-<td>Jaycob</td>
-<td>23</td>
+<td>Alajah</td>
+<td>6</td>
 </tr>
 </tbody>
 </table>
@@ -1711,7 +1717,7 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 <h3 data-number="3.3.5" class="anchored" data-anchor-id="value_counts"><span class="header-section-number">3.3.5</span> <code>.value_counts()</code></h3>
 <p>The <code>Series.value_counts()</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html">(documentation)</a> method counts the number of occurrence of each unique value in a <code>Series</code>. In other words, it <em>counts</em> the number of times each unique <em>value</em> appears. This is often useful for determining the most or least common entries in a <code>Series</code>.</p>
 <p>In the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the <code>"Name"</code> column of <code>babynames</code>. Note that the return value is also a <code>Series</code>.</p>
-<div id="1c3ef741" class="cell" data-execution_count="31">
+<div id="b228378e" class="cell" data-execution_count="31">
 <div class="sourceCode cell-code" id="cb41"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb41-1"><a href="#cb41-1" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"Name"</span>].value_counts().head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="31">
 <pre><code>Name
@@ -1727,7 +1733,7 @@ <h3 data-number="3.3.5" class="anchored" data-anchor-id="value_counts"><span cla
 <section id="unique" class="level3" data-number="3.3.6">
 <h3 data-number="3.3.6" class="anchored" data-anchor-id="unique"><span class="header-section-number">3.3.6</span> <code>.unique()</code></h3>
 <p>If we have a <code>Series</code> with many repeated values, then <code>.unique()</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.unique.html">(documentation)</a> can be used to identify only the <em>unique</em> values. Here we return an array of all the names in <code>babynames</code>.</p>
-<div id="c47f856e" class="cell" data-execution_count="32">
+<div id="c58d336d" class="cell" data-execution_count="32">
 <div class="sourceCode cell-code" id="cb43"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb43-1"><a href="#cb43-1" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"Name"</span>].unique()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="32">
 <pre><code>array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],
@@ -1738,7 +1744,7 @@ <h3 data-number="3.3.6" class="anchored" data-anchor-id="unique"><span class="he
 <section id="sort_values" class="level3" data-number="3.3.7">
 <h3 data-number="3.3.7" class="anchored" data-anchor-id="sort_values"><span class="header-section-number">3.3.7</span> <code>.sort_values()</code></h3>
 <p>Ordering a <code>DataFrame</code> can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. <code>.sort_values</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html">(documentation)</a> allows us to order a <code>DataFrame</code> or <code>Series</code> by a specified column. We can choose to either receive the rows in <code>ascending</code> order (default) or <code>descending</code> order.</p>
-<div id="642dae95" class="cell" data-execution_count="33">
+<div id="42e372fa" class="cell" data-execution_count="33">
 <div class="sourceCode cell-code" id="cb45"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb45-1"><a href="#cb45-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Sort the "Count" column from highest to lowest</span></span>
 <span id="cb45-2"><a href="#cb45-2" aria-hidden="true" tabindex="-1"></a>babynames.sort_values(by<span class="op">=</span><span class="st">"Count"</span>, ascending<span class="op">=</span><span class="va">False</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="33">
@@ -1804,7 +1810,7 @@ <h3 data-number="3.3.7" class="anchored" data-anchor-id="sort_values"><span clas
 </div>
 </div>
 <p>Unlike when calling <code>.value_counts()</code> on a <code>DataFrame</code>, we do not need to explicitly specify the column used for sorting when calling <code>.value_counts()</code> on a <code>Series</code>. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.</p>
-<div id="dc0d57af" class="cell" data-execution_count="34">
+<div id="b503fdd7" class="cell" data-execution_count="34">
 <div class="sourceCode cell-code" id="cb46"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb46-1"><a href="#cb46-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Sort the "Name" Series alphabetically</span></span>
 <span id="cb46-2"><a href="#cb46-2" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"Name"</span>].sort_values(ascending<span class="op">=</span><span class="va">True</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="34">
diff --git a/docs/pandas_3/pandas_3.html b/docs/pandas_3/pandas_3.html
index d79e56af..7105ce90 100644
--- a/docs/pandas_3/pandas_3.html
+++ b/docs/pandas_3/pandas_3.html
@@ -224,6 +224,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -311,7 +317,7 @@ <h2 id="toc-title">Pandas III</h2>
 <h2 data-number="4.1" class="anchored" data-anchor-id="custom-sorts"><span class="header-section-number">4.1</span> Custom Sorts</h2>
 <p>First, let’s finish our discussion about sorting. Let’s try to solve a sorting problem using different approaches. Assume we want to find the longest baby names and sort our data accordingly.</p>
 <p>We’ll start by loading the <code>babynames</code> dataset. Note that this dataset is filtered to only contain data from California.</p>
-<div id="9a430900" class="cell" data-execution_count="1">
+<div id="467862ec" class="cell" data-execution_count="1">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co"># This code pulls census data and loads it into a DataFrame</span></span>
@@ -442,7 +448,7 @@ <h2 data-number="4.1" class="anchored" data-anchor-id="custom-sorts"><span class
 <section id="approach-1-create-a-temporary-column" class="level3" data-number="4.1.1">
 <h3 data-number="4.1.1" class="anchored" data-anchor-id="approach-1-create-a-temporary-column"><span class="header-section-number">4.1.1</span> Approach 1: Create a Temporary Column</h3>
 <p>One method to do this is to first start by creating a column that contains the lengths of the names.</p>
-<div id="418a4a16" class="cell" data-execution_count="2">
+<div id="796b7f4e" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a Series of the length of each name</span></span>
 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>babyname_lengths <span class="op">=</span> babynames[<span class="st">"Name"</span>].<span class="bu">str</span>.<span class="bu">len</span>()</span>
 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -518,7 +524,7 @@ <h3 data-number="4.1.1" class="anchored" data-anchor-id="approach-1-create-a-tem
 </div>
 </div>
 <p>We can then sort the <code>DataFrame</code> by that column using <code>.sort_values()</code>:</p>
-<div id="ebe6f60a" class="cell" data-execution_count="3">
+<div id="8588b084" class="cell" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Sort by the temporary column</span></span>
 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>babynames <span class="op">=</span> babynames.sort_values(by<span class="op">=</span><span class="st">"name_lengths"</span>, ascending<span class="op">=</span><span class="va">False</span>)</span>
 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>babynames.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -591,7 +597,7 @@ <h3 data-number="4.1.1" class="anchored" data-anchor-id="approach-1-create-a-tem
 </div>
 </div>
 <p>Finally, we can drop the <code>name_length</code> column from <code>babynames</code> to prevent our table from getting cluttered.</p>
-<div id="f142ba72" class="cell" data-execution_count="4">
+<div id="aaa250c8" class="cell" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Drop the 'name_length' column</span></span>
 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>babynames <span class="op">=</span> babynames.drop(<span class="st">"name_lengths"</span>, axis<span class="op">=</span><span class="st">'columns'</span>)</span>
 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>babynames.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -661,7 +667,7 @@ <h3 data-number="4.1.1" class="anchored" data-anchor-id="approach-1-create-a-tem
 <section id="approach-2-sorting-using-the-key-argument" class="level3" data-number="4.1.2">
 <h3 data-number="4.1.2" class="anchored" data-anchor-id="approach-2-sorting-using-the-key-argument"><span class="header-section-number">4.1.2</span> Approach 2: Sorting using the <code>key</code> Argument</h3>
 <p>Another way to approach this is to use the <code>key</code> argument of <code>.sort_values()</code>. Here we can specify that we want to sort <code>"Name"</code> values by their length.</p>
-<div id="9a23d9a1" class="cell" data-execution_count="5">
+<div id="f0c94153" class="cell" data-execution_count="5">
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>babynames.sort_values(<span class="st">"Name"</span>, key<span class="op">=</span><span class="kw">lambda</span> x: x.<span class="bu">str</span>.<span class="bu">len</span>(), ascending<span class="op">=</span><span class="va">False</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="5">
 <div>
@@ -729,7 +735,7 @@ <h3 data-number="4.1.2" class="anchored" data-anchor-id="approach-2-sorting-usin
 <section id="approach-3-sorting-using-the-map-function" class="level3" data-number="4.1.3">
 <h3 data-number="4.1.3" class="anchored" data-anchor-id="approach-3-sorting-using-the-map-function"><span class="header-section-number">4.1.3</span> Approach 3: Sorting using the <code>map</code> Function</h3>
 <p>We can also use the <code>map</code> function on a <code>Series</code> to solve this. Say we want to sort the <code>babynames</code> table by the number of <code>"dr"</code>’s and <code>"ea"</code>’s in each <code>"Name"</code>. We’ll define the function <code>dr_ea_count</code> to help us out.</p>
-<div id="591fed6c" class="cell" data-execution_count="6">
+<div id="9776b2fc" class="cell" data-execution_count="6">
 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First, define a function to count the number of times "dr" or "ea" appear in each name</span></span>
 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> dr_ea_count(string):</span>
 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> string.count(<span class="st">'dr'</span>) <span class="op">+</span> string.count(<span class="st">'ea'</span>)</span>
@@ -809,7 +815,7 @@ <h3 data-number="4.1.3" class="anchored" data-anchor-id="approach-3-sorting-usin
 </div>
 </div>
 <p>We can drop the <code>dr_ea_count</code> once we’re done using it to maintain a neat table.</p>
-<div id="e7f3afcb" class="cell" data-execution_count="7">
+<div id="cce2ff75" class="cell" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Drop the `dr_ea_count` column</span></span>
 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>babynames <span class="op">=</span> babynames.drop(<span class="st">"dr_ea_count"</span>, axis <span class="op">=</span> <span class="st">'columns'</span>)</span>
 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>babynames.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -881,10 +887,10 @@ <h3 data-number="4.1.3" class="anchored" data-anchor-id="approach-3-sorting-usin
 <h2 data-number="4.2" class="anchored" data-anchor-id="aggregating-data-with-.groupby"><span class="header-section-number">4.2</span> Aggregating Data with <code>.groupby</code></h2>
 <p>Up until this point, we have been working with individual rows of <code>DataFrame</code>s. As data scientists, we often wish to investigate trends across a larger <em>subset</em> of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our <code>DataFrame</code>. To do this, we’ll use <code>pandas</code> <code>GroupBy</code> objects. Our goal is to group together rows that fall under the same category and perform an operation that aggregates across all rows in the category.</p>
 <p>Let’s say we wanted to aggregate all rows in <code>babynames</code> for a given year.</p>
-<div id="aa2ada9a" class="cell" data-execution_count="8">
+<div id="1105be10" class="cell" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Year"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="8">
-<pre><code>&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x1175d0aa0&gt;</code></pre>
+<pre><code>&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x10c075a00&gt;</code></pre>
 </div>
 </div>
 <p>What does this strange output mean? Calling <code>.groupby</code> <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html">(documentation)</a> has generated a <code>GroupBy</code> object. You can imagine this as a set of “mini” sub-<code>DataFrame</code>s, where each subframe contains all of the rows from <code>babynames</code> that correspond to a particular year.</p>
@@ -894,7 +900,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="aggregating-data-with-.gr
 </center>
 <p>We can’t work with a <code>GroupBy</code> object directly – that is why you saw that strange output earlier rather than a standard view of a <code>DataFrame</code>. To actually manipulate values within these “mini” <code>DataFrame</code>s, we’ll need to call an <em>aggregation method</em>. This is a method that tells <code>pandas</code> how to aggregate the values within the <code>GroupBy</code> object. Once the aggregation is applied, <code>pandas</code> will return a normal (now grouped) <code>DataFrame</code>.</p>
 <p>The first aggregation method we’ll consider is <code>.agg</code>. The <code>.agg</code> method takes in a function as its argument; this function is then applied to each column of a “mini” grouped DataFrame. We end up with a new <code>DataFrame</code> with one aggregated row per subframe. Let’s see this in action by finding the <code>sum</code> of all counts for each year in <code>babynames</code> – this is equivalent to finding the number of babies born in each year.</p>
-<div id="a9ae3f6c" class="cell" data-execution_count="9">
+<div id="c9e5c585" class="cell" data-execution_count="9">
 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>babynames[[<span class="st">"Year"</span>, <span class="st">"Count"</span>]].groupby(<span class="st">"Year"</span>).agg(<span class="st">"sum"</span>).head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="9">
 <div>
@@ -947,7 +953,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="aggregating-data-with-.gr
 </div>
 <p>Calling <code>.agg</code> has condensed each subframe back into a single row. This gives us our final output: a <code>DataFrame</code> that is now indexed by <code>"Year"</code>, with a single row for each unique year in the original <code>babynames</code> DataFrame.</p>
 <p>There are many different aggregation functions we can use, all of which are useful in different applications.</p>
-<div id="e77dddd4" class="cell" data-execution_count="10">
+<div id="8cf48903" class="cell" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>babynames[[<span class="st">"Year"</span>, <span class="st">"Count"</span>]].groupby(<span class="st">"Year"</span>).agg(<span class="st">"min"</span>).head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="10">
 <div>
@@ -991,7 +997,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="aggregating-data-with-.gr
 </div>
 </div>
 </div>
-<div id="05fa26d7" class="cell" data-execution_count="11">
+<div id="69b3f22e" class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>babynames[[<span class="st">"Year"</span>, <span class="st">"Count"</span>]].groupby(<span class="st">"Year"</span>).agg(<span class="st">"max"</span>).head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="11">
 <div>
@@ -1035,7 +1041,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="aggregating-data-with-.gr
 </div>
 </div>
 </div>
-<div id="8053701c" class="cell" data-execution_count="12">
+<div id="b9b38637" class="cell" data-execution_count="12">
 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Same result, but now we explicitly tell pandas to only consider the "Count" column when summing</span></span>
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Year"</span>)[[<span class="st">"Count"</span>]].agg(<span class="st">"sum"</span>).head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="12">
@@ -1089,7 +1095,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="aggregating-data-with-.gr
 <h3 data-number="4.2.1" class="anchored" data-anchor-id="aggregation-functions"><span class="header-section-number">4.2.1</span> Aggregation Functions</h3>
 <p>Because of this fairly broad requirement, <code>pandas</code> offers many ways of computing an aggregation.</p>
 <p><strong>In-built</strong> Python operations – such as <code>sum</code>, <code>max</code>, and <code>min</code> – are automatically recognized by <code>pandas</code>.</p>
-<div id="bf53d0d4" class="cell" data-execution_count="13">
+<div id="3bfd1e3a" class="cell" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co"># What is the minimum count for each name in any year?</span></span>
 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Name"</span>)[[<span class="st">"Count"</span>]].agg(<span class="st">"min"</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="13">
@@ -1134,7 +1140,7 @@ <h3 data-number="4.2.1" class="anchored" data-anchor-id="aggregation-functions">
 </div>
 </div>
 </div>
-<div id="876ad8d8" class="cell" data-execution_count="14">
+<div id="9bce41da" class="cell" data-execution_count="14">
 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="co"># What is the largest single-year count of each name?</span></span>
 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Name"</span>)[[<span class="st">"Count"</span>]].agg(<span class="st">"max"</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="14">
@@ -1180,7 +1186,7 @@ <h3 data-number="4.2.1" class="anchored" data-anchor-id="aggregation-functions">
 </div>
 </div>
 <p>As mentioned previously, functions from the <code>NumPy</code> library, such as <code>np.mean</code>, <code>np.max</code>, <code>np.min</code>, and <code>np.sum</code>, are also fair game in <code>pandas</code>.</p>
-<div id="dd2cc874" class="cell" data-execution_count="15">
+<div id="f55852a6" class="cell" data-execution_count="15">
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co"># What is the average count for each name across all years?</span></span>
 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Name"</span>)[[<span class="st">"Count"</span>]].agg(<span class="st">"mean"</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="15">
@@ -1236,7 +1242,7 @@ <h3 data-number="4.2.1" class="anchored" data-anchor-id="aggregation-functions">
 </ul>
 <p>The latter two entries in this list – <code>"first"</code> and <code>"last"</code> – are unique to <code>pandas</code>. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where <em>multiple</em> columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.</p>
 <p>Let’s illustrate this with an example. Say we add a new column to <code>babynames</code> that contains the first letter of each name.</p>
-<div id="2612e103" class="cell" data-execution_count="16">
+<div id="9f7987e3" class="cell" data-execution_count="16">
 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Imagine we had an additional column, "First Letter". We'll explain this code next week</span></span>
 <span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a>babynames[<span class="st">"First Letter"</span>] <span class="op">=</span> babynames[<span class="st">"Name"</span>].<span class="bu">str</span>[<span class="dv">0</span>]</span>
 <span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1301,7 +1307,7 @@ <h3 data-number="4.2.1" class="anchored" data-anchor-id="aggregation-functions">
 <figcaption>Aggregating using “first”</figcaption>
 </figure>
 </div>
-<div id="4bbd9e6b" class="cell" data-execution_count="17">
+<div id="9d974500" class="cell" data-execution_count="17">
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>babynames_new.groupby(<span class="st">"Name"</span>).agg({<span class="st">"First Letter"</span>:<span class="st">"first"</span>, <span class="st">"Year"</span>:<span class="st">"max"</span>}).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="17">
 <div>
@@ -1356,7 +1362,7 @@ <h3 data-number="4.2.1" class="anchored" data-anchor-id="aggregation-functions">
 <section id="plotting-birth-counts" class="level3" data-number="4.2.2">
 <h3 data-number="4.2.2" class="anchored" data-anchor-id="plotting-birth-counts"><span class="header-section-number">4.2.2</span> Plotting Birth Counts</h3>
 <p>Let’s use <code>.agg</code> to find the total number of babies born in each year. Recall that using <code>.agg</code> with <code>.groupby()</code> follows the format: <code>df.groupby(column_name).agg(aggregation_function)</code>. The line of code below gives us the total number of babies born in each year.</p>
-<div id="0094ed26" class="cell" data-execution_count="18">
+<div id="e8153ebb" class="cell" data-execution_count="18">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Year"</span>)[[<span class="st">"Count"</span>]].agg(<span class="bu">sum</span>).head(<span class="dv">5</span>)</span>
@@ -1366,7 +1372,7 @@ <h3 data-number="4.2.2" class="anchored" data-anchor-id="plotting-birth-counts">
 <span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a><span class="co"># babynames.groupby("Year").sum(numeric_only=True)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-stderr">
-<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73670/390646742.py:1: FutureWarning:
+<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83762/390646742.py:1: FutureWarning:
 
 The provided callable &lt;built-in function sum&gt; is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
 </code></pre>
@@ -1416,7 +1422,7 @@ <h3 data-number="4.2.2" class="anchored" data-anchor-id="plotting-birth-counts">
 <p>Here’s an illustration of the process:</p>
 <p><img src="images/aggregation.png" alt="aggregation" width="600"></p>
 <p>Plotting the <code>Dataframe</code> we obtain tells an interesting story.</p>
-<div id="131a01f5" class="cell" data-execution_count="19">
+<div id="b15eaaa6" class="cell" data-execution_count="19">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> plotly.express <span class="im">as</span> px</span>
@@ -1424,9 +1430,9 @@ <h3 data-number="4.2.2" class="anchored" data-anchor-id="plotting-birth-counts">
 <span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a>px.line(puzzle2, y <span class="op">=</span> <span class="st">"Count"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
-<div>                            <div id="09577baa-e7ff-444c-a35e-1aca491620c3" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("09577baa-e7ff-444c-a35e-1aca491620c3")) {                    Plotly.newPlot(                        "09577baa-e7ff-444c-a35e-1aca491620c3",                        [{"hovertemplate":"Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[9163,9983,17946,22094,26926,35835,37501,39916,44692,45119,54142,58983,61004,67917,74451,73493,72910,74201,74264,72108,75294,71467,69522,66895,69789,71603,74932,83738,91626,93461,102627,114296,142033,159813,164349,171764,204945,232313,229033,233625,235582,250468,271681,287484,297099,304567,324186,340083,337562,345901,358544,363926,360475,361897,355386,336567,319421,318819,321040,333671,342411,310020,287239,275036,286947,290518,302547,315011,322241,343070,365973,382156,390581,394608,404961,425583,435964,453824,480602,512615,552647,549317,541054,524983,509302,494635,483288,468412,464300,460844,471649,466934,467742,477651,480892,484503,494971,497627,483360,460305,444619,437818,439402,431945,440683,431317,427015,411058,395436,386996,362882,362582,360023],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"},"margin":{"b":0,"l":0,"r":0,"t":30}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0}},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="8a8fcb45-64c6-4089-b63f-b985e66e2d63" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("8a8fcb45-64c6-4089-b63f-b985e66e2d63")) {                    Plotly.newPlot(                        "8a8fcb45-64c6-4089-b63f-b985e66e2d63",                        [{"hovertemplate":"Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[9163,9983,17946,22094,26926,35835,37501,39916,44692,45119,54142,58983,61004,67917,74451,73493,72910,74201,74264,72108,75294,71467,69522,66895,69789,71603,74932,83738,91626,93461,102627,114296,142033,159813,164349,171764,204945,232313,229033,233625,235582,250468,271681,287484,297099,304567,324186,340083,337562,345901,358544,363926,360475,361897,355386,336567,319421,318819,321040,333671,342411,310020,287239,275036,286947,290518,302547,315011,322241,343070,365973,382156,390581,394608,404961,425583,435964,453824,480602,512615,552647,549317,541054,524983,509302,494635,483288,468412,464300,460844,471649,466934,467742,477651,480892,484503,494971,497627,483360,460305,444619,437818,439402,431945,440683,431317,427015,411058,395436,386996,362882,362582,360023],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"},"margin":{"b":0,"l":0,"r":0,"t":30}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0}},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('09577baa-e7ff-444c-a35e-1aca491620c3');
+var gd = document.getElementById('8a8fcb45-64c6-4089-b63f-b985e66e2d63');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -1470,7 +1476,7 @@ <h3 data-number="4.2.4" class="anchored" data-anchor-id="revisiting-the-.agg-fun
 <pre><code>babynames.groupby("Year").mean().head()</code></pre>
 <p>We can now put this all into practice. Say we want to find the baby name with sex “F” that has fallen in popularity the most in California. To calculate this, we can first create a metric: “Ratio to Peak” (RTP). The RTP is the ratio of babies born with a given name in 2022 to the <em>maximum</em> number of babies born with the name in <em>any</em> year.</p>
 <p>Let’s start with calculating this for one baby, “Jennifer”.</p>
-<div id="15991a69" class="cell" data-execution_count="20">
+<div id="906d3480" class="cell" data-execution_count="20">
 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co"># We filter by babies with sex "F" and sort by "Year"</span></span>
 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>f_babynames <span class="op">=</span> babynames[babynames[<span class="st">"Sex"</span>] <span class="op">==</span> <span class="st">"F"</span>]</span>
 <span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a>f_babynames <span class="op">=</span> f_babynames.sort_values([<span class="st">"Year"</span>])</span>
@@ -1489,7 +1495,7 @@ <h3 data-number="4.2.4" class="anchored" data-anchor-id="revisiting-the-.agg-fun
 </div>
 </div>
 <p>By creating a function to calculate RTP and applying it to our <code>DataFrame</code> by using <code>.groupby()</code>, we can easily compute the RTP for all names at once!</p>
-<div id="2dcc036a" class="cell" data-execution_count="21">
+<div id="29a0495d" class="cell" data-execution_count="21">
 <div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> ratio_to_peak(series):</span>
 <span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> series.iloc[<span class="op">-</span><span class="dv">1</span>] <span class="op">/</span> <span class="bu">max</span>(series)</span>
 <span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1557,7 +1563,7 @@ <h3 data-number="4.2.5" class="anchored" data-anchor-id="nuisance-columns"><span
 <section id="renaming-columns-after-grouping" class="level3" data-number="4.2.6">
 <h3 data-number="4.2.6" class="anchored" data-anchor-id="renaming-columns-after-grouping"><span class="header-section-number">4.2.6</span> Renaming Columns After Grouping</h3>
 <p>By default, <code>.groupby</code> will not rename any aggregated columns. As we can see in the table above, the aggregated column is still named <code>Count</code> even though it now represents the RTP. For better readability, we can rename <code>Count</code> to <code>Count RTP</code></p>
-<div id="694e2254" class="cell" data-execution_count="22">
+<div id="ed093b39" class="cell" data-execution_count="22">
 <div class="sourceCode cell-code" id="cb26"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a>rtp_table <span class="op">=</span> rtp_table.rename(columns <span class="op">=</span> {<span class="st">"Count"</span>: <span class="st">"Count RTP"</span>})</span>
 <span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a>rtp_table</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="22">
@@ -1644,7 +1650,7 @@ <h3 data-number="4.2.6" class="anchored" data-anchor-id="renaming-columns-after-
 <section id="some-data-science-payoff" class="level3" data-number="4.2.7">
 <h3 data-number="4.2.7" class="anchored" data-anchor-id="some-data-science-payoff"><span class="header-section-number">4.2.7</span> Some Data Science Payoff</h3>
 <p>By sorting <code>rtp_table</code>, we can see the names whose popularity has decreased the most.</p>
-<div id="718ff2d0" class="cell" data-execution_count="23">
+<div id="5af25537" class="cell" data-execution_count="23">
 <div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a>rtp_table <span class="op">=</span> rtp_table.rename(columns <span class="op">=</span> {<span class="st">"Count"</span>: <span class="st">"Count RTP"</span>})</span>
 <span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a>rtp_table.sort_values(<span class="st">"Count RTP"</span>).head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="23">
@@ -1697,16 +1703,16 @@ <h3 data-number="4.2.7" class="anchored" data-anchor-id="some-data-science-payof
 </div>
 </div>
 <p>To visualize the above <code>DataFrame</code>, let’s look at the line plot below:</p>
-<div id="af0a4c6a" class="cell" data-execution_count="24">
+<div id="a0e5072f" class="cell" data-execution_count="24">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb28"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> plotly.express <span class="im">as</span> px</span>
 <span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a>px.line(f_babynames[f_babynames[<span class="st">"Name"</span>] <span class="op">==</span> <span class="st">"Debra"</span>], x <span class="op">=</span> <span class="st">"Year"</span>, y <span class="op">=</span> <span class="st">"Count"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
-<div>                            <div id="ce7c593e-0966-4924-b9a4-e73aea4f9b07" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("ce7c593e-0966-4924-b9a4-e73aea4f9b07")) {                    Plotly.newPlot(                        "ce7c593e-0966-4924-b9a4-e73aea4f9b07",                        [{"hovertemplate":"Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"},"margin":{"b":0,"l":0,"r":0,"t":30}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0}},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="8a1d68f8-3bce-4fab-b1f9-725dfd246ba2" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("8a1d68f8-3bce-4fab-b1f9-725dfd246ba2")) {                    Plotly.newPlot(                        "8a1d68f8-3bce-4fab-b1f9-725dfd246ba2",                        [{"hovertemplate":"Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"},"margin":{"b":0,"l":0,"r":0,"t":30}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0}},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('ce7c593e-0966-4924-b9a4-e73aea4f9b07');
+var gd = document.getElementById('8a1d68f8-3bce-4fab-b1f9-725dfd246ba2');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -1732,7 +1738,7 @@ <h3 data-number="4.2.7" class="anchored" data-anchor-id="some-data-science-payof
 </div>
 </div>
 <p>We can get the list of the top 10 names and then plot popularity with the following code:</p>
-<div id="e1a1a9fb" class="cell" data-execution_count="25">
+<div id="cd8e91e6" class="cell" data-execution_count="25">
 <div class="sourceCode cell-code" id="cb29"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a>top10 <span class="op">=</span> rtp_table.sort_values(<span class="st">"Count RTP"</span>).head(<span class="dv">10</span>).index</span>
 <span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a>px.line(</span>
 <span id="cb29-3"><a href="#cb29-3" aria-hidden="true" tabindex="-1"></a>    f_babynames[f_babynames[<span class="st">"Name"</span>].isin(top10)], </span>
@@ -1747,9 +1753,9 @@ <h3 data-number="4.2.7" class="anchored" data-anchor-id="some-data-science-payof
 </code></pre>
 </div>
 <div class="cell-output cell-output-display">
-<div>                            <div id="3df35394-40ca-4a22-bbe4-6d85fd7ce80a" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("3df35394-40ca-4a22-bbe4-6d85fd7ce80a")) {                    Plotly.newPlot(                        "3df35394-40ca-4a22-bbe4-6d85fd7ce80a",                        [{"hovertemplate":"Name=Carol<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Carol","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Carol","orientation":"v","showlegend":true,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[8,13,17,16,26,38,59,47,55,48,64,67,74,94,138,153,151,148,193,279,270,297,367,453,559,669,873,1015,1050,1109,1079,1339,1672,1937,2089,2138,2152,2201,1954,1779,1737,1734,1727,1597,1684,1651,1704,1703,1545,1480,1359,1283,1191,993,1034,815,622,577,543,468,366,267,223,187,173,146,145,145,121,132,123,128,106,114,111,101,120,107,108,134,150,136,129,89,92,75,87,64,61,46,64,33,43,47,52,76,62,38,44,26,17,47,31,36,24,13,25,18,29,20,17,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Susan<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Susan","line":{"color":"#EF553B","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Susan","orientation":"v","showlegend":true,"x":[1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,8,8,10,16,17,15,20,22,21,19,15,22,26,32,29,43,25,37,63,47,63,74,101,118,138,183,271,433,630,795,1058,1380,1596,1991,2689,2831,3338,3180,3260,3346,3424,3753,3934,3900,3771,3631,3504,3123,3145,3135,2952,2839,2535,2008,1825,1644,1367,1232,1070,861,651,530,552,496,456,437,424,409,420,361,391,352,338,273,280,272,286,267,272,260,196,202,172,152,152,114,116,103,100,104,85,76,70,71,74,53,56,41,39,43,28,44,26,45,22,26,22,19,17,8,13],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tina<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Tina","line":{"color":"#00cc96","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tina","orientation":"v","showlegend":true,"x":[1915,1916,1917,1918,1920,1921,1922,1924,1925,1927,1928,1929,1930,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[5,6,5,5,5,7,5,9,5,8,8,5,10,10,7,8,12,9,28,45,43,53,64,80,80,88,92,128,168,163,177,366,569,569,700,753,889,1045,1228,1212,1129,1202,1282,1342,1402,1302,1248,1091,941,634,642,546,450,370,414,363,335,371,310,268,271,310,238,252,252,208,180,196,163,171,147,121,111,91,80,83,90,80,67,64,63,69,36,37,47,39,39,27,39,28,46,38,33,36,26,21,15,13,6],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Cheryl<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Cheryl","line":{"color":"#ab63fa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Cheryl","orientation":"v","showlegend":true,"x":[1930,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2011,2012,2013,2014,2015,2016,2017,2018,2019,2021,2022],"xaxis":"x","y":[6,8,12,10,16,76,49,42,48,87,377,759,801,1063,1093,1021,916,903,993,955,1058,1465,1639,1715,1833,1832,1639,1624,1565,1420,1295,1207,1051,950,899,751,635,550,428,371,293,271,236,199,178,303,299,272,204,229,164,135,129,130,98,106,88,90,65,55,39,47,38,30,30,19,22,24,14,11,16,17,16,13,21,14,11,15,12,10,12,15,8,10,9,6,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Michele<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Michele","line":{"color":"#FFA15A","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Michele","orientation":"v","showlegend":true,"x":[1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2022],"xaxis":"x","y":[7,8,5,8,18,34,113,132,166,171,172,253,213,335,295,306,401,421,500,498,464,454,470,506,576,763,766,775,768,796,1037,1033,1111,1016,973,700,702,571,494,484,437,390,381,305,281,223,230,227,200,162,206,146,143,164,137,142,125,104,82,65,52,47,45,38,28,37,27,22,28,16,21,15,15,11,14,7,5,10,6,11,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debbie<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Debbie","line":{"color":"#19d3f3","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debbie","orientation":"v","showlegend":true,"x":[1936,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2014,2015,2016,2017,2021],"xaxis":"x","y":[5,9,9,10,16,11,32,74,91,115,120,191,233,300,427,697,902,1313,1656,1776,1675,1547,1458,1215,1004,648,504,415,338,279,243,192,145,108,108,92,72,64,87,91,81,65,79,67,74,64,56,71,78,93,85,78,50,61,70,53,46,39,22,28,19,11,16,14,13,8,21,10,11,10,12,8,9,6,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Terri<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Terri","line":{"color":"#FF6692","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Terri","orientation":"v","showlegend":true,"x":[1938,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2004,2005,2006,2016,2021,2022],"xaxis":"x","y":[6,8,12,26,32,38,65,99,130,132,168,154,236,306,379,542,604,685,839,875,1052,964,937,902,826,737,486,448,398,323,312,263,191,153,120,106,81,59,84,57,44,49,47,53,44,36,37,35,32,34,20,26,29,15,19,22,11,15,12,13,11,14,9,7,6,7,5,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Shannon<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Shannon","line":{"color":"#B6E880","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Shannon","orientation":"v","showlegend":true,"x":[1938,1939,1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,9,6,10,14,19,25,16,34,23,34,43,51,59,73,83,111,106,126,129,161,145,206,216,305,409,441,516,587,932,1419,1650,1436,1198,1090,1127,982,1218,1136,1052,991,923,968,969,971,945,872,803,699,642,597,527,493,594,615,531,438,428,366,303,217,199,200,165,133,133,110,90,88,63,42,43,37,41,32,19,31,22,17,14,21,8,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debra<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Debra","line":{"color":"#FF97FF","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debra","orientation":"v","showlegend":true,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tammy<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Tammy","line":{"color":"#FECB52","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tammy","orientation":"v","showlegend":true,"x":[1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2019,2022],"xaxis":"x","y":[7,5,10,9,12,13,10,9,9,13,28,14,26,37,368,746,990,1038,1136,1223,1539,1273,1219,1168,1143,1099,977,1013,859,704,544,421,392,328,275,229,227,181,168,157,96,120,102,85,120,88,85,94,77,82,74,61,49,45,45,54,50,47,49,45,44,36,30,24,29,14,16,12,11,9,5,13,9,15,11,7,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"},"margin":{"b":0,"l":0,"r":0,"t":30}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"title":{"text":"Name"},"tracegroupgap":0}},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="d8989a65-ac02-4505-9434-8447bd46bfbb" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("d8989a65-ac02-4505-9434-8447bd46bfbb")) {                    Plotly.newPlot(                        "d8989a65-ac02-4505-9434-8447bd46bfbb",                        [{"hovertemplate":"Name=Carol<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Carol","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Carol","orientation":"v","showlegend":true,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[8,13,17,16,26,38,59,47,55,48,64,67,74,94,138,153,151,148,193,279,270,297,367,453,559,669,873,1015,1050,1109,1079,1339,1672,1937,2089,2138,2152,2201,1954,1779,1737,1734,1727,1597,1684,1651,1704,1703,1545,1480,1359,1283,1191,993,1034,815,622,577,543,468,366,267,223,187,173,146,145,145,121,132,123,128,106,114,111,101,120,107,108,134,150,136,129,89,92,75,87,64,61,46,64,33,43,47,52,76,62,38,44,26,17,47,31,36,24,13,25,18,29,20,17,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Susan<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Susan","line":{"color":"#EF553B","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Susan","orientation":"v","showlegend":true,"x":[1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,8,8,10,16,17,15,20,22,21,19,15,22,26,32,29,43,25,37,63,47,63,74,101,118,138,183,271,433,630,795,1058,1380,1596,1991,2689,2831,3338,3180,3260,3346,3424,3753,3934,3900,3771,3631,3504,3123,3145,3135,2952,2839,2535,2008,1825,1644,1367,1232,1070,861,651,530,552,496,456,437,424,409,420,361,391,352,338,273,280,272,286,267,272,260,196,202,172,152,152,114,116,103,100,104,85,76,70,71,74,53,56,41,39,43,28,44,26,45,22,26,22,19,17,8,13],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tina<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Tina","line":{"color":"#00cc96","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tina","orientation":"v","showlegend":true,"x":[1915,1916,1917,1918,1920,1921,1922,1924,1925,1927,1928,1929,1930,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[5,6,5,5,5,7,5,9,5,8,8,5,10,10,7,8,12,9,28,45,43,53,64,80,80,88,92,128,168,163,177,366,569,569,700,753,889,1045,1228,1212,1129,1202,1282,1342,1402,1302,1248,1091,941,634,642,546,450,370,414,363,335,371,310,268,271,310,238,252,252,208,180,196,163,171,147,121,111,91,80,83,90,80,67,64,63,69,36,37,47,39,39,27,39,28,46,38,33,36,26,21,15,13,6],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Cheryl<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Cheryl","line":{"color":"#ab63fa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Cheryl","orientation":"v","showlegend":true,"x":[1930,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2011,2012,2013,2014,2015,2016,2017,2018,2019,2021,2022],"xaxis":"x","y":[6,8,12,10,16,76,49,42,48,87,377,759,801,1063,1093,1021,916,903,993,955,1058,1465,1639,1715,1833,1832,1639,1624,1565,1420,1295,1207,1051,950,899,751,635,550,428,371,293,271,236,199,178,303,299,272,204,229,164,135,129,130,98,106,88,90,65,55,39,47,38,30,30,19,22,24,14,11,16,17,16,13,21,14,11,15,12,10,12,15,8,10,9,6,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Michele<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Michele","line":{"color":"#FFA15A","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Michele","orientation":"v","showlegend":true,"x":[1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2022],"xaxis":"x","y":[7,8,5,8,18,34,113,132,166,171,172,253,213,335,295,306,401,421,500,498,464,454,470,506,576,763,766,775,768,796,1037,1033,1111,1016,973,700,702,571,494,484,437,390,381,305,281,223,230,227,200,162,206,146,143,164,137,142,125,104,82,65,52,47,45,38,28,37,27,22,28,16,21,15,15,11,14,7,5,10,6,11,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debbie<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Debbie","line":{"color":"#19d3f3","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debbie","orientation":"v","showlegend":true,"x":[1936,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2014,2015,2016,2017,2021],"xaxis":"x","y":[5,9,9,10,16,11,32,74,91,115,120,191,233,300,427,697,902,1313,1656,1776,1675,1547,1458,1215,1004,648,504,415,338,279,243,192,145,108,108,92,72,64,87,91,81,65,79,67,74,64,56,71,78,93,85,78,50,61,70,53,46,39,22,28,19,11,16,14,13,8,21,10,11,10,12,8,9,6,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Terri<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Terri","line":{"color":"#FF6692","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Terri","orientation":"v","showlegend":true,"x":[1938,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2004,2005,2006,2016,2021,2022],"xaxis":"x","y":[6,8,12,26,32,38,65,99,130,132,168,154,236,306,379,542,604,685,839,875,1052,964,937,902,826,737,486,448,398,323,312,263,191,153,120,106,81,59,84,57,44,49,47,53,44,36,37,35,32,34,20,26,29,15,19,22,11,15,12,13,11,14,9,7,6,7,5,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Shannon<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Shannon","line":{"color":"#B6E880","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Shannon","orientation":"v","showlegend":true,"x":[1938,1939,1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,9,6,10,14,19,25,16,34,23,34,43,51,59,73,83,111,106,126,129,161,145,206,216,305,409,441,516,587,932,1419,1650,1436,1198,1090,1127,982,1218,1136,1052,991,923,968,969,971,945,872,803,699,642,597,527,493,594,615,531,438,428,366,303,217,199,200,165,133,133,110,90,88,63,42,43,37,41,32,19,31,22,17,14,21,8,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debra<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Debra","line":{"color":"#FF97FF","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debra","orientation":"v","showlegend":true,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tammy<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Tammy","line":{"color":"#FECB52","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tammy","orientation":"v","showlegend":true,"x":[1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2019,2022],"xaxis":"x","y":[7,5,10,9,12,13,10,9,9,13,28,14,26,37,368,746,990,1038,1136,1223,1539,1273,1219,1168,1143,1099,977,1013,859,704,544,421,392,328,275,229,227,181,168,157,96,120,102,85,120,88,85,94,77,82,74,61,49,45,45,54,50,47,49,45,44,36,30,24,29,14,16,12,11,9,5,13,9,15,11,7,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"},"margin":{"b":0,"l":0,"r":0,"t":30}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"title":{"text":"Name"},"tracegroupgap":0}},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('3df35394-40ca-4a22-bbe4-6d85fd7ce80a');
+var gd = document.getElementById('d8989a65-ac02-4505-9434-8447bd46bfbb');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -1775,7 +1781,7 @@ <h3 data-number="4.2.7" class="anchored" data-anchor-id="some-data-science-payof
 </div>
 </div>
 <p>As a quick exercise, consider what code would compute the total number of babies with each name.</p>
-<div id="215d2c20" class="cell" data-execution_count="26">
+<div id="fe396ca5" class="cell" data-execution_count="26">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb31"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Name"</span>)[[<span class="st">"Count"</span>]].agg(<span class="st">"sum"</span>).head()</span>
@@ -1829,7 +1835,7 @@ <h3 data-number="4.2.7" class="anchored" data-anchor-id="some-data-science-payof
 <section id="groupby-continued" class="level2" data-number="4.3">
 <h2 data-number="4.3" class="anchored" data-anchor-id="groupby-continued"><span class="header-section-number">4.3</span> <code>.groupby()</code>, Continued</h2>
 <p>We’ll work with the <code>elections</code> <code>DataFrame</code> again.</p>
-<div id="06a518c9" class="cell" data-execution_count="27">
+<div id="8369d6de" class="cell" data-execution_count="27">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb32"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
@@ -1909,7 +1915,7 @@ <h2 data-number="4.3" class="anchored" data-anchor-id="groupby-continued"><span
 <section id="raw-groupby-objects" class="level3" data-number="4.3.1">
 <h3 data-number="4.3.1" class="anchored" data-anchor-id="raw-groupby-objects"><span class="header-section-number">4.3.1</span> Raw <code>GroupBy</code> Objects</h3>
 <p>The result of <code>groupby</code> applied to a <code>DataFrame</code> is a <code>DataFrameGroupBy</code> object, <strong>not</strong> a <code>DataFrame</code>.</p>
-<div id="15424471" class="cell" data-execution_count="28">
+<div id="91f30a86" class="cell" data-execution_count="28">
 <div class="sourceCode cell-code" id="cb33"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a>grouped_by_year <span class="op">=</span> elections.groupby(<span class="st">"Year"</span>)</span>
 <span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(grouped_by_year)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="28">
@@ -1917,14 +1923,14 @@ <h3 data-number="4.3.1" class="anchored" data-anchor-id="raw-groupby-objects"><s
 </div>
 </div>
 <p>There are several ways to look into <code>DataFrameGroupBy</code> objects:</p>
-<div id="d4f0a621" class="cell" data-execution_count="29">
+<div id="5c52aef8" class="cell" data-execution_count="29">
 <div class="sourceCode cell-code" id="cb35"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true" tabindex="-1"></a>grouped_by_party <span class="op">=</span> elections.groupby(<span class="st">"Party"</span>)</span>
 <span id="cb35-2"><a href="#cb35-2" aria-hidden="true" tabindex="-1"></a>grouped_by_party.groups</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="29">
 <pre><code>{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [2, 4, 8, 10, 13, 14, 17, 20, 28, 29, 34, 37, 39, 45, 47, 52, 55, 57, 64, 70, 74, 77, 81, 83, 86, 91, 94, 97, 100, 105, 108, 111, 114, 116, 118, 123, 129, 134, 137, 140, 144, 151, 158, 162, 168, 176, 178], 'Democratic-Republican': [0, 1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'National Democratic': [50], 'National Republican': [3, 5], 'National Union': [27], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 30, 32, 33, 36, 40, 43, 46, 53, 56, 60, 65, 69, 72, 79, 80, 84, 87, 90, 96, 98, 104, 106, 109, 112, 113, 117, 120, 122, 131, 133, 135, 142, 145, 152, 157, 166, 171, 173, 179], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 'Southern Democratic': [25], 'States' Rights': [110], 'Taxpayers': [147], 'Union': [93], 'Union Labor': [42], 'Whig': [7, 9, 11, 12, 16, 19]}</code></pre>
 </div>
 </div>
-<div id="2d701a60" class="cell" data-execution_count="30">
+<div id="04a67214" class="cell" data-execution_count="30">
 <div class="sourceCode cell-code" id="cb37"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true" tabindex="-1"></a>grouped_by_party.get_group(<span class="st">"Socialist"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="30">
 <div>
@@ -2052,7 +2058,7 @@ <h3 data-number="4.3.2" class="anchored" data-anchor-id="other-groupby-methods">
 <li><a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html#pandas.core.groupby.DataFrameGroupBy.count"><code>.count</code></a>: creates a new <strong><code>DataFrame</code></strong> with the number of entries, excluding missing values.</li>
 </ul>
 <p>Let’s illustrate some examples by creating a <code>DataFrame</code> called <code>df</code>.</p>
-<div id="b2faa901" class="cell" data-execution_count="31">
+<div id="f83218c3" class="cell" data-execution_count="31">
 <div class="sourceCode cell-code" id="cb38"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1"><a href="#cb38-1" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> pd.DataFrame({<span class="st">'letter'</span>:[<span class="st">'A'</span>,<span class="st">'A'</span>,<span class="st">'B'</span>,<span class="st">'C'</span>,<span class="st">'C'</span>,<span class="st">'C'</span>], </span>
 <span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a>                   <span class="st">'num'</span>:[<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">4</span>,np.nan,<span class="dv">4</span>], </span>
 <span id="cb38-3"><a href="#cb38-3" aria-hidden="true" tabindex="-1"></a>                   <span class="st">'state'</span>:[np.nan, <span class="st">'tx'</span>, <span class="st">'fl'</span>, <span class="st">'hi'</span>, np.nan, <span class="st">'ak'</span>]})</span>
@@ -2114,7 +2120,7 @@ <h3 data-number="4.3.2" class="anchored" data-anchor-id="other-groupby-methods">
 </div>
 </div>
 <p>Note the slight difference between <code>.size()</code> and <code>.count()</code>: while <code>.size()</code> returns a <code>Series</code> and counts the number of entries including the missing values, <code>.count()</code> returns a <code>DataFrame</code> and counts the number of entries in each column <em>excluding missing values</em>.</p>
-<div id="a0af1912" class="cell" data-execution_count="32">
+<div id="6fb4bcae" class="cell" data-execution_count="32">
 <div class="sourceCode cell-code" id="cb39"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1"><a href="#cb39-1" aria-hidden="true" tabindex="-1"></a>df.groupby(<span class="st">"letter"</span>).size()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="32">
 <pre><code>letter
@@ -2124,7 +2130,7 @@ <h3 data-number="4.3.2" class="anchored" data-anchor-id="other-groupby-methods">
 dtype: int64</code></pre>
 </div>
 </div>
-<div id="421d5c4d" class="cell" data-execution_count="33">
+<div id="8fba9b01" class="cell" data-execution_count="33">
 <div class="sourceCode cell-code" id="cb41"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb41-1"><a href="#cb41-1" aria-hidden="true" tabindex="-1"></a>df.groupby(<span class="st">"letter"</span>).count()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="33">
 <div>
@@ -2166,7 +2172,7 @@ <h3 data-number="4.3.2" class="anchored" data-anchor-id="other-groupby-methods">
 </div>
 </div>
 <p>You might recall that the <code>value_counts()</code> function in the previous note does something similar. It turns out <code>value_counts()</code> and <code>groupby.size()</code> are the same, except <code>value_counts()</code> sorts the resulting <code>Series</code> in descending order automatically.</p>
-<div id="ee42103e" class="cell" data-execution_count="34">
+<div id="14de8b05" class="cell" data-execution_count="34">
 <div class="sourceCode cell-code" id="cb42"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1"><a href="#cb42-1" aria-hidden="true" tabindex="-1"></a>df[<span class="st">"letter"</span>].value_counts()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="34">
 <pre><code>letter
@@ -2206,7 +2212,7 @@ <h3 data-number="4.3.3" class="anchored" data-anchor-id="filtering-by-group"><sp
 <li>Return all <code>DataFrame</code> rows that correspond to these years</li>
 </ul>
 <p>For each year, we need to find the maximum <code>%</code> among <em>all</em> rows for that year. If this maximum <code>%</code> is lower than 45%, we will tell <code>pandas</code> to keep all rows corresponding to that year.</p>
-<div id="feb41d96" class="cell" data-execution_count="35">
+<div id="3a989d64" class="cell" data-execution_count="35">
 <div class="sourceCode cell-code" id="cb44"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1"><a href="#cb44-1" aria-hidden="true" tabindex="-1"></a>elections.groupby(<span class="st">"Year"</span>).<span class="bu">filter</span>(<span class="kw">lambda</span> sf: sf[<span class="st">"%"</span>].<span class="bu">max</span>() <span class="op">&lt;</span> <span class="dv">45</span>).head(<span class="dv">9</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="35">
 <div>
@@ -2321,10 +2327,10 @@ <h3 data-number="4.3.4" class="anchored" data-anchor-id="aggregation-with-lambda
 <p>What if we wish to aggregate our <code>DataFrame</code> using a non-standard function – for example, a function of our own design? We can do so by combining <code>.agg</code> with <code>lambda</code> expressions.</p>
 <p>Let’s first consider a puzzle to jog our memory. We will attempt to find the <code>Candidate</code> from each <code>Party</code> with the highest <code>%</code> of votes.</p>
 <p>A naive approach may be to group by the <code>Party</code> column and aggregate by the maximum.</p>
-<div id="720d01c6" class="cell" data-execution_count="36">
+<div id="c24a963b" class="cell" data-execution_count="36">
 <div class="sourceCode cell-code" id="cb45"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb45-1"><a href="#cb45-1" aria-hidden="true" tabindex="-1"></a>elections.groupby(<span class="st">"Party"</span>).agg(<span class="bu">max</span>).head(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stderr">
-<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73670/4278286395.py:1: FutureWarning:
+<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83762/4278286395.py:1: FutureWarning:
 
 The provided callable &lt;built-in function max&gt; is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
 </code></pre>
@@ -2452,7 +2458,7 @@ <h3 data-number="4.3.4" class="anchored" data-anchor-id="aggregation-with-lambda
 <li>Group by <code>Party</code> and select the first row of each sub-<code>DataFrame</code></li>
 </ol>
 <p>While it may seem unintuitive, sorting <code>elections</code> by descending order of <code>%</code> is extremely helpful. If we then group by <code>Party</code>, the first row of each <code>GroupBy</code> object will contain information about the <code>Candidate</code> with the highest voter <code>%</code>.</p>
-<div id="7fe084b9" class="cell" data-execution_count="37">
+<div id="f7d5d740" class="cell" data-execution_count="37">
 <div class="sourceCode cell-code" id="cb47"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb47-1"><a href="#cb47-1" aria-hidden="true" tabindex="-1"></a>elections_sorted_by_percent <span class="op">=</span> elections.sort_values(<span class="st">"%"</span>, ascending<span class="op">=</span><span class="va">False</span>)</span>
 <span id="cb47-2"><a href="#cb47-2" aria-hidden="true" tabindex="-1"></a>elections_sorted_by_percent.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="37">
@@ -2523,7 +2529,7 @@ <h3 data-number="4.3.4" class="anchored" data-anchor-id="aggregation-with-lambda
 </div>
 </div>
 </div>
-<div id="04e351fb" class="cell" data-execution_count="38">
+<div id="38a15ca4" class="cell" data-execution_count="38">
 <div class="sourceCode cell-code" id="cb48"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb48-1"><a href="#cb48-1" aria-hidden="true" tabindex="-1"></a>elections_sorted_by_percent.groupby(<span class="st">"Party"</span>).agg(<span class="kw">lambda</span> x : x.iloc[<span class="dv">0</span>]).head(<span class="dv">10</span>)</span>
 <span id="cb48-2"><a href="#cb48-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb48-3"><a href="#cb48-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Equivalent to the below code</span></span>
@@ -2644,7 +2650,7 @@ <h3 data-number="4.3.4" class="anchored" data-anchor-id="aggregation-with-lambda
 <p>More generally, <code>lambda</code> functions are used to design custom aggregation functions that aren’t pre-defined by Python. The input parameter <code>x</code> to the <code>lambda</code> function is a <code>GroupBy</code> object. Therefore, it should make sense why <code>lambda x : x.iloc[0]</code> selects the first row in each groupby object.</p>
 <p>In fact, there’s a few different ways to approach this problem. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc. We’ve given a few examples below.</p>
 <p><strong>Note</strong>: Understanding these alternative solutions is not required. They are given to demonstrate the vast number of problem-solving approaches in <code>pandas</code>.</p>
-<div id="9a0f659e" class="cell" data-execution_count="39">
+<div id="f67a4227" class="cell" data-execution_count="39">
 <div class="sourceCode cell-code" id="cb49"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb49-1"><a href="#cb49-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Using the idxmax function</span></span>
 <span id="cb49-2"><a href="#cb49-2" aria-hidden="true" tabindex="-1"></a>best_per_party <span class="op">=</span> elections.loc[elections.groupby(<span class="st">'Party'</span>)[<span class="st">'%'</span>].idxmax()]</span>
 <span id="cb49-3"><a href="#cb49-3" aria-hidden="true" tabindex="-1"></a>best_per_party.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -2716,7 +2722,7 @@ <h3 data-number="4.3.4" class="anchored" data-anchor-id="aggregation-with-lambda
 </div>
 </div>
 </div>
-<div id="ef6735fe" class="cell" data-execution_count="40">
+<div id="b6b8661f" class="cell" data-execution_count="40">
 <div class="sourceCode cell-code" id="cb50"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb50-1"><a href="#cb50-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Using the .drop_duplicates function</span></span>
 <span id="cb50-2"><a href="#cb50-2" aria-hidden="true" tabindex="-1"></a>best_per_party2 <span class="op">=</span> elections.sort_values(<span class="st">'%'</span>).drop_duplicates([<span class="st">'Party'</span>], keep<span class="op">=</span><span class="st">'last'</span>)</span>
 <span id="cb50-3"><a href="#cb50-3" aria-hidden="true" tabindex="-1"></a>best_per_party2.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -2794,7 +2800,7 @@ <h3 data-number="4.3.4" class="anchored" data-anchor-id="aggregation-with-lambda
 <h2 data-number="4.4" class="anchored" data-anchor-id="aggregating-data-with-pivot-tables"><span class="header-section-number">4.4</span> Aggregating Data with Pivot Tables</h2>
 <p>We know now that <code>.groupby</code> gives us the ability to group and aggregate data across our <code>DataFrame</code>. The examples above formed groups using just one column in the <code>DataFrame</code>. It’s possible to group by multiple columns at once by passing in a list of column names to <code>.groupby</code>.</p>
 <p>Let’s consider the <code>babynames</code> dataset again. In this problem, we will find the total number of baby names associated with each sex for each year. To do this, we’ll group by <em>both</em> the <code>"Year"</code> and <code>"Sex"</code> columns.</p>
-<div id="cb73cf99" class="cell" data-execution_count="41">
+<div id="48515cf2" class="cell" data-execution_count="41">
 <div class="sourceCode cell-code" id="cb51"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1"><a href="#cb51-1" aria-hidden="true" tabindex="-1"></a>babynames.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="41">
 <div>
@@ -2864,12 +2870,12 @@ <h2 data-number="4.4" class="anchored" data-anchor-id="aggregating-data-with-piv
 </div>
 </div>
 </div>
-<div id="2f1b25e8" class="cell" data-execution_count="42">
+<div id="597346b2" class="cell" data-execution_count="42">
 <div class="sourceCode cell-code" id="cb52"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb52-1"><a href="#cb52-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Find the total number of baby names associated with each sex for each </span></span>
 <span id="cb52-2"><a href="#cb52-2" aria-hidden="true" tabindex="-1"></a><span class="co"># year in the data</span></span>
 <span id="cb52-3"><a href="#cb52-3" aria-hidden="true" tabindex="-1"></a>babynames.groupby([<span class="st">"Year"</span>, <span class="st">"Sex"</span>])[[<span class="st">"Count"</span>]].agg(<span class="bu">sum</span>).head(<span class="dv">6</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stderr">
-<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73670/3186035650.py:3: FutureWarning:
+<pre><code>/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83762/3186035650.py:3: FutureWarning:
 
 The provided callable &lt;built-in function sum&gt; is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
 </code></pre>
@@ -2931,7 +2937,7 @@ <h2 data-number="4.4" class="anchored" data-anchor-id="aggregating-data-with-piv
 <p>Here’s an illustration of the process:</p>
 <p><img src="images/pivot.png" alt="groupby_demo" width="600"></p>
 <p>The best way to understand pivot tables is to see one in action. Let’s return to our original goal of summing the total number of names associated with each combination of year and sex. We’ll call the <code>pandas</code> <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html"><code>.pivot_table</code></a> method to create a new table.</p>
-<div id="9e720dae" class="cell" data-execution_count="43">
+<div id="e8fe8bc0" class="cell" data-execution_count="43">
 <div class="sourceCode cell-code" id="cb54"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb54-1"><a href="#cb54-1" aria-hidden="true" tabindex="-1"></a><span class="co"># The `pivot_table` method is used to generate a Pandas pivot table</span></span>
 <span id="cb54-2"><a href="#cb54-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
 <span id="cb54-3"><a href="#cb54-3" aria-hidden="true" tabindex="-1"></a>babynames.pivot_table(</span>
@@ -2998,7 +3004,7 @@ <h2 data-number="4.4" class="anchored" data-anchor-id="aggregating-data-with-piv
 <li><code>aggfunc = np.sum</code> tells <code>pandas</code> what function to use when aggregating the data specified by <code>values</code>. Here, we are summing the name counts for each pair of <code>"Year"</code> and <code>"Sex"</code></li>
 </ul>
 <p>We can even include multiple values in the index or columns of our pivot tables.</p>
-<div id="587000e9" class="cell" data-execution_count="44">
+<div id="e5aec3fa" class="cell" data-execution_count="44">
 <div class="sourceCode cell-code" id="cb55"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb55-1"><a href="#cb55-1" aria-hidden="true" tabindex="-1"></a>babynames_pivot <span class="op">=</span> babynames.pivot_table(</span>
 <span id="cb55-2"><a href="#cb55-2" aria-hidden="true" tabindex="-1"></a>    index<span class="op">=</span><span class="st">"Year"</span>,     <span class="co"># the rows (turned into index)</span></span>
 <span id="cb55-3"><a href="#cb55-3" aria-hidden="true" tabindex="-1"></a>    columns<span class="op">=</span><span class="st">"Sex"</span>,    <span class="co"># the column values</span></span>
@@ -3087,7 +3093,7 @@ <h2 data-number="4.4" class="anchored" data-anchor-id="aggregating-data-with-piv
 <h2 data-number="4.5" class="anchored" data-anchor-id="joining-tables"><span class="header-section-number">4.5</span> Joining Tables</h2>
 <p>When working on data science projects, we’re unlikely to have absolutely all the data we want contained in a single <code>DataFrame</code> – a real-world data scientist needs to grapple with data coming from multiple sources. If we have access to multiple datasets with related information, we can join two or more tables into a single <code>DataFrame</code>.</p>
 <p>To put this into practice, we’ll revisit the <code>elections</code> dataset.</p>
-<div id="2dfb1d0d" class="cell" data-execution_count="45">
+<div id="93f92e9f" class="cell" data-execution_count="45">
 <div class="sourceCode cell-code" id="cb56"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb56-1"><a href="#cb56-1" aria-hidden="true" tabindex="-1"></a>elections.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="45">
 <div>
@@ -3159,7 +3165,7 @@ <h2 data-number="4.5" class="anchored" data-anchor-id="joining-tables"><span cla
 </div>
 <p>Say we want to understand the popularity of the names of each presidential candidate in 2022. To do this, we’ll need the combined data of <code>babynames</code> <em>and</em> <code>elections</code>.</p>
 <p>We’ll start by creating a new column containing the first name of each presidential candidate. This will help us join each name in <code>elections</code> to the corresponding name data in <code>babynames</code>.</p>
-<div id="c880c03b" class="cell" data-execution_count="46">
+<div id="74cb48bf" class="cell" data-execution_count="46">
 <div class="sourceCode cell-code" id="cb57"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb57-1"><a href="#cb57-1" aria-hidden="true" tabindex="-1"></a><span class="co"># This `str` operation splits each candidate's full name at each </span></span>
 <span id="cb57-2"><a href="#cb57-2" aria-hidden="true" tabindex="-1"></a><span class="co"># blank space, then takes just the candidate's first name</span></span>
 <span id="cb57-3"><a href="#cb57-3" aria-hidden="true" tabindex="-1"></a>elections[<span class="st">"First Name"</span>] <span class="op">=</span> elections[<span class="st">"Candidate"</span>].<span class="bu">str</span>.split().<span class="bu">str</span>[<span class="dv">0</span>]</span>
@@ -3238,7 +3244,7 @@ <h2 data-number="4.5" class="anchored" data-anchor-id="joining-tables"><span cla
 </div>
 </div>
 </div>
-<div id="fda6a86f" class="cell" data-execution_count="47">
+<div id="8072a6c4" class="cell" data-execution_count="47">
 <div class="sourceCode cell-code" id="cb58"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb58-1"><a href="#cb58-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Here, we'll only consider `babynames` data from 2022</span></span>
 <span id="cb58-2"><a href="#cb58-2" aria-hidden="true" tabindex="-1"></a>babynames_2022 <span class="op">=</span> babynames[babynames[<span class="st">"Year"</span>]<span class="op">==</span><span class="dv">2022</span>]</span>
 <span id="cb58-3"><a href="#cb58-3" aria-hidden="true" tabindex="-1"></a>babynames_2022.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -3311,7 +3317,7 @@ <h2 data-number="4.5" class="anchored" data-anchor-id="joining-tables"><span cla
 </div>
 </div>
 <p>Now, we’re ready to join the two tables. <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html"><code>pd.merge</code></a> is the <code>pandas</code> method used to join <code>DataFrame</code>s together.</p>
-<div id="5ef8058e" class="cell" data-execution_count="48">
+<div id="19a3eb31" class="cell" data-execution_count="48">
 <div class="sourceCode cell-code" id="cb59"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb59-1"><a href="#cb59-1" aria-hidden="true" tabindex="-1"></a>merged <span class="op">=</span> pd.merge(left <span class="op">=</span> elections, right <span class="op">=</span> babynames_2022, <span class="op">\</span></span>
 <span id="cb59-2"><a href="#cb59-2" aria-hidden="true" tabindex="-1"></a>                  left_on <span class="op">=</span> <span class="st">"First Name"</span>, right_on <span class="op">=</span> <span class="st">"Name"</span>)</span>
 <span id="cb59-3"><a href="#cb59-3" aria-hidden="true" tabindex="-1"></a>merged.head()</span>
diff --git a/docs/regex/regex.html b/docs/regex/regex.html
index 27ddcf46..dc47d5ee 100644
--- a/docs/regex/regex.html
+++ b/docs/regex/regex.html
@@ -208,6 +208,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -403,7 +409,7 @@ <h2 data-number="6.2" class="anchored" data-anchor-id="python-string-methods"><s
 <section id="canonicalization" class="level3" data-number="6.2.1">
 <h3 data-number="6.2.1" class="anchored" data-anchor-id="canonicalization"><span class="header-section-number">6.2.1</span> Canonicalization</h3>
 <p>Assume we want to merge the given tables.</p>
-<div id="9f3576b3" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="1">
+<div id="db659bc6" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="1">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
@@ -415,7 +421,7 @@ <h3 data-number="6.2.1" class="anchored" data-anchor-id="canonicalization"><span
 <span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a>    county_and_pop <span class="op">=</span> pd.read_csv(f)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 </div>
-<div id="aafa84d5" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="2">
+<div id="34bc47b1" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>display(county_and_state), display(county_and_pop)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -498,7 +504,7 @@ <h3 data-number="6.2.1" class="anchored" data-anchor-id="canonicalization"><span
 <section id="canonicalization-with-python-string-manipulation" class="level4" data-number="6.2.1.1">
 <h4 data-number="6.2.1.1" class="anchored" data-anchor-id="canonicalization-with-python-string-manipulation"><span class="header-section-number">6.2.1.1</span> Canonicalization with Python String Manipulation</h4>
 <p>The following function uses Python string manipulation to convert a single county name into canonical form. It does so by eliminating whitespace, punctuation, and unnecessary text.</p>
-<div id="efba4d89" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="3">
+<div id="4eaa01cd" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> canonicalize_county(county_name):</span>
 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (</span>
 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>        county_name</span>
@@ -516,7 +522,7 @@ <h4 data-number="6.2.1.1" class="anchored" data-anchor-id="canonicalization-with
 </div>
 </div>
 <p>We will use the <code>pandas</code> <code>map</code> function to apply the <code>canonicalize_county</code> function to every row in both <code>DataFrame</code>s. In doing so, we’ll create a new column in each called <code>clean_county_python</code> with the canonical form.</p>
-<div id="9b5105bd" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="4">
+<div id="3630efa8" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>county_and_pop[<span class="st">'clean_county_python'</span>] <span class="op">=</span> county_and_pop[<span class="st">'County'</span>].<span class="bu">map</span>(canonicalize_county)</span>
 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>county_and_state[<span class="st">'clean_county_python'</span>] <span class="op">=</span> county_and_state[<span class="st">'County'</span>].<span class="bu">map</span>(canonicalize_county)</span>
 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>display(county_and_state), display(county_and_pop)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -612,7 +618,7 @@ <h4 data-number="6.2.1.1" class="anchored" data-anchor-id="canonicalization-with
 <h4 data-number="6.2.1.2" class="anchored" data-anchor-id="canonicalization-with-pandas-series-methods"><span class="header-section-number">6.2.1.2</span> Canonicalization with Pandas Series Methods</h4>
 <p>Alternatively, we can use <code>pandas</code> <code>Series</code> methods to create this standardized column. To do so, we must call the <code>.str</code> attribute of our <code>Series</code> object prior to calling any methods, like <code>.lower</code> and <code>.replace</code>. Notice how these method names match their equivalent built-in Python string functions.</p>
 <p>Chaining multiple <code>Series</code> methods in this manner eliminates the need to use the <code>map</code> function (as this code is vectorized).</p>
-<div id="7b0fb8e7" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="5">
+<div id="c5605d40" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="5">
 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> canonicalize_county_series(county_series):</span>
 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (</span>
 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>        county_series</span>
@@ -730,7 +736,7 @@ <h4 data-number="6.2.1.2" class="anchored" data-anchor-id="canonicalization-with
 <h3 data-number="6.2.2" class="anchored" data-anchor-id="extraction"><span class="header-section-number">6.2.2</span> Extraction</h3>
 <p>Extraction explores the idea of obtaining useful information from text data. This will be particularily important in model building, which we’ll study in a few weeks.</p>
 <p>Say we want to read some data from a <code>.txt</code> file.</p>
-<div id="da1937b5" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="6">
+<div id="a990a2e0" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="6">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">'data/log.txt'</span>, <span class="st">'r'</span>) <span class="im">as</span> f:</span>
 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>    log_lines <span class="op">=</span> f.readlines()</span>
 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -743,7 +749,7 @@ <h3 data-number="6.2.2" class="anchored" data-anchor-id="extraction"><span class
 </div>
 <p>Suppose we want to extract the day, month, year, hour, minutes, seconds, and time zone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won’t work.</p>
 <p>Instead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further separated by <code>/</code> and <code>:</code>. We can hone in on this region of text, and split the data on these characters. Python’s built-in <code>.split</code> function makes this easy.</p>
-<div id="3648e4f1" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="7">
+<div id="309e6138" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>first <span class="op">=</span> log_lines[<span class="dv">0</span>] <span class="co"># Only considering the first row of data</span></span>
 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>pertinent <span class="op">=</span> first.split(<span class="st">"["</span>)[<span class="dv">1</span>].split(<span class="st">']'</span>)[<span class="dv">0</span>]</span>
@@ -773,7 +779,7 @@ <h3 data-number="6.2.2" class="anchored" data-anchor-id="extraction"><span class
 <h2 data-number="6.3" class="anchored" data-anchor-id="regex-basics"><span class="header-section-number">6.3</span> RegEx Basics</h2>
 <p>A <strong>regular expression (“RegEx”)</strong> is a sequence of characters that specifies a search pattern. They are written to extract specific information from text. Regular expressions are essentially part of a smaller programming language embedded in Python, made available through the <code>re</code> module. As such, they have a stand-alone syntax and methods for various capabilities.</p>
 <p>Regular expressions are useful in many applications beyond data science. For example, Social Security Numbers (SSNs) are often validated with regular expressions.</p>
-<div id="5975fed7" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="8">
+<div id="7e8186b1" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co">r"[0-9]{3}-[0-9]{2}-[0-9]{4}"</span> <span class="co"># Regular Expression Syntax</span></span>
 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="co"># 3 of any digit, then a dash,</span></span>
@@ -1076,7 +1082,7 @@ <h3 data-number="6.5.1" class="anchored" data-anchor-id="greediness"><span class
 <section id="examples-2" class="level3" data-number="6.5.2">
 <h3 data-number="6.5.2" class="anchored" data-anchor-id="examples-2"><span class="header-section-number">6.5.2</span> Examples</h3>
 <p>Let’s revisit our earlier problem of extracting date/time data from the given <code>.txt</code> files. Here is how the data looked.</p>
-<div id="b2ff0cb0" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="9">
+<div id="d56d693c" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="9">
 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>log_lines[<span class="dv">0</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="9">
 <pre><code>'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'</code></pre>
@@ -1109,7 +1115,7 @@ <h4 data-number="6.6.1.1" class="anchored" data-anchor-id="canonicalization-with
 <p>The regular expression here removes text surrounded by <code>&lt;&gt;</code> (also known as HTML tags).</p>
 <p>In order, the pattern matches … 1. a single <code>&lt;</code> 2. any character that is not a <code>&gt;</code> : div, td valign…, /td, /div 3. a single <code>&gt;</code></p>
 <p>Any substring in <code>text</code> that fulfills all three conditions will be replaced by <code>''</code>.</p>
-<div id="daa64e30" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="10">
+<div id="47d3bcc4" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> re</span>
 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a>text <span class="op">=</span> <span class="st">"&lt;div&gt;&lt;td valign='top'&gt;Moo&lt;/td&gt;&lt;/div&gt;"</span></span>
@@ -1126,7 +1132,7 @@ <h4 data-number="6.6.1.1" class="anchored" data-anchor-id="canonicalization-with
 <h4 data-number="6.6.1.2" class="anchored" data-anchor-id="canonicalization-with-pandas"><span class="header-section-number">6.6.1.2</span> Canonicalization with <code>pandas</code></h4>
 <p>We can also use regular expressions with <code>pandas</code> <code>Series</code> methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: <br> <code>ser.str.replace(pattern, repl, regex=True</code>).</p>
 <p>Consider the following <code>DataFrame</code> <code>html_data</code> with a single column.</p>
-<div id="389a9b0b" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="11">
+<div id="90e613e8" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="11">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a>data <span class="op">=</span> {<span class="st">"HTML"</span>: [<span class="st">"&lt;div&gt;&lt;td valign='top'&gt;Moo&lt;/td&gt;&lt;/div&gt;"</span>, <span class="op">\</span></span>
@@ -1135,7 +1141,7 @@ <h4 data-number="6.6.1.2" class="anchored" data-anchor-id="canonicalization-with
 <span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a>html_data <span class="op">=</span> pd.DataFrame(data)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 </div>
-<div id="7c1eb58a" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="12">
+<div id="772ff87a" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="12">
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>html_data</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="12">
 <div>
@@ -1167,7 +1173,7 @@ <h4 data-number="6.6.1.2" class="anchored" data-anchor-id="canonicalization-with
 </div>
 </div>
 </div>
-<div id="9f39c885" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="13">
+<div id="74bfb0d3" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>pattern <span class="op">=</span> <span class="vs">r"&lt;[^&gt;]+&gt;"</span></span>
 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>html_data[<span class="st">'HTML'</span>].<span class="bu">str</span>.replace(pattern, <span class="st">''</span>, regex<span class="op">=</span><span class="va">True</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="13">
@@ -1185,7 +1191,7 @@ <h3 data-number="6.6.2" class="anchored" data-anchor-id="extraction-1"><span cla
 <h4 data-number="6.6.2.1" class="anchored" data-anchor-id="extraction-with-regex"><span class="header-section-number">6.6.2.1</span> Extraction with RegEx</h4>
 <p>Just like with canonicalization, the <code>re</code> module provides capability to extract relevant text from a string: <br> <code>re.findall(pattern, text)</code>. This function returns a list of all matches to <code>pattern</code>.</p>
 <p>Using the familiar regular expression for Social Security Numbers:</p>
-<div id="4dffefae" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="14">
+<div id="49110ff8" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="14">
 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a>text <span class="op">=</span> <span class="st">"My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789."</span></span>
 <span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a>pattern <span class="op">=</span> <span class="vs">r"[0-9]</span><span class="sc">{3}</span><span class="vs">-[0-9]</span><span class="sc">{2}</span><span class="vs">-[0-9]</span><span class="sc">{4}</span><span class="vs">"</span></span>
 <span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a>re.findall(pattern, text)  </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -1198,7 +1204,7 @@ <h4 data-number="6.6.2.1" class="anchored" data-anchor-id="extraction-with-regex
 <h4 data-number="6.6.2.2" class="anchored" data-anchor-id="extraction-with-pandas"><span class="header-section-number">6.6.2.2</span> Extraction with <code>pandas</code></h4>
 <p><code>pandas</code> similarily provides extraction functionality on a <code>Series</code> of data: <code>ser.str.findall(pattern)</code></p>
 <p>Consider the following <code>DataFrame</code> <code>ssn_data</code>.</p>
-<div id="b913f269" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="15">
+<div id="0b9498a2" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="15">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a>data <span class="op">=</span> {<span class="st">"SSN"</span>: [<span class="st">"987-65-4321"</span>, <span class="st">"forty"</span>, <span class="op">\</span></span>
@@ -1207,7 +1213,7 @@ <h4 data-number="6.6.2.2" class="anchored" data-anchor-id="extraction-with-panda
 <span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a>ssn_data <span class="op">=</span> pd.DataFrame(data)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 </div>
-<div id="92d42923" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="16">
+<div id="03c40cc4" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="16">
 <div class="sourceCode cell-code" id="cb24"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a>ssn_data</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="16">
 <div>
@@ -1243,7 +1249,7 @@ <h4 data-number="6.6.2.2" class="anchored" data-anchor-id="extraction-with-panda
 </div>
 </div>
 </div>
-<div id="bd8d7d7f" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="17">
+<div id="099990db" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="17">
 <div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a>ssn_data[<span class="st">"SSN"</span>].<span class="bu">str</span>.findall(pattern)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="17">
 <pre><code>0                 [987-65-4321]
@@ -1255,7 +1261,7 @@ <h4 data-number="6.6.2.2" class="anchored" data-anchor-id="extraction-with-panda
 </div>
 <p>This function returns a list for every row containing the pattern matches in a given string.</p>
 <p>As you may expect, there are similar <code>pandas</code> equivalents for other <code>re</code> functions as well. <code>Series.str.extract</code> takes in a pattern and returns a <code>DataFrame</code> of each capture group’s first match in the string. In contrast, <code>Series.str.extractall</code> returns a multi-indexed <code>DataFrame</code> of all matches for each capture group. You can see the difference in the outputs below:</p>
-<div id="335b0805" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="18">
+<div id="0d0b0431" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="18">
 <div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a>pattern_cg <span class="op">=</span> <span class="vs">r"([0-9]</span><span class="sc">{3}</span><span class="vs">)-([0-9]</span><span class="sc">{2}</span><span class="vs">)-([0-9]</span><span class="sc">{4}</span><span class="vs">)"</span></span>
 <span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a>ssn_data[<span class="st">"SSN"</span>].<span class="bu">str</span>.extract(pattern_cg)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="18">
@@ -1302,7 +1308,7 @@ <h4 data-number="6.6.2.2" class="anchored" data-anchor-id="extraction-with-panda
 </div>
 </div>
 </div>
-<div id="7b5efc3a" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="19">
+<div id="cc410707" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="19">
 <div class="sourceCode cell-code" id="cb28"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a>ssn_data[<span class="st">"SSN"</span>].<span class="bu">str</span>.extractall(pattern_cg)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="19">
 <div>
@@ -1367,12 +1373,12 @@ <h3 data-number="6.6.3" class="anchored" data-anchor-id="regular-expression-capt
 <p>Let’s take a look at an example.</p>
 <section id="example-1" class="level4" data-number="6.6.3.1">
 <h4 data-number="6.6.3.1" class="anchored" data-anchor-id="example-1"><span class="header-section-number">6.6.3.1</span> Example 1</h4>
-<div id="e0992a8e" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="20">
+<div id="7998dd0d" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="20">
 <div class="sourceCode cell-code" id="cb29"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a>text <span class="op">=</span> <span class="st">"Observations: 03:04:53 - Horse awakens. </span><span class="ch">\</span></span>
 <span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a><span class="st">        03:05:14 - Horse goes back to sleep."</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Say we want to capture all occurences of time data (hour, minute, and second) as <em>separate entities</em>.</p>
-<div id="ec95afff" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="21">
+<div id="07147958" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="21">
 <div class="sourceCode cell-code" id="cb30"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true" tabindex="-1"></a>pattern_1 <span class="op">=</span> <span class="vs">r"(\d\d):(\d\d):(\d\d)"</span></span>
 <span id="cb30-2"><a href="#cb30-2" aria-hidden="true" tabindex="-1"></a>re.findall(pattern_1, text)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="21">
@@ -1381,7 +1387,7 @@ <h4 data-number="6.6.3.1" class="anchored" data-anchor-id="example-1"><span clas
 </div>
 <p>Notice how the given pattern has 3 capture groups, each specified by the regular expression <code>(\d\d)</code>. We then use <code>re.findall</code> to return these capture groups, each as tuples containing 3 matches.</p>
 <p>These regular expression capture groups can be different. We can use the <code>(\d{2})</code> shorthand to extract the same data.</p>
-<div id="0f7f4694" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="22">
+<div id="31bc6059" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="22">
 <div class="sourceCode cell-code" id="cb32"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true" tabindex="-1"></a>pattern_2 <span class="op">=</span> <span class="vs">r"(\d\d):(\d\d):(\d</span><span class="sc">{2}</span><span class="vs">)"</span></span>
 <span id="cb32-2"><a href="#cb32-2" aria-hidden="true" tabindex="-1"></a>re.findall(pattern_2, text)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="22">
@@ -1392,14 +1398,14 @@ <h4 data-number="6.6.3.1" class="anchored" data-anchor-id="example-1"><span clas
 <section id="example-2" class="level4" data-number="6.6.3.2">
 <h4 data-number="6.6.3.2" class="anchored" data-anchor-id="example-2"><span class="header-section-number">6.6.3.2</span> Example 2</h4>
 <p>With the notion of capture groups, convince yourself how the following regular expression works.</p>
-<div id="5c2ed8ef" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="23">
+<div id="d7e39ed2" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="23">
 <div class="sourceCode cell-code" id="cb34"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true" tabindex="-1"></a>first <span class="op">=</span> log_lines[<span class="dv">0</span>]</span>
 <span id="cb34-2"><a href="#cb34-2" aria-hidden="true" tabindex="-1"></a>first</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="23">
 <pre><code>'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'</code></pre>
 </div>
 </div>
-<div id="91208840" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="24">
+<div id="8189bd81" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="24">
 <div class="sourceCode cell-code" id="cb36"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a>pattern <span class="op">=</span> <span class="vs">r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'</span></span>
 <span id="cb36-2"><a href="#cb36-2" aria-hidden="true" tabindex="-1"></a>day, month, year, hour, minute, second, time_zone <span class="op">=</span> re.findall(pattern, first)[<span class="dv">0</span>]</span>
 <span id="cb36-3"><a href="#cb36-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(day, month, year, hour, minute, second, time_zone)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
diff --git a/docs/sampling/sampling.html b/docs/sampling/sampling.html
index 70ecfed7..e97ff766 100644
--- a/docs/sampling/sampling.html
+++ b/docs/sampling/sampling.html
@@ -64,6 +64,7 @@
 <script src="../site_libs/quarto-search/fuse.min.js"></script>
 <script src="../site_libs/quarto-search/quarto-search.js"></script>
 <meta name="quarto:offset" content="../">
+<link href="../intro_to_modeling/intro_to_modeling.html" rel="next">
 <link href="../visualization_2/visualization_2.html" rel="prev">
 <link href="../data100_logo.png" rel="icon" type="image/png">
 <script src="../site_libs/quarto-html/quarto.js"></script>
@@ -207,6 +208,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link active">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -445,7 +452,7 @@ <h3 data-number="9.3.3" class="anchored" data-anchor-id="demo-barbie-v.-oppenhei
 <li>There are only two movies they can watch on July 21st: Barbie and Oppenheimer.</li>
 <li>Every resident watches a movie (either Barbie or Oppenheimer) on July 21st.</li>
 </ul>
-<div id="4018131a" class="cell" data-execution_count="1">
+<div id="0573d922" class="cell" data-execution_count="1">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
@@ -459,7 +466,7 @@ <h3 data-number="9.3.3" class="anchored" data-anchor-id="demo-barbie-v.-oppenhei
 <span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a>rng <span class="op">=</span> np.random.default_rng()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 </div>
-<div id="2873b907" class="cell" data-execution_count="2">
+<div id="4e37fca2" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>movie <span class="op">=</span> pd.read_csv(<span class="st">"data/movie.csv"</span>)</span>
 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="co"># create a 1/0 int that indicates Barbie vote</span></span>
@@ -522,7 +529,7 @@ <h3 data-number="9.3.3" class="anchored" data-anchor-id="demo-barbie-v.-oppenhei
 </div>
 </div>
 <p>What fraction of Berkeley residents chose Barbie?</p>
-<div id="cdae49a7" class="cell" data-execution_count="3">
+<div id="bc433b82" class="cell" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>actual_barbie <span class="op">=</span> np.mean(movie[<span class="st">"barbie"</span>])</span>
 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>actual_barbie</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="3">
@@ -533,7 +540,7 @@ <h3 data-number="9.3.3" class="anchored" data-anchor-id="demo-barbie-v.-oppenhei
 <section id="convenience-sample-retirees" class="level4" data-number="9.3.3.1">
 <h4 data-number="9.3.3.1" class="anchored" data-anchor-id="convenience-sample-retirees"><span class="header-section-number">9.3.3.1</span> Convenience Sample: Retirees</h4>
 <p>Let’s take a convenience sample of people who have retired (&gt;= 65 years old). What proportion of them went to see Barbie instead of Oppenheimer?</p>
-<div id="3e70ece1" class="cell" data-execution_count="4">
+<div id="9fa897db" class="cell" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>convenience_sample <span class="op">=</span> movie[movie[<span class="st">'age'</span>] <span class="op">&gt;=</span> <span class="dv">65</span>] <span class="co"># take a convenience sample of retirees</span></span>
 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>np.mean(convenience_sample[<span class="st">"barbie"</span>]) <span class="co"># what proportion of them saw Barbie? </span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="4">
@@ -541,14 +548,14 @@ <h4 data-number="9.3.3.1" class="anchored" data-anchor-id="convenience-sample-re
 </div>
 </div>
 <p>Based on this result, we would have predicted that Oppenheimer would win! What happened? Is it possible that our sample is too small or noisy?</p>
-<div id="ba546126" class="cell" data-execution_count="5">
+<div id="bc85203b" class="cell" data-execution_count="5">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># what's the size of our sample? </span></span>
 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="bu">len</span>(convenience_sample)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="5">
 <pre><code>359396</code></pre>
 </div>
 </div>
-<div id="dd372717" class="cell" data-execution_count="6">
+<div id="0c387e70" class="cell" data-execution_count="6">
 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># what proportion of our data is in the convenience sample? </span></span>
 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="bu">len</span>(convenience_sample)<span class="op">/</span><span class="bu">len</span>(movie)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="6">
@@ -560,7 +567,7 @@ <h4 data-number="9.3.3.1" class="anchored" data-anchor-id="convenience-sample-re
 <section id="check-for-bias" class="level4" data-number="9.3.3.2">
 <h4 data-number="9.3.3.2" class="anchored" data-anchor-id="check-for-bias"><span class="header-section-number">9.3.3.2</span> Check for Bias</h4>
 <p>Let us aggregate all choices by age and visualize the fraction of Barbie views, split by gender.</p>
-<div id="b1bc82cf" class="cell" data-execution_count="7">
+<div id="ca3bb25c" class="cell" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>votes_by_barbie <span class="op">=</span> movie.groupby([<span class="st">"age"</span>,<span class="st">"is_male"</span>]).agg(<span class="st">"mean"</span>, numeric_only<span class="op">=</span><span class="va">True</span>).reset_index()</span>
 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>votes_by_barbie.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="7">
@@ -613,7 +620,7 @@ <h4 data-number="9.3.3.2" class="anchored" data-anchor-id="check-for-bias"><span
 </div>
 </div>
 </div>
-<div id="53d9cf88" class="cell" data-execution_count="8">
+<div id="ebefd4a6" class="cell" data-execution_count="8">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># A common matplotlib/seaborn pattern: create the figure and axes object, pass ax</span></span>
@@ -644,17 +651,17 @@ <h4 data-number="9.3.3.2" class="anchored" data-anchor-id="check-for-bias"><span
 <section id="simple-random-sample" class="level4" data-number="9.3.3.3">
 <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"><span class="header-section-number">9.3.3.3</span> Simple Random Sample</h4>
 <p>Suppose we took a simple random sample (SRS) of the same size as our retiree sample:</p>
-<div id="95b0c01f" class="cell" data-execution_count="9">
+<div id="be8fe525" class="cell" data-execution_count="9">
 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="bu">len</span>(convenience_sample)</span>
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>random_sample <span class="op">=</span> movie.sample(n, replace <span class="op">=</span> <span class="va">False</span>) <span class="co">## By default, replace = False</span></span>
 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>np.mean(random_sample[<span class="st">"barbie"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="9">
-<pre><code>np.float64(0.52963861590001)</code></pre>
+<pre><code>np.float64(0.5309408006766909)</code></pre>
 </div>
 </div>
 <p>This is very close to the actual vote of 0.5302792307692308!</p>
 <p>It turns out that we can get similar results with a <strong>much smaller sample size</strong>, say, 800:</p>
-<div id="41cddee7" class="cell" data-execution_count="10">
+<div id="b03c23c3" class="cell" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="dv">800</span></span>
 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a>random_sample <span class="op">=</span> movie.sample(n, replace <span class="op">=</span> <span class="va">False</span>)</span>
 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -667,7 +674,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
 <span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a>Markdown(<span class="ss">f"**Actual** = </span><span class="sc">{</span>actual_barbie<span class="sc">:.4f}</span><span class="ss">, **Sample** = </span><span class="sc">{</span>sample_barbie<span class="sc">:.4f}</span><span class="ss">, "</span></span>
 <span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a>         <span class="ss">f"**Err** = </span><span class="sc">{</span><span class="dv">100</span><span class="op">*</span>err<span class="sc">:.2f}</span><span class="ss">%."</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display cell-output-markdown" data-execution_count="10">
-<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5300, <strong>Err</strong> = 0.05%.</p>
+<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5188, <strong>Err</strong> = 2.17%.</p>
 </div>
 </div>
 <p>We’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.</p>
@@ -676,7 +683,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
 <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-error"><span class="header-section-number">9.3.3.4</span> Quantifying Chance Error</h4>
 <p>In our SRS of size 800, what would be our chance error?</p>
 <p>Let’s simulate 1000 versions of taking the 800-sized SRS from before:</p>
-<div id="e8d52157" class="cell" data-execution_count="11">
+<div id="020f0738" class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>nrep <span class="op">=</span> <span class="dv">1000</span>   <span class="co"># number of simulations</span></span>
 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="dv">800</span>       <span class="co"># size of our sample</span></span>
 <span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a>poll_result <span class="op">=</span> []</span>
@@ -684,7 +691,7 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
 <span id="cb16-5"><a href="#cb16-5" aria-hidden="true" tabindex="-1"></a>    random_sample <span class="op">=</span> movie.sample(n, replace <span class="op">=</span> <span class="va">False</span>)</span>
 <span id="cb16-6"><a href="#cb16-6" aria-hidden="true" tabindex="-1"></a>    poll_result.append(np.mean(random_sample[<span class="st">"barbie"</span>]))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
-<div id="dcdf09fa" class="cell" data-execution_count="12">
+<div id="b08a1f5e" class="cell" data-execution_count="12">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a>fig, ax <span class="op">=</span> plt.subplots()</span>
@@ -700,21 +707,21 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
 <div class="cell-output cell-output-display">
 <div>
 <figure class="figure">
-<p><img src="sampling_files/figure-html/cell-13-output-2.png" width="605" height="421" class="figure-img"></p>
+<p><img src="sampling_files/figure-html/cell-13-output-2.png" width="605" height="424" class="figure-img"></p>
 </figure>
 </div>
 </div>
 </div>
 <p>What fraction of these simulated samples would have predicted Barbie?</p>
-<div id="3d52ad34" class="cell" data-execution_count="13">
+<div id="0ea3140e" class="cell" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>poll_result <span class="op">=</span> pd.Series(poll_result)</span>
 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>np.<span class="bu">sum</span>(poll_result <span class="op">&gt;</span> <span class="fl">0.5</span>)<span class="op">/</span><span class="dv">1000</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="13">
-<pre><code>np.float64(0.959)</code></pre>
+<pre><code>np.float64(0.946)</code></pre>
 </div>
 </div>
 <p>You can see the curve looks roughly Gaussian/normal. Using KDE:</p>
-<div id="17081f39" class="cell" data-execution_count="14">
+<div id="67e9bded" class="cell" data-execution_count="14">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a>sns.histplot(poll_result, stat<span class="op">=</span><span class="st">'density'</span>, kde<span class="op">=</span><span class="va">True</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -728,7 +735,7 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
 <div class="cell-output cell-output-display">
 <div>
 <figure class="figure">
-<p><img src="sampling_files/figure-html/cell-15-output-2.png" width="605" height="421" class="figure-img"></p>
+<p><img src="sampling_files/figure-html/cell-15-output-2.png" width="605" height="424" class="figure-img"></p>
 </figure>
 </div>
 </div>
@@ -1138,6 +1145,9 @@ <h2 data-number="9.4" class="anchored" data-anchor-id="summary"><span class="hea
       </a>          
   </div>
   <div class="nav-page nav-page-next">
+      <a href="../intro_to_modeling/intro_to_modeling.html" class="pagination-link" aria-label="<span class='chapter-number'>10</span>&nbsp; <span class='chapter-title'>Introduction to Modeling</span>">
+        <span class="nav-page-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span> <i class="bi bi-arrow-right-short"></i>
+      </a>
   </div>
 </nav>
 </div> <!-- /content -->
diff --git a/docs/sampling/sampling_files/figure-html/cell-13-output-2.png b/docs/sampling/sampling_files/figure-html/cell-13-output-2.png
index 22ab187e..da335396 100644
Binary files a/docs/sampling/sampling_files/figure-html/cell-13-output-2.png and b/docs/sampling/sampling_files/figure-html/cell-13-output-2.png differ
diff --git a/docs/sampling/sampling_files/figure-html/cell-15-output-2.png b/docs/sampling/sampling_files/figure-html/cell-15-output-2.png
index 0093da81..c5e3d61c 100644
Binary files a/docs/sampling/sampling_files/figure-html/cell-15-output-2.png and b/docs/sampling/sampling_files/figure-html/cell-15-output-2.png differ
diff --git a/docs/search.json b/docs/search.json
index 20aa5818..faecf1cd 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -144,7 +144,7 @@
     "href": "pandas_2/pandas_2.html#useful-utility-functions",
     "title": "3  Pandas II",
     "section": "3.3 Useful Utility Functions",
-    "text": "3.3 Useful Utility Functions\npandas contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in Data 100.\nDiscussing all functionality offered by pandas could take an entire semester! We will walk you through the most commonly-used functions and encourage you to explore and experiment on your own.\n\nNumPy and built-in function support\n.shape\n.size\n.describe()\n.sample()\n.value_counts()\n.unique()\n.sort_values()\n\nThe pandas documentation will be a valuable resource in Data 100 and beyond.\n\n3.3.1 NumPy\npandas is designed to work well with NumPy, the framework for array computations you encountered in Data 8. Just about any NumPy function can be applied to pandas DataFrames and Series.\n\n# Pull out the number of babies named Yash each year\nyash_count = babynames[babynames[\"Name\"] == \"Yash\"][\"Count\"]\nyash_count.head()\n\n331824     8\n334114     9\n336390    11\n338773    12\n341387    10\nName: Count, dtype: int64\n\n\n\n# Average number of babies named Yash each year\nnp.mean(yash_count)\n\nnp.float64(17.142857142857142)\n\n\n\n# Max number of babies named Yash born in any one year\nnp.max(yash_count)\n\nnp.int64(29)\n\n\n\n\n3.3.2 .shape and .size\n.shape and .size are attributes of Series and DataFrames that measure the “amount” of data stored in the structure. Calling .shape returns a tuple containing the number of rows and columns present in the DataFrame or Series. .size is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.\nMany functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.\n\n# Return the shape of the DataFrame, in the format (num_rows, num_columns)\nbabynames.shape\n\n(407428, 5)\n\n\n\n# Return the size of the DataFrame, equal to num_rows * num_columns\nbabynames.size\n\n2037140\n\n\n\n\n3.3.3 .describe()\nIf many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then .describe() (documentation) can be used to compute all of them at once.\n\nbabynames.describe()\n\n\n\n\n\n\n\n\nYear\nCount\n\n\n\n\ncount\n407428.000000\n407428.000000\n\n\nmean\n1985.733609\n79.543456\n\n\nstd\n27.007660\n293.698654\n\n\nmin\n1910.000000\n5.000000\n\n\n25%\n1969.000000\n7.000000\n\n\n50%\n1992.000000\n13.000000\n\n\n75%\n2008.000000\n38.000000\n\n\nmax\n2022.000000\n8260.000000\n\n\n\n\n\n\n\nA different set of statistics will be reported if .describe() is called on a Series.\n\nbabynames[\"Sex\"].describe()\n\ncount     407428\nunique         2\ntop            F\nfreq      239537\nName: Sex, dtype: object\n\n\n\n\n3.3.4 .sample()\nAs we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). .sample() (documentation) lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).\nBy default, .sample() selects entries without replacement. Pass in the argument replace=True to sample with replacement.\n\n# Sample a single row\nbabynames.sample()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n377796\nCA\nM\n2012\nMaxson\n7\n\n\n\n\n\n\n\nNaturally, this can be chained with other methods and operators (iloc, etc.).\n\n# Sample 5 random rows, and select all columns after column 2\nbabynames.sample(5).iloc[:, 2:]\n\n\n\n\n\n\n\n\nYear\nName\nCount\n\n\n\n\n399452\n2020\nKade\n51\n\n\n354928\n2004\nOjani\n6\n\n\n94414\n1984\nMaggie\n62\n\n\n294766\n1978\nEmmanuel\n52\n\n\n356937\n2005\nDerik\n10\n\n\n\n\n\n\n\n\n# Randomly sample 4 names from the year 2000, with replacement, and select all columns after column 2\nbabynames[babynames[\"Year\"] == 2000].sample(4, replace = True).iloc[:, 2:]\n\n\n\n\n\n\n\n\nYear\nName\nCount\n\n\n\n\n150626\n2000\nKarley\n15\n\n\n152317\n2000\nAmyah\n5\n\n\n149628\n2000\nFrida\n63\n\n\n343239\n2000\nJaycob\n23\n\n\n\n\n\n\n\n\n\n3.3.5 .value_counts()\nThe Series.value_counts() (documentation) method counts the number of occurrence of each unique value in a Series. In other words, it counts the number of times each unique value appears. This is often useful for determining the most or least common entries in a Series.\nIn the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the \"Name\" column of babynames. Note that the return value is also a Series.\n\nbabynames[\"Name\"].value_counts().head()\n\nName\nJean         223\nFrancis      221\nGuadalupe    218\nJessie       217\nMarion       214\nName: count, dtype: int64\n\n\n\n\n3.3.6 .unique()\nIf we have a Series with many repeated values, then .unique() (documentation) can be used to identify only the unique values. Here we return an array of all the names in babynames.\n\nbabynames[\"Name\"].unique()\n\narray(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],\n      dtype=object)\n\n\n\n\n3.3.7 .sort_values()\nOrdering a DataFrame can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values (documentation) allows us to order a DataFrame or Series by a specified column. We can choose to either receive the rows in ascending order (default) or descending order.\n\n# Sort the \"Count\" column from highest to lowest\nbabynames.sort_values(by=\"Count\", ascending=False).head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n268041\nCA\nM\n1957\nMichael\n8260\n\n\n267017\nCA\nM\n1956\nMichael\n8258\n\n\n317387\nCA\nM\n1990\nMichael\n8246\n\n\n281850\nCA\nM\n1969\nMichael\n8245\n\n\n283146\nCA\nM\n1970\nMichael\n8196\n\n\n\n\n\n\n\nUnlike when calling .value_counts() on a DataFrame, we do not need to explicitly specify the column used for sorting when calling .value_counts() on a Series. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.\n\n# Sort the \"Name\" Series alphabetically\nbabynames[\"Name\"].sort_values(ascending=True).head()\n\n366001      Aadan\n384005      Aadan\n369120      Aadan\n398211    Aadarsh\n370306      Aaden\nName: Name, dtype: object",
+    "text": "3.3 Useful Utility Functions\npandas contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in Data 100.\nDiscussing all functionality offered by pandas could take an entire semester! We will walk you through the most commonly-used functions and encourage you to explore and experiment on your own.\n\nNumPy and built-in function support\n.shape\n.size\n.describe()\n.sample()\n.value_counts()\n.unique()\n.sort_values()\n\nThe pandas documentation will be a valuable resource in Data 100 and beyond.\n\n3.3.1 NumPy\npandas is designed to work well with NumPy, the framework for array computations you encountered in Data 8. Just about any NumPy function can be applied to pandas DataFrames and Series.\n\n# Pull out the number of babies named Yash each year\nyash_count = babynames[babynames[\"Name\"] == \"Yash\"][\"Count\"]\nyash_count.head()\n\n331824     8\n334114     9\n336390    11\n338773    12\n341387    10\nName: Count, dtype: int64\n\n\n\n# Average number of babies named Yash each year\nnp.mean(yash_count)\n\nnp.float64(17.142857142857142)\n\n\n\n# Max number of babies named Yash born in any one year\nnp.max(yash_count)\n\nnp.int64(29)\n\n\n\n\n3.3.2 .shape and .size\n.shape and .size are attributes of Series and DataFrames that measure the “amount” of data stored in the structure. Calling .shape returns a tuple containing the number of rows and columns present in the DataFrame or Series. .size is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.\nMany functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.\n\n# Return the shape of the DataFrame, in the format (num_rows, num_columns)\nbabynames.shape\n\n(407428, 5)\n\n\n\n# Return the size of the DataFrame, equal to num_rows * num_columns\nbabynames.size\n\n2037140\n\n\n\n\n3.3.3 .describe()\nIf many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then .describe() (documentation) can be used to compute all of them at once.\n\nbabynames.describe()\n\n\n\n\n\n\n\n\nYear\nCount\n\n\n\n\ncount\n407428.000000\n407428.000000\n\n\nmean\n1985.733609\n79.543456\n\n\nstd\n27.007660\n293.698654\n\n\nmin\n1910.000000\n5.000000\n\n\n25%\n1969.000000\n7.000000\n\n\n50%\n1992.000000\n13.000000\n\n\n75%\n2008.000000\n38.000000\n\n\nmax\n2022.000000\n8260.000000\n\n\n\n\n\n\n\nA different set of statistics will be reported if .describe() is called on a Series.\n\nbabynames[\"Sex\"].describe()\n\ncount     407428\nunique         2\ntop            F\nfreq      239537\nName: Sex, dtype: object\n\n\n\n\n3.3.4 .sample()\nAs we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). .sample() (documentation) lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).\nBy default, .sample() selects entries without replacement. Pass in the argument replace=True to sample with replacement.\n\n# Sample a single row\nbabynames.sample()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n44846\nCA\nF\n1961\nDarcie\n13\n\n\n\n\n\n\n\nNaturally, this can be chained with other methods and operators (iloc, etc.).\n\n# Sample 5 random rows, and select all columns after column 2\nbabynames.sample(5).iloc[:, 2:]\n\n\n\n\n\n\n\n\nYear\nName\nCount\n\n\n\n\n214600\n2016\nAarna\n30\n\n\n284937\n1971\nRigoberto\n36\n\n\n5580\n1922\nClaudine\n6\n\n\n241833\n1917\nBernhard\n5\n\n\n278221\n1965\nJohnson\n5\n\n\n\n\n\n\n\n\n# Randomly sample 4 names from the year 2000, with replacement, and select all columns after column 2\nbabynames[babynames[\"Year\"] == 2000].sample(4, replace = True).iloc[:, 2:]\n\n\n\n\n\n\n\n\nYear\nName\nCount\n\n\n\n\n344323\n2000\nRemington\n7\n\n\n150678\n2000\nJessalyn\n14\n\n\n344765\n2000\nJaelyn\n5\n\n\n151921\n2000\nAlajah\n6\n\n\n\n\n\n\n\n\n\n3.3.5 .value_counts()\nThe Series.value_counts() (documentation) method counts the number of occurrence of each unique value in a Series. In other words, it counts the number of times each unique value appears. This is often useful for determining the most or least common entries in a Series.\nIn the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the \"Name\" column of babynames. Note that the return value is also a Series.\n\nbabynames[\"Name\"].value_counts().head()\n\nName\nJean         223\nFrancis      221\nGuadalupe    218\nJessie       217\nMarion       214\nName: count, dtype: int64\n\n\n\n\n3.3.6 .unique()\nIf we have a Series with many repeated values, then .unique() (documentation) can be used to identify only the unique values. Here we return an array of all the names in babynames.\n\nbabynames[\"Name\"].unique()\n\narray(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],\n      dtype=object)\n\n\n\n\n3.3.7 .sort_values()\nOrdering a DataFrame can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values (documentation) allows us to order a DataFrame or Series by a specified column. We can choose to either receive the rows in ascending order (default) or descending order.\n\n# Sort the \"Count\" column from highest to lowest\nbabynames.sort_values(by=\"Count\", ascending=False).head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n268041\nCA\nM\n1957\nMichael\n8260\n\n\n267017\nCA\nM\n1956\nMichael\n8258\n\n\n317387\nCA\nM\n1990\nMichael\n8246\n\n\n281850\nCA\nM\n1969\nMichael\n8245\n\n\n283146\nCA\nM\n1970\nMichael\n8196\n\n\n\n\n\n\n\nUnlike when calling .value_counts() on a DataFrame, we do not need to explicitly specify the column used for sorting when calling .value_counts() on a Series. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.\n\n# Sort the \"Name\" Series alphabetically\nbabynames[\"Name\"].sort_values(ascending=True).head()\n\n366001      Aadan\n384005      Aadan\n369120      Aadan\n398211    Aadarsh\n370306      Aaden\nName: Name, dtype: object",
     "crumbs": [
       "<span class='chapter-number'>3</span>  <span class='chapter-title'>Pandas II</span>"
     ]
@@ -184,7 +184,7 @@
     "href": "pandas_3/pandas_3.html#aggregating-data-with-.groupby",
     "title": "4  Pandas III",
     "section": "4.2 Aggregating Data with .groupby",
-    "text": "4.2 Aggregating Data with .groupby\nUp until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandas GroupBy objects. Our goal is to group together rows that fall under the same category and perform an operation that aggregates across all rows in the category.\nLet’s say we wanted to aggregate all rows in babynames for a given year.\n\nbabynames.groupby(\"Year\")\n\n&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x1175d0aa0&gt;\n\n\nWhat does this strange output mean? Calling .groupby (documentation) has generated a GroupBy object. You can imagine this as a set of “mini” sub-DataFrames, where each subframe contains all of the rows from babynames that correspond to a particular year.\nThe diagram below shows a simplified view of babynames to help illustrate this idea.\n\n\n\nWe can’t work with a GroupBy object directly – that is why you saw that strange output earlier rather than a standard view of a DataFrame. To actually manipulate values within these “mini” DataFrames, we’ll need to call an aggregation method. This is a method that tells pandas how to aggregate the values within the GroupBy object. Once the aggregation is applied, pandas will return a normal (now grouped) DataFrame.\nThe first aggregation method we’ll consider is .agg. The .agg method takes in a function as its argument; this function is then applied to each column of a “mini” grouped DataFrame. We end up with a new DataFrame with one aggregated row per subframe. Let’s see this in action by finding the sum of all counts for each year in babynames – this is equivalent to finding the number of babies born in each year.\n\nbabynames[[\"Year\", \"Count\"]].groupby(\"Year\").agg(\"sum\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nWe can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of babynames, which is why we see smaller values for the summed counts.\n\n\n\nPerforming an aggregation\n\n\nCalling .agg has condensed each subframe back into a single row. This gives us our final output: a DataFrame that is now indexed by \"Year\", with a single row for each unique year in the original babynames DataFrame.\nThere are many different aggregation functions we can use, all of which are useful in different applications.\n\nbabynames[[\"Year\", \"Count\"]].groupby(\"Year\").agg(\"min\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n5\n\n\n1911\n5\n\n\n1912\n5\n\n\n1913\n5\n\n\n1914\n5\n\n\n\n\n\n\n\n\nbabynames[[\"Year\", \"Count\"]].groupby(\"Year\").agg(\"max\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n295\n\n\n1911\n390\n\n\n1912\n534\n\n\n1913\n614\n\n\n1914\n773\n\n\n\n\n\n\n\n\n# Same result, but now we explicitly tell pandas to only consider the \"Count\" column when summing\nbabynames.groupby(\"Year\")[[\"Count\"]].agg(\"sum\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nThere are many different aggregations that can be applied to the grouped data. The primary requirement is that an aggregation function must:\n\nTake in a Series of data (a single column of the grouped subframe).\nReturn a single value that aggregates this Series.\n\n\n4.2.1 Aggregation Functions\nBecause of this fairly broad requirement, pandas offers many ways of computing an aggregation.\nIn-built Python operations – such as sum, max, and min – are automatically recognized by pandas.\n\n# What is the minimum count for each name in any year?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"min\").head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n5\n\n\nAadarsh\n6\n\n\nAaden\n10\n\n\nAadhav\n6\n\n\nAadhini\n6\n\n\n\n\n\n\n\n\n# What is the largest single-year count of each name?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"max\").head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n7\n\n\nAadarsh\n6\n\n\nAaden\n158\n\n\nAadhav\n8\n\n\nAadhini\n6\n\n\n\n\n\n\n\nAs mentioned previously, functions from the NumPy library, such as np.mean, np.max, np.min, and np.sum, are also fair game in pandas.\n\n# What is the average count for each name across all years?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"mean\").head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n6.000000\n\n\nAadarsh\n6.000000\n\n\nAaden\n46.214286\n\n\nAadhav\n6.750000\n\n\nAadhini\n6.000000\n\n\n\n\n\n\n\npandas also offers a number of in-built functions. Functions that are native to pandas can be referenced using their string name within a call to .agg. Some examples include:\n\n.agg(\"sum\")\n.agg(\"max\")\n.agg(\"min\")\n.agg(\"mean\")\n.agg(\"first\")\n.agg(\"last\")\n\nThe latter two entries in this list – \"first\" and \"last\" – are unique to pandas. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where multiple columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.\nLet’s illustrate this with an example. Say we add a new column to babynames that contains the first letter of each name.\n\n# Imagine we had an additional column, \"First Letter\". We'll explain this code next week\nbabynames[\"First Letter\"] = babynames[\"Name\"].str[0]\n\n# We construct a simplified DataFrame containing just a subset of columns\nbabynames_new = babynames[[\"Name\", \"First Letter\", \"Year\"]]\nbabynames_new.head()\n\n\n\n\n\n\n\n\nName\nFirst Letter\nYear\n\n\n\n\n115957\nDeandrea\nD\n1990\n\n\n101976\nDeandrea\nD\n1986\n\n\n131029\nLeandrea\nL\n1994\n\n\n108731\nDeandrea\nD\n1988\n\n\n308131\nDeandrea\nD\n1985\n\n\n\n\n\n\n\nIf we form groups for each name in the dataset, \"First Letter\" will be the same for all members of the group. This means that if we simply select the first entry for \"First Letter\" in the group, we’ll represent all data in that group.\nWe can use a dictionary to apply different aggregation functions to each column during grouping.\n\n\n\nAggregating using “first”\n\n\n\nbabynames_new.groupby(\"Name\").agg({\"First Letter\":\"first\", \"Year\":\"max\"}).head()\n\n\n\n\n\n\n\n\nFirst Letter\nYear\n\n\nName\n\n\n\n\n\n\nAadan\nA\n2014\n\n\nAadarsh\nA\n2019\n\n\nAaden\nA\n2020\n\n\nAadhav\nA\n2019\n\n\nAadhini\nA\n2022\n\n\n\n\n\n\n\n\n\n4.2.2 Plotting Birth Counts\nLet’s use .agg to find the total number of babies born in each year. Recall that using .agg with .groupby() follows the format: df.groupby(column_name).agg(aggregation_function). The line of code below gives us the total number of babies born in each year.\n\n\nCode\nbabynames.groupby(\"Year\")[[\"Count\"]].agg(sum).head(5)\n# Alternative 1\n# babynames.groupby(\"Year\")[[\"Count\"]].sum()\n# Alternative 2\n# babynames.groupby(\"Year\").sum(numeric_only=True)\n\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73670/390646742.py:1: FutureWarning:\n\nThe provided callable &lt;built-in function sum&gt; is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"sum\" instead.\n\n\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nHere’s an illustration of the process:\n\nPlotting the Dataframe we obtain tells an interesting story.\n\n\nCode\nimport plotly.express as px\npuzzle2 = babynames.groupby(\"Year\")[[\"Count\"]].agg(\"sum\")\npx.line(puzzle2, y = \"Count\")\n\n\n                                                \n\n\nA word of warning: we made an enormous assumption when we decided to use this dataset to estimate birth rate. According to this article from the Legistlative Analyst Office, the true number of babies born in California in 2020 was 421,275. However, our plot shows 362,882 babies —— what happened?\n\n\n4.2.3 Summary of the .groupby() Function\nA groupby operation involves some combination of splitting a DataFrame into grouped subframes, applying a function, and combining the results.\nFor some arbitrary DataFrame df below, the code df.groupby(\"year\").agg(sum) does the following:\n\nSplits the DataFrame into sub-DataFrames with rows belonging to the same year.\nApplies the sum function to each column of each sub-DataFrame.\nCombines the results of sum into a single DataFrame, indexed by year.\n\n\n\n\n4.2.4 Revisiting the .agg() Function\n.agg() can take in any function that aggregates several values into one summary value. Some commonly-used aggregation functions can even be called directly, without explicit use of .agg(). For example, we can call .mean() on .groupby():\nbabynames.groupby(\"Year\").mean().head()\nWe can now put this all into practice. Say we want to find the baby name with sex “F” that has fallen in popularity the most in California. To calculate this, we can first create a metric: “Ratio to Peak” (RTP). The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies born with the name in any year.\nLet’s start with calculating this for one baby, “Jennifer”.\n\n# We filter by babies with sex \"F\" and sort by \"Year\"\nf_babynames = babynames[babynames[\"Sex\"] == \"F\"]\nf_babynames = f_babynames.sort_values([\"Year\"])\n\n# Determine how many Jennifers were born in CA per year\njenn_counts_series = f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"]\n\n# Determine the max number of Jennifers born in a year and the number born in 2022 \n# to calculate RTP\nmax_jenn = max(f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"])\ncurr_jenn = f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"].iloc[-1]\nrtp = curr_jenn / max_jenn\nrtp\n\nnp.float64(0.018796372629843364)\n\n\nBy creating a function to calculate RTP and applying it to our DataFrame by using .groupby(), we can easily compute the RTP for all names at once!\n\ndef ratio_to_peak(series):\n    return series.iloc[-1] / max(series)\n\n#Using .groupby() to apply the function\nrtp_table = f_babynames.groupby(\"Name\")[[\"Year\", \"Count\"]].agg(ratio_to_peak)\nrtp_table.head()\n\n\n\n\n\n\n\n\nYear\nCount\n\n\nName\n\n\n\n\n\n\nAadhini\n1.0\n1.000000\n\n\nAadhira\n1.0\n0.500000\n\n\nAadhya\n1.0\n0.660000\n\n\nAadya\n1.0\n0.586207\n\n\nAahana\n1.0\n0.269231\n\n\n\n\n\n\n\nIn the rows shown above, we can see that every row shown has a Year value of 1.0.\nThis is the “pandas-ification” of logic you saw in Data 8. Much of the logic you’ve learned in Data 8 will serve you well in Data 100.\n\n\n4.2.5 Nuisance Columns\nNote that you must be careful with which columns you apply the .agg() function to. If we were to apply our function to the table as a whole by doing f_babynames.groupby(\"Name\").agg(ratio_to_peak), executing our .agg() call would result in a TypeError.\n\nWe can avoid this issue (and prevent unintentional loss of data) by explicitly selecting column(s) we want to apply our aggregation function to BEFORE calling .agg(),\n\n\n4.2.6 Renaming Columns After Grouping\nBy default, .groupby will not rename any aggregated columns. As we can see in the table above, the aggregated column is still named Count even though it now represents the RTP. For better readability, we can rename Count to Count RTP\n\nrtp_table = rtp_table.rename(columns = {\"Count\": \"Count RTP\"})\nrtp_table\n\n\n\n\n\n\n\n\nYear\nCount RTP\n\n\nName\n\n\n\n\n\n\nAadhini\n1.0\n1.000000\n\n\nAadhira\n1.0\n0.500000\n\n\nAadhya\n1.0\n0.660000\n\n\nAadya\n1.0\n0.586207\n\n\nAahana\n1.0\n0.269231\n\n\n...\n...\n...\n\n\nZyanya\n1.0\n0.466667\n\n\nZyla\n1.0\n1.000000\n\n\nZylah\n1.0\n1.000000\n\n\nZyra\n1.0\n1.000000\n\n\nZyrah\n1.0\n0.833333\n\n\n\n\n13782 rows × 2 columns\n\n\n\n\n\n4.2.7 Some Data Science Payoff\nBy sorting rtp_table, we can see the names whose popularity has decreased the most.\n\nrtp_table = rtp_table.rename(columns = {\"Count\": \"Count RTP\"})\nrtp_table.sort_values(\"Count RTP\").head()\n\n\n\n\n\n\n\n\nYear\nCount RTP\n\n\nName\n\n\n\n\n\n\nDebra\n1.0\n0.001260\n\n\nDebbie\n1.0\n0.002815\n\n\nCarol\n1.0\n0.003180\n\n\nTammy\n1.0\n0.003249\n\n\nSusan\n1.0\n0.003305\n\n\n\n\n\n\n\nTo visualize the above DataFrame, let’s look at the line plot below:\n\n\nCode\nimport plotly.express as px\npx.line(f_babynames[f_babynames[\"Name\"] == \"Debra\"], x = \"Year\", y = \"Count\")\n\n\n                                                \n\n\nWe can get the list of the top 10 names and then plot popularity with the following code:\n\ntop10 = rtp_table.sort_values(\"Count RTP\").head(10).index\npx.line(\n    f_babynames[f_babynames[\"Name\"].isin(top10)], \n    x = \"Year\", \n    y = \"Count\", \n    color = \"Name\"\n)\n\n/Users/nikhilreddy/course-notes/ds100env/lib/python3.12/site-packages/plotly/express/_core.py:1980: FutureWarning:\n\nWhen grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.\n\n\n\n                                                \n\n\nAs a quick exercise, consider what code would compute the total number of babies with each name.\n\n\nCode\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"sum\").head()\n# alternative solution: \n# babynames.groupby(\"Name\")[[\"Count\"]].sum()\n\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n18\n\n\nAadarsh\n6\n\n\nAaden\n647\n\n\nAadhav\n27\n\n\nAadhini\n6",
+    "text": "4.2 Aggregating Data with .groupby\nUp until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandas GroupBy objects. Our goal is to group together rows that fall under the same category and perform an operation that aggregates across all rows in the category.\nLet’s say we wanted to aggregate all rows in babynames for a given year.\n\nbabynames.groupby(\"Year\")\n\n&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x10c075a00&gt;\n\n\nWhat does this strange output mean? Calling .groupby (documentation) has generated a GroupBy object. You can imagine this as a set of “mini” sub-DataFrames, where each subframe contains all of the rows from babynames that correspond to a particular year.\nThe diagram below shows a simplified view of babynames to help illustrate this idea.\n\n\n\nWe can’t work with a GroupBy object directly – that is why you saw that strange output earlier rather than a standard view of a DataFrame. To actually manipulate values within these “mini” DataFrames, we’ll need to call an aggregation method. This is a method that tells pandas how to aggregate the values within the GroupBy object. Once the aggregation is applied, pandas will return a normal (now grouped) DataFrame.\nThe first aggregation method we’ll consider is .agg. The .agg method takes in a function as its argument; this function is then applied to each column of a “mini” grouped DataFrame. We end up with a new DataFrame with one aggregated row per subframe. Let’s see this in action by finding the sum of all counts for each year in babynames – this is equivalent to finding the number of babies born in each year.\n\nbabynames[[\"Year\", \"Count\"]].groupby(\"Year\").agg(\"sum\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nWe can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of babynames, which is why we see smaller values for the summed counts.\n\n\n\nPerforming an aggregation\n\n\nCalling .agg has condensed each subframe back into a single row. This gives us our final output: a DataFrame that is now indexed by \"Year\", with a single row for each unique year in the original babynames DataFrame.\nThere are many different aggregation functions we can use, all of which are useful in different applications.\n\nbabynames[[\"Year\", \"Count\"]].groupby(\"Year\").agg(\"min\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n5\n\n\n1911\n5\n\n\n1912\n5\n\n\n1913\n5\n\n\n1914\n5\n\n\n\n\n\n\n\n\nbabynames[[\"Year\", \"Count\"]].groupby(\"Year\").agg(\"max\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n295\n\n\n1911\n390\n\n\n1912\n534\n\n\n1913\n614\n\n\n1914\n773\n\n\n\n\n\n\n\n\n# Same result, but now we explicitly tell pandas to only consider the \"Count\" column when summing\nbabynames.groupby(\"Year\")[[\"Count\"]].agg(\"sum\").head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nThere are many different aggregations that can be applied to the grouped data. The primary requirement is that an aggregation function must:\n\nTake in a Series of data (a single column of the grouped subframe).\nReturn a single value that aggregates this Series.\n\n\n4.2.1 Aggregation Functions\nBecause of this fairly broad requirement, pandas offers many ways of computing an aggregation.\nIn-built Python operations – such as sum, max, and min – are automatically recognized by pandas.\n\n# What is the minimum count for each name in any year?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"min\").head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n5\n\n\nAadarsh\n6\n\n\nAaden\n10\n\n\nAadhav\n6\n\n\nAadhini\n6\n\n\n\n\n\n\n\n\n# What is the largest single-year count of each name?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"max\").head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n7\n\n\nAadarsh\n6\n\n\nAaden\n158\n\n\nAadhav\n8\n\n\nAadhini\n6\n\n\n\n\n\n\n\nAs mentioned previously, functions from the NumPy library, such as np.mean, np.max, np.min, and np.sum, are also fair game in pandas.\n\n# What is the average count for each name across all years?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"mean\").head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n6.000000\n\n\nAadarsh\n6.000000\n\n\nAaden\n46.214286\n\n\nAadhav\n6.750000\n\n\nAadhini\n6.000000\n\n\n\n\n\n\n\npandas also offers a number of in-built functions. Functions that are native to pandas can be referenced using their string name within a call to .agg. Some examples include:\n\n.agg(\"sum\")\n.agg(\"max\")\n.agg(\"min\")\n.agg(\"mean\")\n.agg(\"first\")\n.agg(\"last\")\n\nThe latter two entries in this list – \"first\" and \"last\" – are unique to pandas. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where multiple columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.\nLet’s illustrate this with an example. Say we add a new column to babynames that contains the first letter of each name.\n\n# Imagine we had an additional column, \"First Letter\". We'll explain this code next week\nbabynames[\"First Letter\"] = babynames[\"Name\"].str[0]\n\n# We construct a simplified DataFrame containing just a subset of columns\nbabynames_new = babynames[[\"Name\", \"First Letter\", \"Year\"]]\nbabynames_new.head()\n\n\n\n\n\n\n\n\nName\nFirst Letter\nYear\n\n\n\n\n115957\nDeandrea\nD\n1990\n\n\n101976\nDeandrea\nD\n1986\n\n\n131029\nLeandrea\nL\n1994\n\n\n108731\nDeandrea\nD\n1988\n\n\n308131\nDeandrea\nD\n1985\n\n\n\n\n\n\n\nIf we form groups for each name in the dataset, \"First Letter\" will be the same for all members of the group. This means that if we simply select the first entry for \"First Letter\" in the group, we’ll represent all data in that group.\nWe can use a dictionary to apply different aggregation functions to each column during grouping.\n\n\n\nAggregating using “first”\n\n\n\nbabynames_new.groupby(\"Name\").agg({\"First Letter\":\"first\", \"Year\":\"max\"}).head()\n\n\n\n\n\n\n\n\nFirst Letter\nYear\n\n\nName\n\n\n\n\n\n\nAadan\nA\n2014\n\n\nAadarsh\nA\n2019\n\n\nAaden\nA\n2020\n\n\nAadhav\nA\n2019\n\n\nAadhini\nA\n2022\n\n\n\n\n\n\n\n\n\n4.2.2 Plotting Birth Counts\nLet’s use .agg to find the total number of babies born in each year. Recall that using .agg with .groupby() follows the format: df.groupby(column_name).agg(aggregation_function). The line of code below gives us the total number of babies born in each year.\n\n\nCode\nbabynames.groupby(\"Year\")[[\"Count\"]].agg(sum).head(5)\n# Alternative 1\n# babynames.groupby(\"Year\")[[\"Count\"]].sum()\n# Alternative 2\n# babynames.groupby(\"Year\").sum(numeric_only=True)\n\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83762/390646742.py:1: FutureWarning:\n\nThe provided callable &lt;built-in function sum&gt; is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"sum\" instead.\n\n\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nHere’s an illustration of the process:\n\nPlotting the Dataframe we obtain tells an interesting story.\n\n\nCode\nimport plotly.express as px\npuzzle2 = babynames.groupby(\"Year\")[[\"Count\"]].agg(\"sum\")\npx.line(puzzle2, y = \"Count\")\n\n\n                                                \n\n\nA word of warning: we made an enormous assumption when we decided to use this dataset to estimate birth rate. According to this article from the Legistlative Analyst Office, the true number of babies born in California in 2020 was 421,275. However, our plot shows 362,882 babies —— what happened?\n\n\n4.2.3 Summary of the .groupby() Function\nA groupby operation involves some combination of splitting a DataFrame into grouped subframes, applying a function, and combining the results.\nFor some arbitrary DataFrame df below, the code df.groupby(\"year\").agg(sum) does the following:\n\nSplits the DataFrame into sub-DataFrames with rows belonging to the same year.\nApplies the sum function to each column of each sub-DataFrame.\nCombines the results of sum into a single DataFrame, indexed by year.\n\n\n\n\n4.2.4 Revisiting the .agg() Function\n.agg() can take in any function that aggregates several values into one summary value. Some commonly-used aggregation functions can even be called directly, without explicit use of .agg(). For example, we can call .mean() on .groupby():\nbabynames.groupby(\"Year\").mean().head()\nWe can now put this all into practice. Say we want to find the baby name with sex “F” that has fallen in popularity the most in California. To calculate this, we can first create a metric: “Ratio to Peak” (RTP). The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies born with the name in any year.\nLet’s start with calculating this for one baby, “Jennifer”.\n\n# We filter by babies with sex \"F\" and sort by \"Year\"\nf_babynames = babynames[babynames[\"Sex\"] == \"F\"]\nf_babynames = f_babynames.sort_values([\"Year\"])\n\n# Determine how many Jennifers were born in CA per year\njenn_counts_series = f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"]\n\n# Determine the max number of Jennifers born in a year and the number born in 2022 \n# to calculate RTP\nmax_jenn = max(f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"])\ncurr_jenn = f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"].iloc[-1]\nrtp = curr_jenn / max_jenn\nrtp\n\nnp.float64(0.018796372629843364)\n\n\nBy creating a function to calculate RTP and applying it to our DataFrame by using .groupby(), we can easily compute the RTP for all names at once!\n\ndef ratio_to_peak(series):\n    return series.iloc[-1] / max(series)\n\n#Using .groupby() to apply the function\nrtp_table = f_babynames.groupby(\"Name\")[[\"Year\", \"Count\"]].agg(ratio_to_peak)\nrtp_table.head()\n\n\n\n\n\n\n\n\nYear\nCount\n\n\nName\n\n\n\n\n\n\nAadhini\n1.0\n1.000000\n\n\nAadhira\n1.0\n0.500000\n\n\nAadhya\n1.0\n0.660000\n\n\nAadya\n1.0\n0.586207\n\n\nAahana\n1.0\n0.269231\n\n\n\n\n\n\n\nIn the rows shown above, we can see that every row shown has a Year value of 1.0.\nThis is the “pandas-ification” of logic you saw in Data 8. Much of the logic you’ve learned in Data 8 will serve you well in Data 100.\n\n\n4.2.5 Nuisance Columns\nNote that you must be careful with which columns you apply the .agg() function to. If we were to apply our function to the table as a whole by doing f_babynames.groupby(\"Name\").agg(ratio_to_peak), executing our .agg() call would result in a TypeError.\n\nWe can avoid this issue (and prevent unintentional loss of data) by explicitly selecting column(s) we want to apply our aggregation function to BEFORE calling .agg(),\n\n\n4.2.6 Renaming Columns After Grouping\nBy default, .groupby will not rename any aggregated columns. As we can see in the table above, the aggregated column is still named Count even though it now represents the RTP. For better readability, we can rename Count to Count RTP\n\nrtp_table = rtp_table.rename(columns = {\"Count\": \"Count RTP\"})\nrtp_table\n\n\n\n\n\n\n\n\nYear\nCount RTP\n\n\nName\n\n\n\n\n\n\nAadhini\n1.0\n1.000000\n\n\nAadhira\n1.0\n0.500000\n\n\nAadhya\n1.0\n0.660000\n\n\nAadya\n1.0\n0.586207\n\n\nAahana\n1.0\n0.269231\n\n\n...\n...\n...\n\n\nZyanya\n1.0\n0.466667\n\n\nZyla\n1.0\n1.000000\n\n\nZylah\n1.0\n1.000000\n\n\nZyra\n1.0\n1.000000\n\n\nZyrah\n1.0\n0.833333\n\n\n\n\n13782 rows × 2 columns\n\n\n\n\n\n4.2.7 Some Data Science Payoff\nBy sorting rtp_table, we can see the names whose popularity has decreased the most.\n\nrtp_table = rtp_table.rename(columns = {\"Count\": \"Count RTP\"})\nrtp_table.sort_values(\"Count RTP\").head()\n\n\n\n\n\n\n\n\nYear\nCount RTP\n\n\nName\n\n\n\n\n\n\nDebra\n1.0\n0.001260\n\n\nDebbie\n1.0\n0.002815\n\n\nCarol\n1.0\n0.003180\n\n\nTammy\n1.0\n0.003249\n\n\nSusan\n1.0\n0.003305\n\n\n\n\n\n\n\nTo visualize the above DataFrame, let’s look at the line plot below:\n\n\nCode\nimport plotly.express as px\npx.line(f_babynames[f_babynames[\"Name\"] == \"Debra\"], x = \"Year\", y = \"Count\")\n\n\n                                                \n\n\nWe can get the list of the top 10 names and then plot popularity with the following code:\n\ntop10 = rtp_table.sort_values(\"Count RTP\").head(10).index\npx.line(\n    f_babynames[f_babynames[\"Name\"].isin(top10)], \n    x = \"Year\", \n    y = \"Count\", \n    color = \"Name\"\n)\n\n/Users/nikhilreddy/course-notes/ds100env/lib/python3.12/site-packages/plotly/express/_core.py:1980: FutureWarning:\n\nWhen grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.\n\n\n\n                                                \n\n\nAs a quick exercise, consider what code would compute the total number of babies with each name.\n\n\nCode\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(\"sum\").head()\n# alternative solution: \n# babynames.groupby(\"Name\")[[\"Count\"]].sum()\n\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n18\n\n\nAadarsh\n6\n\n\nAaden\n647\n\n\nAadhav\n27\n\n\nAadhini\n6",
     "crumbs": [
       "<span class='chapter-number'>4</span>  <span class='chapter-title'>Pandas III</span>"
     ]
@@ -194,7 +194,7 @@
     "href": "pandas_3/pandas_3.html#groupby-continued",
     "title": "4  Pandas III",
     "section": "4.3 .groupby(), Continued",
-    "text": "4.3 .groupby(), Continued\nWe’ll work with the elections DataFrame again.\n\n\nCode\nimport pandas as pd\nimport numpy as np\n\nelections = pd.read_csv(\"data/elections.csv\")\nelections.head(5)\n\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\n\n\n\n\n\n\n\n\n4.3.1 Raw GroupBy Objects\nThe result of groupby applied to a DataFrame is a DataFrameGroupBy object, not a DataFrame.\n\ngrouped_by_year = elections.groupby(\"Year\")\ntype(grouped_by_year)\n\npandas.core.groupby.generic.DataFrameGroupBy\n\n\nThere are several ways to look into DataFrameGroupBy objects:\n\ngrouped_by_party = elections.groupby(\"Party\")\ngrouped_by_party.groups\n\n{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [2, 4, 8, 10, 13, 14, 17, 20, 28, 29, 34, 37, 39, 45, 47, 52, 55, 57, 64, 70, 74, 77, 81, 83, 86, 91, 94, 97, 100, 105, 108, 111, 114, 116, 118, 123, 129, 134, 137, 140, 144, 151, 158, 162, 168, 176, 178], 'Democratic-Republican': [0, 1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'National Democratic': [50], 'National Republican': [3, 5], 'National Union': [27], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 30, 32, 33, 36, 40, 43, 46, 53, 56, 60, 65, 69, 72, 79, 80, 84, 87, 90, 96, 98, 104, 106, 109, 112, 113, 117, 120, 122, 131, 133, 135, 142, 145, 152, 157, 166, 171, 173, 179], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 'Southern Democratic': [25], 'States' Rights': [110], 'Taxpayers': [147], 'Union': [93], 'Union Labor': [42], 'Whig': [7, 9, 11, 12, 16, 19]}\n\n\n\ngrouped_by_party.get_group(\"Socialist\")\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n58\n1904\nEugene V. Debs\nSocialist\n402810\nloss\n2.985897\n\n\n62\n1908\nEugene V. Debs\nSocialist\n420852\nloss\n2.850866\n\n\n66\n1912\nEugene V. Debs\nSocialist\n901551\nloss\n6.004354\n\n\n71\n1916\nAllan L. Benson\nSocialist\n590524\nloss\n3.194193\n\n\n76\n1920\nEugene V. Debs\nSocialist\n913693\nloss\n3.428282\n\n\n85\n1928\nNorman Thomas\nSocialist\n267478\nloss\n0.728623\n\n\n88\n1932\nNorman Thomas\nSocialist\n884885\nloss\n2.236211\n\n\n92\n1936\nNorman Thomas\nSocialist\n187910\nloss\n0.412876\n\n\n95\n1940\nNorman Thomas\nSocialist\n116599\nloss\n0.234237\n\n\n102\n1948\nNorman Thomas\nSocialist\n139569\nloss\n0.286312\n\n\n\n\n\n\n\n\n\n4.3.2 Other GroupBy Methods\nThere are many aggregation methods we can use with .agg. Some useful options are:\n\n.mean: creates a new DataFrame with the mean value of each group\n.sum: creates a new DataFrame with the sum of each group\n.max and .min: creates a new DataFrame with the maximum/minimum value of each group\n.first and .last: creates a new DataFrame with the first/last row in each group\n.size: creates a new Series with the number of entries in each group\n.count: creates a new DataFrame with the number of entries, excluding missing values.\n\nLet’s illustrate some examples by creating a DataFrame called df.\n\ndf = pd.DataFrame({'letter':['A','A','B','C','C','C'], \n                   'num':[1,2,3,4,np.nan,4], \n                   'state':[np.nan, 'tx', 'fl', 'hi', np.nan, 'ak']})\ndf\n\n\n\n\n\n\n\n\nletter\nnum\nstate\n\n\n\n\n0\nA\n1.0\nNaN\n\n\n1\nA\n2.0\ntx\n\n\n2\nB\n3.0\nfl\n\n\n3\nC\n4.0\nhi\n\n\n4\nC\nNaN\nNaN\n\n\n5\nC\n4.0\nak\n\n\n\n\n\n\n\nNote the slight difference between .size() and .count(): while .size() returns a Series and counts the number of entries including the missing values, .count() returns a DataFrame and counts the number of entries in each column excluding missing values.\n\ndf.groupby(\"letter\").size()\n\nletter\nA    2\nB    1\nC    3\ndtype: int64\n\n\n\ndf.groupby(\"letter\").count()\n\n\n\n\n\n\n\n\nnum\nstate\n\n\nletter\n\n\n\n\n\n\nA\n2\n1\n\n\nB\n1\n1\n\n\nC\n2\n2\n\n\n\n\n\n\n\nYou might recall that the value_counts() function in the previous note does something similar. It turns out value_counts() and groupby.size() are the same, except value_counts() sorts the resulting Series in descending order automatically.\n\ndf[\"letter\"].value_counts()\n\nletter\nC    3\nA    2\nB    1\nName: count, dtype: int64\n\n\nThese (and other) aggregation functions are so common that pandas allows for writing shorthand. Instead of explicitly stating the use of .agg, we can call the function directly on the GroupBy object.\nFor example, the following are equivalent:\n\nelections.groupby(\"Candidate\").agg(mean)\nelections.groupby(\"Candidate\").mean()\n\nThere are many other methods that pandas supports. You can check them out on the pandas documentation.\n\n\n4.3.3 Filtering by Group\nAnother common use for GroupBy objects is to filter data by group.\ngroupby.filter takes an argument func, where func is a function that:\n\nTakes a DataFrame object as input\nReturns a single True or False.\n\ngroupby.filter applies func to each group/sub-DataFrame:\n\nIf func returns True for a group, then all rows belonging to the group are preserved.\nIf func returns False for a group, then all rows belonging to that group are filtered out.\n\nIn other words, sub-DataFrames that correspond to True are returned in the final result, whereas those with a False value are not. Importantly, groupby.filter is different from groupby.agg in that an entire sub-DataFrame is returned in the final DataFrame, not just a single row. As a result, groupby.filter preserves the original indices and the column we grouped on does NOT become the index!\n\nTo illustrate how this happens, let’s go back to the elections dataset. Say we want to identify “tight” election years – that is, we want to find all rows that correspond to election years where all candidates in that year won a similar portion of the total vote. Specifically, let’s find all rows corresponding to a year where no candidate won more than 45% of the total vote.\nIn other words, we want to:\n\nFind the years where the maximum % in that year is less than 45%\nReturn all DataFrame rows that correspond to these years\n\nFor each year, we need to find the maximum % among all rows for that year. If this maximum % is lower than 45%, we will tell pandas to keep all rows corresponding to that year.\n\nelections.groupby(\"Year\").filter(lambda sf: sf[\"%\"].max() &lt; 45).head(9)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n23\n1860\nAbraham Lincoln\nRepublican\n1855993\nwin\n39.699408\n\n\n24\n1860\nJohn Bell\nConstitutional Union\n590901\nloss\n12.639283\n\n\n25\n1860\nJohn C. Breckinridge\nSouthern Democratic\n848019\nloss\n18.138998\n\n\n26\n1860\nStephen A. Douglas\nNorthern Democratic\n1380202\nloss\n29.522311\n\n\n66\n1912\nEugene V. Debs\nSocialist\n901551\nloss\n6.004354\n\n\n67\n1912\nEugene W. Chafin\nProhibition\n208156\nloss\n1.386325\n\n\n68\n1912\nTheodore Roosevelt\nProgressive\n4122721\nloss\n27.457433\n\n\n69\n1912\nWilliam Taft\nRepublican\n3486242\nloss\n23.218466\n\n\n70\n1912\nWoodrow Wilson\nDemocratic\n6296284\nwin\n41.933422\n\n\n\n\n\n\n\nWhat’s going on here? In this example, we’ve defined our filtering function, func, to be lambda sf: sf[\"%\"].max() &lt; 45. This filtering function will find the maximum \"%\" value among all entries in the grouped sub-DataFrame, which we call sf. If the maximum value is less than 45, then the filter function will return True and all rows in that grouped sub-DataFrame will appear in the final output DataFrame.\nExamine the DataFrame above. Notice how, in this preview of the first 9 rows, all entries from the years 1860 and 1912 appear. This means that in 1860 and 1912, no candidate in that year won more than 45% of the total vote.\nYou may ask: how is the groupby.filter procedure different to the boolean filtering we’ve seen previously? Boolean filtering considers individual rows when applying a boolean condition. For example, the code elections[elections[\"%\"] &lt; 45] will check the \"%\" value of every single row in elections; if it is less than 45, then that row will be kept in the output. groupby.filter, in contrast, applies a boolean condition across all rows in a group. If not all rows in that group satisfy the condition specified by the filter, the entire group will be discarded in the output.\n\n\n4.3.4 Aggregation with lambda Functions\nWhat if we wish to aggregate our DataFrame using a non-standard function – for example, a function of our own design? We can do so by combining .agg with lambda expressions.\nLet’s first consider a puzzle to jog our memory. We will attempt to find the Candidate from each Party with the highest % of votes.\nA naive approach may be to group by the Party column and aggregate by the maximum.\n\nelections.groupby(\"Party\").agg(max).head(10)\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73670/4278286395.py:1: FutureWarning:\n\nThe provided callable &lt;built-in function max&gt; is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"max\" instead.\n\n\n\n\n\n\n\n\n\n\nYear\nCandidate\nPopular vote\nResult\n%\n\n\nParty\n\n\n\n\n\n\n\n\n\nAmerican\n1976\nThomas J. Anderson\n873053\nloss\n21.554001\n\n\nAmerican Independent\n1976\nLester Maddox\n9901118\nloss\n13.571218\n\n\nAnti-Masonic\n1832\nWilliam Wirt\n100715\nloss\n7.821583\n\n\nAnti-Monopoly\n1884\nBenjamin Butler\n134294\nloss\n1.335838\n\n\nCitizens\n1980\nBarry Commoner\n233052\nloss\n0.270182\n\n\nCommunist\n1932\nWilliam Z. Foster\n103307\nloss\n0.261069\n\n\nConstitution\n2016\nMichael Peroutka\n203091\nloss\n0.152398\n\n\nConstitutional Union\n1860\nJohn Bell\n590901\nloss\n12.639283\n\n\nDemocratic\n2020\nWoodrow Wilson\n81268924\nwin\n61.344703\n\n\nDemocratic-Republican\n1824\nJohn Quincy Adams\n151271\nwin\n57.210122\n\n\n\n\n\n\n\nThis approach is clearly wrong – the DataFrame claims that Woodrow Wilson won the presidency in 2020.\nWhy is this happening? Here, the max aggregation function is taken over every column independently. Among Democrats, max is computing:\n\nThe most recent Year a Democratic candidate ran for president (2020)\nThe Candidate with the alphabetically “largest” name (“Woodrow Wilson”)\nThe Result with the alphabetically “largest” outcome (“win”)\n\nInstead, let’s try a different approach. We will:\n\nSort the DataFrame so that rows are in descending order of %\nGroup by Party and select the first row of each sub-DataFrame\n\nWhile it may seem unintuitive, sorting elections by descending order of % is extremely helpful. If we then group by Party, the first row of each GroupBy object will contain information about the Candidate with the highest voter %.\n\nelections_sorted_by_percent = elections.sort_values(\"%\", ascending=False)\nelections_sorted_by_percent.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n114\n1964\nLyndon Johnson\nDemocratic\n43127041\nwin\n61.344703\n\n\n91\n1936\nFranklin Roosevelt\nDemocratic\n27752648\nwin\n60.978107\n\n\n120\n1972\nRichard Nixon\nRepublican\n47168710\nwin\n60.907806\n\n\n79\n1920\nWarren Harding\nRepublican\n16144093\nwin\n60.574501\n\n\n133\n1984\nRonald Reagan\nRepublican\n54455472\nwin\n59.023326\n\n\n\n\n\n\n\n\nelections_sorted_by_percent.groupby(\"Party\").agg(lambda x : x.iloc[0]).head(10)\n\n# Equivalent to the below code\n# elections_sorted_by_percent.groupby(\"Party\").agg('first').head(10)\n\n\n\n\n\n\n\n\nYear\nCandidate\nPopular vote\nResult\n%\n\n\nParty\n\n\n\n\n\n\n\n\n\nAmerican\n1856\nMillard Fillmore\n873053\nloss\n21.554001\n\n\nAmerican Independent\n1968\nGeorge Wallace\n9901118\nloss\n13.571218\n\n\nAnti-Masonic\n1832\nWilliam Wirt\n100715\nloss\n7.821583\n\n\nAnti-Monopoly\n1884\nBenjamin Butler\n134294\nloss\n1.335838\n\n\nCitizens\n1980\nBarry Commoner\n233052\nloss\n0.270182\n\n\nCommunist\n1932\nWilliam Z. Foster\n103307\nloss\n0.261069\n\n\nConstitution\n2008\nChuck Baldwin\n199750\nloss\n0.152398\n\n\nConstitutional Union\n1860\nJohn Bell\n590901\nloss\n12.639283\n\n\nDemocratic\n1964\nLyndon Johnson\n43127041\nwin\n61.344703\n\n\nDemocratic-Republican\n1824\nAndrew Jackson\n151271\nloss\n57.210122\n\n\n\n\n\n\n\nHere’s an illustration of the process:\n\nNotice how our code correctly determines that Lyndon Johnson from the Democratic Party has the highest voter %.\nMore generally, lambda functions are used to design custom aggregation functions that aren’t pre-defined by Python. The input parameter x to the lambda function is a GroupBy object. Therefore, it should make sense why lambda x : x.iloc[0] selects the first row in each groupby object.\nIn fact, there’s a few different ways to approach this problem. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc. We’ve given a few examples below.\nNote: Understanding these alternative solutions is not required. They are given to demonstrate the vast number of problem-solving approaches in pandas.\n\n# Using the idxmax function\nbest_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]\nbest_per_party.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n22\n1856\nMillard Fillmore\nAmerican\n873053\nloss\n21.554001\n\n\n115\n1968\nGeorge Wallace\nAmerican Independent\n9901118\nloss\n13.571218\n\n\n6\n1832\nWilliam Wirt\nAnti-Masonic\n100715\nloss\n7.821583\n\n\n38\n1884\nBenjamin Butler\nAnti-Monopoly\n134294\nloss\n1.335838\n\n\n127\n1980\nBarry Commoner\nCitizens\n233052\nloss\n0.270182\n\n\n\n\n\n\n\n\n# Using the .drop_duplicates function\nbest_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')\nbest_per_party2.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n148\n1996\nJohn Hagelin\nNatural Law\n113670\nloss\n0.118219\n\n\n164\n2008\nChuck Baldwin\nConstitution\n199750\nloss\n0.152398\n\n\n110\n1956\nT. Coleman Andrews\nStates' Rights\n107929\nloss\n0.174883\n\n\n147\n1996\nHoward Phillips\nTaxpayers\n184656\nloss\n0.192045\n\n\n136\n1988\nLenora Fulani\nNew Alliance\n217221\nloss\n0.237804",
+    "text": "4.3 .groupby(), Continued\nWe’ll work with the elections DataFrame again.\n\n\nCode\nimport pandas as pd\nimport numpy as np\n\nelections = pd.read_csv(\"data/elections.csv\")\nelections.head(5)\n\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\n\n\n\n\n\n\n\n\n4.3.1 Raw GroupBy Objects\nThe result of groupby applied to a DataFrame is a DataFrameGroupBy object, not a DataFrame.\n\ngrouped_by_year = elections.groupby(\"Year\")\ntype(grouped_by_year)\n\npandas.core.groupby.generic.DataFrameGroupBy\n\n\nThere are several ways to look into DataFrameGroupBy objects:\n\ngrouped_by_party = elections.groupby(\"Party\")\ngrouped_by_party.groups\n\n{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [2, 4, 8, 10, 13, 14, 17, 20, 28, 29, 34, 37, 39, 45, 47, 52, 55, 57, 64, 70, 74, 77, 81, 83, 86, 91, 94, 97, 100, 105, 108, 111, 114, 116, 118, 123, 129, 134, 137, 140, 144, 151, 158, 162, 168, 176, 178], 'Democratic-Republican': [0, 1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'National Democratic': [50], 'National Republican': [3, 5], 'National Union': [27], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 30, 32, 33, 36, 40, 43, 46, 53, 56, 60, 65, 69, 72, 79, 80, 84, 87, 90, 96, 98, 104, 106, 109, 112, 113, 117, 120, 122, 131, 133, 135, 142, 145, 152, 157, 166, 171, 173, 179], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 'Southern Democratic': [25], 'States' Rights': [110], 'Taxpayers': [147], 'Union': [93], 'Union Labor': [42], 'Whig': [7, 9, 11, 12, 16, 19]}\n\n\n\ngrouped_by_party.get_group(\"Socialist\")\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n58\n1904\nEugene V. Debs\nSocialist\n402810\nloss\n2.985897\n\n\n62\n1908\nEugene V. Debs\nSocialist\n420852\nloss\n2.850866\n\n\n66\n1912\nEugene V. Debs\nSocialist\n901551\nloss\n6.004354\n\n\n71\n1916\nAllan L. Benson\nSocialist\n590524\nloss\n3.194193\n\n\n76\n1920\nEugene V. Debs\nSocialist\n913693\nloss\n3.428282\n\n\n85\n1928\nNorman Thomas\nSocialist\n267478\nloss\n0.728623\n\n\n88\n1932\nNorman Thomas\nSocialist\n884885\nloss\n2.236211\n\n\n92\n1936\nNorman Thomas\nSocialist\n187910\nloss\n0.412876\n\n\n95\n1940\nNorman Thomas\nSocialist\n116599\nloss\n0.234237\n\n\n102\n1948\nNorman Thomas\nSocialist\n139569\nloss\n0.286312\n\n\n\n\n\n\n\n\n\n4.3.2 Other GroupBy Methods\nThere are many aggregation methods we can use with .agg. Some useful options are:\n\n.mean: creates a new DataFrame with the mean value of each group\n.sum: creates a new DataFrame with the sum of each group\n.max and .min: creates a new DataFrame with the maximum/minimum value of each group\n.first and .last: creates a new DataFrame with the first/last row in each group\n.size: creates a new Series with the number of entries in each group\n.count: creates a new DataFrame with the number of entries, excluding missing values.\n\nLet’s illustrate some examples by creating a DataFrame called df.\n\ndf = pd.DataFrame({'letter':['A','A','B','C','C','C'], \n                   'num':[1,2,3,4,np.nan,4], \n                   'state':[np.nan, 'tx', 'fl', 'hi', np.nan, 'ak']})\ndf\n\n\n\n\n\n\n\n\nletter\nnum\nstate\n\n\n\n\n0\nA\n1.0\nNaN\n\n\n1\nA\n2.0\ntx\n\n\n2\nB\n3.0\nfl\n\n\n3\nC\n4.0\nhi\n\n\n4\nC\nNaN\nNaN\n\n\n5\nC\n4.0\nak\n\n\n\n\n\n\n\nNote the slight difference between .size() and .count(): while .size() returns a Series and counts the number of entries including the missing values, .count() returns a DataFrame and counts the number of entries in each column excluding missing values.\n\ndf.groupby(\"letter\").size()\n\nletter\nA    2\nB    1\nC    3\ndtype: int64\n\n\n\ndf.groupby(\"letter\").count()\n\n\n\n\n\n\n\n\nnum\nstate\n\n\nletter\n\n\n\n\n\n\nA\n2\n1\n\n\nB\n1\n1\n\n\nC\n2\n2\n\n\n\n\n\n\n\nYou might recall that the value_counts() function in the previous note does something similar. It turns out value_counts() and groupby.size() are the same, except value_counts() sorts the resulting Series in descending order automatically.\n\ndf[\"letter\"].value_counts()\n\nletter\nC    3\nA    2\nB    1\nName: count, dtype: int64\n\n\nThese (and other) aggregation functions are so common that pandas allows for writing shorthand. Instead of explicitly stating the use of .agg, we can call the function directly on the GroupBy object.\nFor example, the following are equivalent:\n\nelections.groupby(\"Candidate\").agg(mean)\nelections.groupby(\"Candidate\").mean()\n\nThere are many other methods that pandas supports. You can check them out on the pandas documentation.\n\n\n4.3.3 Filtering by Group\nAnother common use for GroupBy objects is to filter data by group.\ngroupby.filter takes an argument func, where func is a function that:\n\nTakes a DataFrame object as input\nReturns a single True or False.\n\ngroupby.filter applies func to each group/sub-DataFrame:\n\nIf func returns True for a group, then all rows belonging to the group are preserved.\nIf func returns False for a group, then all rows belonging to that group are filtered out.\n\nIn other words, sub-DataFrames that correspond to True are returned in the final result, whereas those with a False value are not. Importantly, groupby.filter is different from groupby.agg in that an entire sub-DataFrame is returned in the final DataFrame, not just a single row. As a result, groupby.filter preserves the original indices and the column we grouped on does NOT become the index!\n\nTo illustrate how this happens, let’s go back to the elections dataset. Say we want to identify “tight” election years – that is, we want to find all rows that correspond to election years where all candidates in that year won a similar portion of the total vote. Specifically, let’s find all rows corresponding to a year where no candidate won more than 45% of the total vote.\nIn other words, we want to:\n\nFind the years where the maximum % in that year is less than 45%\nReturn all DataFrame rows that correspond to these years\n\nFor each year, we need to find the maximum % among all rows for that year. If this maximum % is lower than 45%, we will tell pandas to keep all rows corresponding to that year.\n\nelections.groupby(\"Year\").filter(lambda sf: sf[\"%\"].max() &lt; 45).head(9)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n23\n1860\nAbraham Lincoln\nRepublican\n1855993\nwin\n39.699408\n\n\n24\n1860\nJohn Bell\nConstitutional Union\n590901\nloss\n12.639283\n\n\n25\n1860\nJohn C. Breckinridge\nSouthern Democratic\n848019\nloss\n18.138998\n\n\n26\n1860\nStephen A. Douglas\nNorthern Democratic\n1380202\nloss\n29.522311\n\n\n66\n1912\nEugene V. Debs\nSocialist\n901551\nloss\n6.004354\n\n\n67\n1912\nEugene W. Chafin\nProhibition\n208156\nloss\n1.386325\n\n\n68\n1912\nTheodore Roosevelt\nProgressive\n4122721\nloss\n27.457433\n\n\n69\n1912\nWilliam Taft\nRepublican\n3486242\nloss\n23.218466\n\n\n70\n1912\nWoodrow Wilson\nDemocratic\n6296284\nwin\n41.933422\n\n\n\n\n\n\n\nWhat’s going on here? In this example, we’ve defined our filtering function, func, to be lambda sf: sf[\"%\"].max() &lt; 45. This filtering function will find the maximum \"%\" value among all entries in the grouped sub-DataFrame, which we call sf. If the maximum value is less than 45, then the filter function will return True and all rows in that grouped sub-DataFrame will appear in the final output DataFrame.\nExamine the DataFrame above. Notice how, in this preview of the first 9 rows, all entries from the years 1860 and 1912 appear. This means that in 1860 and 1912, no candidate in that year won more than 45% of the total vote.\nYou may ask: how is the groupby.filter procedure different to the boolean filtering we’ve seen previously? Boolean filtering considers individual rows when applying a boolean condition. For example, the code elections[elections[\"%\"] &lt; 45] will check the \"%\" value of every single row in elections; if it is less than 45, then that row will be kept in the output. groupby.filter, in contrast, applies a boolean condition across all rows in a group. If not all rows in that group satisfy the condition specified by the filter, the entire group will be discarded in the output.\n\n\n4.3.4 Aggregation with lambda Functions\nWhat if we wish to aggregate our DataFrame using a non-standard function – for example, a function of our own design? We can do so by combining .agg with lambda expressions.\nLet’s first consider a puzzle to jog our memory. We will attempt to find the Candidate from each Party with the highest % of votes.\nA naive approach may be to group by the Party column and aggregate by the maximum.\n\nelections.groupby(\"Party\").agg(max).head(10)\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83762/4278286395.py:1: FutureWarning:\n\nThe provided callable &lt;built-in function max&gt; is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"max\" instead.\n\n\n\n\n\n\n\n\n\n\nYear\nCandidate\nPopular vote\nResult\n%\n\n\nParty\n\n\n\n\n\n\n\n\n\nAmerican\n1976\nThomas J. Anderson\n873053\nloss\n21.554001\n\n\nAmerican Independent\n1976\nLester Maddox\n9901118\nloss\n13.571218\n\n\nAnti-Masonic\n1832\nWilliam Wirt\n100715\nloss\n7.821583\n\n\nAnti-Monopoly\n1884\nBenjamin Butler\n134294\nloss\n1.335838\n\n\nCitizens\n1980\nBarry Commoner\n233052\nloss\n0.270182\n\n\nCommunist\n1932\nWilliam Z. Foster\n103307\nloss\n0.261069\n\n\nConstitution\n2016\nMichael Peroutka\n203091\nloss\n0.152398\n\n\nConstitutional Union\n1860\nJohn Bell\n590901\nloss\n12.639283\n\n\nDemocratic\n2020\nWoodrow Wilson\n81268924\nwin\n61.344703\n\n\nDemocratic-Republican\n1824\nJohn Quincy Adams\n151271\nwin\n57.210122\n\n\n\n\n\n\n\nThis approach is clearly wrong – the DataFrame claims that Woodrow Wilson won the presidency in 2020.\nWhy is this happening? Here, the max aggregation function is taken over every column independently. Among Democrats, max is computing:\n\nThe most recent Year a Democratic candidate ran for president (2020)\nThe Candidate with the alphabetically “largest” name (“Woodrow Wilson”)\nThe Result with the alphabetically “largest” outcome (“win”)\n\nInstead, let’s try a different approach. We will:\n\nSort the DataFrame so that rows are in descending order of %\nGroup by Party and select the first row of each sub-DataFrame\n\nWhile it may seem unintuitive, sorting elections by descending order of % is extremely helpful. If we then group by Party, the first row of each GroupBy object will contain information about the Candidate with the highest voter %.\n\nelections_sorted_by_percent = elections.sort_values(\"%\", ascending=False)\nelections_sorted_by_percent.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n114\n1964\nLyndon Johnson\nDemocratic\n43127041\nwin\n61.344703\n\n\n91\n1936\nFranklin Roosevelt\nDemocratic\n27752648\nwin\n60.978107\n\n\n120\n1972\nRichard Nixon\nRepublican\n47168710\nwin\n60.907806\n\n\n79\n1920\nWarren Harding\nRepublican\n16144093\nwin\n60.574501\n\n\n133\n1984\nRonald Reagan\nRepublican\n54455472\nwin\n59.023326\n\n\n\n\n\n\n\n\nelections_sorted_by_percent.groupby(\"Party\").agg(lambda x : x.iloc[0]).head(10)\n\n# Equivalent to the below code\n# elections_sorted_by_percent.groupby(\"Party\").agg('first').head(10)\n\n\n\n\n\n\n\n\nYear\nCandidate\nPopular vote\nResult\n%\n\n\nParty\n\n\n\n\n\n\n\n\n\nAmerican\n1856\nMillard Fillmore\n873053\nloss\n21.554001\n\n\nAmerican Independent\n1968\nGeorge Wallace\n9901118\nloss\n13.571218\n\n\nAnti-Masonic\n1832\nWilliam Wirt\n100715\nloss\n7.821583\n\n\nAnti-Monopoly\n1884\nBenjamin Butler\n134294\nloss\n1.335838\n\n\nCitizens\n1980\nBarry Commoner\n233052\nloss\n0.270182\n\n\nCommunist\n1932\nWilliam Z. Foster\n103307\nloss\n0.261069\n\n\nConstitution\n2008\nChuck Baldwin\n199750\nloss\n0.152398\n\n\nConstitutional Union\n1860\nJohn Bell\n590901\nloss\n12.639283\n\n\nDemocratic\n1964\nLyndon Johnson\n43127041\nwin\n61.344703\n\n\nDemocratic-Republican\n1824\nAndrew Jackson\n151271\nloss\n57.210122\n\n\n\n\n\n\n\nHere’s an illustration of the process:\n\nNotice how our code correctly determines that Lyndon Johnson from the Democratic Party has the highest voter %.\nMore generally, lambda functions are used to design custom aggregation functions that aren’t pre-defined by Python. The input parameter x to the lambda function is a GroupBy object. Therefore, it should make sense why lambda x : x.iloc[0] selects the first row in each groupby object.\nIn fact, there’s a few different ways to approach this problem. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc. We’ve given a few examples below.\nNote: Understanding these alternative solutions is not required. They are given to demonstrate the vast number of problem-solving approaches in pandas.\n\n# Using the idxmax function\nbest_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]\nbest_per_party.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n22\n1856\nMillard Fillmore\nAmerican\n873053\nloss\n21.554001\n\n\n115\n1968\nGeorge Wallace\nAmerican Independent\n9901118\nloss\n13.571218\n\n\n6\n1832\nWilliam Wirt\nAnti-Masonic\n100715\nloss\n7.821583\n\n\n38\n1884\nBenjamin Butler\nAnti-Monopoly\n134294\nloss\n1.335838\n\n\n127\n1980\nBarry Commoner\nCitizens\n233052\nloss\n0.270182\n\n\n\n\n\n\n\n\n# Using the .drop_duplicates function\nbest_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')\nbest_per_party2.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n148\n1996\nJohn Hagelin\nNatural Law\n113670\nloss\n0.118219\n\n\n164\n2008\nChuck Baldwin\nConstitution\n199750\nloss\n0.152398\n\n\n110\n1956\nT. Coleman Andrews\nStates' Rights\n107929\nloss\n0.174883\n\n\n147\n1996\nHoward Phillips\nTaxpayers\n184656\nloss\n0.192045\n\n\n136\n1988\nLenora Fulani\nNew Alliance\n217221\nloss\n0.237804",
     "crumbs": [
       "<span class='chapter-number'>4</span>  <span class='chapter-title'>Pandas III</span>"
     ]
@@ -204,7 +204,7 @@
     "href": "pandas_3/pandas_3.html#aggregating-data-with-pivot-tables",
     "title": "4  Pandas III",
     "section": "4.4 Aggregating Data with Pivot Tables",
-    "text": "4.4 Aggregating Data with Pivot Tables\nWe know now that .groupby gives us the ability to group and aggregate data across our DataFrame. The examples above formed groups using just one column in the DataFrame. It’s possible to group by multiple columns at once by passing in a list of column names to .groupby.\nLet’s consider the babynames dataset again. In this problem, we will find the total number of baby names associated with each sex for each year. To do this, we’ll group by both the \"Year\" and \"Sex\" columns.\n\nbabynames.head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\nFirst Letter\n\n\n\n\n115957\nCA\nF\n1990\nDeandrea\n5\nD\n\n\n101976\nCA\nF\n1986\nDeandrea\n6\nD\n\n\n131029\nCA\nF\n1994\nLeandrea\n5\nL\n\n\n108731\nCA\nF\n1988\nDeandrea\n5\nD\n\n\n308131\nCA\nM\n1985\nDeandrea\n6\nD\n\n\n\n\n\n\n\n\n# Find the total number of baby names associated with each sex for each \n# year in the data\nbabynames.groupby([\"Year\", \"Sex\"])[[\"Count\"]].agg(sum).head(6)\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73670/3186035650.py:3: FutureWarning:\n\nThe provided callable &lt;built-in function sum&gt; is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"sum\" instead.\n\n\n\n\n\n\n\n\n\n\n\nCount\n\n\nYear\nSex\n\n\n\n\n\n1910\nF\n5950\n\n\nM\n3213\n\n\n1911\nF\n6602\n\n\nM\n3381\n\n\n1912\nF\n9804\n\n\nM\n8142\n\n\n\n\n\n\n\nNotice that both \"Year\" and \"Sex\" serve as the index of the DataFrame (they are both rendered in bold). We’ve created a multi-index DataFrame where two different index values, the year and sex, are used to uniquely identify each row.\nThis isn’t the most intuitive way of representing this data – and, because multi-indexed DataFrames have multiple dimensions in their index, they can often be difficult to use.\nAnother strategy to aggregate across two columns is to create a pivot table. You saw these back in Data 8. One set of values is used to create the index of the pivot table; another set is used to define the column names. The values contained in each cell of the table correspond to the aggregated data for each index-column pair.\nHere’s an illustration of the process:\n\nThe best way to understand pivot tables is to see one in action. Let’s return to our original goal of summing the total number of names associated with each combination of year and sex. We’ll call the pandas .pivot_table method to create a new table.\n\n# The `pivot_table` method is used to generate a Pandas pivot table\nimport numpy as np\nbabynames.pivot_table(\n    index = \"Year\",\n    columns = \"Sex\",    \n    values = \"Count\", \n    aggfunc = \"sum\", \n).head(5)\n\n\n\n\n\n\n\nSex\nF\nM\n\n\nYear\n\n\n\n\n\n\n1910\n5950\n3213\n\n\n1911\n6602\n3381\n\n\n1912\n9804\n8142\n\n\n1913\n11860\n10234\n\n\n1914\n13815\n13111\n\n\n\n\n\n\n\nLooks a lot better! Now, our DataFrame is structured with clear index-column combinations. Each entry in the pivot table represents the summed count of names for a given combination of \"Year\" and \"Sex\".\nLet’s take a closer look at the code implemented above.\n\nindex = \"Year\" specifies the column name in the original DataFrame that should be used as the index of the pivot table\ncolumns = \"Sex\" specifies the column name in the original DataFrame that should be used to generate the columns of the pivot table\nvalues = \"Count\" indicates what values from the original DataFrame should be used to populate the entry for each index-column combination\naggfunc = np.sum tells pandas what function to use when aggregating the data specified by values. Here, we are summing the name counts for each pair of \"Year\" and \"Sex\"\n\nWe can even include multiple values in the index or columns of our pivot tables.\n\nbabynames_pivot = babynames.pivot_table(\n    index=\"Year\",     # the rows (turned into index)\n    columns=\"Sex\",    # the column values\n    values=[\"Count\", \"Name\"], \n    aggfunc=\"max\",      # group operation\n)\nbabynames_pivot.head(6)\n\n\n\n\n\n\n\n\nCount\nName\n\n\nSex\nF\nM\nF\nM\n\n\nYear\n\n\n\n\n\n\n\n\n1910\n295\n237\nYvonne\nWilliam\n\n\n1911\n390\n214\nZelma\nWillis\n\n\n1912\n534\n501\nYvonne\nWoodrow\n\n\n1913\n584\n614\nZelma\nYoshio\n\n\n1914\n773\n769\nZelma\nYoshio\n\n\n1915\n998\n1033\nZita\nYukio\n\n\n\n\n\n\n\nNote that each row provides the number of girls and number of boys having that year’s most common name, and also lists the alphabetically largest girl name and boy name. The counts for number of girls/boys in the resulting DataFrame do not correspond to the names listed. For example, in 1910, the most popular girl name is given to 295 girls, but that name was likely not Yvonne.",
+    "text": "4.4 Aggregating Data with Pivot Tables\nWe know now that .groupby gives us the ability to group and aggregate data across our DataFrame. The examples above formed groups using just one column in the DataFrame. It’s possible to group by multiple columns at once by passing in a list of column names to .groupby.\nLet’s consider the babynames dataset again. In this problem, we will find the total number of baby names associated with each sex for each year. To do this, we’ll group by both the \"Year\" and \"Sex\" columns.\n\nbabynames.head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\nFirst Letter\n\n\n\n\n115957\nCA\nF\n1990\nDeandrea\n5\nD\n\n\n101976\nCA\nF\n1986\nDeandrea\n6\nD\n\n\n131029\nCA\nF\n1994\nLeandrea\n5\nL\n\n\n108731\nCA\nF\n1988\nDeandrea\n5\nD\n\n\n308131\nCA\nM\n1985\nDeandrea\n6\nD\n\n\n\n\n\n\n\n\n# Find the total number of baby names associated with each sex for each \n# year in the data\nbabynames.groupby([\"Year\", \"Sex\"])[[\"Count\"]].agg(sum).head(6)\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83762/3186035650.py:3: FutureWarning:\n\nThe provided callable &lt;built-in function sum&gt; is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"sum\" instead.\n\n\n\n\n\n\n\n\n\n\n\nCount\n\n\nYear\nSex\n\n\n\n\n\n1910\nF\n5950\n\n\nM\n3213\n\n\n1911\nF\n6602\n\n\nM\n3381\n\n\n1912\nF\n9804\n\n\nM\n8142\n\n\n\n\n\n\n\nNotice that both \"Year\" and \"Sex\" serve as the index of the DataFrame (they are both rendered in bold). We’ve created a multi-index DataFrame where two different index values, the year and sex, are used to uniquely identify each row.\nThis isn’t the most intuitive way of representing this data – and, because multi-indexed DataFrames have multiple dimensions in their index, they can often be difficult to use.\nAnother strategy to aggregate across two columns is to create a pivot table. You saw these back in Data 8. One set of values is used to create the index of the pivot table; another set is used to define the column names. The values contained in each cell of the table correspond to the aggregated data for each index-column pair.\nHere’s an illustration of the process:\n\nThe best way to understand pivot tables is to see one in action. Let’s return to our original goal of summing the total number of names associated with each combination of year and sex. We’ll call the pandas .pivot_table method to create a new table.\n\n# The `pivot_table` method is used to generate a Pandas pivot table\nimport numpy as np\nbabynames.pivot_table(\n    index = \"Year\",\n    columns = \"Sex\",    \n    values = \"Count\", \n    aggfunc = \"sum\", \n).head(5)\n\n\n\n\n\n\n\nSex\nF\nM\n\n\nYear\n\n\n\n\n\n\n1910\n5950\n3213\n\n\n1911\n6602\n3381\n\n\n1912\n9804\n8142\n\n\n1913\n11860\n10234\n\n\n1914\n13815\n13111\n\n\n\n\n\n\n\nLooks a lot better! Now, our DataFrame is structured with clear index-column combinations. Each entry in the pivot table represents the summed count of names for a given combination of \"Year\" and \"Sex\".\nLet’s take a closer look at the code implemented above.\n\nindex = \"Year\" specifies the column name in the original DataFrame that should be used as the index of the pivot table\ncolumns = \"Sex\" specifies the column name in the original DataFrame that should be used to generate the columns of the pivot table\nvalues = \"Count\" indicates what values from the original DataFrame should be used to populate the entry for each index-column combination\naggfunc = np.sum tells pandas what function to use when aggregating the data specified by values. Here, we are summing the name counts for each pair of \"Year\" and \"Sex\"\n\nWe can even include multiple values in the index or columns of our pivot tables.\n\nbabynames_pivot = babynames.pivot_table(\n    index=\"Year\",     # the rows (turned into index)\n    columns=\"Sex\",    # the column values\n    values=[\"Count\", \"Name\"], \n    aggfunc=\"max\",      # group operation\n)\nbabynames_pivot.head(6)\n\n\n\n\n\n\n\n\nCount\nName\n\n\nSex\nF\nM\nF\nM\n\n\nYear\n\n\n\n\n\n\n\n\n1910\n295\n237\nYvonne\nWilliam\n\n\n1911\n390\n214\nZelma\nWillis\n\n\n1912\n534\n501\nYvonne\nWoodrow\n\n\n1913\n584\n614\nZelma\nYoshio\n\n\n1914\n773\n769\nZelma\nYoshio\n\n\n1915\n998\n1033\nZita\nYukio\n\n\n\n\n\n\n\nNote that each row provides the number of girls and number of boys having that year’s most common name, and also lists the alphabetically largest girl name and boy name. The counts for number of girls/boys in the resulting DataFrame do not correspond to the names listed. For example, in 1910, the most popular girl name is given to 295 girls, but that name was likely not Yvonne.",
     "crumbs": [
       "<span class='chapter-number'>4</span>  <span class='chapter-title'>Pandas III</span>"
     ]
@@ -254,7 +254,7 @@
     "href": "eda/eda.html#granularity-scope-and-temporality",
     "title": "5  Data Cleaning and EDA",
     "section": "5.2 Granularity, Scope, and Temporality",
-    "text": "5.2 Granularity, Scope, and Temporality\nAfter understanding the structure of the dataset, the next task is to determine what exactly the data represents. We’ll do so by considering the data’s granularity, scope, and temporality.\n\n5.2.1 Granularity\nThe granularity of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data’s granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.\n\n\n5.2.2 Scope\nThe scope of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.\n\n\n5.2.3 Temporality\nThe temporality of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.\nTime and date fields of a dataset could represent a few things:\n\nwhen the “event” happened\nwhen the data was collected, or when it was entered into the system\nwhen the data was copied into the database\n\nTo fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley’s time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).\n\n5.2.3.1 Temporality with pandas’ dt accessors\nLet’s briefly look at how we can use pandas’ dt accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.\n\n\nCode\ncalls = pd.read_csv(\"data/Berkeley_PD_-_Calls_for_Service.csv\")\ncalls.head()\n\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n0\n21014296\nTHEFT MISD. (UNDER $950)\n04/01/2021 12:00:00 AM\n10:58\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n1\n21014391\nTHEFT MISD. (UNDER $950)\n04/01/2021 12:00:00 AM\n10:38\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n2\n21090494\nTHEFT MISD. (UNDER $950)\n04/19/2021 12:00:00 AM\n12:15\nLARCENY\n1\n06/15/2021 12:00:00 AM\n2100 BLOCK HASTE ST\\nBerkeley, CA\\n(37.864908,...\n2100 BLOCK HASTE ST\nBerkeley\nCA\n\n\n3\n21090204\nTHEFT FELONY (OVER $950)\n02/13/2021 12:00:00 AM\n17:00\nLARCENY\n6\n06/15/2021 12:00:00 AM\n2600 BLOCK WARRING ST\\nBerkeley, CA\\n(37.86393...\n2600 BLOCK WARRING ST\nBerkeley\nCA\n\n\n4\n21090179\nBURGLARY AUTO\n02/08/2021 12:00:00 AM\n6:20\nBURGLARY - VEHICLE\n1\n06/15/2021 12:00:00 AM\n2700 BLOCK GARBER ST\\nBerkeley, CA\\n(37.86066,...\n2700 BLOCK GARBER ST\nBerkeley\nCA\n\n\n\n\n\n\n\nLooks like there are three columns with dates/times: EVENTDT, EVENTTM, and InDbDate.\nMost likely, EVENTDT stands for the date when the event took place, EVENTTM stands for the time of day the event took place (in 24-hr format), and InDbDate is the date this call is recorded onto the database.\nIf we check the data type of these columns, we will see they are stored as strings. We can convert them to datetime objects using pandas to_datetime function.\n\ncalls[\"EVENTDT\"] = pd.to_datetime(calls[\"EVENTDT\"])\ncalls.head()\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73700/874729699.py:1: UserWarning:\n\nCould not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n\n\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n0\n21014296\nTHEFT MISD. (UNDER $950)\n2021-04-01\n10:58\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n1\n21014391\nTHEFT MISD. (UNDER $950)\n2021-04-01\n10:38\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n2\n21090494\nTHEFT MISD. (UNDER $950)\n2021-04-19\n12:15\nLARCENY\n1\n06/15/2021 12:00:00 AM\n2100 BLOCK HASTE ST\\nBerkeley, CA\\n(37.864908,...\n2100 BLOCK HASTE ST\nBerkeley\nCA\n\n\n3\n21090204\nTHEFT FELONY (OVER $950)\n2021-02-13\n17:00\nLARCENY\n6\n06/15/2021 12:00:00 AM\n2600 BLOCK WARRING ST\\nBerkeley, CA\\n(37.86393...\n2600 BLOCK WARRING ST\nBerkeley\nCA\n\n\n4\n21090179\nBURGLARY AUTO\n2021-02-08\n6:20\nBURGLARY - VEHICLE\n1\n06/15/2021 12:00:00 AM\n2700 BLOCK GARBER ST\\nBerkeley, CA\\n(37.86066,...\n2700 BLOCK GARBER ST\nBerkeley\nCA\n\n\n\n\n\n\n\nNow, we can use the dt accessor on this column.\nWe can get the month:\n\ncalls[\"EVENTDT\"].dt.month.head()\n\n0    4\n1    4\n2    4\n3    2\n4    2\nName: EVENTDT, dtype: int32\n\n\nWhich day of the week the date is on:\n\ncalls[\"EVENTDT\"].dt.dayofweek.head()\n\n0    3\n1    3\n2    0\n3    5\n4    0\nName: EVENTDT, dtype: int32\n\n\nCheck the mimimum values to see if there are any suspicious-looking, 70s dates:\n\ncalls.sort_values(\"EVENTDT\").head()\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n2513\n20057398\nBURGLARY COMMERCIAL\n2020-12-17\n16:05\nBURGLARY - COMMERCIAL\n4\n06/15/2021 12:00:00 AM\n600 BLOCK GILMAN ST\\nBerkeley, CA\\n(37.878405,...\n600 BLOCK GILMAN ST\nBerkeley\nCA\n\n\n624\n20057207\nASSAULT/BATTERY MISD.\n2020-12-17\n16:50\nASSAULT\n4\n06/15/2021 12:00:00 AM\n2100 BLOCK SHATTUCK AVE\\nBerkeley, CA\\n(37.871...\n2100 BLOCK SHATTUCK AVE\nBerkeley\nCA\n\n\n154\n20092214\nTHEFT FROM AUTO\n2020-12-17\n18:30\nLARCENY - FROM VEHICLE\n4\n06/15/2021 12:00:00 AM\n800 BLOCK SHATTUCK AVE\\nBerkeley, CA\\n(37.8918...\n800 BLOCK SHATTUCK AVE\nBerkeley\nCA\n\n\n659\n20057324\nTHEFT MISD. (UNDER $950)\n2020-12-17\n15:44\nLARCENY\n4\n06/15/2021 12:00:00 AM\n1800 BLOCK 4TH ST\\nBerkeley, CA\\n(37.869888, -...\n1800 BLOCK 4TH ST\nBerkeley\nCA\n\n\n993\n20057573\nBURGLARY RESIDENTIAL\n2020-12-17\n22:15\nBURGLARY - RESIDENTIAL\n4\n06/15/2021 12:00:00 AM\n1700 BLOCK STUART ST\\nBerkeley, CA\\n(37.857495...\n1700 BLOCK STUART ST\nBerkeley\nCA\n\n\n\n\n\n\n\nDoesn’t look like it! We are good!\nWe can also do many things with the dt accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on .dt accessor and time series/date functionality.",
+    "text": "5.2 Granularity, Scope, and Temporality\nAfter understanding the structure of the dataset, the next task is to determine what exactly the data represents. We’ll do so by considering the data’s granularity, scope, and temporality.\n\n5.2.1 Granularity\nThe granularity of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data’s granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.\n\n\n5.2.2 Scope\nThe scope of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.\n\n\n5.2.3 Temporality\nThe temporality of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.\nTime and date fields of a dataset could represent a few things:\n\nwhen the “event” happened\nwhen the data was collected, or when it was entered into the system\nwhen the data was copied into the database\n\nTo fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley’s time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).\n\n5.2.3.1 Temporality with pandas’ dt accessors\nLet’s briefly look at how we can use pandas’ dt accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.\n\n\nCode\ncalls = pd.read_csv(\"data/Berkeley_PD_-_Calls_for_Service.csv\")\ncalls.head()\n\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n0\n21014296\nTHEFT MISD. (UNDER $950)\n04/01/2021 12:00:00 AM\n10:58\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n1\n21014391\nTHEFT MISD. (UNDER $950)\n04/01/2021 12:00:00 AM\n10:38\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n2\n21090494\nTHEFT MISD. (UNDER $950)\n04/19/2021 12:00:00 AM\n12:15\nLARCENY\n1\n06/15/2021 12:00:00 AM\n2100 BLOCK HASTE ST\\nBerkeley, CA\\n(37.864908,...\n2100 BLOCK HASTE ST\nBerkeley\nCA\n\n\n3\n21090204\nTHEFT FELONY (OVER $950)\n02/13/2021 12:00:00 AM\n17:00\nLARCENY\n6\n06/15/2021 12:00:00 AM\n2600 BLOCK WARRING ST\\nBerkeley, CA\\n(37.86393...\n2600 BLOCK WARRING ST\nBerkeley\nCA\n\n\n4\n21090179\nBURGLARY AUTO\n02/08/2021 12:00:00 AM\n6:20\nBURGLARY - VEHICLE\n1\n06/15/2021 12:00:00 AM\n2700 BLOCK GARBER ST\\nBerkeley, CA\\n(37.86066,...\n2700 BLOCK GARBER ST\nBerkeley\nCA\n\n\n\n\n\n\n\nLooks like there are three columns with dates/times: EVENTDT, EVENTTM, and InDbDate.\nMost likely, EVENTDT stands for the date when the event took place, EVENTTM stands for the time of day the event took place (in 24-hr format), and InDbDate is the date this call is recorded onto the database.\nIf we check the data type of these columns, we will see they are stored as strings. We can convert them to datetime objects using pandas to_datetime function.\n\ncalls[\"EVENTDT\"] = pd.to_datetime(calls[\"EVENTDT\"])\ncalls.head()\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83785/874729699.py:1: UserWarning:\n\nCould not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n\n\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n0\n21014296\nTHEFT MISD. (UNDER $950)\n2021-04-01\n10:58\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n1\n21014391\nTHEFT MISD. (UNDER $950)\n2021-04-01\n10:38\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n2\n21090494\nTHEFT MISD. (UNDER $950)\n2021-04-19\n12:15\nLARCENY\n1\n06/15/2021 12:00:00 AM\n2100 BLOCK HASTE ST\\nBerkeley, CA\\n(37.864908,...\n2100 BLOCK HASTE ST\nBerkeley\nCA\n\n\n3\n21090204\nTHEFT FELONY (OVER $950)\n2021-02-13\n17:00\nLARCENY\n6\n06/15/2021 12:00:00 AM\n2600 BLOCK WARRING ST\\nBerkeley, CA\\n(37.86393...\n2600 BLOCK WARRING ST\nBerkeley\nCA\n\n\n4\n21090179\nBURGLARY AUTO\n2021-02-08\n6:20\nBURGLARY - VEHICLE\n1\n06/15/2021 12:00:00 AM\n2700 BLOCK GARBER ST\\nBerkeley, CA\\n(37.86066,...\n2700 BLOCK GARBER ST\nBerkeley\nCA\n\n\n\n\n\n\n\nNow, we can use the dt accessor on this column.\nWe can get the month:\n\ncalls[\"EVENTDT\"].dt.month.head()\n\n0    4\n1    4\n2    4\n3    2\n4    2\nName: EVENTDT, dtype: int32\n\n\nWhich day of the week the date is on:\n\ncalls[\"EVENTDT\"].dt.dayofweek.head()\n\n0    3\n1    3\n2    0\n3    5\n4    0\nName: EVENTDT, dtype: int32\n\n\nCheck the mimimum values to see if there are any suspicious-looking, 70s dates:\n\ncalls.sort_values(\"EVENTDT\").head()\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n2513\n20057398\nBURGLARY COMMERCIAL\n2020-12-17\n16:05\nBURGLARY - COMMERCIAL\n4\n06/15/2021 12:00:00 AM\n600 BLOCK GILMAN ST\\nBerkeley, CA\\n(37.878405,...\n600 BLOCK GILMAN ST\nBerkeley\nCA\n\n\n624\n20057207\nASSAULT/BATTERY MISD.\n2020-12-17\n16:50\nASSAULT\n4\n06/15/2021 12:00:00 AM\n2100 BLOCK SHATTUCK AVE\\nBerkeley, CA\\n(37.871...\n2100 BLOCK SHATTUCK AVE\nBerkeley\nCA\n\n\n154\n20092214\nTHEFT FROM AUTO\n2020-12-17\n18:30\nLARCENY - FROM VEHICLE\n4\n06/15/2021 12:00:00 AM\n800 BLOCK SHATTUCK AVE\\nBerkeley, CA\\n(37.8918...\n800 BLOCK SHATTUCK AVE\nBerkeley\nCA\n\n\n659\n20057324\nTHEFT MISD. (UNDER $950)\n2020-12-17\n15:44\nLARCENY\n4\n06/15/2021 12:00:00 AM\n1800 BLOCK 4TH ST\\nBerkeley, CA\\n(37.869888, -...\n1800 BLOCK 4TH ST\nBerkeley\nCA\n\n\n993\n20057573\nBURGLARY RESIDENTIAL\n2020-12-17\n22:15\nBURGLARY - RESIDENTIAL\n4\n06/15/2021 12:00:00 AM\n1700 BLOCK STUART ST\\nBerkeley, CA\\n(37.857495...\n1700 BLOCK STUART ST\nBerkeley\nCA\n\n\n\n\n\n\n\nDoesn’t look like it! We are good!\nWe can also do many things with the dt accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on .dt accessor and time series/date functionality.",
     "crumbs": [
       "<span class='chapter-number'>5</span>  <span class='chapter-title'>Data Cleaning and EDA</span>"
     ]
@@ -284,7 +284,7 @@
     "href": "eda/eda.html#eda-demo-2-mauna-loa-co2-data-a-lesson-in-data-faithfulness",
     "title": "5  Data Cleaning and EDA",
     "section": "5.5 EDA Demo 2: Mauna Loa CO2 Data – A Lesson in Data Faithfulness",
-    "text": "5.5 EDA Demo 2: Mauna Loa CO2 Data – A Lesson in Data Faithfulness\nMauna Loa Observatory has been monitoring CO2 concentrations since 1958.\n\nco2_file = \"data/co2_mm_mlo.txt\"\n\nLet’s do some EDA!!\n\n5.5.1 Reading this file into Pandas?\nLet’s instead check out this .txt file. Some questions to keep in mind: Do we trust this file extension? What structure is it?\nLines 71-78 (inclusive) are shown below:\nline number |                            file contents\n\n71          |   #            decimal     average   interpolated    trend    #days\n72          |   #             date                             (season corr)\n73          |   1958   3    1958.208      315.71      315.71      314.62     -1\n74          |   1958   4    1958.292      317.45      317.45      315.29     -1\n75          |   1958   5    1958.375      317.50      317.50      314.71     -1\n76          |   1958   6    1958.458      -99.99      317.10      314.85     -1\n77          |   1958   7    1958.542      315.86      315.86      314.98     -1\n78          |   1958   8    1958.625      314.93      314.93      315.94     -1\nNotice how:\n\nThe values are separated by white space, possibly tabs.\nThe data line up down the rows. For example, the month appears in 7th to 8th position of each line.\nThe 71st and 72nd lines in the file contain column headings split over two lines.\n\nWe can use read_csv to read the data into a pandas DataFrame, and we provide several arguments to specify that the separators are white space, there is no header (we will set our own column names), and to skip the first 72 rows of the file.\n\nco2 = pd.read_csv(\n    co2_file, header = None, skiprows = 72,\n    sep = r'\\s+'       #delimiter for continuous whitespace (stay tuned for regex next lecture))\n)\nco2.head()\n\n\n\n\n\n\n\n\n0\n1\n2\n3\n4\n5\n6\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nCongratulations! You’ve wrangled the data!\n\n…But our columns aren’t named. We need to do more EDA.\n\n\n5.5.2 Exploring Variable Feature Types\nThe NOAA webpage might have some useful tidbits (in this case it doesn’t).\nUsing this information, we’ll rerun pd.read_csv, but this time with some custom column names.\n\nco2 = pd.read_csv(\n    co2_file, header = None, skiprows = 72,\n    sep = '\\s+', #regex for continuous whitespace (next lecture)\n    names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']\n)\nco2.head()\n\n&lt;&gt;:3: SyntaxWarning:\n\ninvalid escape sequence '\\s'\n\n&lt;&gt;:3: SyntaxWarning:\n\ninvalid escape sequence '\\s'\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73700/150137587.py:3: SyntaxWarning:\n\ninvalid escape sequence '\\s'\n\n\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\n\n\n5.5.3 Visualizing CO2\nScientific studies tend to have very clean data, right…? Let’s jump right in and make a time series plot of CO2 monthly averages.\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2);\n\n\n\n\n\n\n\n\n\nThe code above uses the seaborn plotting library (abbreviated sns). We will cover this in the Visualization lecture, but now you don’t need to worry about how it works!\nYikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some missing values. What happened here?\n\nco2.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\n\nco2.tail()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n733\n2019\n4\n2019.29\n413.32\n413.32\n410.49\n26\n\n\n734\n2019\n5\n2019.38\n414.66\n414.66\n411.20\n28\n\n\n735\n2019\n6\n2019.46\n413.92\n413.92\n411.58\n27\n\n\n736\n2019\n7\n2019.54\n411.77\n411.77\n411.43\n23\n\n\n737\n2019\n8\n2019.62\n409.95\n409.95\n411.84\n29\n\n\n\n\n\n\n\nSome data have unusual values like -1 and -99.99.\nLet’s check the description at the top of the file again.\n\n-1 signifies a missing value for the number of days Days the equipment was in operation that month.\n-99.99 denotes a missing monthly average Avg\n\nHow can we fix this? First, let’s explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.\n\n\n\n5.5.4 Sanity Checks: Reasoning about the data\nFirst, we consider the shape of the data. How many rows should we have?\n\nIf chronological order, we should have one record per month.\nData from March 1958 to August 2019.\nWe should have $ 12 (2019-1957) - 2 - 4 = 738 $ records.\n\n\nco2.shape\n\n(738, 7)\n\n\nNice!! The number of rows (i.e. records) match our expectations.\nLet’s now check the quality of each feature.\n\n\n5.5.5 Understanding Missing Value 1: Days\nDays is a time field, so let’s analyze other time fields to see if there is an explanation for missing values of days of operation.\nLet’s start with months, Mo.\nAre we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).\n\nco2[\"Mo\"].value_counts().sort_index()\n\nMo\n1     61\n2     61\n3     62\n4     62\n5     62\n6     62\n7     62\n8     62\n9     61\n10    61\n11    61\n12    61\nName: count, dtype: int64\n\n\nAs expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.\n\nNext let’s explore days Days itself, which is the number of days that the measurement equipment worked.\n\n\nCode\nsns.displot(co2['Days']);\nplt.title(\"Distribution of days feature\"); # suppresses unneeded plotting output\n\n\n\n\n\n\n\n\n\nIn terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values–that’s about 27% of the data!\n\nFinally, let’s check the last time feature, year Yr.\nLet’s check to see if there is any connection between missing-ness and the year of the recording.\n\n\nCode\nsns.scatterplot(x=\"Yr\", y=\"Days\", data=co2);\nplt.title(\"Day field by Year\"); # the ; suppresses output\n\n\n\n\n\n\n\n\n\nObservations:\n\nAll of the missing data are in the early years of operation.\nIt appears there may have been problems with equipment in the mid to late 80s.\n\nPotential Next Steps:\n\nConfirm these explanations through documentation about the historical readings.\nMaybe drop the earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.\n\n\n\n\n5.5.6 Understanding Missing Value 2: Avg\nNext, let’s return to the -99.99 values in Avg to analyze the overall quality of the CO2 measurements. We’ll plot a histogram of the average CO2 measurements\n\n\nCode\n# Histograms of average CO2 measurements\nsns.displot(co2['Avg']);\n\n\n\n\n\n\n\n\n\nThe non-missing values are in the 300-400 range (a regular range of CO2 levels).\nWe also see that there are only a few missing Avg values (&lt;1% of values). Let’s examine all of them:\n\nco2[co2[\"Avg\"] &lt; 0]\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n7\n1958\n10\n1958.79\n-99.99\n312.66\n315.61\n-1\n\n\n71\n1964\n2\n1964.12\n-99.99\n320.07\n319.61\n-1\n\n\n72\n1964\n3\n1964.21\n-99.99\n320.73\n319.55\n-1\n\n\n73\n1964\n4\n1964.29\n-99.99\n321.77\n319.48\n-1\n\n\n213\n1975\n12\n1975.96\n-99.99\n330.59\n331.60\n0\n\n\n313\n1984\n4\n1984.29\n-99.99\n346.84\n344.27\n2\n\n\n\n\n\n\n\nThere doesn’t seem to be a pattern to these values, other than that most records also were missing Days data.\n\n\n5.5.7 Drop, NaN, or Impute Missing Avg Data?\nHow should we address the invalid Avg data?\n\nDrop records\nSet to NaN\nImpute using some strategy\n\nRemember we want to fix the following plot:\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2)\nplt.title(\"CO2 Average By Month\");\n\n\n\n\n\n\n\n\n\nSince we are plotting Avg vs DecDate, we should just focus on dealing with missing values for Avg.\nLet’s consider a few options: 1. Drop those records 2. Replace -99.99 with NaN 3. Substitute it with a likely value for the average CO2?\nWhat do you think are the pros and cons of each possible action?\nLet’s examine each of these three options.\n\n# 1. Drop missing values\nco2_drop = co2[co2['Avg'] &gt; 0]\nco2_drop.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n5\n1958\n8\n1958.62\n314.93\n314.93\n315.94\n-1\n\n\n\n\n\n\n\n\n# 2. Replace NaN with -99.99\nco2_NA = co2.replace(-99.99, np.nan)\nco2_NA.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\nNaN\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nWe’ll also use a third version of the data.\nFirst, we note that the dataset already comes with a substitute value for the -99.99.\nFrom the file description:\n\nThe interpolated column includes average values from the preceding column (average) and interpolated values where data are missing. Interpolated values are computed in two steps…\n\nThe Int feature has values that exactly match those in Avg, except when Avg is -99.99, and then a reasonable estimate is used instead.\nSo, the third version of our data will use the Int feature instead of Avg.\n\n# 3. Use interpolated column which estimates missing Avg values\nco2_impute = co2.copy()\nco2_impute['Avg'] = co2['Int']\nco2_impute.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n317.10\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nWhat’s a reasonable estimate?\nTo answer this question, let’s zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).\n\n\nCode\n# results of plotting data in 1958\n\ndef line_and_points(data, ax, title):\n    # assumes single year, hence Mo\n    ax.plot('Mo', 'Avg', data=data)\n    ax.scatter('Mo', 'Avg', data=data)\n    ax.set_xlim(2, 13)\n    ax.set_title(title)\n    ax.set_xticks(np.arange(3, 13))\n\ndef data_year(data, year):\n    return data[data[\"Yr\"] == 1958]\n    \n# uses matplotlib subplots\n# you may see more next week; focus on output for now\nfig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)\n\nyear = 1958\nline_and_points(data_year(co2_drop, year), axes[0], title=\"1. Drop Missing\")\nline_and_points(data_year(co2_NA, year), axes[1], title=\"2. Missing Set to NaN\")\nline_and_points(data_year(co2_impute, year), axes[2], title=\"3. Missing Interpolated\")\n\nfig.suptitle(f\"Monthly Averages for {year}\")\nplt.tight_layout()\n\n\n\n\n\n\n\n\n\nIn the big picture since there are only 7 Avg values missing (&lt;1% of 738 months), any of these approaches would work.\nHowever there is some appeal to option C, Imputing:\n\nShows seasonal trends for CO2\nWe are plotting all months in our data as a line plot\n\nLet’s replot our original figure with option 3:\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2_impute)\nplt.title(\"CO2 Average By Month, Imputed\");\n\n\n\n\n\n\n\n\n\nLooks pretty close to what we see on the NOAA website!\n\n\n5.5.8 Presenting the Data: A Discussion on Data Granularity\nFrom the description:\n\nMonthly measurements are averages of average day measurements.\nThe NOAA GML website has datasets for daily/hourly measurements too.\n\nThe data you present depends on your research question.\nHow do CO2 levels vary by season?\n\nYou might want to keep average monthly data.\n\nAre CO2 levels rising over the past 50+ years, consistent with global warming predictions?\n\nYou might be happier with a coarser granularity of average year data!\n\n\n\nCode\nco2_year = co2_impute.groupby('Yr').mean()\nsns.lineplot(x='Yr', y='Avg', data=co2_year)\nplt.title(\"CO2 Average By Year\");\n\n\n\n\n\n\n\n\n\nIndeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.",
+    "text": "5.5 EDA Demo 2: Mauna Loa CO2 Data – A Lesson in Data Faithfulness\nMauna Loa Observatory has been monitoring CO2 concentrations since 1958.\n\nco2_file = \"data/co2_mm_mlo.txt\"\n\nLet’s do some EDA!!\n\n5.5.1 Reading this file into Pandas?\nLet’s instead check out this .txt file. Some questions to keep in mind: Do we trust this file extension? What structure is it?\nLines 71-78 (inclusive) are shown below:\nline number |                            file contents\n\n71          |   #            decimal     average   interpolated    trend    #days\n72          |   #             date                             (season corr)\n73          |   1958   3    1958.208      315.71      315.71      314.62     -1\n74          |   1958   4    1958.292      317.45      317.45      315.29     -1\n75          |   1958   5    1958.375      317.50      317.50      314.71     -1\n76          |   1958   6    1958.458      -99.99      317.10      314.85     -1\n77          |   1958   7    1958.542      315.86      315.86      314.98     -1\n78          |   1958   8    1958.625      314.93      314.93      315.94     -1\nNotice how:\n\nThe values are separated by white space, possibly tabs.\nThe data line up down the rows. For example, the month appears in 7th to 8th position of each line.\nThe 71st and 72nd lines in the file contain column headings split over two lines.\n\nWe can use read_csv to read the data into a pandas DataFrame, and we provide several arguments to specify that the separators are white space, there is no header (we will set our own column names), and to skip the first 72 rows of the file.\n\nco2 = pd.read_csv(\n    co2_file, header = None, skiprows = 72,\n    sep = r'\\s+'       #delimiter for continuous whitespace (stay tuned for regex next lecture))\n)\nco2.head()\n\n\n\n\n\n\n\n\n0\n1\n2\n3\n4\n5\n6\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nCongratulations! You’ve wrangled the data!\n\n…But our columns aren’t named. We need to do more EDA.\n\n\n5.5.2 Exploring Variable Feature Types\nThe NOAA webpage might have some useful tidbits (in this case it doesn’t).\nUsing this information, we’ll rerun pd.read_csv, but this time with some custom column names.\n\nco2 = pd.read_csv(\n    co2_file, header = None, skiprows = 72,\n    sep = '\\s+', #regex for continuous whitespace (next lecture)\n    names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']\n)\nco2.head()\n\n&lt;&gt;:3: SyntaxWarning:\n\ninvalid escape sequence '\\s'\n\n&lt;&gt;:3: SyntaxWarning:\n\ninvalid escape sequence '\\s'\n\n/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83785/150137587.py:3: SyntaxWarning:\n\ninvalid escape sequence '\\s'\n\n\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\n\n\n5.5.3 Visualizing CO2\nScientific studies tend to have very clean data, right…? Let’s jump right in and make a time series plot of CO2 monthly averages.\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2);\n\n\n\n\n\n\n\n\n\nThe code above uses the seaborn plotting library (abbreviated sns). We will cover this in the Visualization lecture, but now you don’t need to worry about how it works!\nYikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some missing values. What happened here?\n\nco2.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\n\nco2.tail()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n733\n2019\n4\n2019.29\n413.32\n413.32\n410.49\n26\n\n\n734\n2019\n5\n2019.38\n414.66\n414.66\n411.20\n28\n\n\n735\n2019\n6\n2019.46\n413.92\n413.92\n411.58\n27\n\n\n736\n2019\n7\n2019.54\n411.77\n411.77\n411.43\n23\n\n\n737\n2019\n8\n2019.62\n409.95\n409.95\n411.84\n29\n\n\n\n\n\n\n\nSome data have unusual values like -1 and -99.99.\nLet’s check the description at the top of the file again.\n\n-1 signifies a missing value for the number of days Days the equipment was in operation that month.\n-99.99 denotes a missing monthly average Avg\n\nHow can we fix this? First, let’s explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.\n\n\n\n5.5.4 Sanity Checks: Reasoning about the data\nFirst, we consider the shape of the data. How many rows should we have?\n\nIf chronological order, we should have one record per month.\nData from March 1958 to August 2019.\nWe should have $ 12 (2019-1957) - 2 - 4 = 738 $ records.\n\n\nco2.shape\n\n(738, 7)\n\n\nNice!! The number of rows (i.e. records) match our expectations.\nLet’s now check the quality of each feature.\n\n\n5.5.5 Understanding Missing Value 1: Days\nDays is a time field, so let’s analyze other time fields to see if there is an explanation for missing values of days of operation.\nLet’s start with months, Mo.\nAre we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).\n\nco2[\"Mo\"].value_counts().sort_index()\n\nMo\n1     61\n2     61\n3     62\n4     62\n5     62\n6     62\n7     62\n8     62\n9     61\n10    61\n11    61\n12    61\nName: count, dtype: int64\n\n\nAs expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.\n\nNext let’s explore days Days itself, which is the number of days that the measurement equipment worked.\n\n\nCode\nsns.displot(co2['Days']);\nplt.title(\"Distribution of days feature\"); # suppresses unneeded plotting output\n\n\n\n\n\n\n\n\n\nIn terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values–that’s about 27% of the data!\n\nFinally, let’s check the last time feature, year Yr.\nLet’s check to see if there is any connection between missing-ness and the year of the recording.\n\n\nCode\nsns.scatterplot(x=\"Yr\", y=\"Days\", data=co2);\nplt.title(\"Day field by Year\"); # the ; suppresses output\n\n\n\n\n\n\n\n\n\nObservations:\n\nAll of the missing data are in the early years of operation.\nIt appears there may have been problems with equipment in the mid to late 80s.\n\nPotential Next Steps:\n\nConfirm these explanations through documentation about the historical readings.\nMaybe drop the earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.\n\n\n\n\n5.5.6 Understanding Missing Value 2: Avg\nNext, let’s return to the -99.99 values in Avg to analyze the overall quality of the CO2 measurements. We’ll plot a histogram of the average CO2 measurements\n\n\nCode\n# Histograms of average CO2 measurements\nsns.displot(co2['Avg']);\n\n\n\n\n\n\n\n\n\nThe non-missing values are in the 300-400 range (a regular range of CO2 levels).\nWe also see that there are only a few missing Avg values (&lt;1% of values). Let’s examine all of them:\n\nco2[co2[\"Avg\"] &lt; 0]\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n7\n1958\n10\n1958.79\n-99.99\n312.66\n315.61\n-1\n\n\n71\n1964\n2\n1964.12\n-99.99\n320.07\n319.61\n-1\n\n\n72\n1964\n3\n1964.21\n-99.99\n320.73\n319.55\n-1\n\n\n73\n1964\n4\n1964.29\n-99.99\n321.77\n319.48\n-1\n\n\n213\n1975\n12\n1975.96\n-99.99\n330.59\n331.60\n0\n\n\n313\n1984\n4\n1984.29\n-99.99\n346.84\n344.27\n2\n\n\n\n\n\n\n\nThere doesn’t seem to be a pattern to these values, other than that most records also were missing Days data.\n\n\n5.5.7 Drop, NaN, or Impute Missing Avg Data?\nHow should we address the invalid Avg data?\n\nDrop records\nSet to NaN\nImpute using some strategy\n\nRemember we want to fix the following plot:\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2)\nplt.title(\"CO2 Average By Month\");\n\n\n\n\n\n\n\n\n\nSince we are plotting Avg vs DecDate, we should just focus on dealing with missing values for Avg.\nLet’s consider a few options: 1. Drop those records 2. Replace -99.99 with NaN 3. Substitute it with a likely value for the average CO2?\nWhat do you think are the pros and cons of each possible action?\nLet’s examine each of these three options.\n\n# 1. Drop missing values\nco2_drop = co2[co2['Avg'] &gt; 0]\nco2_drop.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n5\n1958\n8\n1958.62\n314.93\n314.93\n315.94\n-1\n\n\n\n\n\n\n\n\n# 2. Replace NaN with -99.99\nco2_NA = co2.replace(-99.99, np.nan)\nco2_NA.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\nNaN\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nWe’ll also use a third version of the data.\nFirst, we note that the dataset already comes with a substitute value for the -99.99.\nFrom the file description:\n\nThe interpolated column includes average values from the preceding column (average) and interpolated values where data are missing. Interpolated values are computed in two steps…\n\nThe Int feature has values that exactly match those in Avg, except when Avg is -99.99, and then a reasonable estimate is used instead.\nSo, the third version of our data will use the Int feature instead of Avg.\n\n# 3. Use interpolated column which estimates missing Avg values\nco2_impute = co2.copy()\nco2_impute['Avg'] = co2['Int']\nco2_impute.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n317.10\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nWhat’s a reasonable estimate?\nTo answer this question, let’s zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).\n\n\nCode\n# results of plotting data in 1958\n\ndef line_and_points(data, ax, title):\n    # assumes single year, hence Mo\n    ax.plot('Mo', 'Avg', data=data)\n    ax.scatter('Mo', 'Avg', data=data)\n    ax.set_xlim(2, 13)\n    ax.set_title(title)\n    ax.set_xticks(np.arange(3, 13))\n\ndef data_year(data, year):\n    return data[data[\"Yr\"] == 1958]\n    \n# uses matplotlib subplots\n# you may see more next week; focus on output for now\nfig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)\n\nyear = 1958\nline_and_points(data_year(co2_drop, year), axes[0], title=\"1. Drop Missing\")\nline_and_points(data_year(co2_NA, year), axes[1], title=\"2. Missing Set to NaN\")\nline_and_points(data_year(co2_impute, year), axes[2], title=\"3. Missing Interpolated\")\n\nfig.suptitle(f\"Monthly Averages for {year}\")\nplt.tight_layout()\n\n\n\n\n\n\n\n\n\nIn the big picture since there are only 7 Avg values missing (&lt;1% of 738 months), any of these approaches would work.\nHowever there is some appeal to option C, Imputing:\n\nShows seasonal trends for CO2\nWe are plotting all months in our data as a line plot\n\nLet’s replot our original figure with option 3:\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2_impute)\nplt.title(\"CO2 Average By Month, Imputed\");\n\n\n\n\n\n\n\n\n\nLooks pretty close to what we see on the NOAA website!\n\n\n5.5.8 Presenting the Data: A Discussion on Data Granularity\nFrom the description:\n\nMonthly measurements are averages of average day measurements.\nThe NOAA GML website has datasets for daily/hourly measurements too.\n\nThe data you present depends on your research question.\nHow do CO2 levels vary by season?\n\nYou might want to keep average monthly data.\n\nAre CO2 levels rising over the past 50+ years, consistent with global warming predictions?\n\nYou might be happier with a coarser granularity of average year data!\n\n\n\nCode\nco2_year = co2_impute.groupby('Yr').mean()\nsns.lineplot(x='Yr', y='Avg', data=co2_year)\nplt.title(\"CO2 Average By Year\");\n\n\n\n\n\n\n\n\n\nIndeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.",
     "crumbs": [
       "<span class='chapter-number'>5</span>  <span class='chapter-title'>Data Cleaning and EDA</span>"
     ]
@@ -544,7 +544,7 @@
     "href": "sampling/sampling.html#probability-samples",
     "title": "9  Sampling",
     "section": "9.3 Probability Samples",
-    "text": "9.3 Probability Samples\nWhen sampling, it is essential to focus on the quality of the sample rather than the quantity of the sample. A huge sample size does not fix a bad sampling method. Our main goal is to gather a sample that is representative of the population it came from. In this section, we’ll explore the different types of sampling and their pros and cons.\nA convenience sample is whatever you can get ahold of; this type of sampling is non-random. Note that haphazard sampling is not necessarily random sampling; there are many potential sources of bias.\nIn a probability sample, we provide the chance that any specified set of individuals will be in the sample (individuals in the population can have different chances of being selected; they don’t all have to be uniform), and we sample at random based off this known chance. For this reason, probability samples are also called random samples. The randomness provides a few benefits:\n\nBecause we know the source probabilities, we can measure the errors.\nSampling at random gives us a more representative sample of the population, which reduces bias. (Note: this is only the case when the probability distribution we’re sampling from is accurate. Random samples using “bad” or inaccurate distributions can produce biased estimates of population quantities.)\nProbability samples allow us to estimate the bias and chance error, which helps us quantify uncertainty (more in a future lecture).\n\nThe real world is usually more complicated, and we often don’t know the initial probabilities. For example, we do not generally know the probability that a given bacterium is in a microbiome sample or whether people will answer when Gallup calls landlines. That being said, still we try to model probability sampling to the best of our ability even when the sampling or measurement process is not fully under our control.\nA few common random sampling schemes:\n\nA uniform random sample with replacement is a sample drawn uniformly at random with replacement.\n\nRandom doesn’t always mean “uniformly at random,” but in this specific context, it does.\nSome individuals in the population might get picked more than once.\n\nA simple random sample (SRS) is a sample drawn uniformly at random without replacement.\n\nEvery individual (and subset of individuals) has the same chance of being selected from the sampling frame.\nEvery pair has the same chance as every other pair.\nEvery triple has the same chance as every other triple.\nAnd so on.\n\nA stratified random sample, where random sampling is performed on strata (specific groups), and the groups together compose a sample.\n\n\n9.3.1 Example Scheme 1: Probability Sample\nSuppose we have 3 TA’s (Arman, Boyu, Charlie): I decide to sample 2 of them as follows:\n\nI choose A with probability 1.0\nI choose either B or C, each with a probability of 0.5.\n\nWe can list all the possible outcomes and their respective probabilities in a table:\n\n\n\nOutcome\nProbability\n\n\n\n\n{A, B}\n0.5\n\n\n{A, C}\n0.5\n\n\n{B, C}\n0\n\n\n\nThis is a probability sample (though not a great one). Of the 3 people in my population, I know the chance of getting each subset. Suppose I’m measuring the average distance TAs live from campus.\n\nThis scheme does not see the entire population!\nMy estimate using the single sample I take has some chance error depending on if I see AB or AC.\nThis scheme is biased towards A’s response.\n\n\n\n9.3.2 Example Scheme 2: Simple Random Sample\nConsider the following sampling scheme:\n\nA class roster has 1100 students listed alphabetically.\nPick one of the first 10 students on the list at random (e.g. Student 8).\nTo create your sample, take that student and every 10th student listed after that (e.g. Students 8, 18, 28, 38, etc.).\n\n\n\nIs this a probability sample?\n\nYes. For a sample [n, n + 10, n + 20, …, n + 1090], where 1 &lt;= n &lt;= 10, the probability of that sample is 1/10. Otherwise, the probability is 0.\nOnly 10 possible samples!\n\n\n\nDoes each student have the same probability of being selected?\n\nYes. Each student is chosen with a probability of 1/10.\n\n\n\nIs this a simple random sample?\n\nNo. The chance of selecting (8, 18) is 1/10; the chance of selecting (8, 9) is 0.\n\n\n\n9.3.3 Demo: Barbie v. Oppenheimer\nWe are trying to collect a sample from Berkeley residents to predict the which one of Barbie and Oppenheimer would perform better on their opening day, July 21st.\nFirst, let’s grab a dataset that has every single resident in Berkeley (this is a fake dataset) and which movie they actually watched on July 21st.\nLet’s load in the movie.csv table. We can assume that:\n\nis_male is a boolean that indicates if a resident identifies as male.\nThere are only two movies they can watch on July 21st: Barbie and Oppenheimer.\nEvery resident watches a movie (either Barbie or Oppenheimer) on July 21st.\n\n\n\nCode\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\n\nsns.set_theme(style='darkgrid', font_scale = 1.5,\n              rc={'figure.figsize':(7,5)})\n\nrng = np.random.default_rng()\n\n\n\nmovie = pd.read_csv(\"data/movie.csv\")\n\n# create a 1/0 int that indicates Barbie vote\nmovie['barbie'] = (movie['movie'] == 'Barbie').astype(int)\nmovie.head()\n\n\n\n\n\n\n\n\nage\nis_male\nmovie\nbarbie\n\n\n\n\n0\n35\nFalse\nBarbie\n1\n\n\n1\n42\nTrue\nOppenheimer\n0\n\n\n2\n55\nFalse\nBarbie\n1\n\n\n3\n77\nTrue\nOppenheimer\n0\n\n\n4\n31\nFalse\nBarbie\n1\n\n\n\n\n\n\n\nWhat fraction of Berkeley residents chose Barbie?\n\nactual_barbie = np.mean(movie[\"barbie\"])\nactual_barbie\n\nnp.float64(0.5302792307692308)\n\n\nThis is the actual outcome of the competition. Based on this result, Barbie would win. How did our sample of retirees do?\n\n9.3.3.1 Convenience Sample: Retirees\nLet’s take a convenience sample of people who have retired (&gt;= 65 years old). What proportion of them went to see Barbie instead of Oppenheimer?\n\nconvenience_sample = movie[movie['age'] &gt;= 65] # take a convenience sample of retirees\nnp.mean(convenience_sample[\"barbie\"]) # what proportion of them saw Barbie? \n\nnp.float64(0.3744755089093924)\n\n\nBased on this result, we would have predicted that Oppenheimer would win! What happened? Is it possible that our sample is too small or noisy?\n\n# what's the size of our sample? \nlen(convenience_sample)\n\n359396\n\n\n\n# what proportion of our data is in the convenience sample? \nlen(convenience_sample)/len(movie)\n\n0.27645846153846154\n\n\nSeems like our sample is rather large (roughly 360,000 people), so the error is likely not due to solely to chance.\n\n\n9.3.3.2 Check for Bias\nLet us aggregate all choices by age and visualize the fraction of Barbie views, split by gender.\n\nvotes_by_barbie = movie.groupby([\"age\",\"is_male\"]).agg(\"mean\", numeric_only=True).reset_index()\nvotes_by_barbie.head()\n\n\n\n\n\n\n\n\nage\nis_male\nbarbie\n\n\n\n\n0\n18\nFalse\n0.819594\n\n\n1\n18\nTrue\n0.667001\n\n\n2\n19\nFalse\n0.812214\n\n\n3\n19\nTrue\n0.661252\n\n\n4\n20\nFalse\n0.805281\n\n\n\n\n\n\n\n\n\nCode\n# A common matplotlib/seaborn pattern: create the figure and axes object, pass ax\n# to seaborn for drawing into, and later fine-tune the figure via ax.\nfig, ax = plt.subplots();\n\nred_blue = [\"#bf1518\", \"#397eb7\"]\nwith sns.color_palette(red_blue):\n    sns.pointplot(data=votes_by_barbie, x = \"age\", y = \"barbie\", hue = \"is_male\", ax=ax)\n\nnew_ticks = [i.get_text() for i in ax.get_xticklabels()]\nax.set_xticks(range(0, len(new_ticks), 10), new_ticks[::10])\nax.set_title(\"Preferences by Demographics\");\n\n\n\n\n\n\n\n\n\n\nWe see that retirees (in Berkeley) tend to watch Oppenheimer.\nWe also see that residents who identify as non-male tend to prefer Barbie.\n\n\n\n9.3.3.3 Simple Random Sample\nSuppose we took a simple random sample (SRS) of the same size as our retiree sample:\n\nn = len(convenience_sample)\nrandom_sample = movie.sample(n, replace = False) ## By default, replace = False\nnp.mean(random_sample[\"barbie\"])\n\nnp.float64(0.52963861590001)\n\n\nThis is very close to the actual vote of 0.5302792307692308!\nIt turns out that we can get similar results with a much smaller sample size, say, 800:\n\nn = 800\nrandom_sample = movie.sample(n, replace = False)\n\n# Compute the sample average and the resulting relative error\nsample_barbie = np.mean(random_sample[\"barbie\"])\nerr = abs(sample_barbie-actual_barbie)/actual_barbie\n\n# We can print output with Markdown formatting too...\nfrom IPython.display import Markdown\nMarkdown(f\"**Actual** = {actual_barbie:.4f}, **Sample** = {sample_barbie:.4f}, \"\n         f\"**Err** = {100*err:.2f}%.\")\n\nActual = 0.5303, Sample = 0.5300, Err = 0.05%.\n\n\nWe’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.\n\n\n9.3.3.4 Quantifying Chance Error\nIn our SRS of size 800, what would be our chance error?\nLet’s simulate 1000 versions of taking the 800-sized SRS from before:\n\nnrep = 1000   # number of simulations\nn = 800       # size of our sample\npoll_result = []\nfor i in range(0, nrep):\n    random_sample = movie.sample(n, replace = False)\n    poll_result.append(np.mean(random_sample[\"barbie\"]))\n\n\n\nCode\nfig, ax = plt.subplots()\nsns.histplot(poll_result, stat='density', ax=ax)\nax.axvline(actual_barbie, color=\"orange\", lw=4);\n\n\n/Users/nikhilreddy/course-notes/ds100env/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning:\n\nuse_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.\n\n\n\n\n\n\n\n\n\n\nWhat fraction of these simulated samples would have predicted Barbie?\n\npoll_result = pd.Series(poll_result)\nnp.sum(poll_result &gt; 0.5)/1000\n\nnp.float64(0.959)\n\n\nYou can see the curve looks roughly Gaussian/normal. Using KDE:\n\n\nCode\nsns.histplot(poll_result, stat='density', kde=True);\n\n\n/Users/nikhilreddy/course-notes/ds100env/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning:\n\nuse_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.",
+    "text": "9.3 Probability Samples\nWhen sampling, it is essential to focus on the quality of the sample rather than the quantity of the sample. A huge sample size does not fix a bad sampling method. Our main goal is to gather a sample that is representative of the population it came from. In this section, we’ll explore the different types of sampling and their pros and cons.\nA convenience sample is whatever you can get ahold of; this type of sampling is non-random. Note that haphazard sampling is not necessarily random sampling; there are many potential sources of bias.\nIn a probability sample, we provide the chance that any specified set of individuals will be in the sample (individuals in the population can have different chances of being selected; they don’t all have to be uniform), and we sample at random based off this known chance. For this reason, probability samples are also called random samples. The randomness provides a few benefits:\n\nBecause we know the source probabilities, we can measure the errors.\nSampling at random gives us a more representative sample of the population, which reduces bias. (Note: this is only the case when the probability distribution we’re sampling from is accurate. Random samples using “bad” or inaccurate distributions can produce biased estimates of population quantities.)\nProbability samples allow us to estimate the bias and chance error, which helps us quantify uncertainty (more in a future lecture).\n\nThe real world is usually more complicated, and we often don’t know the initial probabilities. For example, we do not generally know the probability that a given bacterium is in a microbiome sample or whether people will answer when Gallup calls landlines. That being said, still we try to model probability sampling to the best of our ability even when the sampling or measurement process is not fully under our control.\nA few common random sampling schemes:\n\nA uniform random sample with replacement is a sample drawn uniformly at random with replacement.\n\nRandom doesn’t always mean “uniformly at random,” but in this specific context, it does.\nSome individuals in the population might get picked more than once.\n\nA simple random sample (SRS) is a sample drawn uniformly at random without replacement.\n\nEvery individual (and subset of individuals) has the same chance of being selected from the sampling frame.\nEvery pair has the same chance as every other pair.\nEvery triple has the same chance as every other triple.\nAnd so on.\n\nA stratified random sample, where random sampling is performed on strata (specific groups), and the groups together compose a sample.\n\n\n9.3.1 Example Scheme 1: Probability Sample\nSuppose we have 3 TA’s (Arman, Boyu, Charlie): I decide to sample 2 of them as follows:\n\nI choose A with probability 1.0\nI choose either B or C, each with a probability of 0.5.\n\nWe can list all the possible outcomes and their respective probabilities in a table:\n\n\n\nOutcome\nProbability\n\n\n\n\n{A, B}\n0.5\n\n\n{A, C}\n0.5\n\n\n{B, C}\n0\n\n\n\nThis is a probability sample (though not a great one). Of the 3 people in my population, I know the chance of getting each subset. Suppose I’m measuring the average distance TAs live from campus.\n\nThis scheme does not see the entire population!\nMy estimate using the single sample I take has some chance error depending on if I see AB or AC.\nThis scheme is biased towards A’s response.\n\n\n\n9.3.2 Example Scheme 2: Simple Random Sample\nConsider the following sampling scheme:\n\nA class roster has 1100 students listed alphabetically.\nPick one of the first 10 students on the list at random (e.g. Student 8).\nTo create your sample, take that student and every 10th student listed after that (e.g. Students 8, 18, 28, 38, etc.).\n\n\n\nIs this a probability sample?\n\nYes. For a sample [n, n + 10, n + 20, …, n + 1090], where 1 &lt;= n &lt;= 10, the probability of that sample is 1/10. Otherwise, the probability is 0.\nOnly 10 possible samples!\n\n\n\nDoes each student have the same probability of being selected?\n\nYes. Each student is chosen with a probability of 1/10.\n\n\n\nIs this a simple random sample?\n\nNo. The chance of selecting (8, 18) is 1/10; the chance of selecting (8, 9) is 0.\n\n\n\n9.3.3 Demo: Barbie v. Oppenheimer\nWe are trying to collect a sample from Berkeley residents to predict the which one of Barbie and Oppenheimer would perform better on their opening day, July 21st.\nFirst, let’s grab a dataset that has every single resident in Berkeley (this is a fake dataset) and which movie they actually watched on July 21st.\nLet’s load in the movie.csv table. We can assume that:\n\nis_male is a boolean that indicates if a resident identifies as male.\nThere are only two movies they can watch on July 21st: Barbie and Oppenheimer.\nEvery resident watches a movie (either Barbie or Oppenheimer) on July 21st.\n\n\n\nCode\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\n\nsns.set_theme(style='darkgrid', font_scale = 1.5,\n              rc={'figure.figsize':(7,5)})\n\nrng = np.random.default_rng()\n\n\n\nmovie = pd.read_csv(\"data/movie.csv\")\n\n# create a 1/0 int that indicates Barbie vote\nmovie['barbie'] = (movie['movie'] == 'Barbie').astype(int)\nmovie.head()\n\n\n\n\n\n\n\n\nage\nis_male\nmovie\nbarbie\n\n\n\n\n0\n35\nFalse\nBarbie\n1\n\n\n1\n42\nTrue\nOppenheimer\n0\n\n\n2\n55\nFalse\nBarbie\n1\n\n\n3\n77\nTrue\nOppenheimer\n0\n\n\n4\n31\nFalse\nBarbie\n1\n\n\n\n\n\n\n\nWhat fraction of Berkeley residents chose Barbie?\n\nactual_barbie = np.mean(movie[\"barbie\"])\nactual_barbie\n\nnp.float64(0.5302792307692308)\n\n\nThis is the actual outcome of the competition. Based on this result, Barbie would win. How did our sample of retirees do?\n\n9.3.3.1 Convenience Sample: Retirees\nLet’s take a convenience sample of people who have retired (&gt;= 65 years old). What proportion of them went to see Barbie instead of Oppenheimer?\n\nconvenience_sample = movie[movie['age'] &gt;= 65] # take a convenience sample of retirees\nnp.mean(convenience_sample[\"barbie\"]) # what proportion of them saw Barbie? \n\nnp.float64(0.3744755089093924)\n\n\nBased on this result, we would have predicted that Oppenheimer would win! What happened? Is it possible that our sample is too small or noisy?\n\n# what's the size of our sample? \nlen(convenience_sample)\n\n359396\n\n\n\n# what proportion of our data is in the convenience sample? \nlen(convenience_sample)/len(movie)\n\n0.27645846153846154\n\n\nSeems like our sample is rather large (roughly 360,000 people), so the error is likely not due to solely to chance.\n\n\n9.3.3.2 Check for Bias\nLet us aggregate all choices by age and visualize the fraction of Barbie views, split by gender.\n\nvotes_by_barbie = movie.groupby([\"age\",\"is_male\"]).agg(\"mean\", numeric_only=True).reset_index()\nvotes_by_barbie.head()\n\n\n\n\n\n\n\n\nage\nis_male\nbarbie\n\n\n\n\n0\n18\nFalse\n0.819594\n\n\n1\n18\nTrue\n0.667001\n\n\n2\n19\nFalse\n0.812214\n\n\n3\n19\nTrue\n0.661252\n\n\n4\n20\nFalse\n0.805281\n\n\n\n\n\n\n\n\n\nCode\n# A common matplotlib/seaborn pattern: create the figure and axes object, pass ax\n# to seaborn for drawing into, and later fine-tune the figure via ax.\nfig, ax = plt.subplots();\n\nred_blue = [\"#bf1518\", \"#397eb7\"]\nwith sns.color_palette(red_blue):\n    sns.pointplot(data=votes_by_barbie, x = \"age\", y = \"barbie\", hue = \"is_male\", ax=ax)\n\nnew_ticks = [i.get_text() for i in ax.get_xticklabels()]\nax.set_xticks(range(0, len(new_ticks), 10), new_ticks[::10])\nax.set_title(\"Preferences by Demographics\");\n\n\n\n\n\n\n\n\n\n\nWe see that retirees (in Berkeley) tend to watch Oppenheimer.\nWe also see that residents who identify as non-male tend to prefer Barbie.\n\n\n\n9.3.3.3 Simple Random Sample\nSuppose we took a simple random sample (SRS) of the same size as our retiree sample:\n\nn = len(convenience_sample)\nrandom_sample = movie.sample(n, replace = False) ## By default, replace = False\nnp.mean(random_sample[\"barbie\"])\n\nnp.float64(0.5309408006766909)\n\n\nThis is very close to the actual vote of 0.5302792307692308!\nIt turns out that we can get similar results with a much smaller sample size, say, 800:\n\nn = 800\nrandom_sample = movie.sample(n, replace = False)\n\n# Compute the sample average and the resulting relative error\nsample_barbie = np.mean(random_sample[\"barbie\"])\nerr = abs(sample_barbie-actual_barbie)/actual_barbie\n\n# We can print output with Markdown formatting too...\nfrom IPython.display import Markdown\nMarkdown(f\"**Actual** = {actual_barbie:.4f}, **Sample** = {sample_barbie:.4f}, \"\n         f\"**Err** = {100*err:.2f}%.\")\n\nActual = 0.5303, Sample = 0.5188, Err = 2.17%.\n\n\nWe’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.\n\n\n9.3.3.4 Quantifying Chance Error\nIn our SRS of size 800, what would be our chance error?\nLet’s simulate 1000 versions of taking the 800-sized SRS from before:\n\nnrep = 1000   # number of simulations\nn = 800       # size of our sample\npoll_result = []\nfor i in range(0, nrep):\n    random_sample = movie.sample(n, replace = False)\n    poll_result.append(np.mean(random_sample[\"barbie\"]))\n\n\n\nCode\nfig, ax = plt.subplots()\nsns.histplot(poll_result, stat='density', ax=ax)\nax.axvline(actual_barbie, color=\"orange\", lw=4);\n\n\n/Users/nikhilreddy/course-notes/ds100env/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning:\n\nuse_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.\n\n\n\n\n\n\n\n\n\n\nWhat fraction of these simulated samples would have predicted Barbie?\n\npoll_result = pd.Series(poll_result)\nnp.sum(poll_result &gt; 0.5)/1000\n\nnp.float64(0.946)\n\n\nYou can see the curve looks roughly Gaussian/normal. Using KDE:\n\n\nCode\nsns.histplot(poll_result, stat='density', kde=True);\n\n\n/Users/nikhilreddy/course-notes/ds100env/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning:\n\nuse_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.",
     "crumbs": [
       "<span class='chapter-number'>9</span>  <span class='chapter-title'>Sampling</span>"
     ]
@@ -558,5 +558,85 @@
     "crumbs": [
       "<span class='chapter-number'>9</span>  <span class='chapter-title'>Sampling</span>"
     ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html",
+    "href": "intro_to_modeling/intro_to_modeling.html",
+    "title": "10  Introduction to Modeling",
+    "section": "",
+    "text": "10.1 What is a Model?\nA model is an idealized representation of a system. A system is a set of principles or procedures according to which something functions. We live in a world full of systems: the procedure of turning on a light happens according to a specific set of rules dictating the flow of electricity. The truth behind how any event occurs is usually complex, and many times the specifics are unknown. The workings of the world can be viewed as its own giant procedure. Models seek to simplify the world and distill them into workable pieces.\nExample: We model the fall of an object on Earth as subject to a constant acceleration of \\(9.81 m/s^2\\) due to gravity.",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html#what-is-a-model",
+    "href": "intro_to_modeling/intro_to_modeling.html#what-is-a-model",
+    "title": "10  Introduction to Modeling",
+    "section": "",
+    "text": "While this describes the behavior of our system, it is merely an approximation.\nIt doesn’t account for the effects of air resistance, local variations in gravity, etc.\nIn practice, it’s accurate enough to be useful!\n\n\n10.1.1 Reasons for Building Models\nWhy do we want to build models? As far as data scientists and statisticians are concerned, there are three reasons, and each implies a different focus on modeling.\n\nTo explain complex phenomena occurring in the world we live in. Examples of this might be:\n\nHow are the parents’ average height related to their children’s average height?\nHow does an object’s velocity and acceleration impact how far it travels? (Physics: \\(d = d_0 + vt + \\frac{1}{2}at^2\\))\n\nIn these cases, we care about creating models that are simple and interpretable, allowing us to understand what the relationships between our variables are.\nTo make accurate predictions about unseen data. Some examples include:\n\nCan we predict if an email is spam or not?\nCan we generate a one-sentence summary of this 10-page long article?\n\nWhen making predictions, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models and are common in fields like deep learning.\nTo measure the causal effects of one event on some other event. For example,\n\nDoes smoking cause lung cancer?\nDoes a job training program cause increases in employment and wages?\n\nThis is a much harder question because most statistical tools are designed to infer association, not causation. We will not focus on this task in Data 100, but you can take other advanced classes on causal inference (e.g., Stat 156, Data 102) if you are intrigued!\n\nMost of the time, we aim to strike a balance between building interpretable models and building accurate models.\n\n\n10.1.2 Common Types of Models\nIn general, models can be split into two categories:\n\nDeterministic physical (mechanistic) models: Laws that govern how the world works.\n\nKepler’s Third Law of Planetary Motion (1619): The ratio of the square of an object’s orbital period with the cube of the semi-major axis of its orbit is the same for all objects orbiting the same primary.\n\n\\(T^2 \\propto R^3\\)\n\nNewton’s Laws: motion and gravitation (1687): Newton’s second law of motion models the relationship between the mass of an object and the force required to accelerate it.\n\n\\(F = ma\\)\n\\(F_g = G \\frac{m_1 m_2}{r^2}\\) \n\n\nProbabilistic models: Models that attempt to understand how random processes evolve. These are more general and can be used to describe many phenomena in the real world. These models commonly make simplifying assumptions about the nature of the world.\n\nPoisson Process models: Used to model random events that happen with some probability at any point in time and are strictly increasing in count, such as the arrival of customers at a store.\n\n\nNote: These specific models are not in the scope of Data 100 and exist to serve as motivation.",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html#simple-linear-regression",
+    "href": "intro_to_modeling/intro_to_modeling.html#simple-linear-regression",
+    "title": "10  Introduction to Modeling",
+    "section": "10.2 Simple Linear Regression",
+    "text": "10.2 Simple Linear Regression\nThe regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines. As with any straight line, it can be defined by a slope and a y-intercept:\n\n\\(\\text{slope} = r \\cdot \\frac{\\text{Standard Deviation of } y}{\\text{Standard Deviation of }x}\\)\n\\(y\\text{-intercept} = \\text{average of }y - \\text{slope}\\cdot\\text{average of }x\\)\n\\(\\text{regression estimate} = y\\text{-intercept} + \\text{slope}\\cdot\\text{}x\\)\n\\(\\text{residual} =\\text{observed }y - \\text{regression estimate}\\)\n\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n# Set random seed for consistency \nnp.random.seed(43)\nplt.style.use('default') \n\n# Generate random noise for plotting\nx = np.linspace(-3, 3, 100)\ny = x * 0.5 - 1 + np.random.randn(100) * 0.3\n\n# Plot regression line\nsns.regplot(x=x,y=y);\n\n\n\n\n\n\n\n\n\n\n10.2.1 Notations and Definitions\nFor a pair of variables \\(x\\) and \\(y\\) representing our data \\(\\mathcal{D} = \\{(x_1, y_1), (x_2, y_2), \\dots, (x_n, y_n)\\}\\), we denote their means/averages as \\(\\bar x\\) and \\(\\bar y\\) and standard deviations as \\(\\sigma_x\\) and \\(\\sigma_y\\).\n\n10.2.1.1 Standard Units\nA variable is represented in standard units if the following are true:\n\n0 in standard units is equal to the mean (\\(\\bar{x}\\)) in the original variable’s units.\nAn increase of 1 standard unit is an increase of 1 standard deviation (\\(\\sigma_x\\)) in the original variable’s units.\n\nTo convert a variable \\(x_i\\) into standard units, we subtract its mean from it and divide it by its standard deviation. For example, \\(x_i\\) in standard units is \\(\\frac{x_i - \\bar x}{\\sigma_x}\\).\n\n\n10.2.1.2 Correlation\nThe correlation (\\(r\\)) is the average of the product of \\(x\\) and \\(y\\), both measured in standard units.\n\\[r = \\frac{1}{n} \\sum_{i=1}^n (\\frac{x_i - \\bar{x}}{\\sigma_x})(\\frac{y_i - \\bar{y}}{\\sigma_y})\\]\n\nCorrelation measures the strength of a linear association between two variables.\nCorrelations range between -1 and 1: \\(|r| \\leq 1\\), with \\(r=1\\) indicating perfect positive linear association, and \\(r=-1\\) indicating perfect negative association. The closer \\(r\\) is to \\(0\\), the weaker the linear association is.\nCorrelation says nothing about causation and non-linear association. Correlation does not imply causation. When \\(r = 0\\), the two variables are uncorrelated. However, they could still be related through some non-linear relationship.\n\n\n\nCode\ndef plot_and_get_corr(ax, x, y, title):\n    ax.set_xlim(-3, 3)\n    ax.set_ylim(-3, 3)\n    ax.set_xticks([])\n    ax.set_yticks([])\n    ax.scatter(x, y, alpha = 0.73)\n    r = np.corrcoef(x, y)[0, 1]\n    ax.set_title(title + \" (corr: {})\".format(r.round(2)))\n    return r\n\nfig, axs = plt.subplots(2, 2, figsize = (10, 10))\n\n# Just noise\nx1, y1 = np.random.randn(2, 100)\ncorr1 = plot_and_get_corr(axs[0, 0], x1, y1, title = \"noise\")\n\n# Strong linear\nx2 = np.linspace(-3, 3, 100)\ny2 = x2 * 0.5 - 1 + np.random.randn(100) * 0.3\ncorr2 = plot_and_get_corr(axs[0, 1], x2, y2, title = \"strong linear\")\n\n# Unequal spread\nx3 = np.linspace(-3, 3, 100)\ny3 = - x3/3 + np.random.randn(100)*(x3)/2.5\ncorr3 = plot_and_get_corr(axs[1, 0], x3, y3, title = \"strong linear\")\nextent = axs[1, 0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())\n\n# Strong non-linear\nx4 = np.linspace(-3, 3, 100)\ny4 = 2*np.sin(x3 - 1.5) + np.random.randn(100) * 0.3\ncorr4 = plot_and_get_corr(axs[1, 1], x4, y4, title = \"strong non-linear\")\n\nplt.show()\n\n\n\n\n\n\n\n\n\n\n\n\n10.2.2 Alternate Form\nWhen the variables \\(y\\) and \\(x\\) are measured in standard units, the regression line for predicting \\(y\\) based on \\(x\\) has slope \\(r\\) and passes through the origin.\n\\[\\hat{y}_{su} = r \\cdot x_{su}\\]\n\n\nIn the original units, this becomes\n\n\\[\\frac{\\hat{y} - \\bar{y}}{\\sigma_y} = r \\cdot \\frac{x - \\bar{x}}{\\sigma_x}\\]\n\n\n\n10.2.3 Derivation\nStarting from the top, we have our claimed form of the regression line, and we want to show that it is equivalent to the optimal linear regression line: \\(\\hat{y} = \\hat{a} + \\hat{b}x\\).\nRecall:\n\n\\(\\hat{b} = r \\cdot \\frac{\\text{Standard Deviation of }y}{\\text{Standard Deviation of }x}\\)\n\\(\\hat{a} = \\text{average of }y - \\text{slope}\\cdot\\text{average of }x\\)\n\n\n\n\n\n\n\nProof:\n\\[\\frac{\\hat{y} - \\bar{y}}{\\sigma_y} = r \\cdot \\frac{x - \\bar{x}}{\\sigma_x}\\]\nMultiply by \\(\\sigma_y\\), and add \\(\\bar{y}\\) on both sides.\n\\[\\hat{y} = \\sigma_y \\cdot r \\cdot \\frac{x - \\bar{x}}{\\sigma_x} + \\bar{y}\\]\nDistribute coefficient \\(\\sigma_{y}\\cdot r\\) to the \\(\\frac{x - \\bar{x}}{\\sigma_x}\\) term\n\\[\\hat{y} = (\\frac{r\\sigma_y}{\\sigma_x} ) \\cdot x + (\\bar{y} - (\\frac{r\\sigma_y}{\\sigma_x} ) \\bar{x})\\]\nWe now see that we have a line that matches our claim:\n\nslope: \\(r\\cdot\\frac{\\text{SD of y}}{\\text{SD of x}} = r\\cdot\\frac{\\sigma_y}{\\sigma_x}\\)\nintercept: \\(\\bar{y} - \\text{slope}\\cdot \\bar{x}\\)\n\nNote that the error for the i-th datapoint is: \\(e_i = y_i - \\hat{y_i}\\)",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html#the-modeling-process",
+    "href": "intro_to_modeling/intro_to_modeling.html#the-modeling-process",
+    "title": "10  Introduction to Modeling",
+    "section": "10.3 The Modeling Process",
+    "text": "10.3 The Modeling Process\nAt a high level, a model is a way of representing a system. In Data 100, we’ll treat a model as some mathematical rule we use to describe the relationship between variables.\nWhat variables are we modeling? Typically, we use a subset of the variables in our sample of collected data to model another variable in this data. To put this more formally, say we have the following dataset \\(\\mathcal{D}\\):\n\\[\\mathcal{D} = \\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\\}\\]\nEach pair of values \\((x_i, y_i)\\) represents a datapoint. In a modeling setting, we call these observations. \\(y_i\\) is the dependent variable we are trying to model, also called an output or response. \\(x_i\\) is the independent variable inputted into the model to make predictions, also known as a feature.\nOur goal in modeling is to use the observed data \\(\\mathcal{D}\\) to predict the output variable \\(y_i\\). We denote each prediction as \\(\\hat{y}_i\\) (read: “y hat sub i”).\nHow do we generate these predictions? Some examples of models we’ll encounter in the next few lectures are given below:\n\\[\\hat{y}_i = \\theta\\] \\[\\hat{y}_i = \\theta_0 + \\theta_1 x_i\\]\nThe examples above are known as parametric models. They relate the collected data, \\(x_i\\), to the prediction we make, \\(\\hat{y}_i\\). A few parameters (\\(\\theta\\), \\(\\theta_0\\), \\(\\theta_1\\)) are used to describe the relationship between \\(x_i\\) and \\(\\hat{y}_i\\).\nNotice that we don’t immediately know the values of these parameters. While the features, \\(x_i\\), are taken from our observed data, we need to decide what values to give \\(\\theta\\), \\(\\theta_0\\), and \\(\\theta_1\\) ourselves. This is the heart of parametric modeling: what parameter values should we choose so our model makes the best possible predictions?\nTo choose our model parameters, we’ll work through the modeling process.\n\nChoose a model: how should we represent the world?\nChoose a loss function: how do we quantify prediction error?\nFit the model: how do we choose the best parameters of our model given our data?\nEvaluate model performance: how do we evaluate whether this process gave rise to a good model?",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html#choosing-a-model",
+    "href": "intro_to_modeling/intro_to_modeling.html#choosing-a-model",
+    "title": "10  Introduction to Modeling",
+    "section": "10.4 Choosing a Model",
+    "text": "10.4 Choosing a Model\nOur first step is choosing a model: defining the mathematical rule that describes the relationship between the features, \\(x_i\\), and predictions \\(\\hat{y}_i\\).\nIn Data 8, you learned about the Simple Linear Regression (SLR) model. You learned that the model takes the form: \\[\\hat{y}_i = a + bx_i\\]\nIn Data 100, we’ll use slightly different notation: we will replace \\(a\\) with \\(\\theta_0\\) and \\(b\\) with \\(\\theta_1\\). This will allow us to use the same notation when we explore more complex models later on in the course.\n\\[\\hat{y}_i = \\theta_0 + \\theta_1 x_i\\]\nThe parameters of the SLR model are \\(\\theta_0\\), also called the intercept term, and \\(\\theta_1\\), also called the slope term. To create an effective model, we want to choose values for \\(\\theta_0\\) and \\(\\theta_1\\) that most accurately predict the output variable. The “best” fitting model parameters are given the special names: \\(\\hat{\\theta}_0\\) and \\(\\hat{\\theta}_1\\); they are the specific parameter values that allow our model to generate the best possible predictions.\nIn Data 8, you learned that the best SLR model parameters are: \\[\\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1\\bar{x} \\qquad \\qquad \\hat{\\theta}_1 = r \\frac{\\sigma_y}{\\sigma_x}\\]\nA quick reminder on notation:\n\n\\(\\bar{y}\\) and \\(\\bar{x}\\) indicate the mean value of \\(y\\) and \\(x\\), respectively\n\\(\\sigma_y\\) and \\(\\sigma_x\\) indicate the standard deviations of \\(y\\) and \\(x\\)\n\\(r\\) is the correlation coefficient, defined as the average of the product of \\(x\\) and \\(y\\) measured in standard units: \\(\\frac{1}{n} \\sum_{i=1}^n (\\frac{x_i-\\bar{x}}{\\sigma_x})(\\frac{y_i-\\bar{y}}{\\sigma_y})\\)\n\nIn Data 100, we want to understand how to derive these best model coefficients. To do so, we’ll introduce the concept of a loss function.",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html#choosing-a-loss-function",
+    "href": "intro_to_modeling/intro_to_modeling.html#choosing-a-loss-function",
+    "title": "10  Introduction to Modeling",
+    "section": "10.5 Choosing a Loss Function",
+    "text": "10.5 Choosing a Loss Function\nWe’ve talked about the idea of creating the “best” possible predictions. This begs the question: how do we decide how “good” or “bad” our model’s predictions are?\nA loss function characterizes the cost, error, or fit resulting from a particular choice of model or model parameters. This function, \\(L(y, \\hat{y})\\), quantifies how “bad” or “far off” a single prediction by our model is from a true, observed value in our collected data.\nThe choice of loss function for a particular model will affect the accuracy and computational cost of estimation, and it’ll also depend on the estimation task at hand. For example,\n\nAre outputs quantitative or qualitative?\nDo outliers matter?\nAre all errors equally costly? (e.g., a false negative on a cancer test is arguably more dangerous than a false positive)\n\nRegardless of the specific function used, a loss function should follow two basic principles:\n\nIf the prediction \\(\\hat{y}_i\\) is close to the actual value \\(y_i\\), loss should be low.\nIf the prediction \\(\\hat{y}_i\\) is far from the actual value \\(y_i\\), loss should be high.\n\nTwo common choices of loss function are squared loss and absolute loss.\nSquared loss, also known as L2 loss, computes loss as the square of the difference between the observed \\(y_i\\) and predicted \\(\\hat{y}_i\\): \\[L(y_i, \\hat{y}_i) = (y_i - \\hat{y}_i)^2\\]\nAbsolute loss, also known as L1 loss, computes loss as the absolute difference between the observed \\(y_i\\) and predicted \\(\\hat{y}_i\\): \\[L(y_i, \\hat{y}_i) = |y_i - \\hat{y}_i|\\]\nL1 and L2 loss give us a tool for quantifying our model’s performance on a single data point. This is a good start, but ideally, we want to understand how our model performs across our entire dataset. A natural way to do this is to compute the average loss across all data points in the dataset. This is known as the cost function, \\(\\hat{R}(\\theta)\\): \\[\\hat{R}(\\theta) = \\frac{1}{n} \\sum^n_{i=1} L(y_i, \\hat{y}_i)\\]\nThe cost function has many names in the statistics literature. You may also encounter the terms:\n\nEmpirical risk (this is why we give the cost function the name \\(R\\))\nError function\nAverage loss\n\nWe can substitute our L1 and L2 loss into the cost function definition. The Mean Squared Error (MSE) is the average squared loss across a dataset: \\[\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2\\]\nThe Mean Absolute Error (MAE) is the average absolute loss across a dataset: \\[\\text{MAE}= \\frac{1}{n} \\sum_{i=1}^n |y_i - \\hat{y}_i|\\]",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html#fitting-the-model",
+    "href": "intro_to_modeling/intro_to_modeling.html#fitting-the-model",
+    "title": "10  Introduction to Modeling",
+    "section": "10.6 Fitting the Model",
+    "text": "10.6 Fitting the Model\nNow that we’ve established the concept of a loss function, we can return to our original goal of choosing model parameters. Specifically, we want to choose the best set of model parameters that will minimize the model’s cost on our dataset. This process is called fitting the model.\nWe know from calculus that a function is minimized when (1) its first derivative is equal to zero and (2) its second derivative is positive. We often call the function being minimized the objective function (our objective is to find its minimum).\nTo find the optimal model parameter, we:\n\nTake the derivative of the cost function with respect to that parameter\nSet the derivative equal to 0\nSolve for the parameter\n\nWe repeat this process for each parameter present in the model. For now, we’ll disregard the second derivative condition.\nTo help us make sense of this process, let’s put it into action by deriving the optimal model parameters for simple linear regression using the mean squared error as our cost function. Remember: although the notation may look tricky, all we are doing is following the three steps above!\nStep 1: take the derivative of the cost function with respect to each model parameter. We substitute the SLR model, \\(\\hat{y}_i = \\theta_0+\\theta_1 x_i\\), into the definition of MSE above and differentiate with respect to \\(\\theta_0\\) and \\(\\theta_1\\). \\[\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\theta_0 - \\theta_1 x_i)^2\\]\n\\[\\frac{\\partial}{\\partial \\theta_0} \\text{MSE} = \\frac{-2}{n} \\sum_{i=1}^{n} y_i - \\theta_0 - \\theta_1 x_i\\]\n\\[\\frac{\\partial}{\\partial \\theta_1} \\text{MSE} = \\frac{-2}{n} \\sum_{i=1}^{n} (y_i - \\theta_0 - \\theta_1 x_i)x_i\\]\nLet’s walk through these derivations in more depth, starting with the derivative of MSE with respect to \\(\\theta_0\\).\nGiven our MSE above, we know that: \\[\\frac{\\partial}{\\partial \\theta_0} \\text{MSE} = \\frac{\\partial}{\\partial \\theta_0} \\frac{1}{n} \\sum_{i=1}^{n} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\nNoting that the derivative of sum is equivalent to the sum of derivatives, this then becomes: \\[ = \\frac{1}{n} \\sum_{i=1}^{n} \\frac{\\partial}{\\partial \\theta_0} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\nWe can then apply the chain rule.\n\\[ = \\frac{1}{n} \\sum_{i=1}^{n} 2 \\cdot{(y_i - \\theta_0 - \\theta_1 x_i)}\\dot(-1)\\]\nFinally, we can simplify the constants, leaving us with our answer.\n\\[\\frac{\\partial}{\\partial \\theta_0} \\text{MSE} = \\frac{-2}{n} \\sum_{i=1}^{n}{(y_i - \\theta_0 - \\theta_1 x_i)}\\]\nFollowing the same procedure, we can take the derivative of MSE with respect to \\(\\theta_1\\).\n\\[\\frac{\\partial}{\\partial \\theta_1} \\text{MSE} = \\frac{\\partial}{\\partial \\theta_1} \\frac{1}{n} \\sum_{i=1}^{n} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\n\\[ = \\frac{1}{n} \\sum_{i=1}^{n} \\frac{\\partial}{\\partial \\theta_1} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\n\\[ = \\frac{1}{n} \\sum_{i=1}^{n} 2 \\dot{(y_i - \\theta_0 - \\theta_1 x_i)}\\dot(-x_i)\\]\n\\[= \\frac{-2}{n} \\sum_{i=1}^{n} {(y_i - \\theta_0 - \\theta_1 x_i)}x_i\\]\nStep 2: set the derivatives equal to 0. After simplifying terms, this produces two estimating equations. The best set of model parameters \\((\\hat{\\theta}_0, \\hat{\\theta}_1)\\) must satisfy these two optimality conditions. \\[0 = \\frac{-2}{n} \\sum_{i=1}^{n} y_i - \\hat{\\theta}_0 - \\hat{\\theta}_1 x_i \\Longleftrightarrow \\frac{1}{n}\\sum_{i=1}^{n} y_i - \\hat{y}_i = 0\\] \\[0 = \\frac{-2}{n} \\sum_{i=1}^{n} (y_i - \\hat{\\theta}_0 - \\hat{\\theta}_1 x_i)x_i \\Longleftrightarrow \\frac{1}{n}\\sum_{i=1}^{n} (y_i - \\hat{y}_i)x_i = 0\\]\nStep 3: solve the estimating equations to compute estimates for \\(\\hat{\\theta}_0\\) and \\(\\hat{\\theta}_1\\).\nTaking the first equation gives the estimate of \\(\\hat{\\theta}_0\\): \\[\\frac{1}{n} \\sum_{i=1}^n y_i - \\hat{\\theta}_0 - \\hat{\\theta}_1 x_i = 0 \\]\n\\[\\left(\\frac{1}{n} \\sum_{i=1}^n y_i \\right) - \\hat{\\theta}_0 - \\hat{\\theta}_1\\left(\\frac{1}{n} \\sum_{i=1}^n x_i \\right) = 0\\]\n\\[ \\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1 \\bar{x}\\]\nWith a bit more maneuvering, the second equation gives the estimate of \\(\\hat{\\theta}_1\\). Start by multiplying the first estimating equation by \\(\\bar{x}\\), then subtracting the result from the second estimating equation.\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)x_i - \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)\\bar{x} = 0 \\]\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)(x_i - \\bar{x}) = 0 \\]\nNext, plug in \\(\\hat{y}_i = \\hat{\\theta}_0 + \\hat{\\theta}_1 x_i = \\bar{y} + \\hat{\\theta}_1(x_i - \\bar{x})\\):\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\bar{y} - \\hat{\\theta}_1(x - \\bar{x}))(x_i - \\bar{x}) = 0 \\]\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\bar{y})(x_i - \\bar{x}) = \\hat{\\theta}_1 \\times \\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2\n\\]\nBy using the definition of correlation \\(\\left(r = \\frac{1}{n} \\sum_{i=1}^n (\\frac{x_i-\\bar{x}}{\\sigma_x})(\\frac{y_i-\\bar{y}}{\\sigma_y}) \\right)\\) and standard deviation \\(\\left(\\sigma_x = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2} \\right)\\), we can conclude: \\[r \\sigma_x \\sigma_y = \\hat{\\theta}_1 \\times \\sigma_x^2\\] \\[\\hat{\\theta}_1 = r \\frac{\\sigma_y}{\\sigma_x}\\]\nJust as was given in Data 8!\nRemember, this derivation found the optimal model parameters for SLR when using the MSE cost function. If we had used a different model or different loss function, we likely would have found different values for the best model parameters. However, regardless of the model and loss used, we can always follow these three steps to fit the model.",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
+  },
+  {
+    "objectID": "intro_to_modeling/intro_to_modeling.html#evaluating-the-slr-model",
+    "href": "intro_to_modeling/intro_to_modeling.html#evaluating-the-slr-model",
+    "title": "10  Introduction to Modeling",
+    "section": "10.7 Evaluating the SLR Model",
+    "text": "10.7 Evaluating the SLR Model\nNow that we’ve explored the mathematics behind (1) choosing a model, (2) choosing a loss function, and (3) fitting the model, we’re left with one final question – how “good” are the predictions made by this “best” fitted model? To determine this, we can:\n\nVisualize data and compute statistics:\n\nPlot the original data.\nCompute each column’s mean and standard deviation. If the mean and standard deviation of our predictions are close to those of the original observed \\(y_i\\)’s, we might be inclined to say that our model has done well.\n(If we’re fitting a linear model) Compute the correlation \\(r\\). A large magnitude for the correlation coefficient between the feature and response variables could also indicate that our model has done well.\n\nPerformance metrics:\n\nWe can take the Root Mean Squared Error (RMSE).\n\nIt’s the square root of the mean squared error (MSE), which is the average loss that we’ve been minimizing to determine optimal model parameters.\nRMSE is in the same units as \\(y\\).\nA lower RMSE indicates more “accurate” predictions, as we have a lower “average loss” across the data.\n\n\n\\[\\text{RMSE} = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2}\\]\nVisualization:\n\nLook at the residual plot of \\(e_i = y_i - \\hat{y_i}\\) to visualize the difference between actual and predicted values. The good residual plot should not show any pattern between input/features \\(x_i\\) and residual values \\(e_i\\).\n\n\nTo illustrate this process, let’s take a look at Anscombe’s quartet.\n\n10.7.1 Four Mysterious Datasets (Anscombe’s quartet)\nLet’s take a look at four different datasets.\n\n\nCode\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n%matplotlib inline\nimport seaborn as sns\nimport itertools\nfrom mpl_toolkits.mplot3d import Axes3D\n\n\n\n\nCode\n# Big font helper\ndef adjust_fontsize(size=None):\n    SMALL_SIZE = 8\n    MEDIUM_SIZE = 10\n    BIGGER_SIZE = 12\n    if size != None:\n        SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size\n\n    plt.rc(\"font\", size=SMALL_SIZE)  # controls default text sizes\n    plt.rc(\"axes\", titlesize=SMALL_SIZE)  # fontsize of the axes title\n    plt.rc(\"axes\", labelsize=MEDIUM_SIZE)  # fontsize of the x and y labels\n    plt.rc(\"xtick\", labelsize=SMALL_SIZE)  # fontsize of the tick labels\n    plt.rc(\"ytick\", labelsize=SMALL_SIZE)  # fontsize of the tick labels\n    plt.rc(\"legend\", fontsize=SMALL_SIZE)  # legend fontsize\n    plt.rc(\"figure\", titlesize=BIGGER_SIZE)  # fontsize of the figure title\n\n\n# Helper functions\ndef standard_units(x):\n    return (x - np.mean(x)) / np.std(x)\n\n\ndef correlation(x, y):\n    return np.mean(standard_units(x) * standard_units(y))\n\n\ndef slope(x, y):\n    return correlation(x, y) * np.std(y) / np.std(x)\n\n\ndef intercept(x, y):\n    return np.mean(y) - slope(x, y) * np.mean(x)\n\n\ndef fit_least_squares(x, y):\n    theta_0 = intercept(x, y)\n    theta_1 = slope(x, y)\n    return theta_0, theta_1\n\n\ndef predict(x, theta_0, theta_1):\n    return theta_0 + theta_1 * x\n\n\ndef compute_mse(y, yhat):\n    return np.mean((y - yhat) ** 2)\n\n\nplt.style.use(\"default\")  # Revert style to default mpl\n\n\n\n\nCode\nplt.style.use(\"default\")  # Revert style to default mpl\nNO_VIZ, RESID, RESID_SCATTER = range(3)\n\n\ndef least_squares_evaluation(x, y, visualize=NO_VIZ):\n    # statistics\n    print(f\"x_mean : {np.mean(x):.2f}, y_mean : {np.mean(y):.2f}\")\n    print(f\"x_stdev: {np.std(x):.2f}, y_stdev: {np.std(y):.2f}\")\n    print(f\"r = Correlation(x, y): {correlation(x, y):.3f}\")\n\n    # Performance metrics\n    ahat, bhat = fit_least_squares(x, y)\n    yhat = predict(x, ahat, bhat)\n    print(f\"\\theta_0: {ahat:.2f}, \\theta_1: {bhat:.2f}\")\n    print(f\"RMSE: {np.sqrt(compute_mse(y, yhat)):.3f}\")\n\n    # visualization\n    fig, ax_resid = None, None\n    if visualize == RESID_SCATTER:\n        fig, axs = plt.subplots(1, 2, figsize=(8, 3))\n        axs[0].scatter(x, y)\n        axs[0].plot(x, yhat)\n        axs[0].set_title(\"LS fit\")\n        ax_resid = axs[1]\n    elif visualize == RESID:\n        fig = plt.figure(figsize=(4, 3))\n        ax_resid = plt.gca()\n\n    if ax_resid is not None:\n        ax_resid.scatter(x, y - yhat, color=\"red\")\n        ax_resid.plot([4, 14], [0, 0], color=\"black\")\n        ax_resid.set_title(\"Residuals\")\n\n    return fig\n\n\n\n\nCode\n# Load in four different datasets: I, II, III, IV\nx = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]\ny1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]\ny2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]\ny3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]\nx4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]\ny4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]\n\nanscombe = {\n    \"I\": pd.DataFrame(list(zip(x, y1)), columns=[\"x\", \"y\"]),\n    \"II\": pd.DataFrame(list(zip(x, y2)), columns=[\"x\", \"y\"]),\n    \"III\": pd.DataFrame(list(zip(x, y3)), columns=[\"x\", \"y\"]),\n    \"IV\": pd.DataFrame(list(zip(x4, y4)), columns=[\"x\", \"y\"]),\n}\n\n# Plot the scatter plot and line of best fit\nfig, axs = plt.subplots(2, 2, figsize=(10, 10))\n\nfor i, dataset in enumerate([\"I\", \"II\", \"III\", \"IV\"]):\n    ans = anscombe[dataset]\n    x, y = ans[\"x\"], ans[\"y\"]\n    ahat, bhat = fit_least_squares(x, y)\n    yhat = predict(x, ahat, bhat)\n    axs[i // 2, i % 2].scatter(x, y, alpha=0.6, color=\"red\")  # plot the x, y points\n    axs[i // 2, i % 2].plot(x, yhat)  # plot the line of best fit\n    axs[i // 2, i % 2].set_xlabel(f\"$x_{i+1}$\")\n    axs[i // 2, i % 2].set_ylabel(f\"$y_{i+1}$\")\n    axs[i // 2, i % 2].set_title(f\"Dataset {dataset}\")\n\nplt.show()\n\n\n\n\n\n\n\n\n\nWhile these four sets of datapoints look very different, they actually all have identical means \\(\\bar x\\), \\(\\bar y\\), standard deviations \\(\\sigma_x\\), \\(\\sigma_y\\), correlation \\(r\\), and RMSE! If we only look at these statistics, we would probably be inclined to say that these datasets are similar.\n\n\nCode\nfor dataset in [\"I\", \"II\", \"III\", \"IV\"]:\n    print(f\"&gt;&gt;&gt; Dataset {dataset}:\")\n    ans = anscombe[dataset]\n    fig = least_squares_evaluation(ans[\"x\"], ans[\"y\"], visualize=NO_VIZ)\n    print()\n    print()\n\n\n&gt;&gt;&gt; Dataset I:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.816\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.119\n\n\n&gt;&gt;&gt; Dataset II:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.816\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.119\n\n\n&gt;&gt;&gt; Dataset III:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.816\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.118\n\n\n&gt;&gt;&gt; Dataset IV:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.817\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.118\n\n\n\n\nWe may also wish to visualize the model’s residuals, defined as the difference between the observed and predicted \\(y_i\\) value (\\(e_i = y_i - \\hat{y}_i\\)). This gives a high-level view of how “off” each prediction is from the true observed value. Recall that you explored this concept in Data 8: a good regression fit should display no clear pattern in its plot of residuals. The residual plots for Anscombe’s quartet are displayed below. Note how only the first plot shows no clear pattern to the magnitude of residuals. This is an indication that SLR is not the best choice of model for the remaining three sets of points.\n\n\n\nCode\n# Residual visualization\nfig, axs = plt.subplots(2, 2, figsize=(10, 10))\n\nfor i, dataset in enumerate([\"I\", \"II\", \"III\", \"IV\"]):\n    ans = anscombe[dataset]\n    x, y = ans[\"x\"], ans[\"y\"]\n    ahat, bhat = fit_least_squares(x, y)\n    yhat = predict(x, ahat, bhat)\n    axs[i // 2, i % 2].scatter(\n        x, y - yhat, alpha=0.6, color=\"red\"\n    )  # plot the x, y points\n    axs[i // 2, i % 2].plot(\n        x, np.zeros_like(x), color=\"black\"\n    )  # plot the residual line\n    axs[i // 2, i % 2].set_xlabel(f\"$x_{i+1}$\")\n    axs[i // 2, i % 2].set_ylabel(f\"$e_{i+1}$\")\n    axs[i // 2, i % 2].set_title(f\"Dataset {dataset} Residuals\")\n\nplt.show()",
+    "crumbs": [
+      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Introduction to Modeling</span>"
+    ]
   }
 ]
\ No newline at end of file
diff --git a/docs/visualization_1/visualization_1.html b/docs/visualization_1/visualization_1.html
index 85d50407..bb2dbee6 100644
--- a/docs/visualization_1/visualization_1.html
+++ b/docs/visualization_1/visualization_1.html
@@ -237,6 +237,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -407,7 +413,7 @@ <h2 data-number="7.4" class="anchored" data-anchor-id="variable-types-should-inf
 <h2 data-number="7.5" class="anchored" data-anchor-id="qualitative-variables-bar-plots"><span class="header-section-number">7.5</span> Qualitative Variables: Bar Plots</h2>
 <p>A <strong>bar plot</strong> is one of the most common ways of displaying the <strong>distribution</strong> of a <strong>qualitative</strong> (categorical) variable. The length of a bar plot encodes the frequency of a category; the width encodes no useful information. The color <em>could</em> indicate a sub-category, but this is not necessarily the case.</p>
 <p>Let’s contextualize this in an example. We will use the World Bank dataset (<code>wb</code>) in our analysis.</p>
-<div id="c233d09a" class="cell" data-execution_count="1">
+<div id="adfd6c4b" class="cell" data-execution_count="1">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
@@ -581,7 +587,7 @@ <h2 data-number="7.5" class="anchored" data-anchor-id="qualitative-variables-bar
 <p>We can visualize the distribution of the <code>Continent</code> column using a bar plot. There are a few ways to do this.</p>
 <section id="plotting-in-pandas" class="level3" data-number="7.5.1">
 <h3 data-number="7.5.1" class="anchored" data-anchor-id="plotting-in-pandas"><span class="header-section-number">7.5.1</span> Plotting in Pandas</h3>
-<div id="5ff87d07" class="cell" data-execution_count="2">
+<div id="473966bb" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>wb[<span class="st">'Continent'</span>].value_counts().plot(kind<span class="op">=</span><span class="st">'bar'</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -596,7 +602,7 @@ <h3 data-number="7.5.1" class="anchored" data-anchor-id="plotting-in-pandas"><sp
 </section>
 <section id="plotting-in-matplotlib" class="level3" data-number="7.5.2">
 <h3 data-number="7.5.2" class="anchored" data-anchor-id="plotting-in-matplotlib"><span class="header-section-number">7.5.2</span> Plotting in Matplotlib</h3>
-<div id="38908a50" class="cell" data-execution_count="3">
+<div id="d5991304" class="cell" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="co"># matplotlib is typically given the alias plt</span></span>
 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>continent <span class="op">=</span> wb[<span class="st">'Continent'</span>].value_counts()</span>
@@ -616,7 +622,7 @@ <h3 data-number="7.5.2" class="anchored" data-anchor-id="plotting-in-matplotlib"
 </section>
 <section id="plotting-in-seaborn" class="level3" data-number="7.5.3">
 <h3 data-number="7.5.3" class="anchored" data-anchor-id="plotting-in-seaborn"><span class="header-section-number">7.5.3</span> Plotting in <code>Seaborn</code></h3>
-<div id="d97e6a2c" class="cell" data-execution_count="4">
+<div id="b199d3f6" class="cell" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns <span class="co"># seaborn is typically given the alias sns</span></span>
 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>sns.countplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">'Continent'</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
@@ -639,7 +645,7 @@ <h3 data-number="7.5.3" class="anchored" data-anchor-id="plotting-in-seaborn"><s
 <section id="distributions-of-quantitative-variables" class="level2" data-number="7.6">
 <h2 data-number="7.6" class="anchored" data-anchor-id="distributions-of-quantitative-variables"><span class="header-section-number">7.6</span> Distributions of Quantitative Variables</h2>
 <p>Revisiting our example with the <code>wb</code> DataFrame, let’s plot the distribution of <code>Gross national income per capita</code>.</p>
-<div id="ac75936b" class="cell" data-execution_count="5">
+<div id="144e970a" class="cell" data-execution_count="5">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>wb.head(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -805,7 +811,7 @@ <h2 data-number="7.6" class="anchored" data-anchor-id="distributions-of-quantita
 </div>
 <p>How should we define our categories for this variable? In the previous example, these were a few unique values of the <code>Continent</code> column. If we use similar logic here, our categories are the different numerical values contained in the <code>Gross national income per capita</code> column.</p>
 <p>Under this assumption, let’s plot this distribution using the <code>seaborn.countplot</code> function.</p>
-<div id="c3ea410c" class="cell" data-execution_count="6">
+<div id="c2c648fd" class="cell" data-execution_count="6">
 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>sns.countplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">'Gross national income per capita, Atlas method: $: 2016'</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -827,7 +833,7 @@ <h2 data-number="7.6" class="anchored" data-anchor-id="distributions-of-quantita
 <h3 data-number="7.6.1" class="anchored" data-anchor-id="box-plots-and-violin-plots"><span class="header-section-number">7.6.1</span> Box Plots and Violin Plots</h3>
 <p>Box plots and violin plots are two very similar kinds of visualizations. Both display the distribution of a variable using information about <strong>quartiles</strong>.</p>
 <p>In a box plot, the width of the box at any point does not encode meaning. In a violin plot, the width of the plot indicates the density of the distribution at each possible value.</p>
-<div id="57f29f2a" class="cell" data-execution_count="7">
+<div id="769122bb" class="cell" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>sns.boxplot(data<span class="op">=</span>wb, y<span class="op">=</span><span class="st">'Gross national income per capita, Atlas method: $: 2016'</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -837,7 +843,7 @@ <h3 data-number="7.6.1" class="anchored" data-anchor-id="box-plots-and-violin-pl
 </div>
 </div>
 </div>
-<div id="904f06ff" class="cell" data-execution_count="8">
+<div id="72deb76a" class="cell" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>sns.violinplot(data<span class="op">=</span>wb, y<span class="op">=</span><span class="st">"Gross national income per capita, Atlas method: $: 2016"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -854,7 +860,7 @@ <h3 data-number="7.6.1" class="anchored" data-anchor-id="box-plots-and-violin-pl
 <li>The third quartile (Q3) represents the 75th percentile – 75% of the data is smaller than or equal to the third quartile.</li>
 </ul>
 <p>This means that the middle 50% of the data lies between the first and third quartiles. This is demonstrated in the histogram below. The three quartiles are marked with red vertical bars.</p>
-<div id="b3688b04" class="cell" data-execution_count="9">
+<div id="f746be35" class="cell" data-execution_count="9">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>gdp <span class="op">=</span> wb[<span class="st">'Gross domestic product: </span><span class="sc">% g</span><span class="st">rowth : 2016'</span>]</span>
@@ -893,7 +899,7 @@ <h3 data-number="7.6.1" class="anchored" data-anchor-id="box-plots-and-violin-pl
 </div>
 </div>
 <p>In a box plot, the lower extent of the box lies at Q1, while the upper extent of the box lies at Q3. The horizontal line in the middle of the box corresponds to Q2 (equivalently, the median).</p>
-<div id="fed5fc46" class="cell" data-execution_count="10">
+<div id="0dc7ce83" class="cell" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>sns.boxplot(data<span class="op">=</span>wb, y<span class="op">=</span><span class="st">'Gross domestic product: </span><span class="sc">% g</span><span class="st">rowth : 2016'</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -909,7 +915,7 @@ <h3 data-number="7.6.1" class="anchored" data-anchor-id="box-plots-and-violin-pl
 <img src="images/box_plot_diagram.png" width="600">
 </center>
 <p>A violin plot displays quartile information, albeit a bit more subtly through smoothed density curves. Look closely at the center vertical bar of the violin plot below; the three quartiles and “whiskers” are still present!</p>
-<div id="c81ca58e" class="cell" data-execution_count="11">
+<div id="db762b85" class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>sns.violinplot(data<span class="op">=</span>wb, y<span class="op">=</span><span class="st">'Gross domestic product: </span><span class="sc">% g</span><span class="st">rowth : 2016'</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -924,7 +930,7 @@ <h3 data-number="7.6.1" class="anchored" data-anchor-id="box-plots-and-violin-pl
 <h3 data-number="7.6.2" class="anchored" data-anchor-id="side-by-side-box-and-violin-plots"><span class="header-section-number">7.6.2</span> Side-by-Side Box and Violin Plots</h3>
 <p>Plotting side-by-side box or violin plots allows us to compare distributions across different categories. In other words, they enable us to plot both a qualitative variable and a quantitative continuous variable in one visualization.</p>
 <p>With <code>seaborn</code>, we can easily create side-by-side plots by specifying both an x and y column.</p>
-<div id="e51fd611" class="cell" data-execution_count="12">
+<div id="e6840324" class="cell" data-execution_count="12">
 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>sns.boxplot(data<span class="op">=</span>wb, x<span class="op">=</span><span class="st">"Continent"</span>, y<span class="op">=</span><span class="st">'Gross domestic product: </span><span class="sc">% g</span><span class="st">rowth : 2016'</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <div>
@@ -941,7 +947,7 @@ <h3 data-number="7.6.3" class="anchored" data-anchor-id="histograms"><span class
 <section id="plotting-histograms" class="level4" data-number="7.6.3.1">
 <h4 data-number="7.6.3.1" class="anchored" data-anchor-id="plotting-histograms"><span class="header-section-number">7.6.3.1</span> Plotting Histograms</h4>
 <p>Below, we plot a histogram using matplotlib and seaborn. Which graph do you prefer?</p>
-<div id="d6db87c7" class="cell" data-execution_count="13">
+<div id="62366f9f" class="cell" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co"># The `edgecolor` argument controls the color of the bin edges</span></span>
 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>gni <span class="op">=</span> wb[<span class="st">"Gross national income per capita, Atlas method: $: 2016"</span>]</span>
 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>plt.hist(gni, density<span class="op">=</span><span class="va">True</span>, edgecolor<span class="op">=</span><span class="st">"white"</span>)</span>
@@ -958,7 +964,7 @@ <h4 data-number="7.6.3.1" class="anchored" data-anchor-id="plotting-histograms">
 </div>
 </div>
 </div>
-<div id="3c4a1861" class="cell" data-execution_count="14">
+<div id="6875d7ad" class="cell" data-execution_count="14">
 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>sns.histplot(data<span class="op">=</span>wb, x<span class="op">=</span><span class="st">"Gross national income per capita, Atlas method: $: 2016"</span>, stat<span class="op">=</span><span class="st">"density"</span>)</span>
 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
@@ -975,14 +981,14 @@ <h4 data-number="7.6.3.2" class="anchored" data-anchor-id="overlaid-histograms">
 <p>We can overlay histograms (or density curves) to compare distributions across qualitative categories.</p>
 <p>The <code>hue</code> parameter of <code>sns.histplot</code> specifies the column that should be used to determine the color of each category. <code>hue</code> can be used in many <code>seaborn</code> plotting functions.</p>
 <p>Notice that the resulting plot includes a legend describing which color corresponds to each hemisphere – a legend should always be included if color is used to encode information in a visualization!</p>
-<div id="14e5e850" class="cell" data-execution_count="15">
+<div id="64e1eb1f" class="cell" data-execution_count="15">
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a new variable to store the hemisphere in which each country is located</span></span>
 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>north <span class="op">=</span> [<span class="st">"Asia"</span>, <span class="st">"Europe"</span>, <span class="st">"N. America"</span>]</span>
 <span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a>south <span class="op">=</span> [<span class="st">"Africa"</span>, <span class="st">"Oceania"</span>, <span class="st">"S. America"</span>]</span>
 <span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a>wb.loc[wb[<span class="st">"Continent"</span>].isin(north), <span class="st">"Hemisphere"</span>] <span class="op">=</span> <span class="st">"Northern"</span></span>
 <span id="cb16-5"><a href="#cb16-5" aria-hidden="true" tabindex="-1"></a>wb.loc[wb[<span class="st">"Continent"</span>].isin(south), <span class="st">"Hemisphere"</span>] <span class="op">=</span> <span class="st">"Southern"</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
-<div id="e6f3cd1b" class="cell" data-execution_count="16">
+<div id="91f6e1dd" class="cell" data-execution_count="16">
 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a>sns.histplot(data<span class="op">=</span>wb, x<span class="op">=</span><span class="st">"Gross national income per capita, Atlas method: $: 2016"</span>, hue<span class="op">=</span><span class="st">"Hemisphere"</span>, stat<span class="op">=</span><span class="st">"density"</span>)</span>
 <span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stderr">
@@ -1012,7 +1018,7 @@ <h4 data-number="7.6.3.2" class="anchored" data-anchor-id="overlaid-histograms">
 </div>
 </div>
 <p>Again, each bin of a histogram is scaled such that its <strong>area</strong> is proportional to the <strong>percentage</strong> of all datapoints that it contains.</p>
-<div id="72212db7" class="cell" data-execution_count="17">
+<div id="7b3a12ec" class="cell" data-execution_count="17">
 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>densities, bins, _ <span class="op">=</span> plt.hist(gni, density<span class="op">=</span><span class="va">True</span>, edgecolor<span class="op">=</span><span class="st">"white"</span>, bins<span class="op">=</span><span class="dv">5</span>)</span>
 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Gross national income per capita"</span>)</span>
 <span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
@@ -1053,7 +1059,7 @@ <h4 data-number="7.6.3.3" class="anchored" data-anchor-id="evaluating-histograms
 <section id="skewness-and-tails" class="level5" data-number="7.6.3.3.1">
 <h5 data-number="7.6.3.3.1" class="anchored" data-anchor-id="skewness-and-tails"><span class="header-section-number">7.6.3.3.1</span> Skewness and Tails</h5>
 <p>The skew of a histogram describes the direction in which its “tail” extends. - A distribution with a long right tail is <strong>skewed right</strong> (such as <code>Gross national income per capita</code>). In a right-skewed distribution, the few large outliers “pull” the mean to the <strong>right</strong> of the median.</p>
-<div id="16f31693" class="cell" data-execution_count="18">
+<div id="ed081cce" class="cell" data-execution_count="18">
 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a>sns.histplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">'Gross national income per capita, Atlas method: $: 2016'</span>, stat <span class="op">=</span> <span class="st">'density'</span>)<span class="op">;</span></span>
 <span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">'Distribution with a long right tail'</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="18">
@@ -1071,7 +1077,7 @@ <h5 data-number="7.6.3.3.1" class="anchored" data-anchor-id="skewness-and-tails"
 <li>A distribution with a long left tail is <strong>skewed left</strong> (such as <code>Access to an improved water source</code>). In a left-skewed distribution, the few small outliers “pull” the mean to the <strong>left</strong> of the median.</li>
 </ul>
 <p>In the case where a distribution has equal-sized right and left tails, it is <strong>symmetric</strong>. The mean is approximately <strong>equal</strong> to the median. Think of mean as the balancing point of the distribution.</p>
-<div id="a5364154" class="cell" data-execution_count="19">
+<div id="a7935a93" class="cell" data-execution_count="19">
 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a>sns.histplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">'Access to an improved water source: </span><span class="sc">% o</span><span class="st">f population: 2015'</span>, stat <span class="op">=</span> <span class="st">'density'</span>)<span class="op">;</span></span>
 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">'Distribution with a long left tail'</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="19">
@@ -1094,7 +1100,7 @@ <h5 data-number="7.6.3.3.2" class="anchored" data-anchor-id="outliers"><span cla
 <h5 data-number="7.6.3.3.3" class="anchored" data-anchor-id="modes"><span class="header-section-number">7.6.3.3.3</span> Modes</h5>
 <p>In Data 100, we describe a “mode” of a histogram as a peak in the distribution. Often, however, it is difficult to determine what counts as its own “peak.” For example, the number of peaks in the distribution of HIV rates across different countries varies depending on the number of histogram bins we plot.</p>
 <p>If we set the number of bins to 5, the distribution appears unimodal.</p>
-<div id="b052b704" class="cell" data-execution_count="20">
+<div id="6a408121" class="cell" data-execution_count="20">
 <div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Rename the very long column name for convenience</span></span>
 <span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a>wb <span class="op">=</span> wb.rename(columns<span class="op">=</span>{<span class="st">'Antiretroviral therapy coverage: </span><span class="sc">% o</span><span class="st">f people living with HIV: 2015'</span>:<span class="st">"HIV rate"</span>})</span>
 <span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a><span class="co"># With 5 bins, it seems that there is only one peak</span></span>
@@ -1108,7 +1114,7 @@ <h5 data-number="7.6.3.3.3" class="anchored" data-anchor-id="modes"><span class=
 </div>
 </div>
 </div>
-<div id="d00c03b7" class="cell" data-execution_count="21">
+<div id="39622539" class="cell" data-execution_count="21">
 <div class="sourceCode cell-code" id="cb26"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="co"># With 10 bins, there seem to be two peaks</span></span>
 <span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb26-3"><a href="#cb26-3" aria-hidden="true" tabindex="-1"></a>sns.histplot(data<span class="op">=</span>wb, x<span class="op">=</span><span class="st">"HIV rate"</span>, stat<span class="op">=</span><span class="st">"density"</span>, bins<span class="op">=</span><span class="dv">10</span>)</span>
@@ -1121,7 +1127,7 @@ <h5 data-number="7.6.3.3.3" class="anchored" data-anchor-id="modes"><span class=
 </div>
 </div>
 </div>
-<div id="fe26257b" class="cell" data-execution_count="22">
+<div id="417a4999" class="cell" data-execution_count="22">
 <div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co"># And with 20 bins, it becomes hard to say what counts as a "peak"!</span></span>
 <span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb27-3"><a href="#cb27-3" aria-hidden="true" tabindex="-1"></a>sns.histplot(data<span class="op">=</span>wb, x <span class="op">=</span><span class="st">"HIV rate"</span>, stat<span class="op">=</span><span class="st">"density"</span>, bins<span class="op">=</span><span class="dv">20</span>)</span>
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-10-output-2.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-10-output-2.pdf
index 1f4bba42..1d6cee27 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-10-output-2.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-10-output-2.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-11-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-11-output-1.pdf
index eb5ef5ea..ed6cb1f4 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-11-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-11-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-12-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-12-output-1.pdf
index 5b897bf7..f1d4c753 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-12-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-12-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-13-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-13-output-1.pdf
index c40b3e12..e014471d 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-13-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-13-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-14-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-14-output-1.pdf
index 0ada410e..ed511991 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-14-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-14-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-15-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-15-output-1.pdf
index e0e077b2..8dc1ceaf 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-15-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-15-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-17-output-2.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-17-output-2.pdf
index 5cb19dc5..91d6fbd0 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-17-output-2.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-17-output-2.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-18-output-2.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-18-output-2.pdf
index f1928524..fb80ef3b 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-18-output-2.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-18-output-2.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-19-output-2.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-19-output-2.pdf
index c2185545..4414292a 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-19-output-2.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-19-output-2.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-20-output-2.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-20-output-2.pdf
index 27a20b33..f0e90c32 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-20-output-2.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-20-output-2.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-21-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-21-output-1.pdf
index 72a98f1d..ebdafae0 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-21-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-21-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-22-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-22-output-1.pdf
index 6afdb917..742012f7 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-22-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-22-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-23-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-23-output-1.pdf
index dabb2bf0..bf41cce7 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-23-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-23-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-3-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-3-output-1.pdf
index 794b83fe..2d73ebd1 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-3-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-3-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-4-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-4-output-1.pdf
index cac0f7d0..790b87a1 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-4-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-4-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-5-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-5-output-1.pdf
index a440c441..4bc515d7 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-5-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-5-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-7-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-7-output-1.pdf
index 47456772..aebefaf0 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-7-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-7-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-8-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-8-output-1.pdf
index 2314147b..7b33e247 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-8-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-8-output-1.pdf differ
diff --git a/docs/visualization_1/visualization_1_files/figure-pdf/cell-9-output-1.pdf b/docs/visualization_1/visualization_1_files/figure-pdf/cell-9-output-1.pdf
index 906ccaa6..fefc78ca 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-pdf/cell-9-output-1.pdf and b/docs/visualization_1/visualization_1_files/figure-pdf/cell-9-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2.html b/docs/visualization_2/visualization_2.html
index b453efc1..57e2e870 100644
--- a/docs/visualization_2/visualization_2.html
+++ b/docs/visualization_2/visualization_2.html
@@ -237,6 +237,12 @@
   <a href="../sampling/sampling.html" class="sidebar-item-text sidebar-link">
  <span class="menu-text"><span class="chapter-number">9</span>&nbsp; <span class="chapter-title">Sampling</span></span></a>
   </div>
+</li>
+        <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../intro_to_modeling/intro_to_modeling.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text"><span class="chapter-number">10</span>&nbsp; <span class="chapter-title">Introduction to Modeling</span></span></a>
+  </div>
 </li>
     </ul>
     </div>
@@ -355,7 +361,7 @@ <h3 data-number="8.1.1" class="anchored" data-anchor-id="kde-theory"><span class
 <p>A <strong>kernel density estimate (KDE)</strong> is a smooth, continuous function that approximates a curve. It allows us to represent general trends in a distribution without focusing on the details, which is useful for analyzing the broad structure of a dataset.</p>
 <p>More formally, a KDE attempts to approximate the underlying <strong>probability distribution</strong> from which our dataset was drawn. You may have encountered the idea of a probability distribution in your other classes; if not, we’ll discuss it at length in the next lecture. For now, you can think of a probability distribution as a description of how likely it is for us to sample a particular value in our dataset.</p>
 <p>A KDE curve estimates the probability density function of a random variable. Consider the example below, where we have used <code>sns.displot</code> to plot both a histogram (containing the data points we actually collected) and a KDE curve (representing the <em>approximated</em> probability distribution from which this data was drawn) using data from the World Bank dataset (<code>wb</code>).</p>
-<div id="21e0562e" class="cell" data-execution_count="1">
+<div id="597a3af3" class="cell" data-execution_count="1">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
@@ -530,7 +536,7 @@ <h3 data-number="8.1.1" class="anchored" data-anchor-id="kde-theory"><span class
 </div>
 </div>
 </div>
-<div id="7ebee8dc" class="cell" data-execution_count="2">
+<div id="586cd01d" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -559,7 +565,7 @@ <h3 data-number="8.1.2" class="anchored" data-anchor-id="constructing-a-kde"><sp
 </ol>
 <p>We’ll explain what a “kernel” is momentarily.</p>
 <p>To make things simpler, let’s construct a KDE for a small, artificially generated dataset of 5 datapoints: <span class="math inline">\([2.2, 2.8, 3.7, 5.3, 5.7]\)</span>. In the plot below, each vertical bar represents one data point.</p>
-<div id="d3ed998d" class="cell" data-execution_count="3">
+<div id="293a2687" class="cell" data-execution_count="3">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>data <span class="op">=</span> [<span class="fl">2.2</span>, <span class="fl">2.8</span>, <span class="fl">3.7</span>, <span class="fl">5.3</span>, <span class="fl">5.7</span>]</span>
@@ -580,7 +586,7 @@ <h3 data-number="8.1.2" class="anchored" data-anchor-id="constructing-a-kde"><sp
 </div>
 </div>
 <p>Our goal is to create the following KDE curve, which was generated automatically by <code>sns.kdeplot</code>.</p>
-<div id="0b8f0136" class="cell" data-execution_count="4">
+<div id="d32f673a" class="cell" data-execution_count="4">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
@@ -599,7 +605,7 @@ <h3 data-number="8.1.2" class="anchored" data-anchor-id="constructing-a-kde"><sp
 </div>
 </div>
 <p>Alternatively, we can use <code>sns.histplot</code>. You can also get a very similar result in a single call by requesting the KDE be added to the histogram, with <code>kde=True</code> and some extra keywords:</p>
-<div id="2dfcaf6e" class="cell" data-execution_count="5">
+<div id="0f04fd4c" class="cell" data-execution_count="5">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
@@ -622,7 +628,7 @@ <h4 data-number="8.1.2.1" class="anchored" data-anchor-id="step-1-place-a-kernel
 <p>A <strong>kernel</strong> is a density curve. It is the mathematical function that attempts to capture the randomness of each data point in our sampled data. To explain what this means, consider just <em>one</em> of the datapoints in our dataset: <span class="math inline">\(2.2\)</span>. We obtained this datapoint by randomly sampling some information out in the real world (you can imagine <span class="math inline">\(2.2\)</span> as representing a single measurement taken in an experiment, for example). If we were to sample a new datapoint, we may obtain a slightly different value. It could be higher than <span class="math inline">\(2.2\)</span>; it could also be lower than <span class="math inline">\(2.2\)</span>. We make the assumption that any future sampled datapoints will likely be similar in value to the data we’ve already drawn. This means that our <em>kernel</em> – our description of the probability of randomly sampling any new value – will be greatest at the datapoint we’ve already drawn but still have non-zero probability above and below it. The area under any kernel should integrate to 1, representing the total probability of drawing a new datapoint.</p>
 <p>A <strong>bandwidth value</strong>, usually denoted by <span class="math inline">\(\alpha\)</span>, represents the width of the kernel. A large value of <span class="math inline">\(\alpha\)</span> will result in a wide, short kernel function, while a small value with result in a narrow, tall kernel.</p>
 <p>Below, we place a <strong>Gaussian kernel</strong>, plotted in orange, over the datapoint <span class="math inline">\(2.2\)</span>. A Gaussian kernel is simply the normal distribution, which you may have called a bell curve in Data 8.</p>
-<div id="d8a1bfb9" class="cell" data-execution_count="6">
+<div id="160a45f5" class="cell" data-execution_count="6">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> gaussian_kernel(x, z, a):</span>
@@ -650,7 +656,7 @@ <h4 data-number="8.1.2.1" class="anchored" data-anchor-id="step-1-place-a-kernel
 </div>
 </div>
 <p>To begin creating our KDE, we place a kernel on <em>each</em> datapoint in our dataset. For our dataset of 5 points, we will have 5 kernels.</p>
-<div id="e5bdeef5" class="cell" data-execution_count="7">
+<div id="f4bff648" class="cell" data-execution_count="7">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># You will work with the functions below in Lab 4</span></span>
@@ -702,7 +708,7 @@ <h4 data-number="8.1.2.1" class="anchored" data-anchor-id="step-1-place-a-kernel
 <h4 data-number="8.1.2.2" class="anchored" data-anchor-id="step-2-normalize-kernels-to-have-a-total-area-of-1"><span class="header-section-number">8.1.2.2</span> Step 2: Normalize Kernels to Have a Total Area of 1</h4>
 <p>Above, we said that <em>each</em> kernel has an area of 1. Earlier, we also said that our goal is to construct a KDE curve using these kernels with a <em>total</em> area of 1. If we were to directly sum the kernels as they are, we would produce a KDE curve with an integrated area of (5 kernels) <span class="math inline">\(\times\)</span> (area of 1 each) = 5. To avoid this, we will <strong>normalize</strong> each of our kernels. This involves multiplying each kernel by <span class="math inline">\(\frac{1}{\#\:\text{datapoints}}\)</span>.</p>
 <p>In the cell below, we multiply each of our 5 kernels by <span class="math inline">\(\frac{1}{5}\)</span> to apply normalization.</p>
-<div id="e1a1ada6" class="cell" data-execution_count="8">
+<div id="c1ee5c7b" class="cell" data-execution_count="8">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
@@ -725,7 +731,7 @@ <h4 data-number="8.1.2.2" class="anchored" data-anchor-id="step-2-normalize-kern
 <section id="step-3-sum-the-normalized-kernels" class="level4" data-number="8.1.2.3">
 <h4 data-number="8.1.2.3" class="anchored" data-anchor-id="step-3-sum-the-normalized-kernels"><span class="header-section-number">8.1.2.3</span> Step 3: Sum the Normalized Kernels</h4>
 <p>Our KDE curve is the sum of the normalized kernels. Notice that the final curve is identical to the plot generated by <code>sns.kdeplot</code> we saw earlier!</p>
-<div id="f7ecd078" class="cell" data-execution_count="9">
+<div id="99ef6cbc" class="cell" data-execution_count="9">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
@@ -835,7 +841,7 @@ <h4 data-number="8.1.3.2" class="anchored" data-anchor-id="boxcar-kernel"><span
         0, &amp; \text{else }
     \end{cases}\]</span></p>
 <p>The boxcar kernel is seldom used in practice – we include it here to demonstrate that a kernel function can take whatever form you would like, provided it integrates to 1 and does not output negative values.</p>
-<div id="ca29aee4" class="cell" data-execution_count="10">
+<div id="cff8e58f" class="cell" data-execution_count="10">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> boxcar_kernel(alpha, x, z):</span>
@@ -882,7 +888,7 @@ <h2 data-number="8.2" class="anchored" data-anchor-id="diving-deeper-into-displo
 <p>As we saw earlier, we can use <code>seaborn</code>’s <code>displot</code> function to plot various distributions. In particular, <code>displot</code> allows you to specify the <code>kind</code> of plot and is a wrapper for <code>histplot</code>, <code>kdeplot</code>, and <code>ecdfplot</code>.</p>
 <p>Below, we can see a couple of examples of how <code>sns.displot</code> can be used to plot various distributions.</p>
 <p>First, we can plot a histogram by setting <code>kind</code> to <code>"hist"</code>. Note that here we’ve specified <code>stat = density</code> to normalize the histogram such that the area under the histogram is equal to 1.</p>
-<div id="c7879892" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="11">
+<div id="dcea686a" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">"hist"</span>, </span>
@@ -897,7 +903,7 @@ <h2 data-number="8.2" class="anchored" data-anchor-id="diving-deeper-into-displo
 </div>
 </div>
 <p>Now, what if we want to generate a KDE plot? We can set <code>kind</code> = to <code>"kde"</code>!</p>
-<div id="025f0581" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="12">
+<div id="82e5b33a" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="12">
 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">'kde'</span>)</span>
@@ -911,7 +917,7 @@ <h2 data-number="8.2" class="anchored" data-anchor-id="diving-deeper-into-displo
 </div>
 </div>
 <p>And finally, if we want to generate an Empirical Cumulative Distribution Function (ECDF), we can specify <code>kind = "ecdf"</code>.</p>
-<div id="ba5535fd" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="13">
+<div id="fe87c288" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="13">
 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">'ecdf'</span>)</span>
@@ -932,7 +938,7 @@ <h2 data-number="8.3" class="anchored" data-anchor-id="relationships-between-qua
 <h4 data-number="8.3.0.1" class="anchored" data-anchor-id="scatter-plots"><span class="header-section-number">8.3.0.1</span> Scatter Plots</h4>
 <p><strong>Scatter plots</strong> are one of the most useful tools in representing the relationship between <strong>pairs</strong> of quantitative variables. They are particularly important in gauging the strength, or correlation, of the relationship between variables. Knowledge of these relationships can then motivate decisions in our modeling process.</p>
 <p>In <code>matplotlib</code>, we use the function <code>plt.scatter</code> to generate a scatter plot. Notice that, unlike our examples of plotting single-variable distributions, now we specify sequences of values to be plotted along the x-axis <em>and</em> the y-axis.</p>
-<div id="781a2ea2" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="14">
+<div id="2e192be4" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="14">
 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>plt.scatter(wb[<span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>], <span class="op">\</span></span>
 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>            wb[<span class="st">'Adult literacy rate: Female: % ages 15 and older: 2005-14'</span>])</span>
 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -948,7 +954,7 @@ <h4 data-number="8.3.0.1" class="anchored" data-anchor-id="scatter-plots"><span
 </div>
 </div>
 <p>In <code>seaborn</code>, we call the function <code>sns.scatterplot</code>. We use the <code>x</code> and <code>y</code> parameters to indicate the values to be plotted along the x and y axes, respectively. By using the <code>hue</code> parameter, we can specify a third variable to be used for coloring each scatter point.</p>
-<div id="ab39307c" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="15">
+<div id="0405fbd5" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="15">
 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a>               y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, </span>
 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a>               hue <span class="op">=</span> <span class="st">"Continent"</span>)</span>
@@ -971,7 +977,7 @@ <h5 data-number="8.3.0.1.1" class="anchored" data-anchor-id="overplotting"><span
 <li><strong>Jittering</strong> is the process of adding a small amount of random noise to all x and y values to slightly shift the position of each datapoint. By randomly shifting all the data by some small distance, we can discern individual points more clearly without modifying the major trends of the original dataset.</li>
 </ul>
 <p>In the cell below, we first jitter the data using <code>np.random.uniform</code>, then re-plot it with smaller markers. The resulting plot is much easier to interpret.</p>
-<div id="a4de2f37" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="16">
+<div id="2ed29765" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="16">
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Setting a seed ensures that we produce the same plot each time</span></span>
 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="co"># This means that the course notes will not change each time you access them</span></span>
 <span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">150</span>)</span>
@@ -1005,7 +1011,7 @@ <h5 data-number="8.3.0.1.1" class="anchored" data-anchor-id="overplotting"><span
 <h4 data-number="8.3.0.2" class="anchored" data-anchor-id="lmplot-and-jointplot"><span class="header-section-number">8.3.0.2</span> <code>lmplot</code> and <code>jointplot</code></h4>
 <p><code>seaborn</code> also includes several built-in functions for creating more sophisticated scatter plots. Two of the most commonly used examples are <code>sns.lmplot</code> and <code>sns.jointplot</code>.</p>
 <p><code>sns.lmplot</code> plots both a scatter plot <em>and</em> a linear regression line, all in one function call. We’ll discuss linear regression in a few lectures.</p>
-<div id="0048183b" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="17">
+<div id="9e936021" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="17">
 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a>sns.lmplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
 <span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a>           y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>)</span>
 <span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1019,7 +1025,7 @@ <h4 data-number="8.3.0.2" class="anchored" data-anchor-id="lmplot-and-jointplot"
 </div>
 </div>
 <p><code>sns.jointplot</code> creates a visualization with three components: a scatter plot, a histogram of the distribution of x values, and a histogram of the distribution of y values.</p>
-<div id="b007454d" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="18">
+<div id="31f8d44f" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="18">
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>sns.jointplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>           y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>)</span>
 <span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1040,7 +1046,7 @@ <h4 data-number="8.3.0.3" class="anchored" data-anchor-id="hex-plots"><span clas
 <p>For datasets with a very large number of datapoints, jittering is unlikely to fully resolve the issue of overplotting. In these cases, we can attempt to visualize our data by its <em>density</em>, rather than displaying each individual datapoint.</p>
 <p><strong>Hex plots</strong> can be thought of as two-dimensional histograms that show the joint distribution between two variables. This is particularly useful when working with very dense data. In a hex plot, the x-y plane is binned into hexagons. Hexagons that are darker in color indicate a greater density of data – that is, there are more data points that lie in the region enclosed by the hexagon.</p>
 <p>We can generate a hex plot using <code>sns.jointplot</code> modified with the <code>kind</code> parameter.</p>
-<div id="37df951f" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="19">
+<div id="ddb6e429" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="19">
 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>sns.jointplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>              y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, <span class="op">\</span></span>
 <span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>              kind <span class="op">=</span> <span class="st">"hex"</span>)</span>
@@ -1061,7 +1067,7 @@ <h4 data-number="8.3.0.3" class="anchored" data-anchor-id="hex-plots"><span clas
 <h4 data-number="8.3.0.4" class="anchored" data-anchor-id="contour-plots"><span class="header-section-number">8.3.0.4</span> Contour Plots</h4>
 <p><strong>Contour plots</strong> are an alternative way of plotting the joint distribution of two variables. You can think of them as the 2-dimensional versions of KDE plots. A contour plot can be interpreted in a similar way to a <a href="https://gisgeography.com/contour-lines-topographic-map/">topographic map</a>. Each contour line represents an area that has the same <em>density</em> of datapoints throughout the region. Contours marked with darker colors contain more datapoints (a higher density) in that region.</p>
 <p><code>sns.kdeplot</code> will generate a contour plot if we specify both x and y data.</p>
-<div id="2e605433" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="20">
+<div id="7466813d" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="20">
 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a>sns.kdeplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
 <span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>            y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, <span class="op">\</span></span>
 <span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>            fill <span class="op">=</span> <span class="va">True</span>)</span>
@@ -1083,7 +1089,7 @@ <h2 data-number="8.4" class="anchored" data-anchor-id="transformations"><span cl
 <p>Much of this was done to uncover insights in data, which will prove necessary when we begin building models of data later in the course. A strong graphical correlation between two variables hints at an underlying relationship that we may want to study in greater detail. However, relying on visual relationships alone is limiting - not all plots show association. The presence of outliers and other statistical anomalies makes it hard to interpret data.</p>
 <p><strong>Transformations</strong> are the process of manipulating data to find significant relationships between variables. These are often found by applying mathematical functions to variables that “transform” their range of possible values and highlight some previously hidden associations between data.</p>
 <p>To see why we may want to transform data, consider the following plot of adult literacy rates against gross national income.</p>
-<div id="7a654602" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="21">
+<div id="c238e83c" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="21">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Some data cleaning to help with the next example</span></span>
@@ -1127,7 +1133,7 @@ <h3 data-number="8.4.1" class="anchored" data-anchor-id="linearization-and-apply
 </ul>
 <p>One function that produces this result is the <strong>log transformation</strong>. When we take the logarithm of a large number, the original number will decrease in magnitude dramatically. Conversely, when we take the logarithm of a small number, the original number does not change its value by as significant of an amount (to illustrate this, consider the difference between <span class="math inline">\(\log{(100)} = 4.61\)</span> and <span class="math inline">\(\log{(10)} = 2.3\)</span>).</p>
 <p>In Data 100 (and most upper-division STEM classes), <span class="math inline">\(\log\)</span> is used to refer to the natural logarithm with base <span class="math inline">\(e\)</span>.</p>
-<div id="9765ffae" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="22">
+<div id="2dca5c42" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="22">
 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="co"># np.log takes the logarithm of an array or Series</span></span>
 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>])</span>
 <span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1150,7 +1156,7 @@ <h3 data-number="8.4.1" class="anchored" data-anchor-id="linearization-and-apply
 <li>Not substantially alter the scaling of small values of y (we do not want to drastically modify the lower end of the y axis, which is already distributed evenly on the vertical scale).</li>
 </ul>
 <p>In this case, it is helpful to apply a <strong>power transformation</strong> – that is, raise our y values to a power. Let’s try raising our adult literacy rate values to the power of 4. Large values raised to the power of 4 will increase in magnitude proportionally much more than small values raised to the power of 4 (consider the difference between <span class="math inline">\(2^4 = 16\)</span> and <span class="math inline">\(200^4 = 1600000000\)</span>).</p>
-<div id="608abc66" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="23">
+<div id="7fad5a15" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="23">
 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Apply a log transformation to the x values and a power transformation to the y values</span></span>
 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>]<span class="op">**</span><span class="dv">4</span>)</span>
 <span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -1171,7 +1177,7 @@ <h3 data-number="8.4.1" class="anchored" data-anchor-id="linearization-and-apply
 <p><span class="math display">\[y^4 = m(\log{x}) + b\]</span></p>
 <p>Where <span class="math inline">\(m\)</span> represents the slope of the linear fit, while <span class="math inline">\(b\)</span> represents the intercept.</p>
 <p>The cell below computes <span class="math inline">\(m\)</span> and <span class="math inline">\(b\)</span> for our transformed data. We’ll discuss how this code was generated in a future lecture.</p>
-<div id="136d9746" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="24">
+<div id="32be344d" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="24">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb24"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a><span class="co"># The code below fits a linear regression model. We'll discuss it at length in a future lecture</span></span>
@@ -1209,7 +1215,7 @@ <h3 data-number="8.4.1" class="anchored" data-anchor-id="linearization-and-apply
 <p>By rearranging the equation, we find a relationship between the untransformed variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>.</p>
 <p><span class="math display">\[y = [m(\log{x}) + b]^{(1/4)}\]</span></p>
 <p>When we plug in the values for <span class="math inline">\(m\)</span> and <span class="math inline">\(b\)</span> computed above, something interesting happens.</p>
-<div id="565c3dc3" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="25">
+<div id="ba30b549" class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="25">
 <details class="code-fold">
 <summary>Code</summary>
 <div class="sourceCode cell-code" id="cb26"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Now, plug the values for m and b into the relationship between the untransformed x and y</span></span>
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png
index 733e9b37..8010863c 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-10-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-10-output-1.pdf
index 8c55df05..1dcb2f96 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-10-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-10-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf
index 885421af..69f534be 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf
index 7d6552ed..caefd075 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf
index ee86546c..dbf1189e 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-14-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-14-output-1.pdf
index 30d559bf..c5f8cd40 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-14-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-14-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-15-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-15-output-1.pdf
index 12fe9a24..bd63185f 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-15-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-15-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-16-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-16-output-1.pdf
index 88c58431..a90da7da 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-16-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-16-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf
index e1a8fdb0..6f3b896e 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-1.pdf
index d2380e36..c08bc5af 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-1.pdf
index b3b23e96..6dd30373 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-1.pdf
index c7333899..69c831df 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-21-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-21-output-1.pdf
index 98d44bf7..b7f152fd 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-21-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-21-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-22-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-22-output-1.pdf
index 58b627db..5d7e1188 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-22-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-22-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-23-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-23-output-1.pdf
index 19e27bb2..7c5b3e18 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-23-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-23-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-1.pdf
index f1974c34..5e667abb 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-2.pdf
index 6b69e383..c1442297 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-2.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-2.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-26-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-26-output-1.pdf
index 5f6c26ef..fdd851d2 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-26-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-26-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf
index 07d21dd9..b7a452ca 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-1.pdf
index b9ef9838..954eafe4 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-1.pdf
index a14da02e..7ca51572 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-1.pdf
index 5e01364c..98c3421d 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-7-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-7-output-1.pdf
index 3eb3ac5b..4336c78e 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-7-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-7-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-8-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-8-output-1.pdf
index 0938d2ff..6cdecb8e 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-8-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-8-output-1.pdf differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-9-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-9-output-1.pdf
index 5e8ae3b7..38bf3e14 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-9-output-1.pdf and b/docs/visualization_2/visualization_2_files/figure-pdf/cell-9-output-1.pdf differ
diff --git a/index.tex b/index.tex
index e6aa58b1..bd7a1bdd 100644
--- a/index.tex
+++ b/index.tex
@@ -247,7 +247,7 @@ \section*{About the Course Notes}\label{about-the-course-notes}
 
 \chapter{Introduction}\label{introduction}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -301,7 +301,7 @@ \chapter{Introduction}\label{introduction}
 allowing you to take data and produce useful insights on the world's
 most challenging and ambiguous problems.
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Course Goals}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Course Goals}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -319,7 +319,7 @@ \chapter{Introduction}\label{introduction}
 
 \end{tcolorbox}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Some Topics We'll Cover}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Some Topics We'll Cover}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -349,7 +349,7 @@ \chapter{Introduction}\label{introduction}
 
 \end{tcolorbox}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Prerequisites}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Prerequisites}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 To ensure that you can get the most out of the course content, please
 make sure that you are familiar with:
@@ -580,7 +580,7 @@ \section{Conclusion}\label{conclusion}
 
 \chapter{Pandas I}\label{pandas-i}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -1920,7 +1920,7 @@ \section{Parting Note}\label{parting-note}
 
 \chapter{Pandas II}\label{pandas-ii}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -2691,7 +2691,7 @@ \subsection{\texorpdfstring{\texttt{.sample()}}{.sample()}}\label{sample}
 \endhead
 \bottomrule\noalign{}
 \endlastfoot
-344574 & CA & M & 2000 & Tevin & 6 \\
+299916 & CA & M & 1981 & Reid & 21 \\
 \end{longtable}
 
 Naturally, this can be chained with other methods and operators
@@ -2711,11 +2711,11 @@ \subsection{\texorpdfstring{\texttt{.sample()}}{.sample()}}\label{sample}
 \endhead
 \bottomrule\noalign{}
 \endlastfoot
-185733 & 2009 & Rylie & 86 \\
-60362 & 1969 & Carolee & 7 \\
-271501 & 1960 & Ed & 105 \\
-171292 & 2005 & Ebelin & 7 \\
-377076 & 2012 & Camryn & 12 \\
+394504 & 2018 & Montgomery & 14 \\
+187825 & 2009 & Monzerrat & 9 \\
+326066 & 1993 & Delfino & 14 \\
+391405 & 2017 & Henri & 17 \\
+192230 & 2010 & Ruhi & 8 \\
 \end{longtable}
 
 \begin{Shaded}
@@ -2732,10 +2732,10 @@ \subsection{\texorpdfstring{\texttt{.sample()}}{.sample()}}\label{sample}
 \endhead
 \bottomrule\noalign{}
 \endlastfoot
-151204 & 2000 & Rebecka & 10 \\
-344478 & 2000 & Kabir & 6 \\
-344879 & 2000 & Patric & 5 \\
-152475 & 2000 & Hera & 5 \\
+150457 & 2000 & Keilani & 18 \\
+152403 & 2000 & Daisie & 5 \\
+150496 & 2000 & Emelyn & 17 \\
+152822 & 2000 & Zoya & 5 \\
 \end{longtable}
 
 \subsection{\texorpdfstring{\texttt{.value\_counts()}}{.value\_counts()}}\label{value_counts}
@@ -2857,7 +2857,7 @@ \section{Parting Note}\label{parting-note-1}
 
 \chapter{Pandas III}\label{pandas-iii}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -3127,7 +3127,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10fdf0e90>
+<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10ead4e90>
 \end{verbatim}
 
 What does this strange output mean? Calling \texttt{.groupby}
@@ -3467,7 +3467,7 @@ \subsection{Plotting Birth Counts}\label{plotting-birth-counts}
 \end{Shaded}
 
 \begin{verbatim}
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73688/390646742.py:1: FutureWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83775/390646742.py:1: FutureWarning:
 
 The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
 \end{verbatim}
@@ -4118,7 +4118,7 @@ \subsection{\texorpdfstring{Aggregation with \texttt{lambda}
 \end{Shaded}
 
 \begin{verbatim}
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73688/4278286395.py:1: FutureWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83775/4278286395.py:1: FutureWarning:
 
 The provided callable <built-in function max> is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
 \end{verbatim}
@@ -4347,7 +4347,7 @@ \section{Aggregating Data with Pivot
 \end{Shaded}
 
 \begin{verbatim}
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73688/3186035650.py:3: FutureWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83775/3186035650.py:3: FutureWarning:
 
 The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
 \end{verbatim}
@@ -4674,7 +4674,7 @@ \chapter{Data Cleaning and EDA}\label{data-cleaning-and-eda}
 \end{Highlighting}
 \end{Shaded}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -5518,7 +5518,7 @@ \subsubsection{\texorpdfstring{Temporality with \texttt{pandas}'
 \end{Shaded}
 
 \begin{verbatim}
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73731/874729699.py:1: UserWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83817/874729699.py:1: UserWarning:
 
 Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
 \end{verbatim}
@@ -6801,7 +6801,7 @@ \subsection{Exploring Variable Feature
 
 invalid escape sequence '\s'
 
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_73731/150137587.py:3: SyntaxWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_83817/150137587.py:3: SyntaxWarning:
 
 invalid escape sequence '\s'
 \end{verbatim}
@@ -7359,7 +7359,7 @@ \subsection{EDA and Data Wrangling}\label{eda-and-data-wrangling}
 
 \chapter{Regular Expressions}\label{regular-expressions}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -8578,7 +8578,7 @@ \section{Limitations of Regular
 
 \chapter{Visualization I}\label{visualization-i}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -9368,7 +9368,7 @@ \subsubsection{Evaluating Histograms}\label{evaluating-histograms}
 
 \chapter{Visualization II}\label{visualization-ii}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -10764,7 +10764,7 @@ \subsection{Harnessing Context}\label{harnessing-context}
 
 \chapter{Sampling}\label{sampling}
 
-\begin{tcolorbox}[enhanced jigsaw, breakable, colframe=quarto-callout-note-color-frame, left=2mm, bottomtitle=1mm, titlerule=0mm, toprule=.15mm, toptitle=1mm, leftrule=.75mm, colback=white, arc=.35mm, opacitybacktitle=0.6, bottomrule=.15mm, coltitle=black, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colbacktitle=quarto-callout-note-color!10!white, rightrule=.15mm, opacityback=0]
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
 
 \begin{itemize}
 \tightlist
@@ -11350,7 +11350,7 @@ \subsubsection{Simple Random Sample}\label{simple-random-sample}
 \end{Shaded}
 
 \begin{verbatim}
-np.float64(0.5297526961902748)
+np.float64(0.5298417344656035)
 \end{verbatim}
 
 This is very close to the actual vote of 0.5302792307692308!
@@ -11374,8 +11374,8 @@ \subsubsection{Simple Random Sample}\label{simple-random-sample}
 \end{Highlighting}
 \end{Shaded}
 
-\textbf{Actual} = 0.5303, \textbf{Sample} = 0.5200, \textbf{Err} =
-1.94\%.
+\textbf{Actual} = 0.5303, \textbf{Sample} = 0.5262, \textbf{Err} =
+0.76\%.
 
 We'll learn how to choose this number when we (re)learn the Central
 Limit Theorem later in the semester.
@@ -11450,6 +11450,996 @@ \section{Summary}\label{summary-1}
 between the sample and the population. Ultimately, the dataset doesn't
 tell us about the world behind the data.
 
+\bookmarksetup{startatroot}
+
+\chapter{Introduction to Modeling}\label{introduction-to-modeling}
+
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, coltitle=black, titlerule=0mm, bottomrule=.15mm, colbacktitle=quarto-callout-note-color!10!white, toprule=.15mm, opacitybacktitle=0.6, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, colback=white, rightrule=.15mm, leftrule=.75mm, bottomtitle=1mm, breakable, toptitle=1mm, arc=.35mm, opacityback=0]
+
+\begin{itemize}
+\tightlist
+\item
+  Understand what models are and how to carry out the four-step modeling
+  process.
+\item
+  Define the concept of loss and gain familiarity with \(L_1\) and
+  \(L_2\) loss.
+\item
+  Fit the Simple Linear Regression model using minimization techniques.
+\end{itemize}
+
+\end{tcolorbox}
+
+Up until this point in the semester, we've focused on analyzing
+datasets. We've looked into the early stages of the data science
+lifecycle, focusing on the programming tools, visualization techniques,
+and data cleaning methods needed for data analysis.
+
+This lecture marks a shift in focus. We will move away from examining
+datasets to actually \emph{using} our data to better understand the
+world. Specifically, the next sequence of lectures will explore
+predictive modeling: generating models to make some predictions about
+the world around us. In this lecture, we'll introduce the conceptual
+framework for setting up a modeling task. In the next few lectures,
+we'll put this framework into practice by implementing various kinds of
+models.
+
+\section{What is a Model?}\label{what-is-a-model}
+
+A model is an \textbf{idealized representation} of a system. A system is
+a set of principles or procedures according to which something
+functions. We live in a world full of systems: the procedure of turning
+on a light happens according to a specific set of rules dictating the
+flow of electricity. The truth behind how any event occurs is usually
+complex, and many times the specifics are unknown. The workings of the
+world can be viewed as its own giant procedure. Models seek to simplify
+the world and distill them into workable pieces.
+
+Example: We model the fall of an object on Earth as subject to a
+constant acceleration of \(9.81 m/s^2\) due to gravity.
+
+\begin{itemize}
+\tightlist
+\item
+  While this describes the behavior of our system, it is merely an
+  approximation.
+\item
+  It doesn't account for the effects of air resistance, local variations
+  in gravity, etc.
+\item
+  In practice, it's accurate enough to be useful!
+\end{itemize}
+
+\subsection{Reasons for Building
+Models}\label{reasons-for-building-models}
+
+Why do we want to build models? As far as data scientists and
+statisticians are concerned, there are three reasons, and each implies a
+different focus on modeling.
+
+\begin{enumerate}
+\def\labelenumi{\arabic{enumi}.}
+\item
+  To explain complex phenomena occurring in the world we live in.
+  Examples of this might be:
+
+  \begin{itemize}
+  \tightlist
+  \item
+    How are the parents' average height related to their children's
+    average height?
+  \item
+    How does an object's velocity and acceleration impact how far it
+    travels? (Physics: \(d = d_0 + vt + \frac{1}{2}at^2\))
+  \end{itemize}
+
+  In these cases, we care about creating models that are \emph{simple
+  and interpretable}, allowing us to understand what the relationships
+  between our variables are.
+\item
+  To make accurate predictions about unseen data. Some examples include:
+
+  \begin{itemize}
+  \tightlist
+  \item
+    Can we predict if an email is spam or not?
+  \item
+    Can we generate a one-sentence summary of this 10-page long article?
+  \end{itemize}
+
+  When making predictions, we care more about making extremely accurate
+  predictions, at the cost of having an uninterpretable model. These are
+  sometimes called black-box models and are common in fields like deep
+  learning.
+\item
+  To measure the causal effects of one event on some other event. For
+  example,
+
+  \begin{itemize}
+  \tightlist
+  \item
+    Does smoking \emph{cause} lung cancer?
+  \item
+    Does a job training program \emph{cause} increases in employment and
+    wages?
+  \end{itemize}
+
+  This is a much harder question because most statistical tools are
+  designed to infer association, not causation. We will not focus on
+  this task in Data 100, but you can take other advanced classes on
+  causal inference (e.g., Stat 156, Data 102) if you are intrigued!
+\end{enumerate}
+
+Most of the time, we aim to strike a balance between building
+\textbf{interpretable} models and building \textbf{accurate models}.
+
+\subsection{Common Types of Models}\label{common-types-of-models}
+
+In general, models can be split into two categories:
+
+\begin{enumerate}
+\def\labelenumi{\arabic{enumi}.}
+\item
+  Deterministic physical (mechanistic) models: Laws that govern how the
+  world works.
+
+  \begin{itemize}
+  \tightlist
+  \item
+    \href{https://en.wikipedia.org/wiki/Kepler\%27s_laws_of_planetary_motion\#Third_law}{Kepler's
+    Third Law of Planetary Motion (1619)}: The ratio of the square of an
+    object's orbital period with the cube of the semi-major axis of its
+    orbit is the same for all objects orbiting the same primary.
+
+    \begin{itemize}
+    \tightlist
+    \item
+      \(T^2 \propto R^3\)
+    \end{itemize}
+  \item
+    \href{https://en.wikipedia.org/wiki/Newton\%27s_laws_of_motion}{Newton's
+    Laws: motion and gravitation (1687)}: Newton's second law of motion
+    models the relationship between the mass of an object and the force
+    required to accelerate it.
+
+    \begin{itemize}
+    \tightlist
+    \item
+      \(F = ma\)
+    \item
+      \(F_g = G \frac{m_1 m_2}{r^2}\)
+    \end{itemize}
+  \end{itemize}
+\item
+  Probabilistic models: Models that attempt to understand how random
+  processes evolve. These are more general and can be used to describe
+  many phenomena in the real world. These models commonly make
+  simplifying assumptions about the nature of the world.
+
+  \begin{itemize}
+  \tightlist
+  \item
+    \href{https://en.wikipedia.org/wiki/Poisson_point_process}{Poisson
+    Process models}: Used to model random events that happen with some
+    probability at any point in time and are strictly increasing in
+    count, such as the arrival of customers at a store.
+  \end{itemize}
+\end{enumerate}
+
+Note: These specific models are not in the scope of Data 100 and exist
+to serve as motivation.
+
+\section{Simple Linear Regression}\label{simple-linear-regression}
+
+The \textbf{regression line} is the unique straight line that minimizes
+the \textbf{mean squared error} of estimation among all straight lines.
+As with any straight line, it can be defined by a slope and a
+y-intercept:
+
+\begin{itemize}
+\tightlist
+\item
+  \(\text{slope} = r \cdot \frac{\text{Standard Deviation of } y}{\text{Standard Deviation of }x}\)
+\item
+  \(y\text{-intercept} = \text{average of }y - \text{slope}\cdot\text{average of }x\)
+\item
+  \(\text{regression estimate} = y\text{-intercept} + \text{slope}\cdot\text{}x\)
+\item
+  \(\text{residual} =\text{observed }y - \text{regression estimate}\)
+\end{itemize}
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\ImportTok{import}\NormalTok{ pandas }\ImportTok{as}\NormalTok{ pd}
+\ImportTok{import}\NormalTok{ numpy }\ImportTok{as}\NormalTok{ np}
+\ImportTok{import}\NormalTok{ matplotlib.pyplot }\ImportTok{as}\NormalTok{ plt}
+\ImportTok{import}\NormalTok{ seaborn }\ImportTok{as}\NormalTok{ sns}
+\CommentTok{\# Set random seed for consistency }
+\NormalTok{np.random.seed(}\DecValTok{43}\NormalTok{)}
+\NormalTok{plt.style.use(}\StringTok{\textquotesingle{}default\textquotesingle{}}\NormalTok{) }
+
+\CommentTok{\# Generate random noise for plotting}
+\NormalTok{x }\OperatorTok{=}\NormalTok{ np.linspace(}\OperatorTok{{-}}\DecValTok{3}\NormalTok{, }\DecValTok{3}\NormalTok{, }\DecValTok{100}\NormalTok{)}
+\NormalTok{y }\OperatorTok{=}\NormalTok{ x }\OperatorTok{*} \FloatTok{0.5} \OperatorTok{{-}} \DecValTok{1} \OperatorTok{+}\NormalTok{ np.random.randn(}\DecValTok{100}\NormalTok{) }\OperatorTok{*} \FloatTok{0.3}
+
+\CommentTok{\# Plot regression line}
+\NormalTok{sns.regplot(x}\OperatorTok{=}\NormalTok{x,y}\OperatorTok{=}\NormalTok{y)}\OperatorTok{;}
+\end{Highlighting}
+\end{Shaded}
+
+\includegraphics{intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-2-output-1.pdf}
+
+\subsection{Notations and Definitions}\label{notations-and-definitions}
+
+For a pair of variables \(x\) and \(y\) representing our data
+\(\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}\), we
+denote their means/averages as \(\bar x\) and \(\bar y\) and standard
+deviations as \(\sigma_x\) and \(\sigma_y\).
+
+\subsubsection{Standard Units}\label{standard-units}
+
+A variable is represented in standard units if the following are true:
+
+\begin{enumerate}
+\def\labelenumi{\arabic{enumi}.}
+\tightlist
+\item
+  0 in standard units is equal to the mean (\(\bar{x}\)) in the original
+  variable's units.
+\item
+  An increase of 1 standard unit is an increase of 1 standard deviation
+  (\(\sigma_x\)) in the original variable's units.
+\end{enumerate}
+
+To convert a variable \(x_i\) into standard units, we subtract its mean
+from it and divide it by its standard deviation. For example, \(x_i\) in
+standard units is \(\frac{x_i - \bar x}{\sigma_x}\).
+
+\subsubsection{Correlation}\label{correlation}
+
+The correlation (\(r\)) is the average of the product of \(x\) and
+\(y\), both measured in \emph{standard units}.
+
+\[r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i - \bar{x}}{\sigma_x})(\frac{y_i - \bar{y}}{\sigma_y})\]
+
+\begin{enumerate}
+\def\labelenumi{\arabic{enumi}.}
+\tightlist
+\item
+  Correlation measures the strength of a \textbf{linear association}
+  between two variables.
+\item
+  Correlations range between -1 and 1: \(|r| \leq 1\), with \(r=1\)
+  indicating perfect positive linear association, and \(r=-1\)
+  indicating perfect negative association. The closer \(r\) is to \(0\),
+  the weaker the linear association is.
+\item
+  Correlation says nothing about causation and non-linear association.
+  Correlation does \textbf{not} imply causation. When \(r = 0\), the two
+  variables are uncorrelated. However, they could still be related
+  through some non-linear relationship.
+\end{enumerate}
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\KeywordTok{def}\NormalTok{ plot\_and\_get\_corr(ax, x, y, title):}
+\NormalTok{    ax.set\_xlim(}\OperatorTok{{-}}\DecValTok{3}\NormalTok{, }\DecValTok{3}\NormalTok{)}
+\NormalTok{    ax.set\_ylim(}\OperatorTok{{-}}\DecValTok{3}\NormalTok{, }\DecValTok{3}\NormalTok{)}
+\NormalTok{    ax.set\_xticks([])}
+\NormalTok{    ax.set\_yticks([])}
+\NormalTok{    ax.scatter(x, y, alpha }\OperatorTok{=} \FloatTok{0.73}\NormalTok{)}
+\NormalTok{    r }\OperatorTok{=}\NormalTok{ np.corrcoef(x, y)[}\DecValTok{0}\NormalTok{, }\DecValTok{1}\NormalTok{]}
+\NormalTok{    ax.set\_title(title }\OperatorTok{+} \StringTok{" (corr: }\SpecialCharTok{\{\}}\StringTok{)"}\NormalTok{.}\BuiltInTok{format}\NormalTok{(r.}\BuiltInTok{round}\NormalTok{(}\DecValTok{2}\NormalTok{)))}
+    \ControlFlowTok{return}\NormalTok{ r}
+
+\NormalTok{fig, axs }\OperatorTok{=}\NormalTok{ plt.subplots(}\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{, figsize }\OperatorTok{=}\NormalTok{ (}\DecValTok{10}\NormalTok{, }\DecValTok{10}\NormalTok{))}
+
+\CommentTok{\# Just noise}
+\NormalTok{x1, y1 }\OperatorTok{=}\NormalTok{ np.random.randn(}\DecValTok{2}\NormalTok{, }\DecValTok{100}\NormalTok{)}
+\NormalTok{corr1 }\OperatorTok{=}\NormalTok{ plot\_and\_get\_corr(axs[}\DecValTok{0}\NormalTok{, }\DecValTok{0}\NormalTok{], x1, y1, title }\OperatorTok{=} \StringTok{"noise"}\NormalTok{)}
+
+\CommentTok{\# Strong linear}
+\NormalTok{x2 }\OperatorTok{=}\NormalTok{ np.linspace(}\OperatorTok{{-}}\DecValTok{3}\NormalTok{, }\DecValTok{3}\NormalTok{, }\DecValTok{100}\NormalTok{)}
+\NormalTok{y2 }\OperatorTok{=}\NormalTok{ x2 }\OperatorTok{*} \FloatTok{0.5} \OperatorTok{{-}} \DecValTok{1} \OperatorTok{+}\NormalTok{ np.random.randn(}\DecValTok{100}\NormalTok{) }\OperatorTok{*} \FloatTok{0.3}
+\NormalTok{corr2 }\OperatorTok{=}\NormalTok{ plot\_and\_get\_corr(axs[}\DecValTok{0}\NormalTok{, }\DecValTok{1}\NormalTok{], x2, y2, title }\OperatorTok{=} \StringTok{"strong linear"}\NormalTok{)}
+
+\CommentTok{\# Unequal spread}
+\NormalTok{x3 }\OperatorTok{=}\NormalTok{ np.linspace(}\OperatorTok{{-}}\DecValTok{3}\NormalTok{, }\DecValTok{3}\NormalTok{, }\DecValTok{100}\NormalTok{)}
+\NormalTok{y3 }\OperatorTok{=} \OperatorTok{{-}}\NormalTok{ x3}\OperatorTok{/}\DecValTok{3} \OperatorTok{+}\NormalTok{ np.random.randn(}\DecValTok{100}\NormalTok{)}\OperatorTok{*}\NormalTok{(x3)}\OperatorTok{/}\FloatTok{2.5}
+\NormalTok{corr3 }\OperatorTok{=}\NormalTok{ plot\_and\_get\_corr(axs[}\DecValTok{1}\NormalTok{, }\DecValTok{0}\NormalTok{], x3, y3, title }\OperatorTok{=} \StringTok{"strong linear"}\NormalTok{)}
+\NormalTok{extent }\OperatorTok{=}\NormalTok{ axs[}\DecValTok{1}\NormalTok{, }\DecValTok{0}\NormalTok{].get\_window\_extent().transformed(fig.dpi\_scale\_trans.inverted())}
+
+\CommentTok{\# Strong non{-}linear}
+\NormalTok{x4 }\OperatorTok{=}\NormalTok{ np.linspace(}\OperatorTok{{-}}\DecValTok{3}\NormalTok{, }\DecValTok{3}\NormalTok{, }\DecValTok{100}\NormalTok{)}
+\NormalTok{y4 }\OperatorTok{=} \DecValTok{2}\OperatorTok{*}\NormalTok{np.sin(x3 }\OperatorTok{{-}} \FloatTok{1.5}\NormalTok{) }\OperatorTok{+}\NormalTok{ np.random.randn(}\DecValTok{100}\NormalTok{) }\OperatorTok{*} \FloatTok{0.3}
+\NormalTok{corr4 }\OperatorTok{=}\NormalTok{ plot\_and\_get\_corr(axs[}\DecValTok{1}\NormalTok{, }\DecValTok{1}\NormalTok{], x4, y4, title }\OperatorTok{=} \StringTok{"strong non{-}linear"}\NormalTok{)}
+
+\NormalTok{plt.show()}
+\end{Highlighting}
+\end{Shaded}
+
+\includegraphics{intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-3-output-1.pdf}
+
+\subsection{Alternate Form}\label{alternate-form}
+
+When the variables \(y\) and \(x\) are measured in \emph{standard
+units}, the regression line for predicting \(y\) based on \(x\) has
+slope \(r\) and passes through the origin.
+
+\[\hat{y}_{su} = r \cdot x_{su}\]
+
+\includegraphics{intro_to_modeling/images/reg_line_1.png}
+
+\begin{itemize}
+\tightlist
+\item
+  In the original units, this becomes
+\end{itemize}
+
+\[\frac{\hat{y} - \bar{y}}{\sigma_y} = r \cdot \frac{x - \bar{x}}{\sigma_x}\]
+
+\includegraphics{intro_to_modeling/images/reg_line_2.png}
+
+\subsection{Derivation}\label{derivation}
+
+Starting from the top, we have our claimed form of the regression line,
+and we want to show that it is equivalent to the optimal linear
+regression line: \(\hat{y} = \hat{a} + \hat{b}x\).
+
+Recall:
+
+\begin{itemize}
+\tightlist
+\item
+  \(\hat{b} = r \cdot \frac{\text{Standard Deviation of }y}{\text{Standard Deviation of }x}\)
+\item
+  \(\hat{a} = \text{average of }y - \text{slope}\cdot\text{average of }x\)
+\end{itemize}
+
+\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-color-frame, colback=white, rightrule=.15mm, leftrule=.75mm, bottomrule=.15mm, arc=.35mm, breakable, toprule=.15mm, left=2mm, opacityback=0]
+
+Proof:
+
+\[\frac{\hat{y} - \bar{y}}{\sigma_y} = r \cdot \frac{x - \bar{x}}{\sigma_x}\]
+
+Multiply by \(\sigma_y\), and add \(\bar{y}\) on both sides.
+
+\[\hat{y} = \sigma_y \cdot r \cdot \frac{x - \bar{x}}{\sigma_x} + \bar{y}\]
+
+Distribute coefficient \(\sigma_{y}\cdot r\) to the
+\(\frac{x - \bar{x}}{\sigma_x}\) term
+
+\[\hat{y} = (\frac{r\sigma_y}{\sigma_x} ) \cdot x + (\bar{y} - (\frac{r\sigma_y}{\sigma_x} ) \bar{x})\]
+
+We now see that we have a line that matches our claim:
+
+\begin{itemize}
+\tightlist
+\item
+  slope:
+  \(r\cdot\frac{\text{SD of y}}{\text{SD of x}} = r\cdot\frac{\sigma_y}{\sigma_x}\)
+\item
+  intercept: \(\bar{y} - \text{slope}\cdot \bar{x}\)
+\end{itemize}
+
+Note that the error for the i-th datapoint is: \(e_i = y_i - \hat{y_i}\)
+
+\end{tcolorbox}
+
+\section{The Modeling Process}\label{the-modeling-process}
+
+At a high level, a model is a way of representing a system. In Data 100,
+we'll treat a model as some mathematical rule we use to describe the
+relationship between variables.
+
+What variables are we modeling? Typically, we use a subset of the
+variables in our sample of collected data to model another variable in
+this data. To put this more formally, say we have the following dataset
+\(\mathcal{D}\):
+
+\[\mathcal{D} = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}\]
+
+Each pair of values \((x_i, y_i)\) represents a datapoint. In a modeling
+setting, we call these \textbf{observations}. \(y_i\) is the dependent
+variable we are trying to model, also called an \textbf{output} or
+\textbf{response}. \(x_i\) is the independent variable inputted into the
+model to make predictions, also known as a \textbf{feature}.
+
+Our goal in modeling is to use the observed data \(\mathcal{D}\) to
+predict the output variable \(y_i\). We denote each prediction as
+\(\hat{y}_i\) (read: ``y hat sub i'').
+
+How do we generate these predictions? Some examples of models we'll
+encounter in the next few lectures are given below:
+
+\[\hat{y}_i = \theta\] \[\hat{y}_i = \theta_0 + \theta_1 x_i\]
+
+The examples above are known as \textbf{parametric models}. They relate
+the collected data, \(x_i\), to the prediction we make, \(\hat{y}_i\). A
+few parameters (\(\theta\), \(\theta_0\), \(\theta_1\)) are used to
+describe the relationship between \(x_i\) and \(\hat{y}_i\).
+
+Notice that we don't immediately know the values of these parameters.
+While the features, \(x_i\), are taken from our observed data, we need
+to decide what values to give \(\theta\), \(\theta_0\), and \(\theta_1\)
+ourselves. This is the heart of parametric modeling: \emph{what
+parameter values should we choose so our model makes the best possible
+predictions?}
+
+To choose our model parameters, we'll work through the \textbf{modeling
+process}.
+
+\begin{enumerate}
+\def\labelenumi{\arabic{enumi}.}
+\tightlist
+\item
+  Choose a model: how should we represent the world?
+\item
+  Choose a loss function: how do we quantify prediction error?
+\item
+  Fit the model: how do we choose the best parameters of our model given
+  our data?
+\item
+  Evaluate model performance: how do we evaluate whether this process
+  gave rise to a good model?
+\end{enumerate}
+
+\section{Choosing a Model}\label{choosing-a-model}
+
+Our first step is choosing a model: defining the mathematical rule that
+describes the relationship between the features, \(x_i\), and
+predictions \(\hat{y}_i\).
+
+In
+\href{https://inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html}{Data
+8}, you learned about the \textbf{Simple Linear Regression (SLR) model}.
+You learned that the model takes the form: \[\hat{y}_i = a + bx_i\]
+
+In Data 100, we'll use slightly different notation: we will replace
+\(a\) with \(\theta_0\) and \(b\) with \(\theta_1\). This will allow us
+to use the same notation when we explore more complex models later on in
+the course.
+
+\[\hat{y}_i = \theta_0 + \theta_1 x_i\]
+
+The parameters of the SLR model are \(\theta_0\), also called the
+intercept term, and \(\theta_1\), also called the slope term. To create
+an effective model, we want to choose values for \(\theta_0\) and
+\(\theta_1\) that most accurately predict the output variable. The
+``best'' fitting model parameters are given the special names:
+\(\hat{\theta}_0\) and \(\hat{\theta}_1\); they are the specific
+parameter values that allow our model to generate the best possible
+predictions.
+
+In Data 8, you learned that the best SLR model parameters are:
+\[\hat{\theta}_0 = \bar{y} - \hat{\theta}_1\bar{x} \qquad \qquad \hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}\]
+
+A quick reminder on notation:
+
+\begin{itemize}
+\tightlist
+\item
+  \(\bar{y}\) and \(\bar{x}\) indicate the mean value of \(y\) and
+  \(x\), respectively
+\item
+  \(\sigma_y\) and \(\sigma_x\) indicate the standard deviations of
+  \(y\) and \(x\)
+\item
+  \(r\) is the
+  \href{https://inferentialthinking.com/chapters/15/1/Correlation.html\#the-correlation-coefficient}{correlation
+  coefficient}, defined as the average of the product of \(x\) and \(y\)
+  measured in standard units:
+  \(\frac{1}{n} \sum_{i=1}^n (\frac{x_i-\bar{x}}{\sigma_x})(\frac{y_i-\bar{y}}{\sigma_y})\)
+\end{itemize}
+
+In Data 100, we want to understand \emph{how} to derive these best model
+coefficients. To do so, we'll introduce the concept of a loss function.
+
+\section{Choosing a Loss Function}\label{choosing-a-loss-function}
+
+We've talked about the idea of creating the ``best'' possible
+predictions. This begs the question: how do we decide how ``good'' or
+``bad'' our model's predictions are?
+
+A \textbf{loss function} characterizes the cost, error, or fit resulting
+from a particular choice of model or model parameters. This function,
+\(L(y, \hat{y})\), quantifies how ``bad'' or ``far off'' a single
+prediction by our model is from a true, observed value in our collected
+data.
+
+The choice of loss function for a particular model will affect the
+accuracy and computational cost of estimation, and it'll also depend on
+the estimation task at hand. For example,
+
+\begin{itemize}
+\tightlist
+\item
+  Are outputs quantitative or qualitative?
+\item
+  Do outliers matter?
+\item
+  Are all errors equally costly? (e.g., a false negative on a cancer
+  test is arguably more dangerous than a false positive)
+\end{itemize}
+
+Regardless of the specific function used, a loss function should follow
+two basic principles:
+
+\begin{itemize}
+\tightlist
+\item
+  If the prediction \(\hat{y}_i\) is \emph{close} to the actual value
+  \(y_i\), loss should be low.
+\item
+  If the prediction \(\hat{y}_i\) is \emph{far} from the actual value
+  \(y_i\), loss should be high.
+\end{itemize}
+
+Two common choices of loss function are squared loss and absolute loss.
+
+\textbf{Squared loss}, also known as \textbf{L2 loss}, computes loss as
+the square of the difference between the observed \(y_i\) and predicted
+\(\hat{y}_i\): \[L(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2\]
+
+\textbf{Absolute loss}, also known as \textbf{L1 loss}, computes loss as
+the absolute difference between the observed \(y_i\) and predicted
+\(\hat{y}_i\): \[L(y_i, \hat{y}_i) = |y_i - \hat{y}_i|\]
+
+L1 and L2 loss give us a tool for quantifying our model's performance on
+a single data point. This is a good start, but ideally, we want to
+understand how our model performs across our \emph{entire} dataset. A
+natural way to do this is to compute the average loss across all data
+points in the dataset. This is known as the \textbf{cost function},
+\(\hat{R}(\theta)\):
+\[\hat{R}(\theta) = \frac{1}{n} \sum^n_{i=1} L(y_i, \hat{y}_i)\]
+
+The cost function has many names in the statistics literature. You may
+also encounter the terms:
+
+\begin{itemize}
+\tightlist
+\item
+  Empirical risk (this is why we give the cost function the name \(R\))
+\item
+  Error function
+\item
+  Average loss
+\end{itemize}
+
+We can substitute our L1 and L2 loss into the cost function definition.
+The \textbf{Mean Squared Error (MSE)} is the average squared loss across
+a dataset: \[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
+
+The \textbf{Mean Absolute Error (MAE)} is the average absolute loss
+across a dataset:
+\[\text{MAE}= \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|\]
+
+\section{Fitting the Model}\label{fitting-the-model}
+
+Now that we've established the concept of a loss function, we can return
+to our original goal of choosing model parameters. Specifically, we want
+to choose the best set of model parameters that will minimize the
+model's cost on our dataset. This process is called fitting the model.
+
+We know from calculus that a function is minimized when (1) its first
+derivative is equal to zero and (2) its second derivative is positive.
+We often call the function being minimized the \textbf{objective
+function} (our objective is to find its minimum).
+
+To find the optimal model parameter, we:
+
+\begin{enumerate}
+\def\labelenumi{\arabic{enumi}.}
+\tightlist
+\item
+  Take the derivative of the cost function with respect to that
+  parameter
+\item
+  Set the derivative equal to 0
+\item
+  Solve for the parameter
+\end{enumerate}
+
+We repeat this process for each parameter present in the model. For now,
+we'll disregard the second derivative condition.
+
+To help us make sense of this process, let's put it into action by
+deriving the optimal model parameters for simple linear regression using
+the mean squared error as our cost function. Remember: although the
+notation may look tricky, all we are doing is following the three steps
+above!
+
+Step 1: take the derivative of the cost function with respect to each
+model parameter. We substitute the SLR model,
+\(\hat{y}_i = \theta_0+\theta_1 x_i\), into the definition of MSE above
+and differentiate with respect to \(\theta_0\) and \(\theta_1\).
+\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)^2\]
+
+\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} y_i - \theta_0 - \theta_1 x_i\]
+
+\[\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)x_i\]
+
+Let's walk through these derivations in more depth, starting with the
+derivative of MSE with respect to \(\theta_0\).
+
+Given our MSE above, we know that:
+\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{\partial}{\partial \theta_0} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+
+Noting that the derivative of sum is equivalent to the sum of
+derivatives, this then becomes:
+\[ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_0} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+
+We can then apply the chain rule.
+
+\[ = \frac{1}{n} \sum_{i=1}^{n} 2 \cdot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-1)\]
+
+Finally, we can simplify the constants, leaving us with our answer.
+
+\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n}{(y_i - \theta_0 - \theta_1 x_i)}\]
+
+Following the same procedure, we can take the derivative of MSE with
+respect to \(\theta_1\).
+
+\[\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{\partial}{\partial \theta_1} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+
+\[ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_1} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+
+\[ = \frac{1}{n} \sum_{i=1}^{n} 2 \dot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-x_i)\]
+
+\[= \frac{-2}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}x_i\]
+
+Step 2: set the derivatives equal to 0. After simplifying terms, this
+produces two \textbf{estimating equations}. The best set of model
+parameters \((\hat{\theta}_0, \hat{\theta}_1)\) \emph{must} satisfy
+these two optimality conditions.
+\[0 = \frac{-2}{n} \sum_{i=1}^{n} y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} y_i - \hat{y}_i = 0\]
+\[0 = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i)x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)x_i = 0\]
+
+Step 3: solve the estimating equations to compute estimates for
+\(\hat{\theta}_0\) and \(\hat{\theta}_1\).
+
+Taking the first equation gives the estimate of \(\hat{\theta}_0\):
+\[\frac{1}{n} \sum_{i=1}^n y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i = 0 \]
+
+\[\left(\frac{1}{n} \sum_{i=1}^n y_i \right) - \hat{\theta}_0 - \hat{\theta}_1\left(\frac{1}{n} \sum_{i=1}^n x_i \right) = 0\]
+
+\[ \hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x}\]
+
+With a bit more maneuvering, the second equation gives the estimate of
+\(\hat{\theta}_1\). Start by multiplying the first estimating equation
+by \(\bar{x}\), then subtracting the result from the second estimating
+equation.
+
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)x_i - \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)\bar{x} = 0 \]
+
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)(x_i - \bar{x}) = 0 \]
+
+Next, plug in
+\(\hat{y}_i = \hat{\theta}_0 + \hat{\theta}_1 x_i = \bar{y} + \hat{\theta}_1(x_i - \bar{x})\):
+
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y} - \hat{\theta}_1(x - \bar{x}))(x_i - \bar{x}) = 0 \]
+
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x}) = \hat{\theta}_1 \times \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2
+\]
+
+By using the definition of correlation
+\(\left(r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i-\bar{x}}{\sigma_x})(\frac{y_i-\bar{y}}{\sigma_y}) \right)\)
+and standard deviation
+\(\left(\sigma_x = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} \right)\),
+we can conclude:
+\[r \sigma_x \sigma_y = \hat{\theta}_1 \times \sigma_x^2\]
+\[\hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}\]
+
+Just as was given in Data 8!
+
+Remember, this derivation found the optimal model parameters for SLR
+when using the MSE cost function. If we had used a different model or
+different loss function, we likely would have found different values for
+the best model parameters. However, regardless of the model and loss
+used, we can \emph{always} follow these three steps to fit the model.
+
+\section{Evaluating the SLR Model}\label{evaluating-the-slr-model}
+
+Now that we've explored the mathematics behind (1) choosing a model, (2)
+choosing a loss function, and (3) fitting the model, we're left with one
+final question -- how ``good'' are the predictions made by this ``best''
+fitted model? To determine this, we can:
+
+\begin{enumerate}
+\def\labelenumi{\arabic{enumi}.}
+\item
+  Visualize data and compute statistics:
+
+  \begin{itemize}
+  \tightlist
+  \item
+    Plot the original data.
+  \item
+    Compute each column's mean and standard deviation. If the mean and
+    standard deviation of our predictions are close to those of the
+    original observed \(y_i\)'s, we might be inclined to say that our
+    model has done well.
+  \item
+    (If we're fitting a linear model) Compute the correlation \(r\). A
+    large magnitude for the correlation coefficient between the feature
+    and response variables could also indicate that our model has done
+    well.
+  \end{itemize}
+\item
+  Performance metrics:
+
+  \begin{itemize}
+  \tightlist
+  \item
+    We can take the \textbf{Root Mean Squared Error (RMSE)}.
+
+    \begin{itemize}
+    \tightlist
+    \item
+      It's the square root of the mean squared error (MSE), which is the
+      average loss that we've been minimizing to determine optimal model
+      parameters.
+    \item
+      RMSE is in the same units as \(y\).
+    \item
+      A lower RMSE indicates more ``accurate'' predictions, as we have a
+      lower ``average loss'' across the data.
+    \end{itemize}
+  \end{itemize}
+
+  \[\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}\]
+\item
+  Visualization:
+
+  \begin{itemize}
+  \tightlist
+  \item
+    Look at the residual plot of \(e_i = y_i - \hat{y_i}\) to visualize
+    the difference between actual and predicted values. The good
+    residual plot should not show any pattern between input/features
+    \(x_i\) and residual values \(e_i\).
+  \end{itemize}
+\end{enumerate}
+
+To illustrate this process, let's take a look at \textbf{Anscombe's
+quartet}.
+
+\subsection{Four Mysterious Datasets (Anscombe's
+quartet)}\label{four-mysterious-datasets-anscombes-quartet}
+
+Let's take a look at four different datasets.
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\ImportTok{import}\NormalTok{ numpy }\ImportTok{as}\NormalTok{ np}
+\ImportTok{import}\NormalTok{ pandas }\ImportTok{as}\NormalTok{ pd}
+\ImportTok{import}\NormalTok{ matplotlib.pyplot }\ImportTok{as}\NormalTok{ plt}
+\OperatorTok{\%}\NormalTok{matplotlib inline}
+\ImportTok{import}\NormalTok{ seaborn }\ImportTok{as}\NormalTok{ sns}
+\ImportTok{import}\NormalTok{ itertools}
+\ImportTok{from}\NormalTok{ mpl\_toolkits.mplot3d }\ImportTok{import}\NormalTok{ Axes3D}
+\end{Highlighting}
+\end{Shaded}
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\CommentTok{\# Big font helper}
+\KeywordTok{def}\NormalTok{ adjust\_fontsize(size}\OperatorTok{=}\VariableTok{None}\NormalTok{):}
+\NormalTok{    SMALL\_SIZE }\OperatorTok{=} \DecValTok{8}
+\NormalTok{    MEDIUM\_SIZE }\OperatorTok{=} \DecValTok{10}
+\NormalTok{    BIGGER\_SIZE }\OperatorTok{=} \DecValTok{12}
+    \ControlFlowTok{if}\NormalTok{ size }\OperatorTok{!=} \VariableTok{None}\NormalTok{:}
+\NormalTok{        SMALL\_SIZE }\OperatorTok{=}\NormalTok{ MEDIUM\_SIZE }\OperatorTok{=}\NormalTok{ BIGGER\_SIZE }\OperatorTok{=}\NormalTok{ size}
+
+\NormalTok{    plt.rc(}\StringTok{"font"}\NormalTok{, size}\OperatorTok{=}\NormalTok{SMALL\_SIZE)  }\CommentTok{\# controls default text sizes}
+\NormalTok{    plt.rc(}\StringTok{"axes"}\NormalTok{, titlesize}\OperatorTok{=}\NormalTok{SMALL\_SIZE)  }\CommentTok{\# fontsize of the axes title}
+\NormalTok{    plt.rc(}\StringTok{"axes"}\NormalTok{, labelsize}\OperatorTok{=}\NormalTok{MEDIUM\_SIZE)  }\CommentTok{\# fontsize of the x and y labels}
+\NormalTok{    plt.rc(}\StringTok{"xtick"}\NormalTok{, labelsize}\OperatorTok{=}\NormalTok{SMALL\_SIZE)  }\CommentTok{\# fontsize of the tick labels}
+\NormalTok{    plt.rc(}\StringTok{"ytick"}\NormalTok{, labelsize}\OperatorTok{=}\NormalTok{SMALL\_SIZE)  }\CommentTok{\# fontsize of the tick labels}
+\NormalTok{    plt.rc(}\StringTok{"legend"}\NormalTok{, fontsize}\OperatorTok{=}\NormalTok{SMALL\_SIZE)  }\CommentTok{\# legend fontsize}
+\NormalTok{    plt.rc(}\StringTok{"figure"}\NormalTok{, titlesize}\OperatorTok{=}\NormalTok{BIGGER\_SIZE)  }\CommentTok{\# fontsize of the figure title}
+
+
+\CommentTok{\# Helper functions}
+\KeywordTok{def}\NormalTok{ standard\_units(x):}
+    \ControlFlowTok{return}\NormalTok{ (x }\OperatorTok{{-}}\NormalTok{ np.mean(x)) }\OperatorTok{/}\NormalTok{ np.std(x)}
+
+
+\KeywordTok{def}\NormalTok{ correlation(x, y):}
+    \ControlFlowTok{return}\NormalTok{ np.mean(standard\_units(x) }\OperatorTok{*}\NormalTok{ standard\_units(y))}
+
+
+\KeywordTok{def}\NormalTok{ slope(x, y):}
+    \ControlFlowTok{return}\NormalTok{ correlation(x, y) }\OperatorTok{*}\NormalTok{ np.std(y) }\OperatorTok{/}\NormalTok{ np.std(x)}
+
+
+\KeywordTok{def}\NormalTok{ intercept(x, y):}
+    \ControlFlowTok{return}\NormalTok{ np.mean(y) }\OperatorTok{{-}}\NormalTok{ slope(x, y) }\OperatorTok{*}\NormalTok{ np.mean(x)}
+
+
+\KeywordTok{def}\NormalTok{ fit\_least\_squares(x, y):}
+\NormalTok{    theta\_0 }\OperatorTok{=}\NormalTok{ intercept(x, y)}
+\NormalTok{    theta\_1 }\OperatorTok{=}\NormalTok{ slope(x, y)}
+    \ControlFlowTok{return}\NormalTok{ theta\_0, theta\_1}
+
+
+\KeywordTok{def}\NormalTok{ predict(x, theta\_0, theta\_1):}
+    \ControlFlowTok{return}\NormalTok{ theta\_0 }\OperatorTok{+}\NormalTok{ theta\_1 }\OperatorTok{*}\NormalTok{ x}
+
+
+\KeywordTok{def}\NormalTok{ compute\_mse(y, yhat):}
+    \ControlFlowTok{return}\NormalTok{ np.mean((y }\OperatorTok{{-}}\NormalTok{ yhat) }\OperatorTok{**} \DecValTok{2}\NormalTok{)}
+
+
+\NormalTok{plt.style.use(}\StringTok{"default"}\NormalTok{)  }\CommentTok{\# Revert style to default mpl}
+\end{Highlighting}
+\end{Shaded}
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\NormalTok{plt.style.use(}\StringTok{"default"}\NormalTok{)  }\CommentTok{\# Revert style to default mpl}
+\NormalTok{NO\_VIZ, RESID, RESID\_SCATTER }\OperatorTok{=} \BuiltInTok{range}\NormalTok{(}\DecValTok{3}\NormalTok{)}
+
+
+\KeywordTok{def}\NormalTok{ least\_squares\_evaluation(x, y, visualize}\OperatorTok{=}\NormalTok{NO\_VIZ):}
+    \CommentTok{\# statistics}
+    \BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"x\_mean : }\SpecialCharTok{\{}\NormalTok{np}\SpecialCharTok{.}\NormalTok{mean(x)}\SpecialCharTok{:.2f\}}\SpecialStringTok{, y\_mean : }\SpecialCharTok{\{}\NormalTok{np}\SpecialCharTok{.}\NormalTok{mean(y)}\SpecialCharTok{:.2f\}}\SpecialStringTok{"}\NormalTok{)}
+    \BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"x\_stdev: }\SpecialCharTok{\{}\NormalTok{np}\SpecialCharTok{.}\NormalTok{std(x)}\SpecialCharTok{:.2f\}}\SpecialStringTok{, y\_stdev: }\SpecialCharTok{\{}\NormalTok{np}\SpecialCharTok{.}\NormalTok{std(y)}\SpecialCharTok{:.2f\}}\SpecialStringTok{"}\NormalTok{)}
+    \BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"r = Correlation(x, y): }\SpecialCharTok{\{}\NormalTok{correlation(x, y)}\SpecialCharTok{:.3f\}}\SpecialStringTok{"}\NormalTok{)}
+
+    \CommentTok{\# Performance metrics}
+\NormalTok{    ahat, bhat }\OperatorTok{=}\NormalTok{ fit\_least\_squares(x, y)}
+\NormalTok{    yhat }\OperatorTok{=}\NormalTok{ predict(x, ahat, bhat)}
+    \BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"}\CharTok{\textbackslash{}t}\SpecialStringTok{heta\_0: }\SpecialCharTok{\{}\NormalTok{ahat}\SpecialCharTok{:.2f\}}\SpecialStringTok{, }\CharTok{\textbackslash{}t}\SpecialStringTok{heta\_1: }\SpecialCharTok{\{}\NormalTok{bhat}\SpecialCharTok{:.2f\}}\SpecialStringTok{"}\NormalTok{)}
+    \BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"RMSE: }\SpecialCharTok{\{}\NormalTok{np}\SpecialCharTok{.}\NormalTok{sqrt(compute\_mse(y, yhat))}\SpecialCharTok{:.3f\}}\SpecialStringTok{"}\NormalTok{)}
+
+    \CommentTok{\# visualization}
+\NormalTok{    fig, ax\_resid }\OperatorTok{=} \VariableTok{None}\NormalTok{, }\VariableTok{None}
+    \ControlFlowTok{if}\NormalTok{ visualize }\OperatorTok{==}\NormalTok{ RESID\_SCATTER:}
+\NormalTok{        fig, axs }\OperatorTok{=}\NormalTok{ plt.subplots(}\DecValTok{1}\NormalTok{, }\DecValTok{2}\NormalTok{, figsize}\OperatorTok{=}\NormalTok{(}\DecValTok{8}\NormalTok{, }\DecValTok{3}\NormalTok{))}
+\NormalTok{        axs[}\DecValTok{0}\NormalTok{].scatter(x, y)}
+\NormalTok{        axs[}\DecValTok{0}\NormalTok{].plot(x, yhat)}
+\NormalTok{        axs[}\DecValTok{0}\NormalTok{].set\_title(}\StringTok{"LS fit"}\NormalTok{)}
+\NormalTok{        ax\_resid }\OperatorTok{=}\NormalTok{ axs[}\DecValTok{1}\NormalTok{]}
+    \ControlFlowTok{elif}\NormalTok{ visualize }\OperatorTok{==}\NormalTok{ RESID:}
+\NormalTok{        fig }\OperatorTok{=}\NormalTok{ plt.figure(figsize}\OperatorTok{=}\NormalTok{(}\DecValTok{4}\NormalTok{, }\DecValTok{3}\NormalTok{))}
+\NormalTok{        ax\_resid }\OperatorTok{=}\NormalTok{ plt.gca()}
+
+    \ControlFlowTok{if}\NormalTok{ ax\_resid }\KeywordTok{is} \KeywordTok{not} \VariableTok{None}\NormalTok{:}
+\NormalTok{        ax\_resid.scatter(x, y }\OperatorTok{{-}}\NormalTok{ yhat, color}\OperatorTok{=}\StringTok{"red"}\NormalTok{)}
+\NormalTok{        ax\_resid.plot([}\DecValTok{4}\NormalTok{, }\DecValTok{14}\NormalTok{], [}\DecValTok{0}\NormalTok{, }\DecValTok{0}\NormalTok{], color}\OperatorTok{=}\StringTok{"black"}\NormalTok{)}
+\NormalTok{        ax\_resid.set\_title(}\StringTok{"Residuals"}\NormalTok{)}
+
+    \ControlFlowTok{return}\NormalTok{ fig}
+\end{Highlighting}
+\end{Shaded}
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\CommentTok{\# Load in four different datasets: I, II, III, IV}
+\NormalTok{x }\OperatorTok{=}\NormalTok{ [}\DecValTok{10}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{13}\NormalTok{, }\DecValTok{9}\NormalTok{, }\DecValTok{11}\NormalTok{, }\DecValTok{14}\NormalTok{, }\DecValTok{6}\NormalTok{, }\DecValTok{4}\NormalTok{, }\DecValTok{12}\NormalTok{, }\DecValTok{7}\NormalTok{, }\DecValTok{5}\NormalTok{]}
+\NormalTok{y1 }\OperatorTok{=}\NormalTok{ [}\FloatTok{8.04}\NormalTok{, }\FloatTok{6.95}\NormalTok{, }\FloatTok{7.58}\NormalTok{, }\FloatTok{8.81}\NormalTok{, }\FloatTok{8.33}\NormalTok{, }\FloatTok{9.96}\NormalTok{, }\FloatTok{7.24}\NormalTok{, }\FloatTok{4.26}\NormalTok{, }\FloatTok{10.84}\NormalTok{, }\FloatTok{4.82}\NormalTok{, }\FloatTok{5.68}\NormalTok{]}
+\NormalTok{y2 }\OperatorTok{=}\NormalTok{ [}\FloatTok{9.14}\NormalTok{, }\FloatTok{8.14}\NormalTok{, }\FloatTok{8.74}\NormalTok{, }\FloatTok{8.77}\NormalTok{, }\FloatTok{9.26}\NormalTok{, }\FloatTok{8.10}\NormalTok{, }\FloatTok{6.13}\NormalTok{, }\FloatTok{3.10}\NormalTok{, }\FloatTok{9.13}\NormalTok{, }\FloatTok{7.26}\NormalTok{, }\FloatTok{4.74}\NormalTok{]}
+\NormalTok{y3 }\OperatorTok{=}\NormalTok{ [}\FloatTok{7.46}\NormalTok{, }\FloatTok{6.77}\NormalTok{, }\FloatTok{12.74}\NormalTok{, }\FloatTok{7.11}\NormalTok{, }\FloatTok{7.81}\NormalTok{, }\FloatTok{8.84}\NormalTok{, }\FloatTok{6.08}\NormalTok{, }\FloatTok{5.39}\NormalTok{, }\FloatTok{8.15}\NormalTok{, }\FloatTok{6.42}\NormalTok{, }\FloatTok{5.73}\NormalTok{]}
+\NormalTok{x4 }\OperatorTok{=}\NormalTok{ [}\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{19}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{, }\DecValTok{8}\NormalTok{]}
+\NormalTok{y4 }\OperatorTok{=}\NormalTok{ [}\FloatTok{6.58}\NormalTok{, }\FloatTok{5.76}\NormalTok{, }\FloatTok{7.71}\NormalTok{, }\FloatTok{8.84}\NormalTok{, }\FloatTok{8.47}\NormalTok{, }\FloatTok{7.04}\NormalTok{, }\FloatTok{5.25}\NormalTok{, }\FloatTok{12.50}\NormalTok{, }\FloatTok{5.56}\NormalTok{, }\FloatTok{7.91}\NormalTok{, }\FloatTok{6.89}\NormalTok{]}
+
+\NormalTok{anscombe }\OperatorTok{=}\NormalTok{ \{}
+    \StringTok{"I"}\NormalTok{: pd.DataFrame(}\BuiltInTok{list}\NormalTok{(}\BuiltInTok{zip}\NormalTok{(x, y1)), columns}\OperatorTok{=}\NormalTok{[}\StringTok{"x"}\NormalTok{, }\StringTok{"y"}\NormalTok{]),}
+    \StringTok{"II"}\NormalTok{: pd.DataFrame(}\BuiltInTok{list}\NormalTok{(}\BuiltInTok{zip}\NormalTok{(x, y2)), columns}\OperatorTok{=}\NormalTok{[}\StringTok{"x"}\NormalTok{, }\StringTok{"y"}\NormalTok{]),}
+    \StringTok{"III"}\NormalTok{: pd.DataFrame(}\BuiltInTok{list}\NormalTok{(}\BuiltInTok{zip}\NormalTok{(x, y3)), columns}\OperatorTok{=}\NormalTok{[}\StringTok{"x"}\NormalTok{, }\StringTok{"y"}\NormalTok{]),}
+    \StringTok{"IV"}\NormalTok{: pd.DataFrame(}\BuiltInTok{list}\NormalTok{(}\BuiltInTok{zip}\NormalTok{(x4, y4)), columns}\OperatorTok{=}\NormalTok{[}\StringTok{"x"}\NormalTok{, }\StringTok{"y"}\NormalTok{]),}
+\NormalTok{\}}
+
+\CommentTok{\# Plot the scatter plot and line of best fit}
+\NormalTok{fig, axs }\OperatorTok{=}\NormalTok{ plt.subplots(}\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{, figsize}\OperatorTok{=}\NormalTok{(}\DecValTok{10}\NormalTok{, }\DecValTok{10}\NormalTok{))}
+
+\ControlFlowTok{for}\NormalTok{ i, dataset }\KeywordTok{in} \BuiltInTok{enumerate}\NormalTok{([}\StringTok{"I"}\NormalTok{, }\StringTok{"II"}\NormalTok{, }\StringTok{"III"}\NormalTok{, }\StringTok{"IV"}\NormalTok{]):}
+\NormalTok{    ans }\OperatorTok{=}\NormalTok{ anscombe[dataset]}
+\NormalTok{    x, y }\OperatorTok{=}\NormalTok{ ans[}\StringTok{"x"}\NormalTok{], ans[}\StringTok{"y"}\NormalTok{]}
+\NormalTok{    ahat, bhat }\OperatorTok{=}\NormalTok{ fit\_least\_squares(x, y)}
+\NormalTok{    yhat }\OperatorTok{=}\NormalTok{ predict(x, ahat, bhat)}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].scatter(x, y, alpha}\OperatorTok{=}\FloatTok{0.6}\NormalTok{, color}\OperatorTok{=}\StringTok{"red"}\NormalTok{)  }\CommentTok{\# plot the x, y points}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].plot(x, yhat)  }\CommentTok{\# plot the line of best fit}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].set\_xlabel(}\SpecialStringTok{f"$x\_}\SpecialCharTok{\{}\NormalTok{i}\OperatorTok{+}\DecValTok{1}\SpecialCharTok{\}}\SpecialStringTok{$"}\NormalTok{)}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].set\_ylabel(}\SpecialStringTok{f"$y\_}\SpecialCharTok{\{}\NormalTok{i}\OperatorTok{+}\DecValTok{1}\SpecialCharTok{\}}\SpecialStringTok{$"}\NormalTok{)}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].set\_title(}\SpecialStringTok{f"Dataset }\SpecialCharTok{\{}\NormalTok{dataset}\SpecialCharTok{\}}\SpecialStringTok{"}\NormalTok{)}
+
+\NormalTok{plt.show()}
+\end{Highlighting}
+\end{Shaded}
+
+\includegraphics{intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-7-output-1.pdf}
+
+While these four sets of datapoints look very different, they actually
+all have identical means \(\bar x\), \(\bar y\), standard deviations
+\(\sigma_x\), \(\sigma_y\), correlation \(r\), and RMSE! If we only look
+at these statistics, we would probably be inclined to say that these
+datasets are similar.
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\ControlFlowTok{for}\NormalTok{ dataset }\KeywordTok{in}\NormalTok{ [}\StringTok{"I"}\NormalTok{, }\StringTok{"II"}\NormalTok{, }\StringTok{"III"}\NormalTok{, }\StringTok{"IV"}\NormalTok{]:}
+    \BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"\textgreater{}\textgreater{}\textgreater{} Dataset }\SpecialCharTok{\{}\NormalTok{dataset}\SpecialCharTok{\}}\SpecialStringTok{:"}\NormalTok{)}
+\NormalTok{    ans }\OperatorTok{=}\NormalTok{ anscombe[dataset]}
+\NormalTok{    fig }\OperatorTok{=}\NormalTok{ least\_squares\_evaluation(ans[}\StringTok{"x"}\NormalTok{], ans[}\StringTok{"y"}\NormalTok{], visualize}\OperatorTok{=}\NormalTok{NO\_VIZ)}
+    \BuiltInTok{print}\NormalTok{()}
+    \BuiltInTok{print}\NormalTok{()}
+\end{Highlighting}
+\end{Shaded}
+
+\begin{verbatim}
+>>> Dataset I:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.119
+
+
+>>> Dataset II:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.119
+
+
+>>> Dataset III:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.118
+
+
+>>> Dataset IV:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.817
+    heta_0: 3.00,   heta_1: 0.50
+RMSE: 1.118
+
+\end{verbatim}
+
+We may also wish to visualize the model's \textbf{residuals}, defined as
+the difference between the observed and predicted \(y_i\) value
+(\(e_i = y_i - \hat{y}_i\)). This gives a high-level view of how ``off''
+each prediction is from the true observed value. Recall that you
+explored this concept in
+\href{https://inferentialthinking.com/chapters/15/5/Visual_Diagnostics.html?highlight=heteroscedasticity\#detecting-heteroscedasticity}{Data
+8}: a good regression fit should display no clear pattern in its plot of
+residuals. The residual plots for Anscombe's quartet are displayed
+below. Note how only the first plot shows no clear pattern to the
+magnitude of residuals. This is an indication that SLR is not the best
+choice of model for the remaining three sets of points.
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\CommentTok{\# Residual visualization}
+\NormalTok{fig, axs }\OperatorTok{=}\NormalTok{ plt.subplots(}\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{, figsize}\OperatorTok{=}\NormalTok{(}\DecValTok{10}\NormalTok{, }\DecValTok{10}\NormalTok{))}
+
+\ControlFlowTok{for}\NormalTok{ i, dataset }\KeywordTok{in} \BuiltInTok{enumerate}\NormalTok{([}\StringTok{"I"}\NormalTok{, }\StringTok{"II"}\NormalTok{, }\StringTok{"III"}\NormalTok{, }\StringTok{"IV"}\NormalTok{]):}
+\NormalTok{    ans }\OperatorTok{=}\NormalTok{ anscombe[dataset]}
+\NormalTok{    x, y }\OperatorTok{=}\NormalTok{ ans[}\StringTok{"x"}\NormalTok{], ans[}\StringTok{"y"}\NormalTok{]}
+\NormalTok{    ahat, bhat }\OperatorTok{=}\NormalTok{ fit\_least\_squares(x, y)}
+\NormalTok{    yhat }\OperatorTok{=}\NormalTok{ predict(x, ahat, bhat)}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].scatter(}
+\NormalTok{        x, y }\OperatorTok{{-}}\NormalTok{ yhat, alpha}\OperatorTok{=}\FloatTok{0.6}\NormalTok{, color}\OperatorTok{=}\StringTok{"red"}
+\NormalTok{    )  }\CommentTok{\# plot the x, y points}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].plot(}
+\NormalTok{        x, np.zeros\_like(x), color}\OperatorTok{=}\StringTok{"black"}
+\NormalTok{    )  }\CommentTok{\# plot the residual line}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].set\_xlabel(}\SpecialStringTok{f"$x\_}\SpecialCharTok{\{}\NormalTok{i}\OperatorTok{+}\DecValTok{1}\SpecialCharTok{\}}\SpecialStringTok{$"}\NormalTok{)}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].set\_ylabel(}\SpecialStringTok{f"$e\_}\SpecialCharTok{\{}\NormalTok{i}\OperatorTok{+}\DecValTok{1}\SpecialCharTok{\}}\SpecialStringTok{$"}\NormalTok{)}
+\NormalTok{    axs[i }\OperatorTok{//} \DecValTok{2}\NormalTok{, i }\OperatorTok{\%} \DecValTok{2}\NormalTok{].set\_title(}\SpecialStringTok{f"Dataset }\SpecialCharTok{\{}\NormalTok{dataset}\SpecialCharTok{\}}\SpecialStringTok{ Residuals"}\NormalTok{)}
+
+\NormalTok{plt.show()}
+\end{Highlighting}
+\end{Shaded}
+
+\includegraphics{intro_to_modeling/intro_to_modeling_files/figure-pdf/cell-9-output-1.pdf}
+
 
 
 \end{document}
diff --git a/intro_to_modeling/intro_to_modeling.qmd b/intro_to_modeling/intro_to_modeling.qmd
index c3b2299a..f276938b 100644
--- a/intro_to_modeling/intro_to_modeling.qmd
+++ b/intro_to_modeling/intro_to_modeling.qmd
@@ -391,3 +391,204 @@ Just as was given in Data 8!
 
 Remember, this derivation found the optimal model parameters for SLR when using the MSE cost function. If we had used a different model or different loss function, we likely would have found different values for the best model parameters. However, regardless of the model and loss used, we can *always* follow these three steps to fit the model.
 
+## Evaluating the SLR Model
+
+Now that we've explored the mathematics behind (1) choosing a model, (2) choosing a loss function, and (3) fitting the model, we're left with one final question – how "good" are the predictions made by this "best" fitted model? To determine this, we can:
+
+1. Visualize data and compute statistics:
+   - Plot the original data.
+   - Compute each column's mean and standard deviation. If the mean and standard deviation of our predictions are close to those of the original observed $y_i$'s, we might be inclined to say that our model has done well.
+   - (If we're fitting a linear model) Compute the correlation $r$. A large magnitude for the correlation coefficient between the feature and response variables could also indicate that our model has done well.    
+
+2. Performance metrics:
+
+   - We can take the **Root Mean Squared Error (RMSE)**.
+     - It's the square root of the mean squared error (MSE), which is the average loss that we've been minimizing to determine optimal model parameters.
+     - RMSE is in the same units as $y$.
+     - A lower RMSE indicates more "accurate" predictions, as we have a lower "average loss" across the data.
+
+   $$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$$
+
+3. Visualization:
+   - Look at the residual plot of $e_i = y_i - \hat{y_i}$ to visualize the difference between actual and predicted values. The good residual plot should not show any pattern between input/features $x_i$ and residual values $e_i$.
+
+To illustrate this process, let's take a look at **Anscombe's quartet**.
+
+### Four Mysterious Datasets (Anscombe’s quartet)
+
+Let's take a look at four different datasets.
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+%matplotlib inline
+import seaborn as sns
+import itertools
+from mpl_toolkits.mplot3d import Axes3D
+```
+
+```{python}
+#| code-fold: true
+# Big font helper
+def adjust_fontsize(size=None):
+    SMALL_SIZE = 8
+    MEDIUM_SIZE = 10
+    BIGGER_SIZE = 12
+    if size != None:
+        SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size
+
+    plt.rc("font", size=SMALL_SIZE)  # controls default text sizes
+    plt.rc("axes", titlesize=SMALL_SIZE)  # fontsize of the axes title
+    plt.rc("axes", labelsize=MEDIUM_SIZE)  # fontsize of the x and y labels
+    plt.rc("xtick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
+    plt.rc("ytick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
+    plt.rc("legend", fontsize=SMALL_SIZE)  # legend fontsize
+    plt.rc("figure", titlesize=BIGGER_SIZE)  # fontsize of the figure title
+
+
+# Helper functions
+def standard_units(x):
+    return (x - np.mean(x)) / np.std(x)
+
+
+def correlation(x, y):
+    return np.mean(standard_units(x) * standard_units(y))
+
+
+def slope(x, y):
+    return correlation(x, y) * np.std(y) / np.std(x)
+
+
+def intercept(x, y):
+    return np.mean(y) - slope(x, y) * np.mean(x)
+
+
+def fit_least_squares(x, y):
+    theta_0 = intercept(x, y)
+    theta_1 = slope(x, y)
+    return theta_0, theta_1
+
+
+def predict(x, theta_0, theta_1):
+    return theta_0 + theta_1 * x
+
+
+def compute_mse(y, yhat):
+    return np.mean((y - yhat) ** 2)
+
+
+plt.style.use("default")  # Revert style to default mpl
+```
+
+```{python}
+plt.style.use("default")  # Revert style to default mpl
+NO_VIZ, RESID, RESID_SCATTER = range(3)
+
+
+def least_squares_evaluation(x, y, visualize=NO_VIZ):
+    # statistics
+    print(f"x_mean : {np.mean(x):.2f}, y_mean : {np.mean(y):.2f}")
+    print(f"x_stdev: {np.std(x):.2f}, y_stdev: {np.std(y):.2f}")
+    print(f"r = Correlation(x, y): {correlation(x, y):.3f}")
+
+    # Performance metrics
+    ahat, bhat = fit_least_squares(x, y)
+    yhat = predict(x, ahat, bhat)
+    print(f"\theta_0: {ahat:.2f}, \theta_1: {bhat:.2f}")
+    print(f"RMSE: {np.sqrt(compute_mse(y, yhat)):.3f}")
+
+    # visualization
+    fig, ax_resid = None, None
+    if visualize == RESID_SCATTER:
+        fig, axs = plt.subplots(1, 2, figsize=(8, 3))
+        axs[0].scatter(x, y)
+        axs[0].plot(x, yhat)
+        axs[0].set_title("LS fit")
+        ax_resid = axs[1]
+    elif visualize == RESID:
+        fig = plt.figure(figsize=(4, 3))
+        ax_resid = plt.gca()
+
+    if ax_resid is not None:
+        ax_resid.scatter(x, y - yhat, color="red")
+        ax_resid.plot([4, 14], [0, 0], color="black")
+        ax_resid.set_title("Residuals")
+
+    return fig
+```
+
+```{python}
+#| code-fold: true
+# Load in four different datasets: I, II, III, IV
+x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
+y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
+y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
+y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
+x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
+y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
+
+anscombe = {
+    "I": pd.DataFrame(list(zip(x, y1)), columns=["x", "y"]),
+    "II": pd.DataFrame(list(zip(x, y2)), columns=["x", "y"]),
+    "III": pd.DataFrame(list(zip(x, y3)), columns=["x", "y"]),
+    "IV": pd.DataFrame(list(zip(x4, y4)), columns=["x", "y"]),
+}
+
+# Plot the scatter plot and line of best fit
+fig, axs = plt.subplots(2, 2, figsize=(10, 10))
+
+for i, dataset in enumerate(["I", "II", "III", "IV"]):
+    ans = anscombe[dataset]
+    x, y = ans["x"], ans["y"]
+    ahat, bhat = fit_least_squares(x, y)
+    yhat = predict(x, ahat, bhat)
+    axs[i // 2, i % 2].scatter(x, y, alpha=0.6, color="red")  # plot the x, y points
+    axs[i // 2, i % 2].plot(x, yhat)  # plot the line of best fit
+    axs[i // 2, i % 2].set_xlabel(f"$x_{i+1}$")
+    axs[i // 2, i % 2].set_ylabel(f"$y_{i+1}$")
+    axs[i // 2, i % 2].set_title(f"Dataset {dataset}")
+
+plt.show()
+```
+
+While these four sets of datapoints look very different, they actually all have identical means $\bar x$, $\bar y$, standard deviations $\sigma_x$, $\sigma_y$, correlation $r$, and RMSE! If we only look at these statistics, we would probably be inclined to say that these datasets are similar.
+
+```{python}
+#| code-fold: true
+for dataset in ["I", "II", "III", "IV"]:
+    print(f">>> Dataset {dataset}:")
+    ans = anscombe[dataset]
+    fig = least_squares_evaluation(ans["x"], ans["y"], visualize=NO_VIZ)
+    print()
+    print()
+```
+
+We may also wish to visualize the model's **residuals**, defined as the difference between the observed and predicted $y_i$ value ($e_i = y_i - \hat{y}_i$). This gives a high-level view of how "off" each prediction is from the true observed value. Recall that you explored this concept in [Data 8](https://inferentialthinking.com/chapters/15/5/Visual_Diagnostics.html?highlight=heteroscedasticity#detecting-heteroscedasticity): a good regression fit should display no clear pattern in its plot of residuals. The residual plots for Anscombe's quartet are displayed below. Note how only the first plot shows no clear pattern to the magnitude of residuals. This is an indication that SLR is not the best choice of model for the remaining three sets of points.
+
+<!-- <img src="images/residual.png" alt='residual' width='600'> -->
+
+```{python}
+#| code-fold: true
+# Residual visualization
+fig, axs = plt.subplots(2, 2, figsize=(10, 10))
+
+for i, dataset in enumerate(["I", "II", "III", "IV"]):
+    ans = anscombe[dataset]
+    x, y = ans["x"], ans["y"]
+    ahat, bhat = fit_least_squares(x, y)
+    yhat = predict(x, ahat, bhat)
+    axs[i // 2, i % 2].scatter(
+        x, y - yhat, alpha=0.6, color="red"
+    )  # plot the x, y points
+    axs[i // 2, i % 2].plot(
+        x, np.zeros_like(x), color="black"
+    )  # plot the residual line
+    axs[i // 2, i % 2].set_xlabel(f"$x_{i+1}$")
+    axs[i // 2, i % 2].set_ylabel(f"$e_{i+1}$")
+    axs[i // 2, i % 2].set_title(f"Dataset {dataset} Residuals")
+
+plt.show()
+```
+