diff --git a/docs/search.json b/docs/search.json index 6601f8a3..491fc03a 100644 --- a/docs/search.json +++ b/docs/search.json @@ -34,7 +34,7 @@ "href": "visualization_2/visualization_2.html#kernel-density-estimation", "title": "8  Visualization II", "section": "", - "text": "8.1.1 KDE Theory\nA kernel density estimate (KDE) is a smooth, continuous function that approximates a curve. It allows us to represent general trends in a distribution without focusing on the details, which is useful for analyzing the broad structure of a dataset.\nMore formally, a KDE attempts to approximate the underlying probability distribution from which our dataset was drawn. You may have encountered the idea of a probability distribution in your other classes; if not, we’ll discuss it at length in the next lecture. For now, you can think of a probability distribution as a description of how likely it is for us to sample a particular value in our dataset.\nA KDE curve estimates the probability density function of a random variable. Consider the example below, where we have used sns.displot to plot both a histogram (containing the data points we actually collected) and a KDE curve (representing the approximated probability distribution from which this data was drawn) using data from the World Bank dataset (wb).\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport warnings \n\nwarnings.filterwarnings(\"ignore\", \"use_inf_as_na\") # Supresses distracting deprecation warnings\n\nwb = pd.read_csv(\"data/world_bank.csv\", index_col=0)\nwb = wb.rename(columns={'Antiretroviral therapy coverage: % of people living with HIV: 2015':\"HIV rate\",\n 'Gross national income per capita, Atlas method: $: 2016':'gni'})\nwb.head()\n\n\n\n\n\n\n\n\n\nContinent\nCountry\nPrimary completion rate: Male: % of relevant age group: 2015\nPrimary completion rate: Female: % of relevant age group: 2015\nLower secondary completion rate: Male: % of relevant age group: 2015\nLower secondary completion rate: Female: % of relevant age group: 2015\nYouth literacy rate: Male: % of ages 15-24: 2005-14\nYouth literacy rate: Female: % of ages 15-24: 2005-14\nAdult literacy rate: Male: % ages 15 and older: 2005-14\nAdult literacy rate: Female: % ages 15 and older: 2005-14\n...\nAccess to improved sanitation facilities: % of population: 1990\nAccess to improved sanitation facilities: % of population: 2015\nChild immunization rate: Measles: % of children ages 12-23 months: 2015\nChild immunization rate: DTP3: % of children ages 12-23 months: 2015\nChildren with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016\nChildren with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016\nChildren sleeping under treated bed nets: % of children under age 5: 2009-2016\nChildren with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016\nTuberculosis: Treatment success rate: % of new cases: 2014\nTuberculosis: Cases detection rate: % of new estimated cases: 2015\n\n\n\n\n0\nAfrica\nAlgeria\n106.0\n105.0\n68.0\n85.0\n96.0\n92.0\n83.0\n68.0\n...\n80.0\n88.0\n95.0\n95.0\n66.0\n42.0\nNaN\nNaN\n88.0\n80.0\n\n\n1\nAfrica\nAngola\nNaN\nNaN\nNaN\nNaN\n79.0\n67.0\n82.0\n60.0\n...\n22.0\n52.0\n55.0\n64.0\nNaN\nNaN\n25.9\n28.3\n34.0\n64.0\n\n\n2\nAfrica\nBenin\n83.0\n73.0\n50.0\n37.0\n55.0\n31.0\n41.0\n18.0\n...\n7.0\n20.0\n75.0\n79.0\n23.0\n33.0\n72.7\n25.9\n89.0\n61.0\n\n\n3\nAfrica\nBotswana\n98.0\n101.0\n86.0\n87.0\n96.0\n99.0\n87.0\n89.0\n...\n39.0\n63.0\n97.0\n95.0\nNaN\nNaN\nNaN\nNaN\n77.0\n62.0\n\n\n5\nAfrica\nBurundi\n58.0\n66.0\n35.0\n30.0\n90.0\n88.0\n89.0\n85.0\n...\n42.0\n48.0\n93.0\n94.0\n55.0\n43.0\n53.8\n25.4\n91.0\n51.0\n\n\n\n\n5 rows × 47 columns\n\n\n\n\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\nsns.displot(data = wb, x = 'HIV rate', \\\n kde = True, stat = \"density\")\n\nplt.title(\"Distribution of HIV rates\");\n\n\n\n\n\n\n\n\nNotice that the smooth KDE curve is higher when the histogram bins are taller. You can think of the height of the KDE curve as representing how “probable” it is that we randomly sample a datapoint with the corresponding value. This intuitively makes sense – if we have already collected more datapoints with a particular value (resulting in a tall histogram bin), it is more likely that, if we randomly sample another datapoint, we will sample one with a similar value (resulting in a high KDE curve).\nThe area under a probability density function should always integrate to 1, representing the fact that the total probability of a distribution should always sum to 100%. Hence, a KDE curve will always have an area under the curve of 1.\n\n\n8.1.2 Constructing a KDE\nWe perform kernel density estimation using three steps.\n\nPlace a kernel at each datapoint.\nNormalize the kernels to have a total area of 1 (across all kernels).\nSum the normalized kernels.\n\nWe’ll explain what a “kernel” is momentarily.\nTo make things simpler, let’s construct a KDE for a small, artificially generated dataset of 5 datapoints: \\([2.2, 2.8, 3.7, 5.3, 5.7]\\). In the plot below, each vertical bar represents one data point.\n\n\nCode\ndata = [2.2, 2.8, 3.7, 5.3, 5.7]\n\nsns.rugplot(data, height=0.3)\n\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5);\n\n\n\n\n\n\n\n\n\nOur goal is to create the following KDE curve, which was generated automatically by sns.kdeplot.\n\n\nCode\nsns.kdeplot(data)\n\nplt.xlabel(\"Data\")\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5);\n\n\n\n\n\n\n\n\n\n\n8.1.2.1 Step 1: Place a Kernel at Each Data Point\nTo begin generating a density curve, we need to choose a kernel and bandwidth value (\\(\\alpha\\)). What are these exactly?\nA kernel is a density curve. It is the mathematical function that attempts to capture the randomness of each data point in our sampled data. To explain what this means, consider just one of the datapoints in our dataset: \\(2.2\\). We obtained this datapoint by randomly sampling some information out in the real world (you can imagine \\(2.2\\) as representing a single measurement taken in an experiment, for example). If we were to sample a new datapoint, we may obtain a slightly different value. It could be higher than \\(2.2\\); it could also be lower than \\(2.2\\). We make the assumption that any future sampled datapoints will likely be similar in value to the data we’ve already drawn. This means that our kernel – our description of the probability of randomly sampling any new value – will be greatest at the datapoint we’ve already drawn but still have non-zero probability above and below it. The area under any kernel should integrate to 1, representing the total probability of drawing a new datapoint.\nA bandwidth value, usually denoted by \\(\\alpha\\), represents the width of the kernel. A large value of \\(\\alpha\\) will result in a wide, short kernel function, while a small value with result in a narrow, tall kernel.\nBelow, we place a Gaussian kernel, plotted in orange, over the datapoint \\(2.2\\). A Gaussian kernel is simply the normal distribution, which you may have called a bell curve in Data 8.\n\n\nCode\ndef gaussian_kernel(x, z, a):\n # We'll discuss where this mathematical formulation came from later\n return (1/np.sqrt(2*np.pi*a**2)) * np.exp((-(x - z)**2 / (2 * a**2)))\n\n# Plot our datapoint\nsns.rugplot([2.2], height=0.3)\n\n# Plot the kernel\nx = np.linspace(-3, 10, 1000)\nplt.plot(x, gaussian_kernel(x, 2.2, 1))\n\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5);\n\n\n\n\n\n\n\n\n\nTo begin creating our KDE, we place a kernel on each datapoint in our dataset. For our dataset of 5 points, we will have 5 kernels.\n\n\nCode\n# You will work with the functions below in Lab 4\ndef create_kde(kernel, pts, a):\n # Takes in a kernel, set of points, and alpha\n # Returns the KDE as a function\n def f(x):\n output = 0\n for pt in pts:\n output += kernel(x, pt, a)\n return output / len(pts) # Normalization factor\n return f\n\ndef plot_kde(kernel, pts, a):\n # Calls create_kde and plots the corresponding KDE\n f = create_kde(kernel, pts, a)\n x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)\n y = [f(xi) for xi in x]\n plt.plot(x, y);\n \ndef plot_separate_kernels(kernel, pts, a, norm=False):\n # Plots individual kernels, which are then summed to create the KDE\n x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)\n for pt in pts:\n y = kernel(x, pt, a)\n if norm:\n y /= len(pts)\n plt.plot(x, y)\n \n plt.show();\n \nplt.xlim(-3, 10)\nplt.ylim(0, 0.5)\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\n\nplot_separate_kernels(gaussian_kernel, data, a = 1)\n\n\n\n\n\n\n\n\n\n\n\n8.1.2.2 Step 2: Normalize Kernels to Have a Total Area of 1\nAbove, we said that each kernel has an area of 1. Earlier, we also said that our goal is to construct a KDE curve using these kernels with a total area of 1. If we were to directly sum the kernels as they are, we would produce a KDE curve with an integrated area of (5 kernels) \\(\\times\\) (area of 1 each) = 5. To avoid this, we will normalize each of our kernels. This involves multiplying each kernel by \\(\\frac{1}{\\#\\:\\text{datapoints}}\\).\nIn the cell below, we multiply each of our 5 kernels by \\(\\frac{1}{5}\\) to apply normalization.\n\n\nCode\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5)\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\n\n# The `norm` argument specifies whether or not to normalize the kernels\nplot_separate_kernels(gaussian_kernel, data, a = 1, norm = True)\n\n\n\n\n\n\n\n\n\n\n\n8.1.2.3 Step 3: Sum the Normalized Kernels\nOur KDE curve is the sum of the normalized kernels. Notice that the final curve is identical to the plot generated by sns.kdeplot we saw earlier!\n\n\nCode\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5)\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\n\nsns.kdeplot(data, bw_method=0.65)\nsns.histplot(data, stat=\"density\", bins=2);\n\n\n\n\n\n\n\n\n\nAn alternative method to generate the above KDE is shown below, this time using sns.histplot’s arguments.\n\n\nCode\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5)\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\n\nsns.histplot(data, bins=2, kde=True, stat=\"density\", kde_kws=dict(cut=3, bw_method=0.65))\n\n\n\n\n\n\n\n\n\n\n\n\n8.1.3 Kernel Functions and Bandwidths\n\n\n\nA general “KDE formula” function is given above.\n\n\\(K_{\\alpha}(x, x_i)\\) is the kernel centered on the observation i.\n\nEach kernel individually has area 1.\nx represents any number on the number line. It is the input to our function.\n\n\\(n\\) is the number of observed datapoints that we have.\n\nWe multiply by \\(\\frac{1}{n}\\) so that the total area of the KDE is still 1.\n\nEach \\(x_i \\in \\{x_1, x_2, \\dots, x_n\\}\\) represents an observed datapoint.\n\nThese are what we use to create our KDE by summing multiple shifted kernels centered at these points.\n\n\n\n\\(\\alpha\\) (alpha) is the bandwidth or smoothing parameter.\n\nA kernel (for our purposes) is a valid density function. This means it:\n\nMust be non-negative for all inputs.\nMust integrate to 1.\n\n\n8.1.3.1 Gaussian Kernel\nThe most common kernel is the Gaussian kernel. The Gaussian kernel is equivalent to the Gaussian probability density function (the Normal distribution), centered at the observed value with a standard deviation of (this is known as the bandwidth parameter).\n\\[K_a(x, x_i) = \\frac{1}{\\sqrt{2\\pi\\alpha^{2}}}e^{-\\frac{(x-x_i)^{2}}{2\\alpha^{2}}}\\]\nIn this formula:\n\n\\(x\\) (no subscript) represents any value along the x-axis of our plot\n\\(x_i\\) represents the \\(i\\) -th datapoint in our dataset. It is one of the values that we have actually collected in our data sampling process. In our example earlier, \\(x_i=2.2\\). Those of you who have taken a probability class may recognize \\(x_i\\) as the mean of the normal distribution.\nEach kernel is centered on our observed values, so its distribution mean is \\(x_i\\).\n\\(\\alpha\\) is the bandwidth parameter, representing the width of our kernel. More formally, \\(\\alpha\\) is the standard deviation of the Gaussian curve.\n\nA large value of \\(\\alpha\\) will produce a kernel that is wider and shorter – this leads to a smoother KDE when the kernels are summed together.\nA small value of \\(\\alpha\\) will produce a narrower, taller kernel, and, with it, a noisier KDE.\n\n\nThe details of this (admittedly intimidating) formula are less important than understanding its role in kernel density estimation – this equation gives us the shape of each kernel.\n\n\n\n\n\n\n\nGaussian Kernel, \\(\\alpha\\) = 0.1\nGaussian Kernel, \\(\\alpha\\) = 1\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGaussian Kernel, \\(\\alpha\\) = 2\nGaussian Kernel, \\(\\alpha\\) = 5\n\n\n\n\n\n\n\n\n\n\n\n8.1.3.2 Boxcar Kernel\nAnother example of a kernel is the Boxcar kernel. The boxcar kernel assigns a uniform density to points within a “window” of the observation, and a density of 0 elsewhere. The equation below is a boxcar kernel with the center at \\(x_i\\) and the bandwidth of \\(\\alpha\\).\n\\[K_a(x, x_i) = \\begin{cases}\n \\frac{1}{\\alpha}, & |x - x_i| \\le \\frac{\\alpha}{2}\\\\\n 0, & \\text{else }\n \\end{cases}\\]\nThe boxcar kernel is seldom used in practice – we include it here to demonstrate that a kernel function can take whatever form you would like, provided it integrates to 1 and does not output negative values.\n\n\nCode\ndef boxcar_kernel(alpha, x, z):\n return (((x-z)>=-alpha/2)&((x-z)<=alpha/2))/alpha\n\nxs = np.linspace(-5, 5, 200)\nalpha=1\nkde_curve = [boxcar_kernel(alpha, x, 0) for x in xs]\nplt.plot(xs, kde_curve);\n\n\n\n\n\nThe Boxcar kernel centered at 0 with bandwidth \\(\\alpha\\) = 1.\n\n\n\n\nThe diagram on the right is how the density curve for our 5 point dataset would have looked had we used the Boxcar kernel with bandwidth \\(\\alpha\\) = 1.\n\n\n\n\n\n\n\nKDE\nBoxcar", + "text": "8.1.1 KDE Theory\nA kernel density estimate (KDE) is a smooth, continuous function that approximates a curve. It allows us to represent general trends in a distribution without focusing on the details, which is useful for analyzing the broad structure of a dataset.\nMore formally, a KDE attempts to approximate the underlying probability distribution from which our dataset was drawn. You may have encountered the idea of a probability distribution in your other classes; if not, we’ll discuss it at length in the next lecture. For now, you can think of a probability distribution as a description of how likely it is for us to sample a particular value in our dataset.\nA KDE curve estimates the probability density function of a random variable. Consider the example below, where we have used sns.displot to plot both a histogram (containing the data points we actually collected) and a KDE curve (representing the approximated probability distribution from which this data was drawn) using data from the World Bank dataset (wb).\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport warnings \n\nwarnings.filterwarnings(\"ignore\", \"use_inf_as_na\") # Supresses distracting deprecation warnings\n\nwb = pd.read_csv(\"data/world_bank.csv\", index_col=0)\nwb = wb.rename(columns={'Antiretroviral therapy coverage: % of people living with HIV: 2015':\"HIV rate\",\n 'Gross national income per capita, Atlas method: $: 2016':'gni'})\nwb.head()\n\n\n\n\n\n\n\n\n\nContinent\nCountry\nPrimary completion rate: Male: % of relevant age group: 2015\nPrimary completion rate: Female: % of relevant age group: 2015\nLower secondary completion rate: Male: % of relevant age group: 2015\nLower secondary completion rate: Female: % of relevant age group: 2015\nYouth literacy rate: Male: % of ages 15-24: 2005-14\nYouth literacy rate: Female: % of ages 15-24: 2005-14\nAdult literacy rate: Male: % ages 15 and older: 2005-14\nAdult literacy rate: Female: % ages 15 and older: 2005-14\n...\nAccess to improved sanitation facilities: % of population: 1990\nAccess to improved sanitation facilities: % of population: 2015\nChild immunization rate: Measles: % of children ages 12-23 months: 2015\nChild immunization rate: DTP3: % of children ages 12-23 months: 2015\nChildren with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016\nChildren with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016\nChildren sleeping under treated bed nets: % of children under age 5: 2009-2016\nChildren with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016\nTuberculosis: Treatment success rate: % of new cases: 2014\nTuberculosis: Cases detection rate: % of new estimated cases: 2015\n\n\n\n\n0\nAfrica\nAlgeria\n106.0\n105.0\n68.0\n85.0\n96.0\n92.0\n83.0\n68.0\n...\n80.0\n88.0\n95.0\n95.0\n66.0\n42.0\nNaN\nNaN\n88.0\n80.0\n\n\n1\nAfrica\nAngola\nNaN\nNaN\nNaN\nNaN\n79.0\n67.0\n82.0\n60.0\n...\n22.0\n52.0\n55.0\n64.0\nNaN\nNaN\n25.9\n28.3\n34.0\n64.0\n\n\n2\nAfrica\nBenin\n83.0\n73.0\n50.0\n37.0\n55.0\n31.0\n41.0\n18.0\n...\n7.0\n20.0\n75.0\n79.0\n23.0\n33.0\n72.7\n25.9\n89.0\n61.0\n\n\n3\nAfrica\nBotswana\n98.0\n101.0\n86.0\n87.0\n96.0\n99.0\n87.0\n89.0\n...\n39.0\n63.0\n97.0\n95.0\nNaN\nNaN\nNaN\nNaN\n77.0\n62.0\n\n\n5\nAfrica\nBurundi\n58.0\n66.0\n35.0\n30.0\n90.0\n88.0\n89.0\n85.0\n...\n42.0\n48.0\n93.0\n94.0\n55.0\n43.0\n53.8\n25.4\n91.0\n51.0\n\n\n\n\n5 rows × 47 columns\n\n\n\n\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\nsns.displot(data = wb, x = 'HIV rate', \\\n kde = True, stat = \"density\")\n\nplt.title(\"Distribution of HIV rates\");\n\n\n\n\n\n\n\n\nNotice that the smooth KDE curve is higher when the histogram bins are taller. You can think of the height of the KDE curve as representing how “probable” it is that we randomly sample a datapoint with the corresponding value. This intuitively makes sense – if we have already collected more datapoints with a particular value (resulting in a tall histogram bin), it is more likely that, if we randomly sample another datapoint, we will sample one with a similar value (resulting in a high KDE curve).\nThe area under a probability density function should always integrate to 1, representing the fact that the total probability of a distribution should always sum to 100%. Hence, a KDE curve will always have an area under the curve of 1.\n\n\n8.1.2 Constructing a KDE\nWe perform kernel density estimation using three steps.\n\nPlace a kernel at each datapoint.\nNormalize the kernels to have a total area of 1 (across all kernels).\nSum the normalized kernels.\n\nWe’ll explain what a “kernel” is momentarily.\nTo make things simpler, let’s construct a KDE for a small, artificially generated dataset of 5 datapoints: \\([2.2, 2.8, 3.7, 5.3, 5.7]\\). In the plot below, each vertical bar represents one data point.\n\n\nCode\ndata = [2.2, 2.8, 3.7, 5.3, 5.7]\n\nsns.rugplot(data, height=0.3)\n\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5);\n\n\n\n\n\n\n\n\n\nOur goal is to create the following KDE curve, which was generated automatically by sns.kdeplot.\n\n\nCode\nsns.kdeplot(data)\n\nplt.xlabel(\"Data\")\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5);\n\n\n\n\n\n\n\n\n\nAlternatively, we can use sns.histplot. This plot also visualizes the underlying bins as a histogram.\n\n\nCode\nsns.histplot(data, bins=2, kde=True, stat=\"density\", kde_kws=dict(cut=3, bw_method=0.65))\n\nplt.xlabel(\"Data\")\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5);\n\n\n\n\n\n\n\n\n\n\n8.1.2.1 Step 1: Place a Kernel at Each Data Point\nTo begin generating a density curve, we need to choose a kernel and bandwidth value (\\(\\alpha\\)). What are these exactly?\nA kernel is a density curve. It is the mathematical function that attempts to capture the randomness of each data point in our sampled data. To explain what this means, consider just one of the datapoints in our dataset: \\(2.2\\). We obtained this datapoint by randomly sampling some information out in the real world (you can imagine \\(2.2\\) as representing a single measurement taken in an experiment, for example). If we were to sample a new datapoint, we may obtain a slightly different value. It could be higher than \\(2.2\\); it could also be lower than \\(2.2\\). We make the assumption that any future sampled datapoints will likely be similar in value to the data we’ve already drawn. This means that our kernel – our description of the probability of randomly sampling any new value – will be greatest at the datapoint we’ve already drawn but still have non-zero probability above and below it. The area under any kernel should integrate to 1, representing the total probability of drawing a new datapoint.\nA bandwidth value, usually denoted by \\(\\alpha\\), represents the width of the kernel. A large value of \\(\\alpha\\) will result in a wide, short kernel function, while a small value with result in a narrow, tall kernel.\nBelow, we place a Gaussian kernel, plotted in orange, over the datapoint \\(2.2\\). A Gaussian kernel is simply the normal distribution, which you may have called a bell curve in Data 8.\n\n\nCode\ndef gaussian_kernel(x, z, a):\n # We'll discuss where this mathematical formulation came from later\n return (1/np.sqrt(2*np.pi*a**2)) * np.exp((-(x - z)**2 / (2 * a**2)))\n\n# Plot our datapoint\nsns.rugplot([2.2], height=0.3)\n\n# Plot the kernel\nx = np.linspace(-3, 10, 1000)\nplt.plot(x, gaussian_kernel(x, 2.2, 1))\n\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5);\n\n\n\n\n\n\n\n\n\nTo begin creating our KDE, we place a kernel on each datapoint in our dataset. For our dataset of 5 points, we will have 5 kernels.\n\n\nCode\n# You will work with the functions below in Lab 4\ndef create_kde(kernel, pts, a):\n # Takes in a kernel, set of points, and alpha\n # Returns the KDE as a function\n def f(x):\n output = 0\n for pt in pts:\n output += kernel(x, pt, a)\n return output / len(pts) # Normalization factor\n return f\n\ndef plot_kde(kernel, pts, a):\n # Calls create_kde and plots the corresponding KDE\n f = create_kde(kernel, pts, a)\n x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)\n y = [f(xi) for xi in x]\n plt.plot(x, y);\n \ndef plot_separate_kernels(kernel, pts, a, norm=False):\n # Plots individual kernels, which are then summed to create the KDE\n x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)\n for pt in pts:\n y = kernel(x, pt, a)\n if norm:\n y /= len(pts)\n plt.plot(x, y)\n \n plt.show();\n \nplt.xlim(-3, 10)\nplt.ylim(0, 0.5)\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\n\nplot_separate_kernels(gaussian_kernel, data, a = 1)\n\n\n\n\n\n\n\n\n\n\n\n8.1.2.2 Step 2: Normalize Kernels to Have a Total Area of 1\nAbove, we said that each kernel has an area of 1. Earlier, we also said that our goal is to construct a KDE curve using these kernels with a total area of 1. If we were to directly sum the kernels as they are, we would produce a KDE curve with an integrated area of (5 kernels) \\(\\times\\) (area of 1 each) = 5. To avoid this, we will normalize each of our kernels. This involves multiplying each kernel by \\(\\frac{1}{\\#\\:\\text{datapoints}}\\).\nIn the cell below, we multiply each of our 5 kernels by \\(\\frac{1}{5}\\) to apply normalization.\n\n\nCode\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5)\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\n\n# The `norm` argument specifies whether or not to normalize the kernels\nplot_separate_kernels(gaussian_kernel, data, a = 1, norm = True)\n\n\n\n\n\n\n\n\n\n\n\n8.1.2.3 Step 3: Sum the Normalized Kernels\nOur KDE curve is the sum of the normalized kernels. Notice that the final curve is identical to the plot generated by sns.kdeplot we saw earlier!\n\n\nCode\nplt.xlim(-3, 10)\nplt.ylim(0, 0.5)\nplt.xlabel(\"Data\")\nplt.ylabel(\"Density\")\n\nplot_kde(gaussian_kernel, data, a=1)\n\n\n\n\n\n\n\n\n\n\n\n\n8.1.3 Kernel Functions and Bandwidths\n\n\n\nA general “KDE formula” function is given above.\n\n\\(K_{\\alpha}(x, x_i)\\) is the kernel centered on the observation i.\n\nEach kernel individually has area 1.\nx represents any number on the number line. It is the input to our function.\n\n\\(n\\) is the number of observed datapoints that we have.\n\nWe multiply by \\(\\frac{1}{n}\\) so that the total area of the KDE is still 1.\n\nEach \\(x_i \\in \\{x_1, x_2, \\dots, x_n\\}\\) represents an observed datapoint.\n\nThese are what we use to create our KDE by summing multiple shifted kernels centered at these points.\n\n\n\n\\(\\alpha\\) (alpha) is the bandwidth or smoothing parameter.\n\nA kernel (for our purposes) is a valid density function. This means it:\n\nMust be non-negative for all inputs.\nMust integrate to 1.\n\n\n8.1.3.1 Gaussian Kernel\nThe most common kernel is the Gaussian kernel. The Gaussian kernel is equivalent to the Gaussian probability density function (the Normal distribution), centered at the observed value with a standard deviation of (this is known as the bandwidth parameter).\n\\[K_a(x, x_i) = \\frac{1}{\\sqrt{2\\pi\\alpha^{2}}}e^{-\\frac{(x-x_i)^{2}}{2\\alpha^{2}}}\\]\nIn this formula:\n\n\\(x\\) (no subscript) represents any value along the x-axis of our plot\n\\(x_i\\) represents the \\(i\\) -th datapoint in our dataset. It is one of the values that we have actually collected in our data sampling process. In our example earlier, \\(x_i=2.2\\). Those of you who have taken a probability class may recognize \\(x_i\\) as the mean of the normal distribution.\nEach kernel is centered on our observed values, so its distribution mean is \\(x_i\\).\n\\(\\alpha\\) is the bandwidth parameter, representing the width of our kernel. More formally, \\(\\alpha\\) is the standard deviation of the Gaussian curve.\n\nA large value of \\(\\alpha\\) will produce a kernel that is wider and shorter – this leads to a smoother KDE when the kernels are summed together.\nA small value of \\(\\alpha\\) will produce a narrower, taller kernel, and, with it, a noisier KDE.\n\n\nThe details of this (admittedly intimidating) formula are less important than understanding its role in kernel density estimation – this equation gives us the shape of each kernel.\n\n\n\n\n\n\n\nGaussian Kernel, \\(\\alpha\\) = 0.1\nGaussian Kernel, \\(\\alpha\\) = 1\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGaussian Kernel, \\(\\alpha\\) = 2\nGaussian Kernel, \\(\\alpha\\) = 5\n\n\n\n\n\n\n\n\n\n\n\n8.1.3.2 Boxcar Kernel\nAnother example of a kernel is the Boxcar kernel. The boxcar kernel assigns a uniform density to points within a “window” of the observation, and a density of 0 elsewhere. The equation below is a boxcar kernel with the center at \\(x_i\\) and the bandwidth of \\(\\alpha\\).\n\\[K_a(x, x_i) = \\begin{cases}\n \\frac{1}{\\alpha}, & |x - x_i| \\le \\frac{\\alpha}{2}\\\\\n 0, & \\text{else }\n \\end{cases}\\]\nThe boxcar kernel is seldom used in practice – we include it here to demonstrate that a kernel function can take whatever form you would like, provided it integrates to 1 and does not output negative values.\n\n\nCode\ndef boxcar_kernel(alpha, x, z):\n return (((x-z)>=-alpha/2)&((x-z)<=alpha/2))/alpha\n\nxs = np.linspace(-5, 5, 200)\nalpha=1\nkde_curve = [boxcar_kernel(alpha, x, 0) for x in xs]\nplt.plot(xs, kde_curve);\n\n\n\n\n\nThe Boxcar kernel centered at 0 with bandwidth \\(\\alpha\\) = 1.\n\n\n\n\nThe diagram on the right is how the density curve for our 5 point dataset would have looked had we used the Boxcar kernel with bandwidth \\(\\alpha\\) = 1.\n\n\n\n\n\n\n\nKDE\nBoxcar", "crumbs": [ "8  Visualization II" ] diff --git a/docs/visualization_2/visualization_2.html b/docs/visualization_2/visualization_2.html index 45ed0656..66fed2b4 100644 --- a/docs/visualization_2/visualization_2.html +++ b/docs/visualization_2/visualization_2.html @@ -348,7 +348,7 @@

A kernel density estimate (KDE) is a smooth, continuous function that approximates a curve. It allows us to represent general trends in a distribution without focusing on the details, which is useful for analyzing the broad structure of a dataset.

More formally, a KDE attempts to approximate the underlying probability distribution from which our dataset was drawn. You may have encountered the idea of a probability distribution in your other classes; if not, we’ll discuss it at length in the next lecture. For now, you can think of a probability distribution as a description of how likely it is for us to sample a particular value in our dataset.

A KDE curve estimates the probability density function of a random variable. Consider the example below, where we have used sns.displot to plot both a histogram (containing the data points we actually collected) and a KDE curve (representing the approximated probability distribution from which this data was drawn) using data from the World Bank dataset (wb).

-
+
Code
import pandas as pd
@@ -364,7 +364,7 @@ 

'Gross national income per capita, Atlas method: $: 2016':'gni'}) wb.head()

-
+
@@ -523,7 +523,7 @@

-
+
import seaborn as sns
 import matplotlib.pyplot as plt
 
@@ -552,7 +552,7 @@ 

We’ll explain what a “kernel” is momentarily.

To make things simpler, let’s construct a KDE for a small, artificially generated dataset of 5 datapoints: \([2.2, 2.8, 3.7, 5.3, 5.7]\). In the plot below, each vertical bar represents one data point.

-
+
Code
data = [2.2, 2.8, 3.7, 5.3, 5.7]
@@ -573,7 +573,7 @@ 

Our goal is to create the following KDE curve, which was generated automatically by sns.kdeplot.

-
+
Code
sns.kdeplot(data)
@@ -590,30 +590,15 @@ 

-
-

8.1.2.1 Step 1: Place a Kernel at Each Data Point

-

To begin generating a density curve, we need to choose a kernel and bandwidth value (\(\alpha\)). What are these exactly?

-

A kernel is a density curve. It is the mathematical function that attempts to capture the randomness of each data point in our sampled data. To explain what this means, consider just one of the datapoints in our dataset: \(2.2\). We obtained this datapoint by randomly sampling some information out in the real world (you can imagine \(2.2\) as representing a single measurement taken in an experiment, for example). If we were to sample a new datapoint, we may obtain a slightly different value. It could be higher than \(2.2\); it could also be lower than \(2.2\). We make the assumption that any future sampled datapoints will likely be similar in value to the data we’ve already drawn. This means that our kernel – our description of the probability of randomly sampling any new value – will be greatest at the datapoint we’ve already drawn but still have non-zero probability above and below it. The area under any kernel should integrate to 1, representing the total probability of drawing a new datapoint.

-

A bandwidth value, usually denoted by \(\alpha\), represents the width of the kernel. A large value of \(\alpha\) will result in a wide, short kernel function, while a small value with result in a narrow, tall kernel.

-

Below, we place a Gaussian kernel, plotted in orange, over the datapoint \(2.2\). A Gaussian kernel is simply the normal distribution, which you may have called a bell curve in Data 8.

-
+

Alternatively, we can use sns.histplot. This plot also visualizes the underlying bins as a histogram.

+
Code -
def gaussian_kernel(x, z, a):
-    # We'll discuss where this mathematical formulation came from later
-    return (1/np.sqrt(2*np.pi*a**2)) * np.exp((-(x - z)**2 / (2 * a**2)))
-
-# Plot our datapoint
-sns.rugplot([2.2], height=0.3)
-
-# Plot the kernel
-x = np.linspace(-3, 10, 1000)
-plt.plot(x, gaussian_kernel(x, 2.2, 1))
-
-plt.xlabel("Data")
-plt.ylabel("Density")
-plt.xlim(-3, 10)
-plt.ylim(0, 0.5);
+
sns.histplot(data, bins=2, kde=True, stat="density", kde_kws=dict(cut=3, bw_method=0.65))
+
+plt.xlabel("Data")
+plt.xlim(-3, 10)
+plt.ylim(0, 0.5);
@@ -623,45 +608,30 @@

+
+

8.1.2.1 Step 1: Place a Kernel at Each Data Point

+

To begin generating a density curve, we need to choose a kernel and bandwidth value (\(\alpha\)). What are these exactly?

+

A kernel is a density curve. It is the mathematical function that attempts to capture the randomness of each data point in our sampled data. To explain what this means, consider just one of the datapoints in our dataset: \(2.2\). We obtained this datapoint by randomly sampling some information out in the real world (you can imagine \(2.2\) as representing a single measurement taken in an experiment, for example). If we were to sample a new datapoint, we may obtain a slightly different value. It could be higher than \(2.2\); it could also be lower than \(2.2\). We make the assumption that any future sampled datapoints will likely be similar in value to the data we’ve already drawn. This means that our kernel – our description of the probability of randomly sampling any new value – will be greatest at the datapoint we’ve already drawn but still have non-zero probability above and below it. The area under any kernel should integrate to 1, representing the total probability of drawing a new datapoint.

+

A bandwidth value, usually denoted by \(\alpha\), represents the width of the kernel. A large value of \(\alpha\) will result in a wide, short kernel function, while a small value with result in a narrow, tall kernel.

+

Below, we place a Gaussian kernel, plotted in orange, over the datapoint \(2.2\). A Gaussian kernel is simply the normal distribution, which you may have called a bell curve in Data 8.

+
Code -
# You will work with the functions below in Lab 4
-def create_kde(kernel, pts, a):
-    # Takes in a kernel, set of points, and alpha
-    # Returns the KDE as a function
-    def f(x):
-        output = 0
-        for pt in pts:
-            output += kernel(x, pt, a)
-        return output / len(pts) # Normalization factor
-    return f
+
def gaussian_kernel(x, z, a):
+    # We'll discuss where this mathematical formulation came from later
+    return (1/np.sqrt(2*np.pi*a**2)) * np.exp((-(x - z)**2 / (2 * a**2)))
+
+# Plot our datapoint
+sns.rugplot([2.2], height=0.3)
+
+# Plot the kernel
+x = np.linspace(-3, 10, 1000)
+plt.plot(x, gaussian_kernel(x, 2.2, 1))
 
-def plot_kde(kernel, pts, a):
-    # Calls create_kde and plots the corresponding KDE
-    f = create_kde(kernel, pts, a)
-    x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)
-    y = [f(xi) for xi in x]
-    plt.plot(x, y);
-    
-def plot_separate_kernels(kernel, pts, a, norm=False):
-    # Plots individual kernels, which are then summed to create the KDE
-    x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)
-    for pt in pts:
-        y = kernel(x, pt, a)
-        if norm:
-            y /= len(pts)
-        plt.plot(x, y)
-    
-    plt.show();
-    
-plt.xlim(-3, 10)
-plt.ylim(0, 0.5)
-plt.xlabel("Data")
-plt.ylabel("Density")
-
-plot_separate_kernels(gaussian_kernel, data, a = 1)
+plt.xlabel("Data") +plt.ylabel("Density") +plt.xlim(-3, 10) +plt.ylim(0, 0.5);
@@ -671,21 +641,45 @@

-

8.1.2.2 Step 2: Normalize Kernels to Have a Total Area of 1

-

Above, we said that each kernel has an area of 1. Earlier, we also said that our goal is to construct a KDE curve using these kernels with a total area of 1. If we were to directly sum the kernels as they are, we would produce a KDE curve with an integrated area of (5 kernels) \(\times\) (area of 1 each) = 5. To avoid this, we will normalize each of our kernels. This involves multiplying each kernel by \(\frac{1}{\#\:\text{datapoints}}\).

-

In the cell below, we multiply each of our 5 kernels by \(\frac{1}{5}\) to apply normalization.

-
+

To begin creating our KDE, we place a kernel on each datapoint in our dataset. For our dataset of 5 points, we will have 5 kernels.

+
Code -
plt.xlim(-3, 10)
-plt.ylim(0, 0.5)
-plt.xlabel("Data")
-plt.ylabel("Density")
-
-# The `norm` argument specifies whether or not to normalize the kernels
-plot_separate_kernels(gaussian_kernel, data, a = 1, norm = True)
+
# You will work with the functions below in Lab 4
+def create_kde(kernel, pts, a):
+    # Takes in a kernel, set of points, and alpha
+    # Returns the KDE as a function
+    def f(x):
+        output = 0
+        for pt in pts:
+            output += kernel(x, pt, a)
+        return output / len(pts) # Normalization factor
+    return f
+
+def plot_kde(kernel, pts, a):
+    # Calls create_kde and plots the corresponding KDE
+    f = create_kde(kernel, pts, a)
+    x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)
+    y = [f(xi) for xi in x]
+    plt.plot(x, y);
+    
+def plot_separate_kernels(kernel, pts, a, norm=False):
+    # Plots individual kernels, which are then summed to create the KDE
+    x = np.linspace(min(pts) - 5, max(pts) + 5, 1000)
+    for pt in pts:
+        y = kernel(x, pt, a)
+        if norm:
+            y /= len(pts)
+        plt.plot(x, y)
+    
+    plt.show();
+    
+plt.xlim(-3, 10)
+plt.ylim(0, 0.5)
+plt.xlabel("Data")
+plt.ylabel("Density")
+
+plot_separate_kernels(gaussian_kernel, data, a = 1)
@@ -696,10 +690,11 @@

-

8.1.2.3 Step 3: Sum the Normalized Kernels

-

Our KDE curve is the sum of the normalized kernels. Notice that the final curve is identical to the plot generated by sns.kdeplot we saw earlier!

-
+
+

8.1.2.2 Step 2: Normalize Kernels to Have a Total Area of 1

+

Above, we said that each kernel has an area of 1. Earlier, we also said that our goal is to construct a KDE curve using these kernels with a total area of 1. If we were to directly sum the kernels as they are, we would produce a KDE curve with an integrated area of (5 kernels) \(\times\) (area of 1 each) = 5. To avoid this, we will normalize each of our kernels. This involves multiplying each kernel by \(\frac{1}{\#\:\text{datapoints}}\).

+

In the cell below, we multiply each of our 5 kernels by \(\frac{1}{5}\) to apply normalization.

+
Code
plt.xlim(-3, 10)
@@ -707,8 +702,8 @@ 

plt.xlabel("Data") plt.ylabel("Density") -sns.kdeplot(data, bw_method=0.65) -sns.histplot(data, stat="density", bins=2);

+# The `norm` argument specifies whether or not to normalize the kernels +plot_separate_kernels(gaussian_kernel, data, a = 1, norm = True)

@@ -718,8 +713,11 @@

+ +
+

8.1.2.3 Step 3: Sum the Normalized Kernels

+

Our KDE curve is the sum of the normalized kernels. Notice that the final curve is identical to the plot generated by sns.kdeplot we saw earlier!

+
Code
plt.xlim(-3, 10)
@@ -727,7 +725,7 @@ 

plt.xlabel("Data") plt.ylabel("Density") -sns.histplot(data, bins=2, kde=True, stat="density", kde_kws=dict(cut=3, bw_method=0.65))

+plot_kde(gaussian_kernel, data, a=1)
@@ -829,7 +827,7 @@

The boxcar kernel is seldom used in practice – we include it here to demonstrate that a kernel function can take whatever form you would like, provided it integrates to 1 and does not output negative values.

-
+
Code
def boxcar_kernel(alpha, x, z):
@@ -876,7 +874,7 @@ 

. Note that here we’ve specified stat = density to normalize the histogram such that the area under the histogram is equal to 1.

-
+
sns.displot(data=wb, 
             x="gni", 
             kind="hist", 
@@ -891,7 +889,7 @@ 

!

-
+
sns.displot(data=wb, 
             x="gni", 
             kind='kde')
@@ -905,7 +903,7 @@ 

.

-
+
sns.displot(data=wb, 
             x="gni", 
             kind='ecdf')
@@ -926,7 +924,7 @@ 

8.3.0.1 Scatter Plots

Scatter plots are one of the most useful tools in representing the relationship between pairs of quantitative variables. They are particularly important in gauging the strength, or correlation, of the relationship between variables. Knowledge of these relationships can then motivate decisions in our modeling process.

In matplotlib, we use the function plt.scatter to generate a scatter plot. Notice that, unlike our examples of plotting single-variable distributions, now we specify sequences of values to be plotted along the x-axis and the y-axis.

-
+
plt.scatter(wb["per capita: % growth: 2016"], \
             wb['Adult literacy rate: Female: % ages 15 and older: 2005-14'])
 
@@ -942,7 +940,7 @@ 

In seaborn, we call the function sns.scatterplot. We use the x and y parameters to indicate the values to be plotted along the x and y axes, respectively. By using the hue parameter, we can specify a third variable to be used for coloring each scatter point.

-
+
sns.scatterplot(data = wb, x = "per capita: % growth: 2016", \
                y = "Adult literacy rate: Female: % ages 15 and older: 2005-14", 
                hue = "Continent")
@@ -965,7 +963,7 @@ 
Jittering is the process of adding a small amount of random noise to all x and y values to slightly shift the position of each datapoint. By randomly shifting all the data by some small distance, we can discern individual points more clearly without modifying the major trends of the original dataset.

In the cell below, we first jitter the data using np.random.uniform, then re-plot it with smaller markers. The resulting plot is much easier to interpret.

-
+
# Setting a seed ensures that we produce the same plot each time
 # This means that the course notes will not change each time you access them
 np.random.seed(150)
@@ -999,7 +997,7 @@ 
8.3.0.2 lmplot and jointplot

seaborn also includes several built-in functions for creating more sophisticated scatter plots. Two of the most commonly used examples are sns.lmplot and sns.jointplot.

sns.lmplot plots both a scatter plot and a linear regression line, all in one function call. We’ll discuss linear regression in a few lectures.

-
+
sns.lmplot(data = wb, x = "per capita: % growth: 2016", \
            y = "Adult literacy rate: Female: % ages 15 and older: 2005-14")
 
@@ -1013,7 +1011,7 @@ 

sns.jointplot creates a visualization with three components: a scatter plot, a histogram of the distribution of x values, and a histogram of the distribution of y values.

-
+
sns.jointplot(data = wb, x = "per capita: % growth: 2016", \
            y = "Adult literacy rate: Female: % ages 15 and older: 2005-14")
 
@@ -1034,7 +1032,7 @@ 

For datasets with a very large number of datapoints, jittering is unlikely to fully resolve the issue of overplotting. In these cases, we can attempt to visualize our data by its density, rather than displaying each individual datapoint.

Hex plots can be thought of as two-dimensional histograms that show the joint distribution between two variables. This is particularly useful when working with very dense data. In a hex plot, the x-y plane is binned into hexagons. Hexagons that are darker in color indicate a greater density of data – that is, there are more data points that lie in the region enclosed by the hexagon.

We can generate a hex plot using sns.jointplot modified with the kind parameter.

-
+
sns.jointplot(data = wb, x = "per capita: % growth: 2016", \
               y = "Adult literacy rate: Female: % ages 15 and older: 2005-14", \
               kind = "hex")
@@ -1055,7 +1053,7 @@ 

8.3.0.4 Contour Plots

Contour plots are an alternative way of plotting the joint distribution of two variables. You can think of them as the 2-dimensional versions of KDE plots. A contour plot can be interpreted in a similar way to a topographic map. Each contour line represents an area that has the same density of datapoints throughout the region. Contours marked with darker colors contain more datapoints (a higher density) in that region.

sns.kdeplot will generate a contour plot if we specify both x and y data.

-
+
sns.kdeplot(data = wb, x = "per capita: % growth: 2016", \
             y = "Adult literacy rate: Female: % ages 15 and older: 2005-14", \
             fill = True)
@@ -1077,7 +1075,7 @@ 

Much of this was done to uncover insights in data, which will prove necessary when we begin building models of data later in the course. A strong graphical correlation between two variables hints at an underlying relationship that we may want to study in greater detail. However, relying on visual relationships alone is limiting - not all plots show association. The presence of outliers and other statistical anomalies makes it hard to interpret data.

Transformations are the process of manipulating data to find significant relationships between variables. These are often found by applying mathematical functions to variables that “transform” their range of possible values and highlight some previously hidden associations between data.

To see why we may want to transform data, consider the following plot of adult literacy rates against gross national income.

-
+
Code
# Some data cleaning to help with the next example
@@ -1121,7 +1119,7 @@ 

\(\log{(100)} = 4.61\) and \(\log{(10)} = 2.3\)).

In Data 100 (and most upper-division STEM classes), \(\log\) is used to refer to the natural logarithm with base \(e\).

-
+
# np.log takes the logarithm of an array or Series
 plt.scatter(np.log(df["inc"]), df["lit"])
 
@@ -1144,7 +1142,7 @@ 

\(2^4 = 16\) and \(200^4 = 1600000000\)).

-
+
# Apply a log transformation to the x values and a power transformation to the y values
 plt.scatter(np.log(df["inc"]), df["lit"]**4)
 
@@ -1165,7 +1163,7 @@ 

\[y^4 = m(\log{x}) + b\]

Where \(m\) represents the slope of the linear fit, while \(b\) represents the intercept.

The cell below computes \(m\) and \(b\) for our transformed data. We’ll discuss how this code was generated in a future lecture.

-
+
Code
# The code below fits a linear regression model. We'll discuss it at length in a future lecture
@@ -1203,7 +1201,7 @@ 

\(x\) and \(y\).

\[y = [m(\log{x}) + b]^{(1/4)}\]

When we plug in the values for \(m\) and \(b\) computed above, something interesting happens.

-
+
Code
# Now, plug the values for m and b into the relationship between the untransformed x and y
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png
index 27468a46..16876c4d 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-2.png
deleted file mode 100644
index e4a89e13..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-2.png
deleted file mode 100644
index c5644115..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-2.png
deleted file mode 100644
index c802680f..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png
index 7d55dffe..e33716b1 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-2.png
deleted file mode 100644
index 9d20a774..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-2.png
deleted file mode 100644
index cf460dcf..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-2.png
deleted file mode 100644
index 916ff06f..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-2.png
deleted file mode 100644
index c416ca30..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-25-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-25-output-1.png
deleted file mode 100644
index 2ad64eb3..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-25-output-1.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-2.png
deleted file mode 100644
index d90bc5f6..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-4-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-4-output-2.png
deleted file mode 100644
index 04b3e63e..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-4-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-5-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-5-output-2.png
deleted file mode 100644
index d45c62e3..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-5-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png
index 1e3539b2..27468a46 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-2.png
deleted file mode 100644
index 1e3539b2..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png
index 9ba547f2..1e3539b2 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png
index 35bf23c0..9ba547f2 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png
index a6f13e27..35bf23c0 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-10-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-10-output-1.pdf
deleted file mode 100644
index d6d1e7e5..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-10-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf
deleted file mode 100644
index 408e75a9..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-2.pdf
deleted file mode 100644
index bda8cdf0..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-11-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf
deleted file mode 100644
index a7a63844..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-2.pdf
deleted file mode 100644
index 882d332b..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-12-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf
deleted file mode 100644
index a75e6c3f..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-2.pdf
deleted file mode 100644
index 788a3158..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-13-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-14-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-14-output-1.pdf
deleted file mode 100644
index c0cb322a..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-14-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-15-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-15-output-1.pdf
deleted file mode 100644
index 595bbf18..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-15-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-16-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-16-output-1.pdf
deleted file mode 100644
index 85c7cfd8..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-16-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf
deleted file mode 100644
index 7e9a6959..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-1.pdf
deleted file mode 100644
index b2f5a4b2..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-2.pdf
deleted file mode 100644
index 9888e1bf..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-18-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-1.pdf
deleted file mode 100644
index 613781de..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-2.pdf
deleted file mode 100644
index b6c6327e..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-19-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-1.pdf
deleted file mode 100644
index 0488ae23..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-2.pdf
deleted file mode 100644
index 23c9dff9..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-20-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-21-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-21-output-1.pdf
deleted file mode 100644
index c657f3ef..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-21-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-22-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-22-output-1.pdf
deleted file mode 100644
index 8d6649fc..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-22-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-23-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-23-output-1.pdf
deleted file mode 100644
index 97f795d1..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-23-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-1.pdf
deleted file mode 100644
index 9bcedb5f..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-2.pdf
deleted file mode 100644
index 39d611fc..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-24-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-1.pdf
deleted file mode 100644
index 82ee7eb4..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-2.pdf
deleted file mode 100644
index 30720779..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-25-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-26-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-26-output-1.pdf
deleted file mode 100644
index 99efadda..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-26-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf
deleted file mode 100644
index 6d22cde7..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-2.pdf
deleted file mode 100644
index 6bec915d..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-3-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-1.pdf
deleted file mode 100644
index b8c3370a..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-2.pdf
deleted file mode 100644
index a93e0bef..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-4-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-1.pdf
deleted file mode 100644
index 6c1fd3af..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-2.pdf
deleted file mode 100644
index b28e4d7d..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-5-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-1.pdf
deleted file mode 100644
index 2bb1512a..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-2.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-2.pdf
deleted file mode 100644
index 47a09495..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-6-output-2.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-7-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-7-output-1.pdf
deleted file mode 100644
index fb8aff75..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-7-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-8-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-8-output-1.pdf
deleted file mode 100644
index 188bdbfc..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-8-output-1.pdf and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-pdf/cell-9-output-1.pdf b/docs/visualization_2/visualization_2_files/figure-pdf/cell-9-output-1.pdf
deleted file mode 100644
index c6bbff68..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-pdf/cell-9-output-1.pdf and /dev/null differ
diff --git a/visualization_2/visualization_2.qmd b/visualization_2/visualization_2.qmd
index 5861eeed..1531766c 100644
--- a/visualization_2/visualization_2.qmd
+++ b/visualization_2/visualization_2.qmd
@@ -107,6 +107,17 @@ plt.xlim(-3, 10)
 plt.ylim(0, 0.5);
 ```
 
+Alternatively, we can use `sns.histplot`. This plot also visualizes the underlying bins as a histogram.
+
+```{python}
+#| code-fold: true
+sns.histplot(data, bins=2, kde=True, stat="density", kde_kws=dict(cut=3, bw_method=0.65))
+
+plt.xlabel("Data")
+plt.xlim(-3, 10)
+plt.ylim(0, 0.5);
+```
+
 #### Step 1: Place a Kernel at Each Data Point
 
 To begin generating a density curve, we need to choose a **kernel** and **bandwidth value ($\alpha$)**. What are these exactly? 
@@ -207,20 +218,7 @@ plt.ylim(0, 0.5)
 plt.xlabel("Data")
 plt.ylabel("Density")
 
-sns.kdeplot(data, bw_method=0.65)
-sns.histplot(data, stat="density", bins=2);
-```
-
-An alternative method to generate the above KDE is shown below, this time using `sns.histplot`'s arguments.
-
-```{python}
-#| code-fold: true
-plt.xlim(-3, 10)
-plt.ylim(0, 0.5)
-plt.xlabel("Data")
-plt.ylabel("Density")
-
-sns.histplot(data, bins=2, kde=True, stat="density", kde_kws=dict(cut=3, bw_method=0.65))
+plot_kde(gaussian_kernel, data, a=1)
 ```
 
 ### Kernel Functions and Bandwidths