From 47c809e904539d58fc2bcd52094a4264c9feb009 Mon Sep 17 00:00:00 2001 From: badabad <69161193+badabad@users.noreply.github.com> Date: Mon, 17 Aug 2020 18:34:55 -0500 Subject: [PATCH 1/4] Created using Colaboratory --- 1_Sprint_2_Statistics_Study_Guide.ipynb | 935 ++++++++++++++++++++++++ 1 file changed, 935 insertions(+) create mode 100644 1_Sprint_2_Statistics_Study_Guide.ipynb diff --git a/1_Sprint_2_Statistics_Study_Guide.ipynb b/1_Sprint_2_Statistics_Study_Guide.ipynb new file mode 100644 index 000000000..6749551b2 --- /dev/null +++ b/1_Sprint_2_Statistics_Study_Guide.ipynb @@ -0,0 +1,935 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Copy of Unit 1 Sprint 2 - Statistics - Study Guide.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GTv68Uw5Zk-P", + "colab_type": "text" + }, + "source": [ + "This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.\n", + "\n", + "If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.\n", + "\n", + "Have fun studying!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VvSCoixx7rRe", + "colab_type": "text" + }, + "source": [ + "# Resources\n", + "\n", + "[Scipy Stats Documentation](https://docs.scipy.org/doc/scipy/reference/stats.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sDkirKu1B-Lw", + "colab_type": "text" + }, + "source": [ + "# General Terms" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iY916675DAXf", + "colab_type": "text" + }, + "source": [ + "Define the following terms. *Double click the text to edit the markdown cells.*\n", + "

\n", + "\n", + "**Normal Distribution:** It's a regular bell curve probability function. It describe the frequency of a variable's values\n", + "\n", + "**Standard Deviation:** sqrt((x-u)^2/N)\n", + "\n", + "**Z-Score:** x-u/sd\n", + "\n", + "**P-Value:** A low p value is used to reject a null hypothesis\n", + "\n", + "**Null Hypothesis:** The \"boring\" or expected outcome\n", + "\n", + "**Sample:** The portion of the dataset used to represent the population\n", + "\n", + "**Statistical Signifigance:** The chance of variables being related and not by chance" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KTiR7Fh6FPH0", + "colab_type": "text" + }, + "source": [ + "# T-Test" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L-NzA2VTFapj", + "colab_type": "text" + }, + "source": [ + "Answer the following questions as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*\n", + "

\n", + "\n", + "1. What is a T-Test? What is it used for?\n", + "\n", + " it finds the difference in means of two variables. It is used to test if your hypothesis is likely true or bs\n", + "\n", + "2. What is the difference between the normal distribution and the t-distribution?\n", + "\n", + " A t distribution is wider. A normal distribution shows how values are spread out, whereas a t distribution measures the accuracy of those values\n", + "\n", + "3. What is the difference between a 1-sample and a 2-sample t-test?\n", + "\n", + " A 1-sample t test looks to see if the population reflects a hypothesized value\n", + " A 2-sample t test compares two different samples and determines how different they are" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_tZDJesBHeDB", + "colab_type": "text" + }, + "source": [ + "We are scientists running a drug trial and wanting to know whether our drug reduced patient symptoms. Below are the results (just random numbers), explain in 2-3 sentences whether or not the drug was effective. How can we tell that from the t-test?\n", + "\n", + "```\n", + "After running the t-test, look at the p value. If the p value is very high, there is a probable chance of the hypothesis being true. If it is low, we reject the null hypothesis\n", + "```\n", + "\n", + "What is likely our null hypothesis?\n", + "\n", + "```\n", + "The null hypothesis is that the drug is effective\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0ggDf6GE4mVU", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "2f9c09e4-eddf-4aa6-a8bc-787501dfa900" + }, + "source": [ + "from scipy import stats\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# Get our \"results\" with random numbers\n", + "np.random.seed(42)\n", + "with_drug = stats.norm.rvs(loc=5, scale=10, size=500)\n", + "without_drug = stats.norm.rvs(loc=5, scale=10, size=500)\n", + "\n", + "# See if our drug made a difference\n", + "stats.ttest_ind(with_drug, without_drug)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_indResult(statistic=-0.40331379088750186, pvalue=0.6868037874359643)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5KJ4ZpQQPoIv", + "colab_type": "text" + }, + "source": [ + "Here is a dataframe of movie ratings. Divide the dataframe by gender and then use t-tests to show which movies have a statistically significant difference in rating when divided by gender. Give a sentence explanation of the results." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_HtmwEHBHTEb", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "outputId": "9cd88011-55a2-4141-a513-2aa0974dbcb6" + }, + "source": [ + "df = pd.DataFrame({'gender':['m','f','f','m','m','m','f','f','m','f'],\n", + " 'jurassic park':[10,9,10,9,9,10,10,10,9,9],\n", + " 'love actually':[6,9,10,7,6,7,10,10,5,8],\n", + " 'pacific rim':[10,3,4,8,9,8,5,4,9,3]})\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
genderjurassic parklove actuallypacific rim
0m10610
1f993
2f10104
3m978
4m969
\n", + "
" + ], + "text/plain": [ + " gender jurassic park love actually pacific rim\n", + "0 m 10 6 10\n", + "1 f 9 9 3\n", + "2 f 10 10 4\n", + "3 m 9 7 8\n", + "4 m 9 6 9" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bNDXqu-ZRDNe", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Divide the dataframe here\n", + "m = df[df['gender']=='m']\n", + "f = df[df['gender']=='f']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ReEWvQbmQrGz", + "colab_type": "text" + }, + "source": [ + "**Jurassic Park**\n", + "\n", + "Explanation of results:\n", + "\n", + "```\n", + "I fail to reject the null hypothesis that the ratings are significantly different\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iOIwQT5zPX59", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "2be93085-c03e-4fe9-e4dd-4bb292480171" + }, + "source": [ + "# T-Test Code Here\n", + "stats.ttest_ind(m['jurassic park'],f['jurassic park'])\n" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_indResult(statistic=-0.5773502691896236, pvalue=0.5795840000000014)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8GTFaWm-Q5RL", + "colab_type": "text" + }, + "source": [ + "**Love Actually**\n", + "\n", + "Explanation of results:\n", + "\n", + "```\n", + "Females prefer love actually. The difference is statistically significant\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zlGdfuVhQ8e3", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "181550c3-8b03-42c6-cc1e-d8565e4165ac" + }, + "source": [ + "# T-Test Code Here\n", + "stats.ttest_ind(m['love actually'],f['love actually'])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_indResult(statistic=-5.8423739467217715, pvalue=0.0003861022071216145)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JIZU8lzyQ80N", + "colab_type": "text" + }, + "source": [ + "**Pacific Rim**\n", + "\n", + "Explanation of results:\n", + "\n", + "```\n", + "Males prefer pacific rim more and the difference is statistically significant\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KCN4M4SORBCZ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "483bf62f-75f7-4735-c54a-c3521aa6b530" + }, + "source": [ + "# T-Test Code Here\n", + "stats.ttest_ind(m['pacific rim'],f['pacific rim'])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_indResult(statistic=9.449111825230684, pvalue=1.2936944097439082e-05)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hn-JhlRxRXQK", + "colab_type": "text" + }, + "source": [ + "# Confidence Interval" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zVKjVPipS9Ko", + "colab_type": "text" + }, + "source": [ + "Answer the following question as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*\n", + "

\n", + "\n", + "1. What is a confidence interval?\n", + "\n", + " A confidence interval is a range of values that has a certain percent chance of containing the actual values of a population" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ozcajm5PXPLc", + "colab_type": "text" + }, + "source": [ + "Using the movie rating data, graph the ratings with a confidence interval. After graphing the ratings with the confidence interval, write a brief explanation of how to interpret the graph.\n", + "\n", + "```\n", + "The bars shows the average rating of each movie. Each bar has a line on it to indicate confidence intervals of 95%. We are 95% certain that the true ratings fall within those lines\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1Wg7BLdGXXMq", + "colab_type": "code", + "colab": {} + }, + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Your Graph Code Here\n", + "def CI(data, confidence=.95):\n", + " #calculate confidence interval\n", + " sample = np.array(data)\n", + " s = np.std(sample, ddof=1)\n", + " n = np.size(sample)\n", + " standard_error = s/np.sqrt(n)\n", + " t = stats.t.ppf((1+confidence)/2, n-1)\n", + " #margin of error\n", + " MOE = t*standard_error\n", + " xbar = np.mean(sample)\n", + " lower = xbar-MOE\n", + " upper = xbar+MOE\n", + " return(lower,xbar,upper, MOE)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UuFE9e6JHkLC", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 282 + }, + "outputId": "663ae47b-baaa-4c1c-f002-0f623f6ed5be" + }, + "source": [ + "plt.bar(0,df['jurassic park'].mean(), yerr=CI(df['jurassic park'])[3])\n", + "plt.bar(1,df['love actually'].mean(), yerr=CI(df['love actually'])[3])\n", + "plt.bar(2,df['pacific rim'].mean(), yerr=CI(df['pacific rim'])[3])\n", + "plt.xticks([0,1,2],['Jurassic Park', 'Love Actually','Pacific Rim'])\n", + "plt.title=\"Movie Ratings\"\n", + "plt.show" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 42 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAP60lEQVR4nO3de5BkZX3G8e8TFhSUcHGnCAHigCEaBUWyXtAiLhctLwma8gIEFKzEjUlUMLEUK17Q8g/QGE0gl9ooiAlRVIgY4w2NoPGCzMICCytR11VB0FERRY0E/eWPc1bbcXZnprvZmXf2+6ma6vecPpff9Nv9zHtOd59JVSFJas+vLHYBkqThGOCS1CgDXJIaZYBLUqMMcElq1IrtubOVK1fW5OTk9tylJDVv3bp136qqiZnzt2uAT05OMjU1tT13KUnNS/KV2eZ7CkWSGmWAS1KjDHBJatScAZ7kvCTfTLJhYN7eSS5L8oX+dq97tkxJ0kzzGYG/DXjijHlnAB+rqoOBj/XTkqTtaM4Ar6pPAN+ZMfupwAV9+wLgaWOuS5I0h2HPge9TVbf27duAfba2YJI1SaaSTE1PTw+5O0nSTCO/iVnd9Wi3ek3aqlpbVauqatXExC99Dl2SNKRhA/wbSfYF6G+/Ob6SJEnzMWyAvw84pW+fAlw6nnKWh9WrV7N69erFLkPSMjefjxG+A/gM8MAkNyf5I+As4PFJvgAc209LkrajOa+FUlUnbuWuY8ZciyRpAfwmpiQ1ygCXpEYZ4JLUKANckhplgEtSo7brf+QZxeQZ/7nYJczbbZu+DbRT8+aznrLYJUgagiNwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXDs8L/+rVjXzOfCW/NofenVdSfc8R+CS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JKa439R6hjgktQoA1ySGjVSgCd5cZIbkmxI8o4k9x5XYZKkbRs6wJPsB7wIWFVVhwA7ASeMqzBJ0raNegplBbBrkhXAbsDXRy9JkjQfQwd4Vd0C/DXwVeBW4I6q+sjM5ZKsSTKVZGp6enr4SiVJv2CUUyh7AU8FDgR+HbhPkpNnLldVa6tqVVWtmpiYGL5SSdIvGOUUyrHAl6tquqr+D7gEeMx4ypIkzWWUAP8q8OgkuyUJcAywcTxlSZLmsmLYFavqyiTvAa4G7gauAdaOqzAtA2fusdgVzM/mH3S3rdQLcOYdi12BloChAxygql4NvHpMtUiSFsBvYkpSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWrUisUuQFpsl596n8UuQRqKI3BJapQBLkmNMsAlqVGeA5cEwKEXHLrYJczbpts2Ae3UfP0p198j23UELkmNMsAlqVEjBXiSPZO8J8nnk2xMcsS4CpMkbduo58D/FvhQVT0jyS7AbmOoSZI0D0MHeJI9gN8FTgWoqruAu8ZTliRpLqOcQjkQmAbOT3JNkrck+aWvtCVZk2QqydT09PQIu5MkDRolwFcAhwP/WFUPB34AnDFzoapaW1WrqmrVxMTECLuTJA0aJcBvBm6uqiv76ffQBbokaTsYOsCr6jbga0ke2M86BrhxLFVJkuY06qdQXghc2H8CZRPw3NFLkiTNx0gBXlXrgVVjqkWStAB+E1OSGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVErFrsASVqog15+0GKXsCQ4ApekRhngktQoA1ySGjVygCfZKck1Sd4/joIkSfMzjhH4acDGMWxHkrQAIwV4kv2BpwBvGU85kqT5GnUE/mbgpcBPt7ZAkjVJppJMTU9Pj7g7SdIWQwd4kt8DvllV67a1XFWtrapVVbVqYmJi2N1JkmYYZQT+WOC4JJuBdwJHJ/nXsVQlSZrT0AFeVS+vqv2rahI4Afivqjp5bJVJkrbJz4FLUqPGci2UqrocuHwc25IkzY8jcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJatTQAZ7kgCQfT3JjkhuSnDbOwiRJ27ZihHXvBv6yqq5OsjuwLsllVXXjmGqTJG3D0CPwqrq1qq7u298HNgL7jaswSdK2jeUceJJJ4OHAlbPctybJVJKp6enpcexOksQYAjzJfYGLgdOr6nsz76+qtVW1qqpWTUxMjLo7SVJvpABPsjNdeF9YVZeMpyRJ0nyM8imUAG8FNlbV34yvJEnSfIwyAn8s8Gzg6CTr+58nj6kuSdIchv4YYVX9N5Ax1iJJWgC/iSlJjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0aKcCTPDHJTUm+mOSMcRUlSZrb0AGeZCfg74EnAQ8GTkzy4HEVJknatlFG4I8EvlhVm6rqLuCdwFPHU5YkaS4rRlh3P+BrA9M3A4+auVCSNcCafvLOJDeNsM+WrAS+tdhFzEfOXuwKloRm+guA12SxK1gKmumznDpyf91/tpmjBPi8VNVaYO09vZ+lJslUVa1a7Do0P/ZXe+yz0U6h3AIcMDC9fz9PkrQdjBLgVwEHJzkwyS7ACcD7xlOWJGkuQ59Cqaq7k7wA+DCwE3BeVd0wtsrat8OdNmqc/dWeHb7PUlWLXYMkaQh+E1OSGmWAS1Kjln2AJ7lzEfb56QUse2aSW5KsT7IhyXEL3NfmJCsXXuXSsr37qX+83zmP5Q5L8uQR9/WzPlqM5+P2luQnA8/ndyfZbYhtfCDJnn37RUk2JrkwyXELuWxH/9hfn+S6JFckuf/AffN+nS5Vyz7AF6K/PMDIquoxC1zlTVV1GPBM4Lwk8+qXcdW7o0ny23RvvB+Z5D5zLH4YMFKA74B+VFWHVdUhwF3A8xe6gap6clV9t5/8M+DxVXVSVb2vqs5a4OaOqqqHApcDrxjYx0Jfp0vODhHgSVYnef/A9LlJTu3bm5OcneRq4JlJnpfkqiTXJrl4y+ghyTP7EcW1ST7Rz3tIks/1o43rkhzcz79zYF8v60cA1ybZ5hOvqjYCdwMrk7w3ybokN/TfZt2yvTuTvDHJtcARA/N3TfLBJM8bw0O2JPSj38/2j+2/J9kryYOSfG5gmckk1/ft3+lHWeuSfDjJvlvZ9InAvwAfYeDyD0kekeTTfV99LskewGuB4/s+Pr4/YnrJwDobkkz27Vn7bCu/29uTPG1g+sIky/FSFJ8EfjPJ7ye5Msk1ST6aZB+AJPdNcv7AKPnp/fzNSVYm+SfgIOCDSV6c5NQk5/bL7NM/L67tf+YK5M/QfYOcfv07+9vV/fPm0iSbkpyV5KT+OXB9kgfcA4/LeFTVsv4B7gRWA+8fmHcucGrf3gy8dOC++w20Xwe8sG9fD+zXt/fsb88BTurbuwC7btlnf/sk4NPAbv303rPUdybwkr79KODrQLYsC+wKbNhSF1DAswbW3wxMAh8FnrPYj/co/TTLvOuAx/Xt1wJv7tvrgQP79svoRlU794/1RD//eLqPts62r5uA3wCeAPzHQP9tAh7RT/8q3cdsTwXOna2/+ukNwORg/87SZ5uBlTOeG48D3tu39wC+DKxY7H4YZ1/2j9+lwJ8Ce/HzT739MfDGvn32ln7tp/ea5TEbbP+sP4CLgNP79k7AHrPUMrjum4E1s9S5GvgusC9wL7ovJL6mv++0wfqW2s89/lX6Rlw00D4kyeuAPYH70n3OHeBTwNuSvAu4pJ/3GeCvkuwPXFJVX5ix3WOB86vqhwBV9Z2t7P/FSU4Gvg8cX1XVn/f7g/7+A4CDgW8DPwEunrH+pcDrq+rC+f/KS1s/+t2zqq7oZ10AvLtvv4suoM/qb48HHggcAlyWBLoX9K2zbHcV8K2q+mqSW+hOWe1NNzK7taquAqiq7/XLL6TsrfXZL6mqK5L8Q5IJ4OnAxVV190J2toTtmmR93/4k8Fa6/rmoPyrahe4PFnSvkRO2rFhVty9gP0cDz+nX+wlwx1aW+3jfx3cCr9zKMldV1a0ASb5Ed3QG3cDtqAXUtF3tEKdQ6E5LDP6u955x/w8G2m8DXlBVhwKv2bJsVT2fbqR3ALAuyf2q6t+A44AfAR9IcvSQ9b2punOGR1bVJ5OspntiH1FVDwOuGaj5f/sn66BPAU/MAtOmYRcBz0ryW0D1fzgD3NA/jodV1aFV9YRZ1j0ReFCSzcCX6EbaT1/Avmd9Ls3RZ1vzduBk4LnAeQuoYan70UA/vLC6q5WeQzdyPhT4E+Z+bMbpKLqLQa2ne03P5scD7Z8OTP+U7XDNqGHtKAH+FeDBSe6V7p3tY7ax7O7ArUl2Bk7aMjPJA6rqyqp6FTANHJDkIGBTVf0d3Sj4oTO2dRnw3IHz6HvPs949gNur6odJHgQ8eo7lXwXcTnd99mWhqu4Abk9yZD/r2cAV/X1fojsSeSU/P3q6CZhIcgRAkp2TPGRwm+neHH4WcGhVTVbVJN058BP79fdN8oh+2d2TrKA7Ktp9YDObgcP7ZQ4HDuznL7TPoBssnN7/TjfOY/mW7cHPr5V0ysD8y4A/3zKRZK8FbPNjdKdnSLJTf9Q2q/7o5nTgOQt4HS55yzrA+xfgj6vqa3SH3Rv622u2sdorgSvpRrWfH5j/hv4NjQ1051qvpQuDDf3h4iF0I6qfqaoP0V0fZqpf5iXMz4eAFUk20p0m+Ow81jmN7tD19fPcx1KzW5KbB37+gu6F/oYk19F9GuS1A8tfRDd6fRdAP8p7BnB2ujd41wMz39Q6Erilqr4+MO8TdP+Q5H50p2LO6de/jG6U+HG6P/7rkxxPd/pq7yQ3AC8A/qffzoL7rKq+AWwEzp/74WnemcC7k6zjFy8B+zpgr/7N4GtZ2OmK04Cj0r2JvY6uH7eqP0XyDgb+YLRuWX+VPsnDgH+uqkcudi3STP2R2fXA4f0Rh7Qgy3YEnuT5dH9tXzHXstL2luRYutH3OYa3hrWsR+CStJwt2xG4JC13BrgkNcoAl6RGGeCS1CgDXJIa9f982ZVpVxJQ4QAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2kdB0Bcxaw3h", + "colab_type": "text" + }, + "source": [ + "# Chi Squared" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DOmy8rAhbnXj", + "colab_type": "text" + }, + "source": [ + "Answer the following questions as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*\n", + "

\n", + "\n", + "1. What is a Chi Squared Test? What is it used for?\n", + "\n", + " To determine if two variables are related to each other\n", + "\n", + "2. What type of data is it used on?\n", + "\n", + " ` Your Answer Here `\n", + "\n", + "3. What is a contingency table?\n", + "\n", + " A contignency table compares two categorical variables to each other. It shows the frequency that variables intersect\n", + "\n", + "4. Define Degrees of Freedom\n", + "\n", + " The number of possible independent variables" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J8VTCMJBiSu_", + "colab_type": "text" + }, + "source": [ + "Use the `grades` dataframe below to complete the following:\n", + "- Create at least 2 contingency tables\n", + "- Use chi-squared tests to find 2 features that are independent of each other.\n", + " - Write a brief interpretation of the results\n", + "- Use chi-squared tests to find 2 features that are dependent to each other.\n", + " - Write a brief interpretation of the results" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xm4saRNNbGQd", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "outputId": "a6bf5ce9-54e0-4d90-8bdf-923a0c2b971c" + }, + "source": [ + "grades = pd.DataFrame({'good_standing':[True, True, False, False, False, True, True, False, True, True],\n", + " 'grade_1':['A', 'B', 'A', 'C', 'A', 'A', 'D', 'A', 'B', 'B'],\n", + " 'grade_2':['Pass', 'Pass', 'Fail', 'Fail', 'Fail','Pass', 'Pass', 'Fail', 'Pass', 'Fail'],\n", + " 'grade_3':[10, 5, 6, 10, 9, 9, 8, 7, 3, 9]})\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
genderjurassic parklove actuallypacific rim
0m10610
1f993
2f10104
3m978
4m969
\n", + "
" + ], + "text/plain": [ + " gender jurassic park love actually pacific rim\n", + "0 m 10 6 10\n", + "1 f 9 9 3\n", + "2 f 10 10 4\n", + "3 m 9 7 8\n", + "4 m 9 6 9" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 43 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mwcJfWhzh6gJ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + }, + "outputId": "fdbbd2d7-e627-4bfb-c6cd-8cfa05d8db38" + }, + "source": [ + "# Contingency Table 1\n", + "\n", + "table1 = pd.crosstab(grades['good_standing'],grades['grade_1'])\n", + "chi2_contingency(table1)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(5.0, 0.1717971442967335, 3, array([[2. , 1.2, 0.4, 0.4],\n", + " [3. , 1.8, 0.6, 0.6]]))" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 44 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q5AEI6Lgkcfm", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Contingency Table 2\n", + "table2 = pd.crosstab(grades['good_standing'],grades['grade_2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JuK6pVIkkel1", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 117 + }, + "outputId": "aa1fd0aa-29c7-4a29-c8d0-fbd02da8adf5" + }, + "source": [ + "# Chi Squared, independent features\n", + "### Perform the chi-square test\n", + "stat, p, dof, expected = chi2_contingency(table1, correction=False)\n", + "\n", + "### Print out the stats in a nice format\n", + "print('Expected values: \\n ', expected.round(2))\n", + "print('The degrees of freedom: ', dof)\n", + "print(f'The chi square statistics is: {stat:.3f}')\n", + "print(f'The p value is: {p:.6f}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Expected values: \n", + " [[2. 1.2 0.4 0.4]\n", + " [3. 1.8 0.6 0.6]]\n", + "The degrees of freedom: 3\n", + "The chi square statistics is: 5.000\n", + "The p value is: 0.171797\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZsZrdkOHki-B", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 117 + }, + "outputId": "110ba39a-e093-4dec-d9e0-770e60cfe7a6" + }, + "source": [ + "# Chi Squared, dependent features\n", + "### Perform the chi-square test\n", + "stat, p, dof, expected = chi2_contingency(table2, correction=False)\n", + "\n", + "### Print out the stats in a nice format\n", + "print('Expected values: \\n ', expected.round(2))\n", + "print('The degrees of freedom: ', dof)\n", + "print(f'The chi square statistics is: {stat:.3f}')\n", + "print(f'The p value is: {p:.6f}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Expected values: \n", + " [[2. 2.]\n", + " [3. 3.]]\n", + "The degrees of freedom: 1\n", + "The chi square statistics is: 6.667\n", + "The p value is: 0.009823\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5g6IXrsppE_j", + "colab_type": "text" + }, + "source": [ + "# Bayesian Statisics" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MjPRgVbxp_eN", + "colab_type": "text" + }, + "source": [ + "Answer the following questions as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*\n", + "

\n", + "\n", + "1. What is the difference between Bayesian and Frequentist Statistics?\n", + "\n", + " Bayesians rely on prior knowledge\n", + "\n", + "2. What is a prior belief? How is it used in Bayesian Statistics?\n", + "\n", + " A probability distribution created based on beliefs about an uncertain variable\n", + "\n", + "3. What is the law of total probability?\n", + "\n", + " You can find the probability of an event by adding up the probability of distinct events\n", + "\n", + "4. What is the law of conditional probability?\n", + "\n", + " probability of an event given that another event occured\n", + "\n", + "5. Give an example of when you might use bayesian statistics. Do not use an example given during the lecture or assignment.\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8N39IjRS7Jix", + "colab_type": "text" + }, + "source": [ + "# Graphing" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r3GRbrZI7NIP", + "colab_type": "text" + }, + "source": [ + "Use any of the dataframes above and make two additional visualizations to explore the data. Make sure to include axis labels and title for each graph." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ywKWLarY7khK", + "colab_type": "code", + "colab": {} + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TYVX3IYZ7kmO", + "colab_type": "code", + "colab": {} + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file From 9b42758b3dcd9f73480eb3b42462de8f16330f06 Mon Sep 17 00:00:00 2001 From: badabad <69161193+badabad@users.noreply.github.com> Date: Tue, 18 Aug 2020 11:04:50 -0500 Subject: [PATCH 2/4] Created using Colaboratory --- ..._123_Confidence_Intervals_Assignment.ipynb | 842 ++++++++++++++++++ 1 file changed, 842 insertions(+) create mode 100644 LS_DS_123_Confidence_Intervals_Assignment.ipynb diff --git a/LS_DS_123_Confidence_Intervals_Assignment.ipynb b/LS_DS_123_Confidence_Intervals_Assignment.ipynb new file mode 100644 index 000000000..ce7a97fed --- /dev/null +++ b/LS_DS_123_Confidence_Intervals_Assignment.ipynb @@ -0,0 +1,842 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Copy of LS_DS_123_Confidence_Intervals_Assignment.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_c3L9CeANiJ", + "colab_type": "text" + }, + "source": [ + "## Confidence Intervals\n", + "\n", + "The following url can be used to access an abbreviated version of responses to Stack Overflow's 2018 Developer Survey. The original Survey had ~100k respondents but the data is quite dirty so I have selected a cleaner subset of it for you to use for your assignment.\n", + "\n", + "\n", + "\n", + "The provided dataset holds 14 columns of information about individuals who make less than 500k per year and who responded that they had: \n", + "\n", + "\"Participated in a full-time developer training program or bootcamp\"\n", + "\n", + "## Part 1 - Setting the Stage\n", + "\n", + "**1) Load the dataset**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yXwMJQGrAIbO", + "colab_type": "code", + "colab": {} + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from scipy import stats" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pYsFvm1dPygj", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 476 + }, + "outputId": "9baaceef-d4c5-4f0d-c2f5-9f01340ab4c8" + }, + "source": [ + "df=pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/bootcampers.csv')\n", + "print(df.shape)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "(2761, 15)\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0StudentEmploymentUndergradMajorDevTypeYearsCodingYearsCodingProfConvertedSalaryEducationTypesSelfTaughtTypesTimeAfterBootcampLanguageWorkedWithGenderAgeRaceEthnicity
062NoEmployed full-timeComputer science, computer engineering, or sof...Back-end developer;Data scientist or machine l...12-14 years6-8 years120000.0Taken an online course in programming or softw...The official documentation and/or standards fo...I already had a full-time job as a developer w...C;F#;Haskell;Python;ScalaMale25 - 34 years oldWhite or of European descent
173NoEmployed full-timeA humanities discipline (ex. literature, histo...Back-end developer;Full-stack developer;System...0-2 years0-2 years36000.0Participated in a full-time developer training...The official documentation and/or standards fo...Four to six monthsJava;JavaScript;SQL;HTML;CSS;Bash/ShellMale25 - 34 years oldWhite or of European descent
2127Yes, full-timeEmployed full-timeA business discipline (ex. accounting, finance...Full-stack developer3-5 years3-5 years59980.0Taken an online course in programming or softw...The official documentation and/or standards fo...One to three monthsC#;JavaScript;TypeScript;HTML;CSSMale25 - 34 years oldEast Asian
3140NoEmployed full-timeA social science (ex. anthropology, psychology...Data scientist or machine learning specialist;...9-11 years3-5 years70000.0Taken an online course in programming or softw...Questions & answers on Stack Overflow;Tapping ...I haven’t gotten a developer jobJavaScript;Python;SQL;VBAMale25 - 34 years oldWhite or of European descent
4153NoEmployed full-timeComputer science, computer engineering, or sof...Mobile developer6-8 years3-5 years105000.0Taken an online course in programming or softw...The official documentation and/or standards fo...One to three monthsC;Java;JavaScript;Objective-C;PHP;Python;Ruby;...Male25 - 34 years oldWhite or of European descent
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 Student ... Age RaceEthnicity\n", + "0 62 No ... 25 - 34 years old White or of European descent\n", + "1 73 No ... 25 - 34 years old White or of European descent\n", + "2 127 Yes, full-time ... 25 - 34 years old East Asian\n", + "3 140 No ... 25 - 34 years old White or of European descent\n", + "4 153 No ... 25 - 34 years old White or of European descent\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wOQ-9E8uYcni", + "colab_type": "text" + }, + "source": [ + "**2) Select two random samples from this dataset, one with a sample size of 20 and the other with a sample size of 200. (Use a `random_state` of `42` when selecting the samples)**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "m1vuFGxVQo64", + "colab_type": "code", + "colab": {} + }, + "source": [ + "rs1 = df.sample(20,random_state=42)\n", + "rs2 = df.sample(200, random_state=42)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y2Rz-8O-YmF9", + "colab_type": "text" + }, + "source": [ + "**3) Calculate and report the sample means of the `ConvertedSalary` column for both of the samples.**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ti9x37XSQ_yL", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + }, + "outputId": "f9c063df-4518-4d25-b6ba-0199b9522ce3" + }, + "source": [ + "rs1_mean = rs1['ConvertedSalary'].mean()\n", + "print(rs1_mean)\n", + "rs2_mean = rs2['ConvertedSalary'].mean()\n", + "print(rs2_mean)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "55752.2\n", + "68551.255\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8hGxkMQ020th", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 369 + }, + "outputId": "e07f9302-42a8-4051-ca34-008d19bbb537" + }, + "source": [ + "r= rs1['ConvertedSalary']\n", + "r.head(21)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "367 74500.0\n", + "2759 60000.0\n", + "1330 86120.0\n", + "2750 60000.0\n", + "521 25047.0\n", + "819 10704.0\n", + "322 150000.0\n", + "1970 41124.0\n", + "365 135000.0\n", + "2512 62600.0\n", + "1973 6348.0\n", + "533 62507.0\n", + "2060 31309.0\n", + "807 42635.0\n", + "2724 14687.0\n", + "239 64417.0\n", + "2261 9600.0\n", + "2233 40196.0\n", + "1688 9706.0\n", + "1268 128544.0\n", + "Name: ConvertedSalary, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AtGASqgxYp5t", + "colab_type": "text" + }, + "source": [ + "**4) Both of these sample means are estimates of an underlying population value. Which sample mean do you trust more? Why? Would a non-technical audience have any idea about which of these values is more trustworthy?**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8oD8lp84YyvU", + "colab_type": "text" + }, + "source": [ + "I would trust the second sample more. Smaller sample sizes are more likely to be inaccurate because they can be thrown off by anomalies. I'm pretty sure even a non technical audience would choose the larger sample size" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-2ulIAGmYudp", + "colab_type": "text" + }, + "source": [ + "**5) Does just the point estimate (individual value of the sample mean) indicate to you anything about how much sampling error there could be with these estimates?**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hD5HloatYxoh", + "colab_type": "text" + }, + "source": [ + "Yes, the sample with 20 vs the sample with 200 are vastly different" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SpqgKQfTYvF0", + "colab_type": "text" + }, + "source": [ + "**6) What strategies could we use when reporting these numbers to not only report our estimates but also to give non-technical readers an idea about how far off our estimates might be due to sampling error?**\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h7N1WnTTYyD7", + "colab_type": "text" + }, + "source": [ + "Non-technical readers would be more convinced if they saw visuals- i.e. a graph\n", + "We could also calculate percent difference" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9tIHBlM_SyGR", + "colab_type": "text" + }, + "source": [ + "## Part 2 - Reporting CIs / MOEs along with our point estimates for more context.\n", + "\n", + "**1) Calculate and report a 95% confidence interval around both of the sample means from part 1.**\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hfCO0gHEUOrE", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def CI(data, confidence=.95):\n", + " #calculate confidence interval\n", + " sample = np.array(data)\n", + " s = np.std(sample, ddof=1)\n", + " n = np.size(sample)\n", + " standard_error = s/np.sqrt(n)\n", + " t = stats.t.ppf((1+confidence)/2, n-1)\n", + " #margin of error\n", + " MOE = t*standard_error\n", + " xbar = np.mean(sample)\n", + " lower = xbar-MOE\n", + " upper = xbar+MOE\n", + " return(lower,xbar,upper, MOE)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "r5OqA6GYUbUZ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "d9c59938-046a-4b34-9a7f-17bba79fc14c" + }, + "source": [ + "CI(rs1['ConvertedSalary'], .95)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(35805.10309625536, 55752.2, 75699.29690374463, 19947.096903744638)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lS7o7xwn7IBy", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "7e3b3343-cb96-4fbd-b7fc-c65c3a2c0fbf" + }, + "source": [ + "CI(rs2['ConvertedSalary'], .95)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(60593.058804479086, 68551.255, 76509.45119552092, 7958.196195520917)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdiW7DHRZwK-", + "colab_type": "text" + }, + "source": [ + "**2) Which confidence interval is wider and why?**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p5KGlyvYZ1Qq", + "colab_type": "text" + }, + "source": [ + "The confidence interval of the first sample is much wider. Due to its lower sample size the standard error of our estimate is much larger causing the potential distribution of sample means to be much more spread out. A confidence interval catches 95% of this theoretical distriution of sample means so if our standard error is larger, our confidence interval will be wider as well." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6N33K4NvZ13H", + "colab_type": "text" + }, + "source": [ + "**3) Report the mean and the margin of error for both of the sample means. What does the Margin of Error Represent?**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MfMIBftMU_rz", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 100 + }, + "outputId": "88648468-9e97-4f0a-9d2a-2ca0ea8551cc" + }, + "source": [ + "print(CI(rs1['ConvertedSalary'])[3])\n", + "print(\"margin of error for sample 1 is \" + \"19947\")\n", + "print('margin of error for sample 2 is ' + '7958')\n", + "print('mean of sample 1 is ' + str(rs1_mean))\n", + "print('mean of sample 2 is ' + str(rs2_mean))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "19947.096903744638\n", + "margin of error for sample 1 is 19947\n", + "margin of error for sample 2 is 7958\n", + "mean of sample 1 is 55752.2\n", + "mean of sample 2 is 68551.255\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JylFhQm_-x_x", + "colab_type": "text" + }, + "source": [ + "## Margin of error is just a measure of how inaccurate the results could be. A margin of error indicates how much the real results could vary from what we calculated" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_7NuYtHfaQav", + "colab_type": "text" + }, + "source": [ + "The margin of error gives an idea of how far off our estimates might be (with 95% confidence). We're trying to supply a plausible range for our parameter of interest (the true average salary of bootcamp grads)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zkg9EW9KTgGv", + "colab_type": "text" + }, + "source": [ + "## Part 3 - Communicate the Precision of Sample Estimates Graphically\n", + "\n", + "**1) Create a plot using `plt.errorbar` that compares both of the confidence intervals.** " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2QOQGVfsVfFZ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 282 + }, + "outputId": "ca9d90d1-89d3-4a05-dedc-0d4274525acc" + }, + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.errorbar(0, rs1_mean, yerr= 19947, fmt='o')\n", + "plt.errorbar(1, rs2_mean, yerr=7958, fmt='o')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HsTtONTNawFi", + "colab_type": "text" + }, + "source": [ + "**2) Create a plot using `plt.bar` that compares both of the confidence intervals.**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UE1fxaoKW1Xg", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + }, + "outputId": "7dd36979-13b1-493c-fe4d-36e38d5936e9" + }, + "source": [ + "plt.bar(0, rs1_mean, yerr=19947)\n", + "plt.bar(1, rs2_mean, yerr=7958)\n", + "plt.show" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZGCzS-BFctob", + "colab_type": "text" + }, + "source": [ + "## Part 4 - Check for Understanding\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E181afPXezJ9", + "colab_type": "text" + }, + "source": [ + "**Calculate a Confidence Interval using the entire dataset. How precise do our estimates get?**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3puy99D6esLn", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "a566b3c1-0069-4a60-faba-cb79e1d270c2" + }, + "source": [ + "CI(df['ConvertedSalary'], .95)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(71299.82986224785, 73453.40420137631, 75606.97854050477, 2153.574339128457)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q-ACeVEbDfoX", + "colab_type": "text" + }, + "source": [ + "The results get very precise. The confidence interval is much smaller given that the entire population was used" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wbA0inEKexdW", + "colab_type": "text" + }, + "source": [ + "**What does \"precision\" mean in the context of statistical estimates and how is that different from \"accuracy?\"**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Btl5phUUd6L4", + "colab_type": "text" + }, + "source": [ + "\n", + "\n", + "Precision is a measure of how similar your results are to each other. Accuracy is a measure of how close to the actual value you get\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4F-4ux7sd5pL", + "colab_type": "text" + }, + "source": [ + "**It is very common to misunderstand what is captured by a 95% confidence interval. What is the correct interpretation? ([Hint](https://www.statisticssolutions.com/misconceptions-about-confidence-intervals/))**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bSU07naHd6li", + "colab_type": "text" + }, + "source": [ + "A 95% confidence interval will capture 95% of sample means" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cvtnsGLJa4bW", + "colab_type": "text" + }, + "source": [ + "# Stretch Goals:\n", + "\n", + "1) Study the relationship between t-tests and confidence intervals.\n", + " - Find a sample mean that we have worked with and construct a 95% confidence interval around it. (find the lower and upper bounds)\n", + " - Run a 1-sample t-test with the null hypothesis value being just barely **outside** of the confidence interval. What is the p-value?\n", + " - Run a 1-sample t-test with the null hypothesis value being just barely **inside** of the confidence interval. What is the p-value?\n", + "\n", + " What does it mean when we say that the boundaries of the confidence interval are the boundaries of statistical significance in a 1-sample t-test?\n", + "\n", + "\n", + "2) Go back to our [congressional voting dataset](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records) and build confidence intervals around the means of all of the issues for a single party. Compare all of those confidence intervals graphically on a single graph. \n", + "\n", + "3) Dive deeper into the [2018 Stack Overflow Survey](https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey) results to see what cool things you can find." + ] + } + ] +} \ No newline at end of file From 4ff8be5c0662f1a83f5508fa8e2d55bd867fe772 Mon Sep 17 00:00:00 2001 From: badabad <69161193+badabad@users.noreply.github.com> Date: Tue, 18 Aug 2020 11:05:52 -0500 Subject: [PATCH 3/4] Created using Colaboratory --- ...ion_to_Bayesian_Inference_Assignment.ipynb | 568 ++++++++++++++++++ 1 file changed, 568 insertions(+) create mode 100644 LS_DS_124_Introduction_to_Bayesian_Inference_Assignment.ipynb diff --git a/LS_DS_124_Introduction_to_Bayesian_Inference_Assignment.ipynb b/LS_DS_124_Introduction_to_Bayesian_Inference_Assignment.ipynb new file mode 100644 index 000000000..b8f1fa9e5 --- /dev/null +++ b/LS_DS_124_Introduction_to_Bayesian_Inference_Assignment.ipynb @@ -0,0 +1,568 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Copy of LS_DS_124_Introduction_to_Bayesian_Inference_Assignment.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H7OLbevlbd_Z", + "colab_type": "text" + }, + "source": [ + "# Lambda School Data Science Module 124\n", + "\n", + "## Introduction to Bayesian Inference\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P-DzzRk5bf0z", + "colab_type": "text" + }, + "source": [ + "## Assignment - Code it up!\n", + "\n", + "We used pure math to apply Bayes Theorem to drug tests. Now write Python code to reproduce the results! This is purposefully open ended - you'll have to think about how you should represent probabilities and events. You can and should look things up.\n", + "\n", + "Specific goals/targets:\n", + "\n", + "### 1) Write a function \n", + "\n", + "`def prob_drunk_given_positive(prob_drunk_prior, false_positive_rate, true_positive_rate):` \n", + "\n", + "You should only truly need these three values in order to apply Bayes Theorem. In this example, imagine that individuals are taking a breathalyzer test with an 8% false positive rate, a 100% true positive rate, and that our prior belief about drunk driving in the population is 1/1000. \n", + " - What is the probability that a person is drunk after one positive breathalyzer test?\n", + " - What is the probability that a person is drunk after two positive breathalyzer tests?\n", + " - How many positive breathalyzer tests are needed in order to have a probability that's greater than 95% that a person is drunk beyond the legal limit?\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xpVhZyUnbf7o", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# TODO - code!\n", + "def prob_drunk_given_positive(prob_drunk_prior, false_positive_rate, true_positive_rate):\n", + " #Returns P[person is drunk|positive test]\n", + " #P[drunk|positive]=(P[positive|drunk]*P[drunk])/P[positive]\n", + " #P[positive]=P[false positive]+P[true positive]\n", + " return (true_positive_rate * prob_drunk_prior)/(false_positive_rate*(1-prob_drunk_prior) + true_positive_rate*prob_drunk_prior)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qgANjxz8-nsW", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "2b253e3c-48ca-47a7-ddd3-383af99ad7fe" + }, + "source": [ + "a=prob_drunk_given_positive(1/1000,.08,1)\n", + "a" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.012357884330202669" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N1vfFJ_DB1_p", + "colab_type": "text" + }, + "source": [ + "### ^ chance of being drunk after one positive test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T5HoJWgLAksx", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "7d270406-2efe-4fd8-b85c-84fc2af1428a" + }, + "source": [ + "1- (1-a)**2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.02456305135528669" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qTkkl1XzB0h1", + "colab_type": "text" + }, + "source": [ + "### ^ chance of being drunk after two positive tests" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UHzWuIlXCE3f", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "3085b754-04ce-4a8c-8046-d2abefb1ff1f" + }, + "source": [ + "i = 1\n", + "a=0\n", + "while a<.95:\n", + " a= 1-(1-prob_drunk_given_positive(1/1000,.08,1))**i\n", + " i=i+1\n", + "print(i)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "242\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oTKMLQxqDZOM", + "colab_type": "text" + }, + "source": [ + "### 242 tests required for a >95% chance that the person is drunk" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nhOAsLjvCSLF", + "colab_type": "text" + }, + "source": [ + "### 2) In your own words, summarize the difference between Bayesian and Frequentist statistics\n", + "\n", + "If you're unsure where to start, check out [this blog post of Bayes theorem with Python](https://dataconomy.com/2015/02/introduction-to-bayes-theorem-with-python/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AADlIVfgDztY", + "colab_type": "text" + }, + "source": [ + "\n", + "\n", + "Bayesians use prior knowledge to draw conclusions, whereas frequentists just draw their conclusions from the observations that are available\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AeAGBxWjCTv3", + "colab_type": "text" + }, + "source": [ + "### 3) Use the following Template to help come up with ideas for your Build Sprint Project: \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YgIaBr__C5Q2", + "colab_type": "text" + }, + "source": [ + "---\n", + "\n", + "## Idea 1:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MwR6hTB1CiLR", + "colab_type": "text" + }, + "source": [ + "### You\n", + "What do you care about?\n", + "Something I am passionate about is health/fitness, bodybuilding, weightlifting etc\n", + "\n", + "What do you know about?\n", + "I know a decent about exercise science, nutrition, and pharmacology\n", + "\n", + "What decisions do you face?\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5owPwlSClm2", + "colab_type": "text" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lorLHSkzClzC", + "colab_type": "text" + }, + "source": [ + "### Seven templates\n", + "\n", + "In order to better understand the prompts here, please review the [Priceonomics Content Marketing Templates](https://priceonomics.com/introducing-priceonomics-content-marketing/)\n", + "\n", + "Can you apply the templates to your topics?\n", + "\n", + "**Geographic Variation:**\n", + "Number of gyms in each city\n", + "**Trend related to the news:**\n", + "Exercise trends post lockdown\n", + "**Who does that?:**\n", + "Which demographics exercise the most?\n", + "**Answering a question people care about:**\n", + "How prevalent is steroid use in the general population?\n", + "in sports?\n", + "**Valuable to businesses:**\n", + "which gyms are the most successful?\n", + "**What's the most popular?:**\n", + "Most popular supplements\n", + "**Cost/Money rankings:**\n", + "Most profitable supplements or gym business models\n", + "Richest bodybuilders" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cjxeN9D7Cygt", + "colab_type": "text" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3OXCCIT6C_p0", + "colab_type": "text" + }, + "source": [ + "### Misconceptions\n", + "\n", + "What misconceptions do people have about your topic?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kZAkrgTUDCjF", + "colab_type": "text" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3tNrndE9DD3-", + "colab_type": "text" + }, + "source": [ + "### Examples\n", + "\n", + "What data storytelling example inspires you?\n", + "\n", + "Could you do a new hypothesis, for the same question?\n", + "\n", + "Could you do a new question, for the same topic?\n", + "\n", + "Could you do a new topic, with the same \"style\"?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-5NsWQy5DKoR", + "colab_type": "text" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9rPtMFB6DN7B", + "colab_type": "text" + }, + "source": [ + "###Data\n", + "\n", + "Where could you search for data about your topic?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s9hqjOlhDTDa", + "colab_type": "text" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ld5MOaFaDVSK" + }, + "source": [ + "---\n", + "\n", + "## Idea 2:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "znhTWIlUDVSM" + }, + "source": [ + "### You\n", + "What do you care about?\n", + "video games\n", + "What do you know about?\n", + "\n", + "What decisions do you face?\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3Gln5mYeDVSM" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "W-njBredDVSN" + }, + "source": [ + "### Seven templates\n", + "\n", + "In order to better understand the prompts here, please review the [Priceonomics Content Marketing Templates](https://priceonomics.com/introducing-priceonomics-content-marketing/)\n", + "\n", + "Can you apply the templates to your topics?\n", + "\n", + "**Geographic Variation:** What regions play what games. How many people in those region play these games\n", + "\n", + "**Trend related to the news:**\n", + "\n", + "**Who does that?:**Which demographics play the most games? what kinds\n", + "\n", + "**Answering a question people care about:**\n", + "\n", + "**Valuable to businesses:**\n", + "\n", + "**What's the most popular?:**\n", + "Most popular games? Most popular genres?\n", + "**Cost/Money rankings:**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "pkOqHHL3DVSN" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "njLQ3hLuDVSO" + }, + "source": [ + "### Misconceptions\n", + "\n", + "What misconceptions do people have about your topic?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "IzrUvaKeDVSO" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "L0X6X5WaDVSP" + }, + "source": [ + "### Examples\n", + "\n", + "What data storytelling example inspires you?\n", + "\n", + "Could you do a new hypothesis, for the same question?\n", + "\n", + "Could you do a new question, for the same topic?\n", + "\n", + "Could you do a new topic, with the same \"style\"?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "fDdHExa0DVSP" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "CxgSpiwYDVSQ" + }, + "source": [ + "###Data\n", + "\n", + "Where could you search for data about your topic?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uWgWjp3PQ3Sq", + "colab_type": "text" + }, + "source": [ + "## Resources" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QRgHqmYIQ9qn", + "colab_type": "text" + }, + "source": [ + "- [Worked example of Bayes rule calculation](https://en.wikipedia.org/wiki/Bayes'_theorem#Examples) (helpful as it fully breaks out the denominator)\n", + "- [Source code for mvsdist in scipy](https://github.com/scipy/scipy/blob/90534919e139d2a81c24bf08341734ff41a3db12/scipy/stats/morestats.py#L139)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GP7Jv1XvwtkX", + "colab_type": "text" + }, + "source": [ + "## Stretch Goals:\n", + "\n", + "- Go back and study the content from Modules 1 & 2 & 3 to make sure that you're really comfortable with them.\n", + "- Apply a Bayesian technique to a problem you previously worked (in an assignment or project work) on from a frequentist (standard) perspective\n", + "- Check out [PyMC3](https://docs.pymc.io/) (note this goes beyond hypothesis tests into modeling) - read the guides and work through some examples\n", + "- Take PyMC3 further - see if you can build something with it!" + ] + } + ] +} \ No newline at end of file From acd28b8e407c29743ffee1570cbf19d705974b46 Mon Sep 17 00:00:00 2001 From: badabad <69161193+badabad@users.noreply.github.com> Date: Tue, 18 Aug 2020 11:06:42 -0500 Subject: [PATCH 4/4] Created using Colaboratory --- ...21_Statistics_Probability_Assignment.ipynb | 1538 +++++++++++++++++ 1 file changed, 1538 insertions(+) create mode 100644 LS_DS_121_Statistics_Probability_Assignment.ipynb diff --git a/LS_DS_121_Statistics_Probability_Assignment.ipynb b/LS_DS_121_Statistics_Probability_Assignment.ipynb new file mode 100644 index 000000000..b580a42a1 --- /dev/null +++ b/LS_DS_121_Statistics_Probability_Assignment.ipynb @@ -0,0 +1,1538 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + }, + "colab": { + "name": "Copy of LS_DS_121_Statistics_Probability_Assignment.ipynb", + "provenance": [], + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Sa5KWMO1ngPN", + "colab_type": "text" + }, + "source": [ + "\n", + "

\n", + "

\n", + "\n", + "## *Data Science Unit 1 Sprint 2 Assignment 1*\n", + "\n", + "# Apply the t-test to real data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gZDO3KBZOJNp", + "colab_type": "text" + }, + "source": [ + "## Practice 1-Sample T-tests\n", + "\n", + "One Sample t-tests determine whether or not a sample mean is statistically different from some known (or hypothesized) population mean. \n", + "\n", + "### 1) Load the Data\n", + "- Use the [automobile dataset](https://archive.ics.uci.edu/ml/datasets/Automobile)\n", + "- Fix the column headers\n", + "- Make sure NaNs are used to indicate missing values\n", + "\n", + "Feel free to add code cells and text cells as needed throughout the assignment." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cj-d0_8wWJjj", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 201 + }, + "outputId": "36c0519b-2f84-4db0-b9c3-e284477c776e" + }, + "source": [ + "### YOUR WORK HERE\n", + "import pandas as pd\n", + "import numpy as np\n", + "from scipy import stats\n", + "\n", + "!wget https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data\n" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "--2020-08-14 15:39:42-- https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data\n", + "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n", + "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 25936 (25K) [application/x-httpd-php]\n", + "Saving to: ‘imports-85.data’\n", + "\n", + "imports-85.data 100%[===================>] 25.33K --.-KB/s in 0.1s \n", + "\n", + "2020-08-14 15:39:42 (183 KB/s) - ‘imports-85.data’ saved [25936/25936]\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "94yTDfALXBG9", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 278 + }, + "outputId": "d902a7b4-ce31-47e4-9c9b-de99e108f599" + }, + "source": [ + "auto = pd.read_csv('imports-85.data')\n", + "auto.columns = [\"symboling\", 'normalized-losses', 'make', 'fuel-type', 'aspiration' , 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', \n", + " 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', \n", + " 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']\n", + "auto=auto.replace('?', np.nan)\n", + "\n", + "auto.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
symbolingnormalized-lossesmakefuel-typeaspirationnum-of-doorsbody-styledrive-wheelsengine-locationwheel-baselengthwidthheightcurb-weightengine-typenum-of-cylindersengine-sizefuel-systemborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
03NaNalfa-romerogasstdtwoconvertiblerwdfront88.6168.864.148.82548dohcfour130mpfi3.472.689.01115000212716500
11NaNalfa-romerogasstdtwohatchbackrwdfront94.5171.265.552.42823ohcvsix152mpfi2.683.479.01545000192616500
22164audigasstdfoursedanfwdfront99.8176.666.254.32337ohcfour109mpfi3.193.4010.01025500243013950
32164audigasstdfoursedan4wdfront99.4176.666.454.32824ohcfive136mpfi3.193.408.01155500182217450
42NaNaudigasstdtwosedanfwdfront99.8177.366.353.12507ohcfive136mpfi3.193.408.51105500192515250
\n", + "
" + ], + "text/plain": [ + " symboling normalized-losses make ... city-mpg highway-mpg price\n", + "0 3 NaN alfa-romero ... 21 27 16500\n", + "1 1 NaN alfa-romero ... 19 26 16500\n", + "2 2 164 audi ... 24 30 13950\n", + "3 2 164 audi ... 18 22 17450\n", + "4 2 NaN audi ... 19 25 15250\n", + "\n", + "[5 rows x 26 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FD-Pk07-Z-xk", + "colab_type": "text" + }, + "source": [ + "### 2) Pretend that this dataset represents the cars at a used car lot in your local town. \n", + "\n", + "- Use df.sample() to pick a random sample of 10 cars. Note that because this sample is random we are going to set the `random_state` so that all of us in the class get the same random sample. Please set your random state to `30` when using `df.sample()`\n", + "\n", + "Is your sample reflective of the population value in regards to highway-mpg? Find the mean for `highway-mpg` for the entire dataset and use a 1-sample t-test to compare your estaimated sample mean to the population mean. Can you say that they are different? " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0xXMbIM9rU09", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + }, + "outputId": "6cc6cd6b-eacf-47eb-a4b5-e5b806f8a04f" + }, + "source": [ + "auto.sample(n=10, random_state=30)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
symbolingnormalized-lossesmakefuel-typeaspirationnum-of-doorsbody-styledrive-wheelsengine-locationwheel-baselengthwidthheightcurb-weightengine-typenum-of-cylindersengine-sizefuel-systemborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
146089subarugasstdfourwagonfwdfront97.0173.565.453.02455ohcffour108mpfi3.622.649.0945200253110198
67-193mercedes-benzdieselturbofourwagonrwdfront110.0190.970.358.73750ohcfive183idi3.583.6421.51234350222528248
261148dodgegasturboNaNsedanfwdfront93.7157.363.850.62191ohcfour98mpfi3.033.397.6102550024308558
155091toyotagasstdfoursedanfwdfront95.7166.364.453.02081ohcfour982bbl3.193.039.070480030376938
1353150saabgasturbotwohatchbackfwdfront99.1186.666.556.12808dohcfour121mpfi3.543.079.01605500192618150
241148dodgegasstdfoursedanfwdfront93.7157.363.850.61989ohcfour902bbl2.973.239.468550031386692
154091toyotagasstdfourwagon4wdfront95.7169.763.659.13110ohcfour922bbl3.053.039.062480027328778
195-2103volvogasstdfoursedanrwdfront104.3188.867.256.22935ohcfour141mpfi3.783.159.51145400242815985
1322104saabgasstdfoursedanfwdfront99.1186.666.556.12695ohcfour121mpfi3.543.079.31105250212812170
194-174volvogasstdfourwagonrwdfront104.3188.867.257.53034ohcfour141mpfi3.783.159.51145400232813415
\n", + "
" + ], + "text/plain": [ + " symboling normalized-losses make ... city-mpg highway-mpg price\n", + "146 0 89 subaru ... 25 31 10198\n", + "67 -1 93 mercedes-benz ... 22 25 28248\n", + "26 1 148 dodge ... 24 30 8558\n", + "155 0 91 toyota ... 30 37 6938\n", + "135 3 150 saab ... 19 26 18150\n", + "24 1 148 dodge ... 31 38 6692\n", + "154 0 91 toyota ... 27 32 8778\n", + "195 -2 103 volvo ... 24 28 15985\n", + "132 2 104 saab ... 21 28 12170\n", + "194 -1 74 volvo ... 23 28 13415\n", + "\n", + "[10 rows x 26 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "G5vU1WIvoISQ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "bfbac652-432d-4b81-85f1-e33632ed798a" + }, + "source": [ + "auto['highway-mpg'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "30.769607843137255" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6qXiJGTwdG2N", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "2feb74c5-8f91-4b9b-cc29-306f9ebf3760" + }, + "source": [ + "### YOUR WORK HERE\n", + "ah = auto.sample(n=10,random_state=30)['highway-mpg']\n", + "auto.sample(n=10, random_state=30)['highway-mpg'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "30.3" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ss7QTFfao9_8", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "f15ce6e7-b172-4129-bf1e-6ef6e91c5c09" + }, + "source": [ + "auto['highway-mpg'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "30.769607843137255" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y_IeXmU4p0at", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "8166f3cd-5858-48e9-89e2-38703e4949be" + }, + "source": [ + "stats.ttest_1samp(ah,auto['highway-mpg'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_1sampResult(statistic=-0.3415894425589548, pvalue=0.7405001258873907)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6teLk_dRpdxc", + "colab_type": "text" + }, + "source": [ + "### Although the sample t test and entire dataset avg highway mpg are slightly different, I would say it is close enough that the t sample is a good reprensentation of the population\n", + "\n", + "## Based on the 1 sample t test, I fail to reject the hypothesis that the sample is representative of the population" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oxsx9cN7dUPp", + "colab_type": "text" + }, + "source": [ + "The salesman says the cars he sells typically have a fuel efficiency of about 35 miles per gallon on the highway. You want to verify his claim but can only test 10 cars. Using your sample of 10, test his claim and report your results." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6pXyclB9ZXCn", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "fc578f79-c310-43c1-f0c6-e66f25580120" + }, + "source": [ + "### YOUR WORK HERE\n", + "stats.ttest_1samp(ah,35)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_1sampResult(statistic=-3.4187469470305465, pvalue=0.007643069993182772)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "12bRFpQcuP8c", + "colab_type": "text" + }, + "source": [ + "## given that the probability of these cars getting an avg of 35 highway mpg is 0.76%, I would say that the salesman is lying and I would reject his claim" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yAkyHaEnbNYa", + "colab_type": "text" + }, + "source": [ + "Imagine that you now have the capacity to test 100 cars on the lot. Using the same random state of `30`, take a sample of 100 cars. Run a t-test to verify the salesman's claim with your new larger sample. Do you reach the same conclusion as you did with the sample size of 10? " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y3H81uMubpLh", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "8a7a08ad-f8be-41fb-f5d6-7539347ee3f7" + }, + "source": [ + "### YOUR WORK HERE\n", + "ah2 = auto.sample(n=100,random_state=30)['highway-mpg']\n", + "stats.ttest_1samp(ah2,35)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_1sampResult(statistic=-6.4827186446460106, pvalue=3.5481090083532426e-09)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gAvI8kv6u2B1", + "colab_type": "text" + }, + "source": [ + "## After repeating the ttest with a larger sample, the probability of the saleman's claims is even lower. I still continue to reject his claim of 35 highway mpg" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sDO-ufMDfpVA", + "colab_type": "text" + }, + "source": [ + "Why might these two t-tests using the same dataset lead to different conclusions about the salesman's claim?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8H6-cStNfpwS", + "colab_type": "text" + }, + "source": [ + "## The two samples may be from the same dataset, but they are different sample sizes. A larger sample size is more accurate" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TPT-ZTt0PuIk", + "colab_type": "text" + }, + "source": [ + "## Practice 2-Sample T-tests\n", + "\n", + "Two Sample t-tests determine whether or not two sample means are statistically different from each other. \n", + "\n", + "This portion of your assignment requires you to determine which issues have \"statistically significant\" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PvFVLwekQxLH", + "colab_type": "text" + }, + "source": [ + "### 1) Load the data\n", + "\n", + "The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!\n", + "\n", + "- Read the dataset in from UCI, you'll need to provide a list of column headers\n", + "- Encode \"yes\" votes as 1 and \"no\" votes as 0. (You can use `df.replace()` to swap out these values)\n", + "- Use dataframe filtering to split the dataframe into two new dataframes based on party. Hold all republicans in one dataframe and all democrats in the other. These will be our two different \"samples.\"\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "__qLGHt5fXvU", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + }, + "outputId": "d436e133-bc18-4a02-ca70-d8a1098d7374" + }, + "source": [ + "### YOUR WORK HERE\n", + "df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data')\n", + "df.columns=('party', 'handicapped infants', 'water project', 'adoption of resolution', 'physician fee freeze', 'el salvador aid', 'religious groups in school', 'anti satellite test', 'aid to nicaraguan-contras',\n", + " 'mx-miscle', 'immigration', 'synfuels-corp-cuts', 'education spending', 'superfund sue', 'crime', 'duty free imports', 'export admin act SA')\n", + "df=df.replace(\"n\",0)\n", + "df=df.replace(\"y\",1)\n", + "df=df.replace(\"?\",np.nan)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
partyhandicapped infantswater projectadoption of resolutionphysician fee freezeel salvador aidreligious groups in schoolanti satellite testaid to nicaraguan-contrasmx-miscleimmigrationsynfuels-corp-cutseducation spendingsuperfund suecrimeduty free importsexport admin act SA
0republican0.01.00.01.01.01.00.00.00.00.00.01.01.01.00.0NaN
1democratNaN1.01.0NaN1.01.00.00.00.00.01.00.01.01.00.00.0
2democrat0.01.01.00.0NaN1.00.00.00.00.01.00.01.00.00.01.0
3democrat1.01.01.00.01.01.00.00.00.00.01.0NaN1.01.01.01.0
4democrat0.01.01.00.01.01.00.00.00.00.00.00.01.01.01.01.0
\n", + "
" + ], + "text/plain": [ + " party handicapped infants ... duty free imports export admin act SA\n", + "0 republican 0.0 ... 0.0 NaN\n", + "1 democrat NaN ... 0.0 0.0\n", + "2 democrat 0.0 ... 0.0 1.0\n", + "3 democrat 1.0 ... 1.0 1.0\n", + "4 democrat 0.0 ... 1.0 1.0\n", + "\n", + "[5 rows x 17 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3ToxAOXCyemp", + "colab_type": "code", + "colab": {} + }, + "source": [ + "republican=df[df['party']=='republican']\n", + "democrat=df[df['party']=='democrat']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f20v73e3RLAV", + "colab_type": "text" + }, + "source": [ + "### 2) Perform two sample T-tests on different issues and report the results.\n", + "\n", + "- Find an issue that democrats support more than republicans with p < 0.01 (significant at the 99% level).\n", + "- Find an issue that republicans support more than democrats with p < 0.01 (significant at the 99% level).\n", + "- Find an issue where the difference between republicans and democrats has p > 0.1 (Not significant at the 90% level - i.e. there may not be much of a difference the two sample means)\n", + "\n", + "Please for each test that you run state your null and alternative hypothesis and then write a conclusion reflecting on the null and alternative hypothesis.\n", + "\n", + "Remember, that two-sample t-tests will only tell us if the two groups are *different* from one another. We'll have to look at their sample means directly or use the sign on the t-statistic to know which of the two sample means is larger. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bCE3UgpbP69p", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "ad1416ef-6a39-48b9-cca6-7af1109771fa" + }, + "source": [ + "### YOUR WORK HERE\n", + "stats.ttest_ind(democrat['adoption of resolution'], republican['adoption of resolution'], nan_policy = 'omit')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_indResult(statistic=23.12119107755175, pvalue=6.013425749068062e-77)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2A6zbMxG3e99", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "88a10423-2f41-4d85-ea05-036d32b0d311" + }, + "source": [ + "democrat['adoption of resolution'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8884615384615384" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zpH0lWfP3fRv", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "39a7f4b0-5d7f-41d0-df8b-cb70b80a5d63" + }, + "source": [ + "republican['adoption of resolution'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.13496932515337423" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "p2xsu17J3wmf", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "97e630ad-d5f9-4914-f655-2453b2641e6f" + }, + "source": [ + "stats.ttest_ind(democrat['education spending'], republican['education spending'], nan_policy = 'omit')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ttest_indResult(statistic=-20.414298768685285, pvalue=4.967619782338976e-64)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_eDyT4MC34NQ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "a4ad00a0-3ebd-4ac2-bf99-bb3d46cc237d" + }, + "source": [ + "democrat['education spending'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.14457831325301204" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZhRgert734Uf", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 33 + }, + "outputId": "48aab007-0396-47fb-e28d-4dc75496c2bf" + }, + "source": [ + "republican['education spending'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8701298701298701" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Sd-Qojly3-k-", + "colab_type": "text" + }, + "source": [ + "### Republicans support education spending significantly more. Democrats support the adoption of the budget resolution significantly more" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C5pkmOuKgK9Y", + "colab_type": "text" + }, + "source": [ + "## Stretch Goals:\n", + "\n", + "### 1) Use functions and some iterator (for loop, .apply(), list comprehension, etc.) to perform two sample t-tests on every issue in the dataset in an automated fashion." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "z7c7K322gLeN", + "colab_type": "code", + "colab": {} + }, + "source": [ + "### YOUR WORK HERE" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wcA_AvvRAqW_", + "colab_type": "text" + }, + "source": [ + "\n", + "\n", + "### 2) Work on Performing a T-test without using Scipy in order to get \"under the hood\" and learn more thoroughly about this topic.\n", + "### Start with a 1-sample t-test\n", + " - Establish the conditions for your test \n", + " - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).\n", + " - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)\n", + "\n", + "### Be sure to check your work using Scipy!\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3pdMy-KoPjoM", + "colab_type": "code", + "colab": {} + }, + "source": [ + "### YOUR WORK HERE" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3114xDdlPjFx", + "colab_type": "text" + }, + "source": [ + "### 3) Then try a 2-sample t-test\n", + " - Establish the conditions for your test \n", + " - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).\n", + " - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)\n", + "\n", + " ### Be sure to check your work using Scipy!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dPmXOHh1Cfea", + "colab_type": "code", + "colab": {} + }, + "source": [ + "### YOUR WORK HERE" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file