diff --git a/notebooks/3.2 Exploratory data analysis II and working with texts.ipynb b/notebooks/3.2 Exploratory data analysis II and working with texts.ipynb deleted file mode 100644 index 3e279e0..0000000 --- a/notebooks/3.2 Exploratory data analysis II and working with texts.ipynb +++ /dev/null @@ -1,2564 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 3.2 Exploratory data analysis and working with texts\n", - "\n", - "In this notebook, we learn about:\n", - "1. descriptive statistics to explore data;\n", - "2. working with texts (hints)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Part 1: descriptive statistics\n", - "\n", - "*The goal of exploratory data analysis is to develop an understanding of your data. EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.* \n", - "\n", - "Key questions:\n", - "* Which kind of variation occurs within variables?\n", - "* Which kind of co-variation occurs between variables?\n", - "\n", - "https://r4ds.had.co.nz/exploratory-data-analysis.html" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# imports\n", - "\n", - "import os, codecs\n", - "import pandas as pd\n", - "import numpy as np\n", - "import seaborn as sns\n", - "import matplotlib.pyplot as plt" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Import the dataset\n", - "Let us import the Venetian apprenticeship contracts dataset in memory." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "root_folder = \"../data/apprenticeship_venice/\"\n", - "df_contracts = pd.read_csv(codecs.open(os.path.join(root_folder,\"professions_data.csv\"), encoding=\"utf8\"), sep=\";\")\n", - "df_professions = pd.read_csv(codecs.open(os.path.join(root_folder,\"professions_classification.csv\"), encoding=\"utf8\"), sep=\",\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's take another look to the dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 9653 entries, 0 to 9652\n", - "Data columns (total 47 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 page_title 9653 non-null object \n", - " 1 register 9653 non-null object \n", - " 2 annual_salary 7870 non-null float64\n", - " 3 a_profession 9653 non-null object \n", - " 4 profession_code_strict 9618 non-null object \n", - " 5 profession_code_gen 9614 non-null object \n", - " 6 profession_cat 9597 non-null object \n", - " 7 corporation 9350 non-null object \n", - " 8 keep_profession_a 9653 non-null int64 \n", - " 9 complete_profession_a 9653 non-null int64 \n", - " 10 enrolmentY 9628 non-null float64\n", - " 11 enrolmentM 9631 non-null float64\n", - " 12 startY 9533 non-null float64\n", - " 13 startM 9539 non-null float64\n", - " 14 length 9645 non-null float64\n", - " 15 has_fled 9653 non-null int64 \n", - " 16 m_profession 9535 non-null object \n", - " 17 m_profession_code_strict 9508 non-null object \n", - " 18 m_profession_code_gen 9506 non-null object \n", - " 19 m_profession_cat 9489 non-null object \n", - " 20 m_corporation 9276 non-null object \n", - " 21 keep_profession_m 9653 non-null int64 \n", - " 22 complete_profession_m 9653 non-null int64 \n", - " 23 m_gender 9554 non-null float64\n", - " 24 m_name 9623 non-null object \n", - " 25 m_surname 6960 non-null object \n", - " 26 m_patronimic 2620 non-null object \n", - " 27 m_atelier 1434 non-null object \n", - " 28 m_coords 9639 non-null object \n", - " 29 a_name 9653 non-null object \n", - " 30 a_age 9303 non-null float64\n", - " 31 a_gender 9522 non-null float64\n", - " 32 a_geo_origins 7149 non-null object \n", - " 33 a_geo_origins_std 4636 non-null object \n", - " 34 a_coords 9610 non-null object \n", - " 35 a_quondam 7848 non-null float64\n", - " 36 accommodation_master 9653 non-null int64 \n", - " 37 personal_care_master 9653 non-null int64 \n", - " 38 clothes_master 9653 non-null int64 \n", - " 39 generic_expenses_master 9653 non-null int64 \n", - " 40 salary_in_kind_master 9653 non-null int64 \n", - " 41 pledge_goods_master 9653 non-null int64 \n", - " 42 pledge_money_master 9653 non-null int64 \n", - " 43 salary_master 9653 non-null int64 \n", - " 44 female_guarantor 9653 non-null int64 \n", - " 45 period_cat 7891 non-null float64\n", - " 46 incremental_salary 9653 non-null int64 \n", - "dtypes: float64(11), int64(15), object(21)\n", - "memory usage: 3.5+ MB\n" - ] - } - ], - "source": [ - "df_contracts.info()" - ] - }, - { - "cell_type": "code", - "execution_count": 80, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
page_titleregisterannual_salarya_professionprofession_code_strictprofession_code_genprofession_catcorporationkeep_profession_acomplete_profession_a...personal_care_masterclothes_mastergeneric_expenses_mastersalary_in_kind_masterpledge_goods_masterpledge_money_mastersalary_masterfemale_guarantorperiod_catincremental_salary
0Carlo Della sosta (Orese) 1592-08-03asv, giustizia vecchia, accordi dei garzoni, 1...NaNoreseoreseoreficeoreficeOresi11...11100000NaN0
1Antonio quondam Andrea (squerariol) 1583-01-09asv, giustizia vecchia, accordi dei garzoni, 1...12.5squerariolsquerariollavori allo squerolavori allo squeroSquerarioli11...001000101.00
2Cristofollo di Zuane (batioro in carta) 1591-0...asv, giustizia vecchia, accordi dei garzoni, 1...NaNbatiorobatiorobattiorofabbricatore di foglie/fili/cordelle d'oro o a...Battioro11...00000000NaN0
\n", - "

3 rows × 47 columns

\n", - "
" - ], - "text/plain": [ - " page_title \\\n", - "0 Carlo Della sosta (Orese) 1592-08-03 \n", - "1 Antonio quondam Andrea (squerariol) 1583-01-09 \n", - "2 Cristofollo di Zuane (batioro in carta) 1591-0... \n", - "\n", - " register annual_salary \\\n", - "0 asv, giustizia vecchia, accordi dei garzoni, 1... NaN \n", - "1 asv, giustizia vecchia, accordi dei garzoni, 1... 12.5 \n", - "2 asv, giustizia vecchia, accordi dei garzoni, 1... NaN \n", - "\n", - " a_profession profession_code_strict profession_code_gen \\\n", - "0 orese orese orefice \n", - "1 squerariol squerariol lavori allo squero \n", - "2 batioro batioro battioro \n", - "\n", - " profession_cat corporation \\\n", - "0 orefice Oresi \n", - "1 lavori allo squero Squerarioli \n", - "2 fabbricatore di foglie/fili/cordelle d'oro o a... Battioro \n", - "\n", - " keep_profession_a complete_profession_a ... personal_care_master \\\n", - "0 1 1 ... 1 \n", - "1 1 1 ... 0 \n", - "2 1 1 ... 0 \n", - "\n", - " clothes_master generic_expenses_master salary_in_kind_master \\\n", - "0 1 1 0 \n", - "1 0 1 0 \n", - "2 0 0 0 \n", - "\n", - " pledge_goods_master pledge_money_master salary_master female_guarantor \\\n", - "0 0 0 0 0 \n", - "1 0 0 1 0 \n", - "2 0 0 0 0 \n", - "\n", - " period_cat incremental_salary \n", - "0 NaN 0 \n", - "1 1.0 0 \n", - "2 NaN 0 \n", - "\n", - "[3 rows x 47 columns]" - ] - }, - "execution_count": 80, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_contracts.head(3)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Index(['page_title', 'register', 'annual_salary', 'a_profession',\n", - " 'profession_code_strict', 'profession_code_gen', 'profession_cat',\n", - " 'corporation', 'keep_profession_a', 'complete_profession_a',\n", - " 'enrolmentY', 'enrolmentM', 'startY', 'startM', 'length', 'has_fled',\n", - " 'm_profession', 'm_profession_code_strict', 'm_profession_code_gen',\n", - " 'm_profession_cat', 'm_corporation', 'keep_profession_m',\n", - " 'complete_profession_m', 'm_gender', 'm_name', 'm_surname',\n", - " 'm_patronimic', 'm_atelier', 'm_coords', 'a_name', 'a_age', 'a_gender',\n", - " 'a_geo_origins', 'a_geo_origins_std', 'a_coords', 'a_quondam',\n", - " 'accommodation_master', 'personal_care_master', 'clothes_master',\n", - " 'generic_expenses_master', 'salary_in_kind_master',\n", - " 'pledge_goods_master', 'pledge_money_master', 'salary_master',\n", - " 'female_guarantor', 'period_cat', 'incremental_salary'],\n", - " dtype='object')" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_contracts.columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Every row represents an apprenticeship contract. Contracts were registered both at the guild's and at a public office. This is a sample of contracts from a much larger set of records.\n", - "\n", - "Some of the variables we will work with are:\n", - "* `annual_salary`: the annual salary paid to the apprencice, if any (in Venetian ducats).\n", - "* `a_profession` to `corporation`: increasingly generic classifications for the apprentice's stated profession.\n", - "* `startY` and `enrolmentY`: contract start and registration year respectively.\n", - "* `length`: of the contract, in years.\n", - "* `m_gender` and `a_gender`: of master and apprentice respectively.\n", - "* `a_age`: age of the apprentice at entry, in years.\n", - "* `female_guarantor`: if at least one of the contract's guarantors was female, boolean." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TrascrizioneStandardGruppo 0Gruppo 1Gruppo 2Gruppo 3Gruppo 4Corporazione
0al negotio del librarolibrerlibraiolibrai - diverse specializzazionistampaaltre lavorazioni manifatturierebenilibreri, stampatori e ligadori
1arte de far arpicordiarte de far arpicordifabbricatore di arpicordifabbricatore di strumenti musicalimusicaaltri serviziserviziNaN
2arte de' coloriarte dei colorifabbricazione/vendita di coloricoloricoloridecorazioni e mestieri dell'artebenispezieri
\n", - "
" - ], - "text/plain": [ - " Trascrizione Standard \\\n", - "0 al negotio del libraro librer \n", - "1 arte de far arpicordi arte de far arpicordi \n", - "2 arte de' colori arte dei colori \n", - "\n", - " Gruppo 0 Gruppo 1 \\\n", - "0 libraio librai - diverse specializzazioni \n", - "1 fabbricatore di arpicordi fabbricatore di strumenti musicali \n", - "2 fabbricazione/vendita di colori colori \n", - "\n", - " Gruppo 2 Gruppo 3 Gruppo 4 \\\n", - "0 stampa altre lavorazioni manifatturiere beni \n", - "1 musica altri servizi servizi \n", - "2 colori decorazioni e mestieri dell'arte beni \n", - "\n", - " Corporazione \n", - "0 libreri, stampatori e ligadori \n", - "1 NaN \n", - "2 spezieri " - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_professions.head(3)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The professions data frame contains a classification system for each profession as found in the records (transcription, first column). The last column is the guild (or corporation) which governed the given profession. This work was performed manually by historians. We don't use it here as the classifications we need are already part of the main dataframe." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Questions\n", - "\n", - "* Plot the distribution (histogram) of the apprentices' age, contract length, annual salary and start year.\n", - "* Calculate the proportion of female apprentices and masters, and of contracts with a female guarantor.\n", - "* How likely it is for a female apprentice to have a female master? And for a male apprentice?" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "salaries_male_guarantor = df_contracts[df_contracts.female_guarantor == 0].annual_salary" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "salaries_female_guarantor = df_contracts[df_contracts.female_guarantor == 1].annual_salary" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWy0lEQVR4nO3dfZBddX3H8fdHItBJlA1idzJJ2oSa6qAZMdkJcXyYG1NDEi2hrTLYTFlpZradiVandCTUaiwPM6GKFKaKk5rUYNGFogwZRDGN3Dr5IzyE50ezQCjJhKSyIXoDYonf/nF/Sy/r3uy9d8/eu5vf5zVzZ8/5nt8593vObj737LnnbhQRmJlZHt7Q6QbMzKx9HPpmZhlx6JuZZcShb2aWEYe+mVlGpnS6gWM57bTTYs6cOS2te+TIEaZOnVpsQ+NksvTqPos3WXp1n8Ua7z537dr184h464gLI2LCPhYuXBituvPOO1tet90mS6/us3iTpVf3Wazx7hO4N+rkqi/vmJllxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llZNQ/wyDp7cCNNaXTgS8C16f6HGAPcF5EHJIk4BpgJfAS8MmIuC9tqxf4h7SdyyNiSzG7MbI5634wnpuva8+Gj3Tkec3MRjPqmX5EPBkRZ0bEmcBCqkF+C7AO2B4R84DtaR5gBTAvPfqA6wAknQqsB84CFgHrJU0vdG/MzOyYmr28sxR4KiKeBVYBQ2fqW4Bz0/Qq4Pr0JyB2Al2SZgBnA9siYjAiDgHbgOVj3QEzM2ucoon/I1fSZuC+iPgXSS9GRFeqCzgUEV2SbgM2RMSOtGw7cDFQAk6OiMtT/QvAyxHxlWHP0Uf1NwS6u7sX9vf3t7RjlUqFZw4fbWndsZo/85SmxlcqFaZNmzZO3RTHfRZvsvTqPos13n0uWbJkV0T0jLSs4T+tLOlE4BzgkuHLIiIkFfI/rEfERmAjQE9PT5RKpZa2Uy6XuWrHkSJaatqe1aWmxpfLZVrdz3Zyn8WbLL26z2J1ss9mLu+soHqWfyDNH0iXbUhfD6b6PmB2zXqzUq1e3czM2qSZ0P8E8N2a+a1Ab5ruBW6tqV+gqsXA4YjYD9wBLJM0Pb2BuyzVzMysTRq6vCNpKvBh4K9qyhuAmyStAZ4Fzkv126nerjlA9U6fCwEiYlDSZcA9adylETE45j0wM7OGNRT6EXEEeMuw2gtU7+YZPjaAtXW2sxnY3HybZmZWBH8i18wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy0hDoS+pS9LNkp6Q9Lik90o6VdI2SbvT1+lprCRdK2lA0kOSFtRspzeN3y2pd7x2yszMRtbomf41wI8i4h3Au4HHgXXA9oiYB2xP8wArgHnp0QdcByDpVGA9cBawCFg/9EJhZmbtMWroSzoF+CCwCSAifh0RLwKrgC1p2Bbg3DS9Crg+qnYCXZJmAGcD2yJiMCIOAduA5QXui5mZjaKRM/25wP8A/ybpfknflDQV6I6I/WnM80B3mp4JPFez/t5Uq1c3M7M2mdLgmAXApyPiLknX8P+XcgCIiJAURTQkqY/qZSG6u7spl8stbadSqXDR/KNFtNS0ZnuuVCot72c7uc/iTZZe3WexOtlnI6G/F9gbEXel+Zuphv4BSTMiYn+6fHMwLd8HzK5Zf1aq7QNKw+rl4U8WERuBjQA9PT1RKpWGD2lIuVzmqh1HWlp3rPasLjU1vlwu0+p+tpP7LN5k6dV9FquTfY56eScingeek/T2VFoKPAZsBYbuwOkFbk3TW4EL0l08i4HD6TLQHcAySdPTG7jLUs3MzNqkkTN9gE8DN0g6EXgauJDqC8ZNktYAzwLnpbG3AyuBAeClNJaIGJR0GXBPGndpRAwWshdmZtaQhkI/Ih4AekZYtHSEsQGsrbOdzcDmJvozM7MC+RO5ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhoKfUl7JD0s6QFJ96baqZK2Sdqdvk5PdUm6VtKApIckLajZTm8av1tS7/jskpmZ1dPMmf6SiDgzInrS/Dpge0TMA7aneYAVwLz06AOug+qLBLAeOAtYBKwfeqEwM7P2GMvlnVXAljS9BTi3pn59VO0EuiTNAM4GtkXEYEQcArYBy8fw/GZm1qRGQz+AH0vaJakv1bojYn+afh7oTtMzgedq1t2bavXqZmbWJlMaHPf+iNgn6XeBbZKeqF0YESEpimgovaj0AXR3d1Mul1vaTqVS4aL5R4toqWnN9lypVFrez3Zyn8WbLL26z2J1ss+GQj8i9qWvByXdQvWa/AFJMyJif7p8czAN3wfMrll9VqrtA0rD6uURnmsjsBGgp6cnSqXS8CENKZfLXLXjSEvrjtWe1aWmxpfLZVrdz3Zyn8WbLL26z2J1ss9RL+9ImirpTUPTwDLgEWArMHQHTi9wa5reClyQ7uJZDBxOl4HuAJZJmp7ewF2WamZm1iaNnOl3A7dIGhr/nYj4kaR7gJskrQGeBc5L428HVgIDwEvAhQARMSjpMuCeNO7SiBgsbE/MzGxUo4Z+RDwNvHuE+gvA0hHqAayts63NwObm2zQzsyL4E7lmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGGg59SSdIul/SbWl+rqS7JA1IulHSial+UpofSMvn1GzjklR/UtLZhe+NmZkdUzNn+p8BHq+ZvxK4OiLeBhwC1qT6GuBQql+dxiHpDOB84J3AcuDrkk4YW/tmZtaMhkJf0izgI8A307yADwE3pyFbgHPT9Ko0T1q+NI1fBfRHxCsR8QwwACwqYB/MzKxBjZ7p/zPwOeA3af4twIsR8Wqa3wvMTNMzgecA0vLDafxr9RHWMTOzNpgy2gBJHwUORsQuSaXxbkhSH9AH0N3dTblcbmk7lUqFi+YfLbCzxjXbc6VSaXk/28l9Fm+y9Oo+i9XJPkcNfeB9wDmSVgInA28GrgG6JE1JZ/OzgH1p/D5gNrBX0hTgFOCFmvqQ2nVeExEbgY0APT09USqVWtitavBeteNIS+uO1Z7VpabGl8tlWt3PdnKfxZssvbrPYnWyz1Ev70TEJRExKyLmUH0j9icRsRq4E/hYGtYL3Jqmt6Z50vKfRESk+vnp7p65wDzg7sL2xMzMRtXImX49FwP9ki4H7gc2pfom4NuSBoBBqi8URMSjkm4CHgNeBdZGRGeuv5iZZaqp0I+IMlBO008zwt03EfEr4ON11r8CuKLZJs3MrBj+RK6ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUZGDX1JJ0u6W9KDkh6V9I+pPlfSXZIGJN0o6cRUPynND6Tlc2q2dUmqPynp7HHbKzMzG1EjZ/qvAB+KiHcDZwLLJS0GrgSujoi3AYeANWn8GuBQql+dxiHpDOB84J3AcuDrkk4ocF/MzGwUo4Z+VFXS7BvTI4APATen+hbg3DS9Ks2Tli+VpFTvj4hXIuIZYABYVMROmJlZYxQRow+qnpHvAt4GfA34MrAznc0jaTbww4h4l6RHgOURsTctewo4C/hSWuffU31TWufmYc/VB/QBdHd3L+zv729pxyqVCs8cPtrSumM1f+YpTY2vVCpMmzZtnLopjvss3mTp1X0Wa7z7XLJkya6I6Blp2ZRGNhARR4EzJXUBtwDvKK6933qujcBGgJ6eniiVSi1tp1wuc9WOIwV21rg9q0tNjS+Xy7S6n+3kPos3WXp1n8XqZJ9N3b0TES8CdwLvBbokDb1ozAL2pel9wGyAtPwU4IXa+gjrmJlZGzRy985b0xk+kn4H+DDwONXw/1ga1gvcmqa3pnnS8p9E9RrSVuD8dHfPXGAecHdB+2FmZg1o5PLODGBLuq7/BuCmiLhN0mNAv6TLgfuBTWn8JuDbkgaAQap37BARj0q6CXgMeBVYmy4bmZlZm4wa+hHxEPCeEepPM8LdNxHxK+DjdbZ1BXBF822amVkR/IlcM7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsI6OGvqTZku6U9JikRyV9JtVPlbRN0u70dXqqS9K1kgYkPSRpQc22etP43ZJ6x2+3zMxsJI2c6b8KXBQRZwCLgbWSzgDWAdsjYh6wPc0DrADmpUcfcB1UXySA9cBZwCJg/dALhZmZtceooR8R+yPivjT9S+BxYCawCtiShm0Bzk3Tq4Dro2on0CVpBnA2sC0iBiPiELANWF7kzpiZ2bEpIhofLM0Bfgq8C/jviOhKdQGHIqJL0m3AhojYkZZtBy4GSsDJEXF5qn8BeDkivjLsOfqo/oZAd3f3wv7+/pZ2rFKp8Mzhoy2tO1bzZ57S1PhKpcK0adPGqZviuM/iTZZe3WexxrvPJUuW7IqInpGWTWl0I5KmAd8DPhsRv6jmfFVEhKTGXz2OISI2AhsBenp6olQqtbSdcrnMVTuOFNFS0/asLjU1vlwu0+p+tpP7LN5k6dV9FquTfTZ0946kN1IN/Bsi4vupfCBdtiF9PZjq+4DZNavPSrV6dTMza5NG7t4RsAl4PCK+WrNoKzB0B04vcGtN/YJ0F89i4HBE7AfuAJZJmp7ewF2WamZm1iaNXN55H/AXwMOSHki1vwc2ADdJWgM8C5yXlt0OrAQGgJeACwEiYlDSZcA9adylETFYxE6YmVljRg399Ias6ixeOsL4ANbW2dZmYHMzDZqZWXH8iVwzs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjo4a+pM2SDkp6pKZ2qqRtknanr9NTXZKulTQg6SFJC2rW6U3jd0vqHZ/dMTOzY2nkTP9bwPJhtXXA9oiYB2xP8wArgHnp0QdcB9UXCWA9cBawCFg/9EJhZmbtM2roR8RPgcFh5VXAljS9BTi3pn59VO0EuiTNAM4GtkXEYEQcArbx2y8kZmY2zhQRow+S5gC3RcS70vyLEdGVpgUcioguSbcBGyJiR1q2HbgYKAEnR8Tlqf4F4OWI+MoIz9VH9bcEuru7F/b397e0Y5VKhWcOH21p3bGaP/OUpsZXKhWmTZs2Tt0Ux30Wb7L06j6LNd59LlmyZFdE9Iy0bMpYNx4RIWn0V47Gt7cR2AjQ09MTpVKppe2Uy2Wu2nGkqLaasmd1qanx5XKZVvezndxn8SZLr+6zWJ3ss9W7dw6kyzakrwdTfR8wu2bcrFSrVzczszZqNfS3AkN34PQCt9bUL0h38SwGDkfEfuAOYJmk6ekN3GWpZmZmbTTq5R1J36V6Tf40SXup3oWzAbhJ0hrgWeC8NPx2YCUwALwEXAgQEYOSLgPuSeMujYjhbw6bmdk4GzX0I+ITdRYtHWFsAGvrbGczsLmp7szMrFD+RK6ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGRnzf5c4ke05+c/rLpvzq++0sRMzs4nBZ/pmZhk5rs/0O2XOuh80Nf6i+a/yySbXGcmeDR8Z8zbM7PjmM30zs4w49M3MMuLQNzPLSNuv6UtaDlwDnAB8MyI2tLsHqH9nj+/qMbPjWVtDX9IJwNeADwN7gXskbY2Ix9rZx/Gq2TeQm3WsN5z9JrLZ5NDuM/1FwEBEPA0gqR9YBUyY0Pe9/a0Z7xecevxiY9acdof+TOC5mvm9wFm1AyT1AX1ptiLpyRaf6zTg5y2uW8dHi91c8jfj0mvxJmKfunLE8oTr8xgmS6/us1jj3efv11sw4e7Tj4iNwMaxbkfSvRHRU0BL426y9Oo+izdZenWfxepkn+2+e2cfMLtmflaqmZlZG7Q79O8B5kmaK+lE4Hxga5t7MDPLVlsv70TEq5I+BdxB9ZbNzRHx6Dg93ZgvEbXRZOnVfRZvsvTqPovVsT4VEZ16bjMzazN/ItfMLCMOfTOzjByXoS9puaQnJQ1IWtfpfoZImi3pTkmPSXpU0mdS/UuS9kl6ID1WToBe90h6OPVzb6qdKmmbpN3p6/QJ0Ofba47bA5J+IemzE+GYStos6aCkR2pqIx5DVV2bfmYfkrSgw31+WdITqZdbJHWl+hxJL9cc12+0q89j9Fr3ey3pknRMn5R0dof7vLGmxz2SHkj19h7TiDiuHlTfIH4KOB04EXgQOKPTfaXeZgAL0vSbgJ8BZwBfAv6u0/0N63UPcNqw2j8B69L0OuDKTvc5wvf+eaofTOn4MQU+CCwAHhntGAIrgR8CAhYDd3W4z2XAlDR9ZU2fc2rHTZBjOuL3Ov3behA4CZibcuGETvU5bPlVwBc7cUyPxzP91/7UQ0T8Ghj6Uw8dFxH7I+K+NP1L4HGqn1KeLFYBW9L0FuDczrUyoqXAUxHxbKcbAYiInwKDw8r1juEq4Pqo2gl0SZrRqT4j4scR8Wqa3Un1MzUdV+eY1rMK6I+IVyLiGWCAaj6Mu2P1KUnAecB329HLcMdj6I/0px4mXLBKmgO8B7grlT6VfpXePBEumwAB/FjSrvSnMQC6I2J/mn4e6O5Ma3Wdz+v/IU20Ywr1j+FE/rn9S6q/hQyZK+l+Sf8l6QOdamqYkb7XE/WYfgA4EBG7a2ptO6bHY+hPeJKmAd8DPhsRvwCuA/4AOBPYT/VXv057f0QsAFYAayV9sHZhVH8vnTD3+6YP+50D/EcqTcRj+joT7RiORNLngVeBG1JpP/B7EfEe4G+B70h6c6f6Syb893qYT/D6k5O2HtPjMfQn9J96kPRGqoF/Q0R8HyAiDkTE0Yj4DfCvtOlX0GOJiH3p60HgFqo9HRi65JC+Huxch79lBXBfRByAiXlMk3rHcML93Er6JNW/Mrg6vUCRLpW8kKZ3Ub1O/ocda5Jjfq8n4jGdAvwpcONQrd3H9HgM/Qn7px7StbxNwOMR8dWaeu212z8BHhm+bjtJmirpTUPTVN/Ue4TqcexNw3qBWzvT4Yhed/Y00Y5pjXrHcCtwQbqLZzFwuOYyUNup+p8dfQ44JyJeqqm/VdX/FwNJpwPzgKc70+VrPdX7Xm8Fzpd0kqS5VHu9u939DfNHwBMRsXeo0PZj2q53jNv5oHonxM+ovmJ+vtP91PT1fqq/zj8EPJAeK4FvAw+n+lZgRof7PJ3qXQ8PAo8OHUPgLcB2YDfwn8CpnT6mqa+pwAvAKTW1jh9Tqi9C+4H/pXo9eU29Y0j1rp2vpZ/Zh4GeDvc5QPV6+NDP6TfS2D9LPxMPAPcBfzwBjmnd7zXw+XRMnwRWdLLPVP8W8NfDxrb1mPrPMJiZZeR4vLxjZmZ1OPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy8j/AR4PAMHujH0FAAAAAElFTkSuQmCC\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "salaries_male_guarantor.hist()\n", - "salaries_female_guarantor.hist()" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAATWklEQVR4nO3df6zd9X3f8edrkKIUBwIltTybxc7kVuOHSuM7hpRR2QorDmGFtEvliCWgZnIXESlRMwnTSCvSZMnd1lSLaOicgiAjzS1rEuGGsJUie6hSGLEZqTGOhxO8zD8GakIBZxObyXt/nK/RwZzre7n33HPu8ef5kI7O93zO9/s9r/O9x6/7Pd/zvcepKiRJbfhb4w4gSRodS1+SGmLpS1JDLH1JaoilL0kNOXvcAWZz0UUX1erVq+e17I9//GPOPffc4QZaBOYcvknJas7hm5Ssi51zz549f11V73rTHVW1pC/r1q2r+dq5c+e8lx0lcw7fpGQ15/BNStbFzgnsrgGd6uEdSWqIpS9JDbH0Jakhlr4kNcTSl6SGWPqS1BBLX5IaYulLUkMsfUlqyJL/GoZhWb3lodenD2374BiTSNL4uKcvSQ2x9CWpIZa+JDXE0pekhlj6ktQQS1+SGmLpS1JDLH1JaoilL0kNsfQlqSGWviQ1xNKXpIZY+pLUEEtfkhpi6UtSQyx9SWqIpS9JDbH0Jakhlr4kNWTW0k9ycZKdSfYn2ZfkU934HUmOJHmqu1zXt8ztSQ4mOZDk2r7xdUn2dvd9PkkW52lJkgaZy3+MfgL4TFU9meQdwJ4kj3T3/X5V/dv+mZNcAmwCLgX+NvAXSX6uql4D7gI2A48D3wQ2Ag8P56lIkmYz655+VR2rqie76VeA/cDK0yxyAzBdVa9W1XPAQeDKJCuA86rqW1VVwJeAGxf6BCRJc5de/85x5mQ18BhwGfBbwC3Ay8Bueu8GXkxyJ/B4Vd3fLXM3vb35Q8C2qrqmG78auK2qrh/wOJvpvSNg+fLl66anp+f15I4fP86yZcsA2HvkpdfHL195/rzWt1j6cy5lk5ITJierOYdvUrIuds4NGzbsqaqpN91RVXO6AMuAPcCvdreXA2fRe7ewFbinG/8D4J/2LXc38GvA3wf+om/8auDPZnvcdevW1Xzt3Lnz9el33/aN1y9LTX/OpWxSclZNTlZzDt+kZF3snMDuGtCpczp7J8nbgK8CX66qr3W/LJ6vqteq6ifAF4Eru9kPAxf3Lb4KONqNrxowLkkakbmcvRN6e+v7q+pzfeMr+mb7EPB0N70D2JTknCRrgLXAE1V1DHglyVXdOj8GPDik5yFJmoO5nL3zPuCjwN4kT3Vjvw18JMkVQNE7Xv+bAFW1L8kDwDP0zvy5tXpn7gB8ArgXeDu94/yeuSNJIzRr6VfVXwKDzqf/5mmW2UrvOP+p47vpfQgsSRoD/yJXkhpi6UtSQyx9SWqIpS9JDbH0Jakhlr4kNcTSl6SGWPqS1BBLX5IaYulLUkMsfUlqiKUvSQ2x9CWpIZa+JDXE0pekhlj6ktQQS1+SGmLpS1JDLH1JaoilL0kNmfU/Rj8Trd7y0OvTh7Z9cIxJJGm03NOXpIZY+pLUEEtfkhpi6UtSQyx9SWqIpS9JDbH0Jakhs5Z+kouT7EyyP8m+JJ/qxi9M8kiSZ7vrC/qWuT3JwSQHklzbN74uyd7uvs8nyeI8LUnSIHPZ0z8BfKaq/h5wFXBrkkuALcCjVbUWeLS7TXffJuBSYCPwhSRndeu6C9gMrO0uG4f4XCRJs5i19KvqWFU92U2/AuwHVgI3APd1s90H3NhN3wBMV9WrVfUccBC4MskK4Lyq+lZVFfClvmUkSSOQXv/OceZkNfAYcBnwg6p6Z999L1bVBUnuBB6vqvu78buBh4FDwLaquqYbvxq4raquH/A4m+m9I2D58uXrpqen5/Xkjh8/zrJlywDYe+SlgfNcvvL8ea17mPpzLmWTkhMmJ6s5h29Ssi52zg0bNuypqqlTx+f83TtJlgFfBT5dVS+f5nD8oDvqNONvHqzaDmwHmJqaqvXr18815hvs2rWLk8ve0vd9O/0O3TS/dQ9Tf86lbFJywuRkNefwTUrWceWc09k7Sd5Gr/C/XFVf64af7w7Z0F2/0I0fBi7uW3wVcLQbXzVgXJI0InM5eyfA3cD+qvpc3107gJu76ZuBB/vGNyU5J8kaeh/YPlFVx4BXklzVrfNjfctIkkZgLod33gd8FNib5Klu7LeBbcADST4O/AD4MEBV7UvyAPAMvTN/bq2q17rlPgHcC7yd3nH+h4fzNCRJczFr6VfVXzL4eDzA+2dYZiuwdcD4bnofAkuSxsC/yJWkhlj6ktQQS1+SGmLpS1JDLH1JaoilL0kNsfQlqSGWviQ1xNKXpIZY+pLUEEtfkhpi6UtSQyx9SWqIpS9JDbH0Jakhlr4kNcTSl6SGWPqS1BBLX5IaYulLUkMsfUlqiKUvSQ2x9CWpIZa+JDXE0pekhlj6ktQQS1+SGmLpS1JDZi39JPckeSHJ031jdyQ5kuSp7nJd3323JzmY5ECSa/vG1yXZ2933+SQZ/tORJJ3OXPb07wU2Dhj//aq6ort8EyDJJcAm4NJumS8kOaub/y5gM7C2uwxapyRpEc1a+lX1GPCjOa7vBmC6ql6tqueAg8CVSVYA51XVt6qqgC8BN84zsyRpntLr4FlmSlYD36iqy7rbdwC3AC8Du4HPVNWLSe4EHq+q+7v57gYeBg4B26rqmm78auC2qrp+hsfbTO9dAcuXL183PT09ryd3/Phxli1bBsDeIy8NnOfylefPa93D1J9zKZuUnDA5Wc05fJOSdbFzbtiwYU9VTZ06fvY813cX8K+A6q5/D/gNYNBx+jrN+EBVtR3YDjA1NVXr16+fV8hdu3Zxctlbtjw0cJ5DN81v3cPUn3Mpm5ScMDlZzTl8k5J1XDnndfZOVT1fVa9V1U+ALwJXdncdBi7um3UVcLQbXzVgXJI0QvMq/e4Y/UkfAk6e2bMD2JTknCRr6H1g+0RVHQNeSXJVd9bOx4AHF5BbkjQPsx7eSfIVYD1wUZLDwO8A65NcQe8QzSHgNwGqal+SB4BngBPArVX1WreqT9A7E+jt9I7zPzzE5yFJmoNZS7+qPjJg+O7TzL8V2DpgfDdw2VtKJ0kaKv8iV5IaYulLUkMsfUlqiKUvSQ2x9CWpIZa+JDXE0pekhlj6ktQQS1+SGmLpS1JDLH1JaoilL0kNsfQlqSGWviQ1xNKXpIZY+pLUEEtfkhoy6/+cdaZbveWh16cPbfvgGJNI0uJzT1+SGmLpS1JDLH1JaoilL0kNsfQlqSGWviQ1xNKXpIZY+pLUEEtfkhoya+knuSfJC0me7hu7MMkjSZ7tri/ou+/2JAeTHEhybd/4uiR7u/s+nyTDfzqSpNOZy57+vcDGU8a2AI9W1Vrg0e42SS4BNgGXdst8IclZ3TJ3AZuBtd3l1HVKkhbZrKVfVY8BPzpl+Abgvm76PuDGvvHpqnq1qp4DDgJXJlkBnFdV36qqAr7Ut4wkaUTS6+BZZkpWA9+oqsu6239TVe/su//FqrogyZ3A41V1fzd+N/AwcAjYVlXXdONXA7dV1fUzPN5meu8KWL58+brp6el5Pbnjx4+zbNkyAPYeeWnW+S9fef68Hmeh+nMuZZOSEyYnqzmHb1KyLnbODRs27KmqqVPHh/0tm4OO09dpxgeqqu3AdoCpqalav379vMLs2rWLk8ve0vdtmjM5dNP8Hmeh+nMuZZOSEyYnqzmHb1KyjivnfM/eeb47ZEN3/UI3fhi4uG++VcDRbnzVgHFJ0gjNt/R3ADd30zcDD/aNb0pyTpI19D6wfaKqjgGvJLmqO2vnY33LSJJGZNbDO0m+AqwHLkpyGPgdYBvwQJKPAz8APgxQVfuSPAA8A5wAbq2q17pVfYLemUBvp3ec/+GhPhNJ0qxmLf2q+sgMd71/hvm3AlsHjO8GLntL6SRJQ+Vf5EpSQyx9SWqIpS9JDbH0Jakhlr4kNcTSl6SGWPqS1BBLX5IaYulLUkMsfUlqiKUvSQ2x9CWpIZa+JDXE0pekhlj6ktQQS1+SGmLpS1JDLH1JaoilL0kNsfQlqSGWviQ1xNKXpIZY+pLUEEtfkhpi6UtSQyx9SWqIpS9JDbH0JakhCyr9JIeS7E3yVJLd3diFSR5J8mx3fUHf/LcnOZjkQJJrFxpekvTWDGNPf0NVXVFVU93tLcCjVbUWeLS7TZJLgE3ApcBG4AtJzhrC40uS5mgxDu/cANzXTd8H3Ng3Pl1Vr1bVc8BB4MpFeHxJ0gxSVfNfOHkOeBEo4N9X1fYkf1NV7+yb58WquiDJncDjVXV/N3438HBV/emA9W4GNgMsX7583fT09LzyHT9+nGXLlgGw98hLs85/+crz5/U4C9WfcymblJwwOVnNOXyTknWxc27YsGFP3xGY1529wPW+r6qOJvlZ4JEk3z3NvBkwNvA3TlVtB7YDTE1N1fr16+cVbteuXZxc9pYtD806/6Gb5vc4C9WfcymblJwwOVnNOXyTknVcORd0eKeqjnbXLwBfp3e45vkkKwC66xe62Q8DF/ctvgo4upDHlyS9NfMu/STnJnnHyWngl4GngR3Azd1sNwMPdtM7gE1JzkmyBlgLPDHfx5ckvXULObyzHPh6kpPr+eOq+k9Jvg08kOTjwA+ADwNU1b4kDwDPACeAW6vqtQWllyS9JfMu/ar6PvALA8Z/CLx/hmW2Alvn+5iSpIXxL3IlqSELPXvnjLK67wyfQ9s+OMYkkrQ43NOXpIac0Xv6e4+8NKfz8yWpFe7pS1JDLH1JaoilL0kNsfQlqSGWviQ1xNKXpIZY+pLUEEtfkhpyRv9x1kL4lQySzkTu6UtSQyx9SWqIpS9JDbH0Jakhlr4kNcTSl6SGeMrmHHj6pqQzhXv6ktQQS1+SGmLpS1JDLH1Jaogf5C6AH/BKmjSW/iLzF4OkpcTSf4v6S1ySJo2lPyb9vzw+c/kJ1o8viqSGjLz0k2wE/h1wFvBHVbVt1BkWw2IdxvHwkKRhGmnpJzkL+APgHwGHgW8n2VFVz4wyR4v85SEJRr+nfyVwsKq+D5BkGrgBOKNKf6bj/qf7PGAhpbzYhX5y/Qs9DOUvHmn8UlWje7DknwAbq+qfdbc/CvyDqvrkKfNtBjZ3N38eODDPh7wI+Ot5LjtK5hy+SclqzuGblKyLnfPdVfWuUwdHvaefAWNv+q1TVduB7Qt+sGR3VU0tdD2LzZzDNylZzTl8k5J1XDlH/Re5h4GL+26vAo6OOIMkNWvUpf9tYG2SNUl+CtgE7BhxBklq1kgP71TViSSfBP4zvVM276mqfYv4kAs+RDQi5hy+SclqzuGblKxjyTnSD3IlSePlt2xKUkMsfUlqyBlZ+kk2JjmQ5GCSLePOc1KSi5PsTLI/yb4kn+rG70hyJMlT3eW6cWcFSHIoyd4u0+5u7MIkjyR5tru+YMwZf75vuz2V5OUkn14K2zTJPUleSPJ039iM2y/J7d1r9kCSa5dA1n+T5LtJ/irJ15O8sxtfneT/9G3bPxxzzhl/1uPapjPk/JO+jIeSPNWNj3Z7VtUZdaH3AfH3gPcAPwV8B7hk3Lm6bCuA93bT7wD+O3AJcAfwL8adb0DeQ8BFp4z9a2BLN70F+N1x5zzlZ/+/gHcvhW0K/BLwXuDp2bZf9zr4DnAOsKZ7DZ815qy/DJzdTf9uX9bV/fMtgW068Gc9zm06KOcp9/8e8C/HsT3PxD3917/qoar+L3Dyqx7GrqqOVdWT3fQrwH5g5XhTvWU3APd10/cBN44vypu8H/heVf2PcQcBqKrHgB+dMjzT9rsBmK6qV6vqOeAgvdfySAzKWlV/XlUnupuP0/u7mrGaYZvOZGzb9HQ5kwT4deAro8hyqjOx9FcC/7Pv9mGWYLEmWQ38IvBfu6FPdm+j7xn3IZM+Bfx5kj3dV2MALK+qY9D7JQb87NjSvdkm3vgPaSlu05m231J/3f4G8HDf7TVJ/luS/5Lk6nGF6jPoZ71Ut+nVwPNV9Wzf2Mi255lY+nP6qodxSrIM+Crw6ap6GbgL+LvAFcAxem/9loL3VdV7gQ8Atyb5pXEHmkn3x36/AvzHbmipbtOZLNnXbZLPAieAL3dDx4C/U1W/CPwW8MdJzhtXPmb+WS/VbfoR3rhzMtLteSaW/pL+qockb6NX+F+uqq8BVNXzVfVaVf0E+CIjfFt/OlV1tLt+Afg6vVzPJ1kB0F2/ML6Eb/AB4Mmqeh6W7jZl5u23JF+3SW4Grgduqu4AdHe45Ifd9B56x8p/blwZT/OzXnLbNMnZwK8Cf3JybNTb80ws/SX7VQ/dsby7gf1V9bm+8RV9s30IePrUZUctyblJ3nFymt6Hek/T25Y3d7PdDDw4noRv8oa9p6W4TTszbb8dwKYk5yRZA6wFnhhDvtel9x8e3Qb8SlX9777xd6X3f2OQ5D30sn5/PClP+7NectsUuAb4blUdPjkw8u05qk+MR3kBrqN3Zsz3gM+OO09frn9I7+3lXwFPdZfrgP8A7O3GdwArlkDW99A78+E7wL6T2xH4GeBR4Nnu+sIlkPWngR8C5/eNjX2b0vsldAz4f/T2Oj9+uu0HfLZ7zR4APrAEsh6kd0z85Gv1D7t5f617TXwHeBL4x2POOePPelzbdFDObvxe4J+fMu9It6dfwyBJDTkTD+9IkmZg6UtSQyx9SWqIpS9JDbH0Jakhlr4kNcTSl6SG/H+J27HnIC37LQAAAABJRU5ErkJggg==", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "df_contracts.annual_salary.hist(bins=100)" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAD4CAYAAADo30HgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAVkklEQVR4nO3df4wc91nH8feD3Qa318YOoYexLRyQVUhifsSnECit7pRCTBPVARHkKlAHgqyitATkSnWoRPnHwoCC1BJSZHBUl0S9mjQlJqmhkclRIdUJcUh7cdw0LjGpE9eGNkl7JQpcePhjx7Df697d7s7u3l38fkmnnf3Od2aenZ3bz82PnYvMRJKks75roQuQJC0uBoMkqWAwSJIKBoMkqWAwSJIKyxe6gPlceOGFuX79+rb7f/vb3+b1r399/wqqyfrqsb56rK+epVTfkSNH/iMzv7erGWXmov7ZtGlTduLBBx/sqP+gWV891leP9dWzlOoDHskuP3c9lCRJKhgMkqSCwSBJKhgMkqSCwSBJKhgMkqSCwSBJKhgMkqSCwSBJKiz6W2K8mqzfeT87Nk5zw877257mxO6r+1iRJH0n9xgkSQWDQZJUMBgkSQWDQZJUMBgkSQWDQZJUMBgkSQWDQZJUMBgkSQWDQZJUmDcYIuKOiDgTEY83tf1xRHwpIr4YEZ+OiJVN426JiOMR8WREXNXUvikiJqtxH4mI6PmrkSTV1s4ew8eAzTPaHgAuzcwfBb4M3AIQERcDW4FLqmluj4hl1TQfBbYDG6qfmfOUJC0C8wZDZn4O+MaMts9m5nT19DCwthreAoxn5suZ+TRwHLg8IlYDb8zMz2dmAh8Hru3Ra5Ak9VA0Pqfn6RSxHrgvMy9tMe5vgU9m5p0RcRtwODPvrMbtBQ4CJ4Ddmfn2qv2twAcy85pZlredxt4Fw8PDm8bHx9t+QVNTUwwNDbXdf5Amn32R4RVw+qX2p9m45vz+FdTCYl5/YH11WV89S6m+sbGxI5k50s18at12OyI+CEwDd51tatEt52hvKTP3AHsARkZGcnR0tO2aJiYm6KT/IN1Q3Xb71sn2V/uJ60f7V1ALi3n9gfXVZX31nCv1dR0MEbENuAa4Mv9/t+MksK6p21rguap9bYt2SdIi09XlqhGxGfgA8M7M/M+mUQeArRFxXkRcROMk88OZeQr4VkRcUV2N9G7g3pq1S5L6YN49hoj4BDAKXBgRJ4EP0bgK6Tzggeqq08OZ+Z7MPBoR+4EnaBxiuikzX6lm9Zs0rnBaQeO8w8HevhRJUi/MGwyZ+a4WzXvn6L8L2NWi/RHgO05eS5IWF7/5LEkq1LoqSYvP+p33d9T/xO6r+1SJpKXKPQZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUsFgkCQVDAZJUmHeYIiIOyLiTEQ83tR2QUQ8EBFPVY+rmsbdEhHHI+LJiLiqqX1TRExW4z4SEdH7lyNJqqudPYaPAZtntO0EDmXmBuBQ9ZyIuBjYClxSTXN7RCyrpvkosB3YUP3MnKckaRGYNxgy83PAN2Y0bwH2VcP7gGub2scz8+XMfBo4DlweEauBN2bm5zMzgY83TSNJWkSi8Tk9T6eI9cB9mXlp9fyFzFzZNP75zFwVEbcBhzPzzqp9L3AQOAHszsy3V+1vBT6QmdfMsrztNPYuGB4e3jQ+Pt72C5qammJoaKjt/oM0+eyLDK+A0y+1P83GNed3vIxOzJz/Yl5/YH11WV89S6m+sbGxI5k50s18lve0Kmh13iDnaG8pM/cAewBGRkZydHS07QImJibopP8g3bDzfnZsnObWyfZX+4nrRzteRidmzn8xrz+wvrqsr55zpb5ur0o6XR0eono8U7WfBNY19VsLPFe1r23RLklaZLoNhgPAtmp4G3BvU/vWiDgvIi6icZL54cw8BXwrIq6orkZ6d9M0kqRFZN5jGhHxCWAUuDAiTgIfAnYD+yPiRuAZ4DqAzDwaEfuBJ4Bp4KbMfKWa1W/SuMJpBY3zDgd7+kokST0xbzBk5rtmGXXlLP13AbtatD8CXNpRdZKkgfObz5KkQq+vStISs37GVUw7Nk7PeWXTid1X97skSQvMPQZJUsFgkCQVDAZJUsFzDDXMPD4vSa8G7jFIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgq1giEificijkbE4xHxiYj47oi4ICIeiIinqsdVTf1viYjjEfFkRFxVv3xJUq91HQwRsQb4LWAkMy8FlgFbgZ3AoczcAByqnhMRF1fjLwE2A7dHxLJ65UuSeq3uoaTlwIqIWA68DngO2ALsq8bvA66thrcA45n5cmY+DRwHLq+5fElSj0Vmdj9xxM3ALuAl4LOZeX1EvJCZK5v6PJ+ZqyLiNuBwZt5Zte8FDmbm3S3mux3YDjA8PLxpfHy87ZqmpqYYGhrq+jV1YvLZFzueZngFnH6p/f4b15zf0fy7qanZfPV1Wk+vDfL97Yb11WN99TTXNzY2diQzR7qZz/JuC6jOHWwBLgJeAP46In5lrklatLVMpczcA+wBGBkZydHR0bbrmpiYoJP+ddyw8/6Op9mxcZpbJ9tf7SeuH+1o/t3U1Gy++jqtp9cG+f52w/rqsb56elVfnUNJbweezsx/z8z/Bu4Bfho4HRGrAarHM1X/k8C6punX0jj0JElaROoEwzPAFRHxuogI4ErgGHAA2Fb12QbcWw0fALZGxHkRcRGwAXi4xvIlSX3Q9aGkzHwoIu4GHgWmgX+hcfhnCNgfETfSCI/rqv5HI2I/8ETV/6bMfKVm/ZKkHus6GAAy80PAh2Y0v0xj76FV/100TlZriVrf4TmME7uv7lMlkvrFbz5LkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpUOsmeuq/Tm9aJ0l1uccgSSoYDJKkgsEgSSoYDJKkgsEgSSoYDJKkgsEgSSoYDJKkgsEgSSoYDJKkQq1giIiVEXF3RHwpIo5FxE9FxAUR8UBEPFU9rmrqf0tEHI+IJyPiqvrlS5J6re4ew4eBv8vMHwZ+DDgG7AQOZeYG4FD1nIi4GNgKXAJsBm6PiGU1ly9J6rGugyEi3gi8DdgLkJn/lZkvAFuAfVW3fcC11fAWYDwzX87Mp4HjwOXdLl+S1B+Rmd1NGPHjwB7gCRp7C0eAm4FnM3NlU7/nM3NVRNwGHM7MO6v2vcDBzLy7xby3A9sBhoeHN42Pj7dd19TUFENDQ129pk5NPvtix9MMr4DTL/WhmB7pdX0b15zfu5kx2Pe3G9ZXj/XV01zf2NjYkcwc6WY+dW67vRy4DHhfZj4UER+mOmw0i2jR1jKVMnMPjdBhZGQkR0dH2y5qYmKCTvrXcUMXt8TesXGaWycX793Oe13fietHezYvGOz72w3rq8f66ulVfXXOMZwETmbmQ9Xzu2kExemIWA1QPZ5p6r+uafq1wHM1li9J6oOugyEzvwZ8NSLeXDVdSeOw0gFgW9W2Dbi3Gj4AbI2I8yLiImAD8HC3y5ck9UfdYwbvA+6KiNcC/wr8Go2w2R8RNwLPANcBZObRiNhPIzymgZsy85Way5ck9VitYMjMx4BWJzeunKX/LmBXnWVKkvrLbz5LkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgoGgySpYDBIkgq1gyEilkXEv0TEfdXzCyLigYh4qnpc1dT3log4HhFPRsRVdZctSeq9Xuwx3Awca3q+EziUmRuAQ9VzIuJiYCtwCbAZuD0ilvVg+ZKkHqoVDBGxFrga+Mum5i3Avmp4H3BtU/t4Zr6cmU8Dx4HL6yxfktR7kZndTxxxN/AHwBuA92fmNRHxQmaubOrzfGauiojbgMOZeWfVvhc4mJl3t5jvdmA7wPDw8Kbx8fG2a5qammJoaKjr19SJyWdf7Hia4RVw+qU+FNMjva5v45rzezczBvv+dsP66rG+eprrGxsbO5KZI93MZ3m3BUTENcCZzDwSEaPtTNKirWUqZeYeYA/AyMhIjo62M/uGiYkJOulfxw077+94mh0bp7l1suvV3ne9ru/E9aM9mxcM9v3thvXVY3319Kq+Op8AbwHeGRHvAL4beGNE3AmcjojVmXkqIlYDZ6r+J4F1TdOvBZ6rsXxJUh90fY4hM2/JzLWZuZ7GSeV/yMxfAQ4A26pu24B7q+EDwNaIOC8iLgI2AA93XbkkqS/6cUxjN7A/Im4EngGuA8jMoxGxH3gCmAZuysxX+rB8SVINPQmGzJwAJqrhrwNXztJvF7CrF8uUJPWH33yWJBUMBklSYfFeN6lXhfUdXtJ7YvfVfapEUrvcY5AkFQwGSVLBYJAkFQwGSVLBYJAkFQwGSVLBYJAkFQwGSVLBYJAkFQwGSVLBYJAkFbxXUpNO7+sjSa9G7jFIkgoGgySpYDBIkgoGgySpYDBIkgpelSTNo5Or1XZsnGa0f6VIA2EwaMnz34dKveWhJElSoetgiIh1EfFgRByLiKMRcXPVfkFEPBART1WPq5qmuSUijkfEkxFxVS9egCSpt+rsMUwDOzLzR4ArgJsi4mJgJ3AoMzcAh6rnVOO2ApcAm4HbI2JZneIlSb3XdTBk5qnMfLQa/hZwDFgDbAH2Vd32AddWw1uA8cx8OTOfBo4Dl3e7fElSf/TkHENErAd+AngIGM7MU9AID+BNVbc1wFebJjtZtUmSFpHIzHoziBgC/hHYlZn3RMQLmbmyafzzmbkqIv4M+Hxm3lm17wU+k5mfajHP7cB2gOHh4U3j4+Nt1zM1NcXQ0FBXr2Xy2Re7mq4Twyvg9Et9X0zXFrq+jWvOn3N8q/e30/dtvmXM1Mn8h1fAmy7obP6DVOf3YxCsr57m+sbGxo5k5kg386l1uWpEvAb4FHBXZt5TNZ+OiNWZeSoiVgNnqvaTwLqmydcCz7Wab2buAfYAjIyM5OjoaNs1TUxM0En/ZjcM4O6qOzZOc+vk4r1KeKHrO3H96JzjW72/nb5v8y1jpk7mv2PjNL/c5fY3CHV+PwbB+urpVX11rkoKYC9wLDP/pGnUAWBbNbwNuLepfWtEnBcRFwEbgIe7Xb4kqT/q/Gn4FuBXgcmIeKxq+11gN7A/Im4EngGuA8jMoxGxH3iCxhVNN2XmKzWWL0nqg66DITP/CYhZRl85yzS7gF3dLlOS1H+L92C3zknz3d5ix8bpgZwLks5l3hJDklQwGCRJBYNBklQwGCRJBYNBklQwGCRJBYNBklTwewzSEuS/M1U/uccgSSoYDJKkgsEgSSoYDJKkgiefdc7p9MStdK5xj0GSVDAYJEkFg0GSVPAcg6SBm+08z2z/iMkv6A2WewySpILBIEkqGAySpILnGKRzwNlj+rMdw5+p02P6/f5uSDfz97xE917VweAXmSSpc6/qYJCWAv+A0WIz8GCIiM3Ah4FlwF9m5u5B1yBpbobV/F7N/xNjoMEQEcuAPwN+FjgJ/HNEHMjMJwZZh9RPfqguDq/mD+5+G/Qew+XA8cz8V4CIGAe2AAaDpAXVTpC0e/K+2/k3W8igiswc3MIifgnYnJm/UT3/VeAnM/O9M/ptB7ZXT98MPNnBYi4E/qMH5faL9dVjffVYXz1Lqb4fyMzv7WYmg95jiBZt35FMmbkH2NPVAiIeycyRbqYdBOurx/rqsb56zpX6Bv0Ft5PAuqbna4HnBlyDJGkOgw6GfwY2RMRFEfFaYCtwYMA1SJLmMNBDSZk5HRHvBf6exuWqd2Tm0R4vpqtDUANkffVYXz3WV885Ud9ATz5LkhY/b6InSSoYDJKkwpINhojYHBFPRsTxiNjZYnxExEeq8V+MiMsGWNu6iHgwIo5FxNGIuLlFn9GIeDEiHqt+fm9Q9VXLPxERk9WyH2kxfiHX35ub1stjEfHNiPjtGX0Guv4i4o6IOBMRjze1XRARD0TEU9XjqlmmnXNb7WN9fxwRX6rev09HxMpZpp1zW+hjfb8fEc82vYfvmGXahVp/n2yq7UREPDbLtINYfy0/U/q2DWbmkvuhceL6K8APAq8FvgBcPKPPO4CDNL47cQXw0ADrWw1cVg2/Afhyi/pGgfsWcB2eAC6cY/yCrb8W7/XXaHxZZ8HWH/A24DLg8aa2PwJ2VsM7gT+cpf45t9U+1vdzwPJq+A9b1dfOttDH+n4feH8b7/+CrL8Z428Ffm8B11/Lz5R+bYNLdY/h/26tkZn/BZy9tUazLcDHs+EwsDIiVg+iuMw8lZmPVsPfAo4Bawax7B5asPU3w5XAVzLz3xZg2f8nMz8HfGNG8xZgXzW8D7i2xaTtbKt9qS8zP5uZ09XTwzS+N7QgZll/7Viw9XdWRATwy8Aner3cds3xmdKXbXCpBsMa4KtNz0/ynR+87fTpu4hYD/wE8FCL0T8VEV+IiIMRcclgKyOBz0bEkWjcgmSmRbH+aHzXZbZfyIVcfwDDmXkKGr+4wJta9Fks6/HXaewBtjLfttBP760Odd0xy2GQxbD+3gqczsynZhk/0PU34zOlL9vgUg2Gdm6t0dbtN/opIoaATwG/nZnfnDH6URqHR34M+FPgbwZZG/CWzLwM+Hngpoh424zxi2H9vRZ4J/DXLUYv9Ppr12JYjx8EpoG7Zuky37bQLx8Ffgj4ceAUjcM1My34+gPexdx7CwNbf/N8psw6WYu2OdfhUg2Gdm6tsaC334iI19B4A+/KzHtmjs/Mb2bmVDX8GeA1EXHhoOrLzOeqxzPAp2nsbjZbDLcv+Xng0cw8PXPEQq+/yumzh9eqxzMt+iz0drgNuAa4PqsDzjO1sS30RWaezsxXMvN/gL+YZbkLvf6WA78IfHK2PoNaf7N8pvRlG1yqwdDOrTUOAO+urq65Anjx7C5Xv1XHJPcCxzLzT2bp831VPyLichrvxdcHVN/rI+INZ4dpnKR8fEa3BVt/TWb9S20h11+TA8C2angbcG+LPgt2G5ho/FOsDwDvzMz/nKVPO9tCv+prPmf1C7Msd6Fvo/N24EuZebLVyEGtvzk+U/qzDfbzTHo/f2hcNfNlGmfbP1i1vQd4TzUcNP4p0FeASWBkgLX9DI1dtS8Cj1U/75hR33uBozSuEDgM/PQA6/vBarlfqGpYVOuvWv7raHzQn9/UtmDrj0ZAnQL+m8ZfYDcC3wMcAp6qHi+o+n4/8Jm5ttUB1XecxrHls9vgn8+sb7ZtYUD1/VW1bX2RxgfV6sW0/qr2j53d5pr6LsT6m+0zpS/boLfEkCQVluqhJElSnxgMkqSCwSBJKhgMkqSCwSBJKhgMkqSCwSBJKvwvlUId3avGll4AAAAASUVORK5CYII=", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "df_contracts[df_contracts.annual_salary < 20].annual_salary.hist(bins=25)" - ] - }, - { - "cell_type": "code", - "execution_count": 86, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 86, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAATvklEQVR4nO3dfYxl9X3f8fenYBPCmgd3lRVmqZZKUImHxO1ONlQu7aztBmJbAquxtA41ILvaCOEqUUnLklYKkbXVNsrGFaJGXQvLUFKPaGwHZCAIo0wIEZTsEuJlIchr74ouILaOycLQlGTX3/5xz1Z3x/O0s3Pn3pnf+yWN5p7vPU/f+/CZM7975kyqCklSG/7OsHdAkrR8DH1JaoihL0kNMfQlqSGGviQ15PRh78B81q5dWxs2bBjIut955x3OOuusgax7VKz2Hu1v5VvtPQ6jv7Vr1/LYY489VlXXTL9v5EN/w4YN7N69eyDrnpycZHx8fCDrHhWrvUf7W/lWe4/D6i/J2pnqDu9IUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDRv4vcleiDdseHtq2D+74+NC2LWn0zXukn+TCJH+Y5KUk+5L8Sle/I8mrSZ7vvj7Wt8ztSfYneTnJ1X31jUn2dvfdmSSDaUuSNJOFHOkfBW6tqueSvA/Yk+Tx7r4vVtVv98+c5FJgC3AZ8AHg20kuqapjwN3AVuAZ4BHgGuDRpWlFkjSfeY/0q+r1qnquu/028BJwwRyLXAtMVNW7VXUA2A9sSnI+cHZVPV29f8x7H3DdqTYgSVq4nMw/Rk+yAXgSuBz4N8BNwFvAbnq/DbyZ5C7gmaq6v1vmHnpH8weBHVX10a5+FXBbVX1ihu1spfcbAevWrds4MTGxyPbmNjU1xZo1a5Z8vXtfPbLk61yoKy4454TpQfU4Kuxv5VvtPQ6rv82bN++pqrHp9QV/kJtkDfB14Fer6q0kdwNfAKr7vhP4LDDTOH3NUf/xYtUuYBfA2NhYDeqypIO65OlNw/wg9/rxE6a9bO3Kttr7g9Xf46j1t6BTNpO8h17g/25VfQOgqt6oqmNV9SPgy8CmbvZDwIV9i68HXuvq62eoS5KWyULO3glwD/BSVf1OX/38vtk+CbzQ3X4I2JLkjCQXARcDz1bV68DbSa7s1nkD8OAS9SFJWoCFDO98CPgMsDfJ813t14FPJ/kgvSGag8AvA1TVviQPAC/SO/Pnlu7MHYCbga8CZ9Ib5/fMHUlaRvOGflU9xczj8Y/Mscx2YPsM9d30PgSWJA2Bl2GQpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNWTe0E9yYZI/TPJSkn1JfqWrvz/J40m+230/r2+Z25PsT/Jykqv76huT7O3uuzNJBtOWJGkmpy9gnqPArVX1XJL3AXuSPA7cBDxRVTuSbAO2AbcluRTYAlwGfAD4dpJLquoYcDewFXgGeAS4Bnh0qZtSWzZse3go2z244+ND2a50KuY90q+q16vque7228BLwAXAtcC93Wz3Atd1t68FJqrq3ao6AOwHNiU5Hzi7qp6uqgLu61tGkrQM0svfBc6cbACeBC4HXqmqc/vue7OqzktyF/BMVd3f1e+hdzR/ENhRVR/t6lcBt1XVJ2bYzlZ6vxGwbt26jRMTE4tqbj5TU1OsWbNmyde799UjS77OhbrignNOmB5Uj6NiamqKA0eODWXb0x/rQVjtzx+s/h6H1d/mzZv3VNXY9PpChncASLIG+Drwq1X11hzD8TPdUXPUf7xYtQvYBTA2Nlbj4+ML3c2TMjk5ySDWfdOQhhsADl4/fsL0oHocFZOTk+x86p2hbHv6Yz0Iq/35g9Xf46j1t6Czd5K8h17g/25VfaMrv9EN2dB9P9zVDwEX9i2+Hnitq6+foS5JWiYLOXsnwD3AS1X1O313PQTc2N2+EXiwr74lyRlJLgIuBp6tqteBt5Nc2a3zhr5lJEnLYCHDOx8CPgPsTfJ8V/t1YAfwQJLPAa8AnwKoqn1JHgBepHfmzy3dmTsANwNfBc6kN87vmTuStIzmDf2qeoqZx+MBPjLLMtuB7TPUd9P7EFiSNAT+Ra4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGrLgf6IiaXT4f4G1WB7pS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyLyhn+QrSQ4neaGvdkeSV5M83319rO++25PsT/Jykqv76huT7O3uuzNJlr4dSdJcFnKk/1XgmhnqX6yqD3ZfjwAkuRTYAlzWLfOlJKd1898NbAUu7r5mWqckaYDmDf2qehL44QLXdy0wUVXvVtUBYD+wKcn5wNlV9XRVFXAfcN0i91mStEjpZfA8MyUbgG9V1eXd9B3ATcBbwG7g1qp6M8ldwDNVdX833z3Ao8BBYEdVfbSrXwXcVlWfmGV7W+n9VsC6des2TkxMLL7DOUxNTbFmzZolX+/eV48s+ToX6ooLzjlhelA9joqpqSkOHDk2lG1Pf6wHYbbnb1ivsUH03MJrdBj9bd68eU9VjU2vn77I9d0NfAGo7vtO4LPATOP0NUd9RlW1C9gFMDY2VuPj44vczblNTk4yiHXftO3hJV/nQh28fvyE6UH1OComJyfZ+dQ7Q9n29Md6EGZ7/ob1GhtEzy28Rkepv0WdvVNVb1TVsar6EfBlYFN31yHgwr5Z1wOvdfX1M9QlSctoUaHfjdEf90ng+Jk9DwFbkpyR5CJ6H9g+W1WvA28nubI7a+cG4MFT2G9J0iLMO7yT5GvAOLA2ySHgN4DxJB+kN0RzEPhlgKral+QB4EXgKHBLVR0fcL2Z3plAZ9Ib5390CfuQJC3AvKFfVZ+eoXzPHPNvB7bPUN8NXH5SeydJWlL+Ra4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhoy7z9Gl6TjNmx7eMnXeesVR7lpnvUe3PHxJd9uqzzSl6SGrOoj/fmOShZyhCFJq4lH+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1JB5Qz/JV5IcTvJCX+39SR5P8t3u+3l9992eZH+Sl5Nc3VffmGRvd9+dSbL07UiS5rKQI/2vAtdMq20Dnqiqi4EnummSXApsAS7rlvlSktO6Ze4GtgIXd1/T1ylJGrB5Q7+qngR+OK18LXBvd/te4Lq++kRVvVtVB4D9wKYk5wNnV9XTVVXAfX3LSJKWSXoZPM9MyQbgW1V1eTf9V1V1bt/9b1bVeUnuAp6pqvu7+j3Ao8BBYEdVfbSrXwXcVlWfmGV7W+n9VsC6des2TkxMLKq5va8emfP+dWfCG3+9qFWPrCsuOOeE6ampKdasWTOkvRm8qakpDhw5NpRtT3+sB2G252++1/ZKspD34XI81oMyrPfg5s2b91TV2PT6Ul9lc6Zx+pqjPqOq2gXsAhgbG6vx8fFF7cx8V9C89Yqj7Ny7ui40evD68ROmJycnWezjtxJMTk6y86l3hrLt6Y/1IMz2/K2mq8Mu5H24HI/1oIzae3CxZ++80Q3Z0H0/3NUPARf2zbceeK2rr5+hLklaRosN/YeAG7vbNwIP9tW3JDkjyUX0PrB9tqpeB95OcmV31s4NfctIkpbJvGMbSb4GjANrkxwCfgPYATyQ5HPAK8CnAKpqX5IHgBeBo8AtVXV8wPVmemcCnUlvnP/RJe1EkjSveUO/qj49y10fmWX+7cD2Geq7gctPau8kSUvKv8iVpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNOX3YOyBJ89mw7eGhbfvgjo8PbduD4JG+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGnFPpJDibZm+T5JLu72vuTPJ7ku9338/rmvz3J/iQvJ7n6VHdeknRyluJIf3NVfbCqxrrpbcATVXUx8EQ3TZJLgS3AZcA1wJeSnLYE25ckLdAghneuBe7tbt8LXNdXn6iqd6vqALAf2DSA7UuSZpGqWvzCyQHgTaCA/1pVu5L8VVWd2zfPm1V1XpK7gGeq6v6ufg/waFX93gzr3QpsBVi3bt3GiYmJRe3f3lePzHn/ujPhjb9e1KpH1hUXnHPC9NTUFGvWrBnS3gze1NQUB44cG8q2pz/WgzDb8zffa3slGfX34ak+z8N6D27evHlP3wjM/3eqF1z7UFW9luSngMeT/MUc82aG2ow/capqF7ALYGxsrMbHxxe1czfNc5GmW684ys69q+uacwevHz9henJyksU+fivB5OQkO596Zyjbnv5YD8Jsz998r+2VZNTfh6f6PI/ae/CUhneq6rXu+2Hgm/SGa95Icj5A9/1wN/sh4MK+xdcDr53K9iVJJ2fRoZ/krCTvO34b+HngBeAh4MZuthuBB7vbDwFbkpyR5CLgYuDZxW5fknTyTuV3qnXAN5McX89/r6o/SPKnwANJPge8AnwKoKr2JXkAeBE4CtxSVcMZjJWkRi069Kvq+8DPzFD/S+AjsyyzHdi+2G1Kkk6Nf5ErSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGLPofo0ut27Dt4YFv49YrjnLTMmxH7fBIX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1Jasiyh36Sa5K8nGR/km3LvX1Jatmyhn6S04D/AvwCcCnw6SSXLuc+SFLLlvtIfxOwv6q+X1V/A0wA1y7zPkhSs1JVy7ex5BeBa6rqX3XTnwF+rqo+P22+rcDWbvIfAC8PaJfWAj8Y0LpHxWrv0f5WvtXe4zD6+wFAVV0z/Y7lvp5+Zqj92E+dqtoF7Br4ziS7q2ps0NsZptXeo/2tfKu9x1Hrb7mHdw4BF/ZNrwdeW+Z9kKRmLXfo/ylwcZKLkrwX2AI8tMz7IEnNWtbhnao6muTzwGPAacBXqmrfcu7DNAMfQhoBq71H+1v5VnuPI9Xfsn6QK0kaLv8iV5IaYuhLUkNWXegn+UqSw0le6KvdkeTVJM93Xx/r6u9Jcm+SvUleSnJ73zIbu/r+JHcmmel002U3U39d/V93l7fYl+S3+uq3dz28nOTqvvqK7y/JP0+yp+tjT5IP980/kv3ByT+H3X1/L8lUkl/rq41kj4t4jf50kqe7+t4kP9HVR7I/OOnX6WjlTFWtqi/gnwL/CHihr3YH8GszzPtLwER3+yeBg8CGbvpZ4B/T+9uCR4FfGHZvc/S3Gfg2cEY3/VPd90uBPwfOAC4Cvgector6+4fAB7rblwOv9i0zkv2dbI99938d+B/9r+NR7fEkn8PTge8AP9NN/91Rf40uoseRyplVd6RfVU8CP1zo7MBZSU4HzgT+BngryfnA2VX1dPWemfuA6waxvydrlv5uBnZU1bvdPIe7+rX0XmzvVtUBYD+wabX0V1V/VlXH/85jH/ATSc4Y5f7gpJ9DklwHfJ9ej8drI9vjSfb388B3qurPu/pfVtWxUe4PTrrHkcqZVRf6c/h8ku90v5ad19V+D3gHeB14BfjtqvohcAG9PyQ77lBXG1WXAFcl+Z9J/ijJz3b1C4D/1Tff8T5WS3/9/gXwZ90bbqX1B7P0mOQs4DbgN6fNv9J6nO05vASoJI8leS7Jv+vqK60/mL3HkcqZ5b4Mw7DcDXyB3k/cLwA7gc/SuwDcMeADwHnAHyf5Ngu8XMQIOZ3e/l8J/CzwQJK/z+x9rIr+uqMjklwG/Cd6R42w8vqD2Z/D3wS+WFVT04Z7V1qPs/V3OvBPutr/AZ5Isgd4a4Z1jHJ/MHuPI5UzTYR+Vb1x/HaSLwPf6iZ/CfiDqvpb4HCSPwHGgD+md4mI40b9chGHgG90Ifhskh/Ru8jTbJe9OMTq6O9/J1kPfBO4oaq+1zf/SuoPZu/x54Bf7D4UPBf4UZL/S2+MfyX1ONdr9I+q6gcASR6hN1Z+PyurP5i9x5HKmSaGd7qxs+M+CRz/xP0V4MPpOYveT+i/qKrXgbeTXNl9mn4D8OCy7vTJ+X3gwwBJLgHeS+8qew8BW7px7ouAi4FnV0t/Sc4FHgZur6o/OT7zCuwPZumxqq6qqg1VtQH4z8B/rKq7VmCPv8/Mr9HHgJ9O8pPdmPc/A15cgf3B7D2OVs4M+pPi5f4CvkZv7Oxv6f3k/Rzw34C99M4SeAg4v5t3Db0zIvYBLwL/tm89Y/R+OHwPuIvur5eH/TVLf++ld2T0AvAc8OG++f9918PL9J0ZsBr6A/4DvbHS5/u+jp8xMZL9LeY57FvuDk48e2cke1zEa/Rfdu/BF4DfGvX+FvE6Hamc8TIMktSQJoZ3JEk9hr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyP8DDE9W+GtusxEAAAAASUVORK5CYII=", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "df_contracts.startY.hist(bins=10)" - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.026105873821609893" - ] - }, - "execution_count": 81, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# proportion of female apprentices\n", - "1-(df_contracts.a_gender.sum()/df_contracts.shape[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 82, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.023723194861701047" - ] - }, - "execution_count": 82, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# proportion of female masters\n", - "1-(df_contracts.m_gender.sum()/df_contracts.shape[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 83, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.7310924369747899" - ] - }, - "execution_count": 83, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# prop female apprentices with male master\n", - "df_contracts[(df_contracts.a_gender == 0) & (df_contracts.startY < 1800)].m_gender.sum()\\\n", - " /df_contracts[(df_contracts.a_gender == 0) & (df_contracts.startY < 1800)].shape[0]" - ] - }, - { - "cell_type": "code", - "execution_count": 84, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.9810528582193992" - ] - }, - "execution_count": 84, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# prop male apprentices with male master\n", - "df_contracts[(df_contracts.a_gender == 1) & (df_contracts.startY < 1800)].m_gender.sum()\\\n", - " /df_contracts[(df_contracts.a_gender == 1) & (df_contracts.startY < 1800)].shape[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Looking at empirical distributions" - ] - }, - { - "cell_type": "code", - "execution_count": 87, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 87, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAT2ElEQVR4nO3df4zk9X3f8eerh0OpiRM74NXlDveIdLbCj+QqVhTJbbTESbkYK+AqTg5RA7Wrsy0s2epVLaSR7MY6CbVx3FqJic4GgWWXCwqxQcEkITQrXAmK72zi44epD3OxlzvdySY1rGNdc/jdP+a7ZbLs7e7M7M3CfJ4PaTQzn+/38/1+5s3x2u985jvfSVUhSWrDP1jvAUiSxsfQl6SGGPqS1BBDX5IaYuhLUkNOW+8BrOSss86qLVu2DNzvBz/4Aa997WvXfkCvMtbhJdaixzr0THod9u/f/92qOntx+ys+9Lds2cK+ffsG7jc7O8vMzMzaD+hVxjq8xFr0WIeeSa9Dkr9eqt3pHUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1Jasgr/hu562XLDfcuu/zQTZePaSSStHY80pekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNWTH0k9ya5FiSx/ra/jDJo93tUJJHu/YtSX7Yt+wP+vpclORAkoNJPpkkp+QVSZJOajXn6d8G/B7w2YWGqvqNhcdJPg58v2/9p6tq2xLbuRnYCTwMfAnYDtw38IglSUNb8Ui/qh4EnltqWXe0/uvAHcttI8lG4HVV9VBVFb0/IFcOPFpJ0khG/UbuPweOVtU3+9rOTfI14Hngt6rqy8AmYK5vnbmubUlJdtJ7V8DU1BSzs7MDD2x+fn6ofgt2XXhi2eWjbHucRq3DJLEWPdahp9U6jBr6V/H3j/KPAG+qqu8luQj4YpLzgaXm7+tkG62qPcAegOnp6Rrmx4tH/dHj61a6DMPVw297nCb9x58HYS16rENPq3UYOvSTnAb8S+CihbaqOg4c7x7vT/I08GZ6R/ab+7pvBg4Pu29J0nBGOWXzl4BvVNX/n7ZJcnaSDd3jnwG2At+qqiPAC0ku6T4HuAa4e4R9S5KGsJpTNu8AHgLekmQuyXu7RTt4+Qe4vwB8PclfAX8EvL+qFj4E/gDwGeAg8DSeuSNJY7fi9E5VXXWS9uuWaLsLuOsk6+8DLhhwfJKkNeQ3ciWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGrOaH0W9NcizJY31tH03ybJJHu9vb+5bdmORgkqeSXNbXflGSA92yTybJ2r8cSdJyVnOkfxuwfYn2T1TVtu72JYAk5wE7gPO7Pp9KsqFb/2ZgJ7C1uy21TUnSKbRi6FfVg8Bzq9zeFcDeqjpeVc8AB4GLk2wEXldVD1VVAZ8FrhxyzJKkIZ02Qt8PJrkG2Afsqqq/ATYBD/etM9e1/V33eHH7kpLspPeugKmpKWZnZwce3Pz8/FD9Fuy68MSyy0fZ9jiNWodJYi16rENPq3UYNvRvBj4GVHf/ceA9wFLz9LVM+5Kqag+wB2B6erpmZmYGHuDs7CzD9Ftw3Q33Lrv80NXDb3ucRq3DJLEWPdahp9U6DHX2TlUdraoXq+pHwKeBi7tFc8A5fatuBg537ZuXaJckjdFQod/N0S94J7BwZs89wI4kpyc5l94Hto9U1RHghSSXdGftXAPcPcK4JUlDWHF6J8kdwAxwVpI54CPATJJt9KZoDgHvA6iqx5PcCTwBnACur6oXu019gN6ZQGcA93U3SdIYrRj6VXXVEs23LLP+bmD3Eu37gAsGGp0kaU35jVxJaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ1Z8ecStbQtN9x70mWHbrp8jCORpNVb8Ug/ya1JjiV5rK/tvyT5RpKvJ/lCkp/s2rck+WGSR7vbH/T1uSjJgSQHk3wySU7JK5IkndRqpnduA7YvarsfuKCqfg7438CNfcuerqpt3e39fe03AzuBrd1t8TYlSafYiqFfVQ8Czy1q+/OqOtE9fRjYvNw2kmwEXldVD1VVAZ8FrhxqxJKkoa3FnP57gD/se35ukq8BzwO/VVVfBjYBc33rzHVtS0qyk967AqamppidnR14UPPz80P1W7DrwhMrr3QSo+x3rY1ah0liLXqsQ0+rdRgp9JP8R+AE8Pmu6Qjwpqr6XpKLgC8mOR9Yav6+TrbdqtoD7AGYnp6umZmZgcc2OzvLMP0WXLfMB7UrOXT18Ptda6PWYZJYix7r0NNqHYYO/STXAu8A3tZN2VBVx4Hj3eP9SZ4G3kzvyL5/CmgzcHjYfUuShjPUefpJtgP/AfjVqvrbvvazk2zoHv8MvQ9sv1VVR4AXklzSnbVzDXD3yKOXJA1kxSP9JHcAM8BZSeaAj9A7W+d04P7uzMuHuzN1fgH47SQngBeB91fVwofAH6B3JtAZwH3dTZI0RiuGflVdtUTzLSdZ9y7grpMs2wdcMNDoJElrysswSFJDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWpI07+Ru9zv3ErSJPJIX5IaYuhLUkOant45VVaaNjp00+VjGokk/X0e6UtSQwx9SWqIoS9JDTH0Jakhhr4kNWTF0E9ya5JjSR7ra3tDkvuTfLO7f33fshuTHEzyVJLL+tovSnKgW/bJdL+oLkkan9Uc6d8GbF/UdgPwQFVtBR7onpPkPGAHcH7X51NJNnR9bgZ2Alu72+JtSpJOsRVDv6oeBJ5b1HwFcHv3+Hbgyr72vVV1vKqeAQ4CFyfZCLyuqh6qqgI+29dHkjQmw345a6qqjgBU1ZEkb+zaNwEP960317X9Xfd4cfuSkuyk966AqakpZmdnBx7g/Pz8iv12XXhi4O2uhWFez7BWU4dWWIse69DTah3W+hu5S83T1zLtS6qqPcAegOnp6ZqZmRl4ILOzs6zU77r1uuDagR8su3gtv7G7mjq0wlr0WIeeVusw7Nk7R7spG7r7Y137HHBO33qbgcNd++Yl2iVJYzRs6N8DXNs9vha4u699R5LTk5xL7wPbR7qpoBeSXNKdtXNNXx9J0pisOL2T5A5gBjgryRzwEeAm4M4k7wW+DbwLoKoeT3In8ARwAri+ql7sNvUBemcCnQHc190kSWO0YuhX1VUnWfS2k6y/G9i9RPs+4IKBRidJWlN+I1eSGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyNChn+QtSR7tuz2f5MNJPprk2b72t/f1uTHJwSRPJblsbV6CJGm1Vvxh9JOpqqeAbQBJNgDPAl8A/jXwiar6nf71k5wH7ADOB34a+Iskb66qF4cdgyRpMGs1vfM24Omq+utl1rkC2FtVx6vqGeAgcPEa7V+StAprFfo7gDv6nn8wydeT3Jrk9V3bJuA7fevMdW2SpDFJVY22geTHgMPA+VV1NMkU8F2ggI8BG6vqPUl+H3ioqj7X9bsF+FJV3bXENncCOwGmpqYu2rt378Djmp+f58wzz1x2nQPPfn/g7Y7DhZt+Ys22tZo6tMJa9FiHnkmvw6WXXrq/qqYXtw89p9/nV4CvVtVRgIV7gCSfBv6kezoHnNPXbzO9PxYvU1V7gD0A09PTNTMzM/CgZmdnWanfdTfcO/B2x+HQ1TNrtq3V1KEV1qLHOvS0Woe1mN65ir6pnSQb+5a9E3ise3wPsCPJ6UnOBbYCj6zB/iVJqzTSkX6SfwT8MvC+vub/nGQbvemdQwvLqurxJHcCTwAngOs9c0eSxmuk0K+qvwV+alHbu5dZfzewe5R9SpKG5zdyJakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIaMFPpJDiU5kOTRJPu6tjckuT/JN7v71/etf2OSg0meSnLZqIOXJA1mLY70L62qbVU13T2/AXigqrYCD3TPSXIesAM4H9gOfCrJhjXYvyRplU7F9M4VwO3d49uBK/va91bV8ap6BjgIXHwK9i9JOolRQ7+AP0+yP8nOrm2qqo4AdPdv7No3Ad/p6zvXtUmSxuS0Efu/taoOJ3kjcH+SbyyzbpZoqyVX7P0B2QkwNTXF7OzswAObn59fsd+uC08MvN1xGOb1nsxq6tAKa9FjHXparcNIoV9Vh7v7Y0m+QG+65miSjVV1JMlG4Fi3+hxwTl/3zcDhk2x3D7AHYHp6umZmZgYe2+zsLCv1u+6Gewfe7jgcunpmzba1mjq0wlr0WIeeVusw9PROktcm+fGFx8C/AB4D7gGu7Va7Fri7e3wPsCPJ6UnOBbYCjwy7f0nS4EY50p8CvpBkYTv/var+NMlXgDuTvBf4NvAugKp6PMmdwBPACeD6qnpxpNFLkgYydOhX1beAn1+i/XvA207SZzewe9h9SpJG4zdyJakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDVk1Esr6xTYsszVPw/ddPkYRyJp0nikL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDRk69JOck+QvkzyZ5PEkH+raP5rk2SSPdre39/W5McnBJE8luWwtXoAkafVGuQzDCWBXVX01yY8D+5Pc3y37RFX9Tv/KSc4DdgDnAz8N/EWSN1fViyOMQZI0gKGP9KvqSFV9tXv8AvAksGmZLlcAe6vqeFU9AxwELh52/5KkwaWqRt9IsgV4ELgA+LfAdcDzwD567wb+JsnvAQ9X1ee6PrcA91XVHy2xvZ3AToCpqamL9u7dO/CY5ufnOfPMM5dd58Cz3x94u+vtwk0/sezyxa9p6gw4+sPV9Z10q/k30QLr0DPpdbj00kv3V9X04vaRr7KZ5EzgLuDDVfV8kpuBjwHV3X8ceA+QJbov+RenqvYAewCmp6drZmZm4HHNzs6yUr/rlrma5SvVoatnll2++DXtuvAEHz9w2qr6TrrV/JtogXXoabUOI529k+Q19AL/81X1xwBVdbSqXqyqHwGf5qUpnDngnL7um4HDo+xfkjSYUc7eCXAL8GRV/W5f+8a+1d4JPNY9vgfYkeT0JOcCW4FHht2/JGlwo0zvvBV4N3AgyaNd228CVyXZRm/q5hDwPoCqejzJncAT9M78ud4zdyRpvIYO/ar6nyw9T/+lZfrsBnYPu09J0mj8Rq4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyMiXYdDk2LLCZSkO3XT5mEYi6VSZ6NBfKcRejSbxNUkaH6d3JKkhhr4kNcTQl6SGTPScvsbHD4GlVweP9CWpIR7pa6L1vwPZdeGJl/2ymO9A1BqP9CWpIYa+JDXE0JekhjinLzXmwLPff9lnGwv8jGPyGfoai+VO6ZzUoGnxNb9SLfXfYuGD/db+W4w99JNsB/4bsAH4TFXdNO4xtGrU6/as13V/XqnfAfA6SHo1GmvoJ9kA/D7wy8Ac8JUk91TVE+Mch15ZXqnh+UodlzSKcR/pXwwcrKpvASTZC1wBGPoa2ijh/Ep99zKKUd75jDqu5fY9yrZXek2v1j/Q6zEFmKo6JRtecmfJrwHbq+rfdM/fDfzTqvrgovV2Aju7p28Bnhpid2cB3x1huJPCOrzEWvRYh55Jr8M/rqqzFzeO+0g/S7S97K9OVe0B9oy0o2RfVU2Pso1JYB1eYi16rENPq3UY93n6c8A5fc83A4fHPAZJata4Q/8rwNYk5yb5MWAHcM+YxyBJzRrr9E5VnUjyQeDP6J2yeWtVPX6KdjfS9NAEsQ4vsRY91qGnyTqM9YNcSdL68to7ktQQQ1+SGjJxoZ9ke5KnkhxMcsN6j2ecktya5FiSx/ra3pDk/iTf7O5fv55jHIck5yT5yyRPJnk8yYe69qZqkeQfJnkkyV91dfhPXXtTdeiXZEOSryX5k+55c7WYqNDvu8zDrwDnAVclOW99RzVWtwHbF7XdADxQVVuBB7rnk+4EsKuqfha4BLi++3fQWi2OA79YVT8PbAO2J7mE9urQ70PAk33Pm6vFRIU+fZd5qKr/Cyxc5qEJVfUg8Nyi5iuA27vHtwNXjnNM66GqjlTVV7vHL9D7n3wTjdWieua7p6/pbkVjdViQZDNwOfCZvubmajFpob8J+E7f87murWVTVXUEemEIvHGdxzNWSbYA/wT4XzRYi24641HgGHB/VTVZh85/Bf498KO+tuZqMWmhv6rLPKgNSc4E7gI+XFXPr/d41kNVvVhV2+h9+/3iJBes85DWRZJ3AMeqav96j2W9TVroe5mHlzuaZCNAd39sncczFkleQy/wP19Vf9w1N1kLgKr6P8Asvc98WqzDW4FfTXKI3rTvLyb5HA3WYtJC38s8vNw9wLXd42uBu9dxLGORJMAtwJNV9bt9i5qqRZKzk/xk9/gM4JeAb9BYHQCq6saq2lxVW+jlwv+oqn9Fg7WYuG/kJnk7vbm7hcs87F7fEY1PkjuAGXqXjD0KfAT4InAn8Cbg28C7qmrxh70TJck/A74MHOCl+dvfpDev30wtkvwcvQ8nN9A7wLuzqn47yU/RUB0WSzID/LuqekeLtZi40JckndykTe9IkpZh6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SG/D8WXZv35dC0cAAAAABJRU5ErkJggg==", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "df_contracts[df_contracts.annual_salary < 50].annual_salary.hist(bins=40)" - ] - }, - { - "cell_type": "code", - "execution_count": 88, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 88, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD4CAYAAAAEhuazAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAQaUlEQVR4nO3df6xf9V3H8efLMpHQbRTZGkLRommMQBXlBknmzG02Rx1LYEZMCRklznRZINkS/ljZP5uaJo1xU5cJsROyks01jduksaISshtcMmTtgpYfQ5pRsbRpM2GMLgYte/vHPa3fdd/b9nt7f33P5/lIvvme8znnfM/nndP7uud+zvmepqqQJLXlJxa7A5KkhWf4S1KDDH9JapDhL0kNMvwlqUHnLXYHzuSSSy6p1atXn5z/wQ9+wIUXXrh4HZonfa0L+lubdY2fvtY2rK69e/d+t6reNtM2Sz78V69ezZ49e07OT01NMTk5uXgdmid9rQv6W5t1jZ++1jasriT/cbptHPaRpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGLflv+EpzbfXm3SOtf2DrjfPUE2nxeOYvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSg84Y/kkuT/K1JM8meTrJR7r2i5M8kuT57n3FwDb3JNmf5LkkNwy0X5tkX7fsM0kyP2VJkk7nbM78jwN3V9UvAtcDdya5EtgMPFpVa4BHu3m6ZRuAq4D1wL1JlnWfdR+wCVjTvdbPYS2SpLN0xvCvqsNV9a1u+jXgWeAy4CZge7faduDmbvomYEdVvV5VLwD7geuSXAq8paq+UVUFPDiwjSRpAWU6h89y5WQ18BhwNfBiVV00sOyVqlqR5LPA41X1ha79fuBh4ACwtare3bW/E/hYVb1vyH42Mf0XAitXrrx2x44dJ5cdO3aM5cuXj1blGOhrXbD0atv30qsjrb/2srcObV9qdc2VvtYF/a1tWF3r1q3bW1UTM21z1v+Hb5LlwJeBj1bV908zXD9sQZ2m/ccbq7YB2wAmJiZqcnLy5LKpqSkG5/uir3XB0qvtjlH/D9/bJoe2L7W65kpf64L+1jabus7qbp8kb2I6+L9YVV/pmo90Qzl070e79oPA5QObrwIOde2rhrRLkhbY2dztE+B+4Nmq+vTAol3Axm56I/DQQPuGJOcnuYLpC7tPVNVh4LUk13efefvANpKkBXQ2wz7vAD4A7EvyZNf2cWArsDPJB4EXgVsAqurpJDuBZ5i+U+jOqnqj2+7DwOeBC5i+DvDw3JQhSRrFGcO/qr7O8PF6gHfNsM0WYMuQ9j1MXyyWJC0iv+ErSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhp0xvBP8kCSo0meGmj7ZJKXkjzZvd47sOyeJPuTPJfkhoH2a5Ps65Z9JknmvhxJ0tk4mzP/zwPrh7T/aVVd073+HiDJlcAG4Kpum3uTLOvWvw/YBKzpXsM+U5K0AM4Y/lX1GPDyWX7eTcCOqnq9ql4A9gPXJbkUeEtVfaOqCngQuHmWfZYknaPzzmHbu5LcDuwB7q6qV4DLgMcH1jnYtf1vN31q+1BJNjH9VwIrV65kamrq5LJjx479yHxf9LUuWHq13b32+Ejrz9T3pVbXXOlrXdDf2mZT12zD/z7gj4Dq3j8F/B4wbBy/TtM+VFVtA7YBTExM1OTk5MllU1NTDM73RV/rgqVX2x2bd4+0/oHbJoe2L7W65kpf64L+1jabumZ1t09VHamqN6rqh8DngOu6RQeBywdWXQUc6tpXDWmXJC2CWYV/N4Z/wvuBE3cC7QI2JDk/yRVMX9h9oqoOA68lub67y+d24KFz6Lck6RyccdgnyZeASeCSJAeBTwCTSa5heujmAPAhgKp6OslO4BngOHBnVb3RfdSHmb5z6ALg4e4lSVoEZwz/qrp1SPP9p1l/C7BlSPse4OqReidJmhd+w1eSGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IadMb/yUtaSKs37x55mwNbb5yHnkj95pm/JDXI8JekBjnso7E3m6EiqXWe+UtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDzhj+SR5IcjTJUwNtFyd5JMnz3fuKgWX3JNmf5LkkNwy0X5tkX7fsM0ky9+VIks7G2Zz5fx5Yf0rbZuDRqloDPNrNk+RKYANwVbfNvUmWddvcB2wC1nSvUz9TkrRAzhj+VfUY8PIpzTcB27vp7cDNA+07qur1qnoB2A9cl+RS4C1V9Y2qKuDBgW0kSQss01l8hpWS1cDfVdXV3fz3quqigeWvVNWKJJ8FHq+qL3Tt9wMPAweArVX17q79ncDHqup9M+xvE9N/JbBy5cprd+zYcXLZsWPHWL58+eiVLnF9rQtGq23fS6/Oc29Gt/aytw5t7+sx62td0N/ahtW1bt26vVU1MdM2581xH4aN49dp2oeqqm3ANoCJiYmanJw8uWxqaorB+b7oa10wWm13bN49v52ZhQO3TQ5t7+sx62td0N/aZlPXbMP/SJJLq+pwN6RztGs/CFw+sN4q4FDXvmpIu7TkrZ7hF9Lda48P/WV1YOuN890l6ZzN9lbPXcDGbnoj8NBA+4Yk5ye5gukLu09U1WHgtSTXd3f53D6wjSRpgZ3xzD/Jl4BJ4JIkB4FPAFuBnUk+CLwI3AJQVU8n2Qk8AxwH7qyqN7qP+jDTdw5dwPR1gIfntBJJ0lk7Y/hX1a0zLHrXDOtvAbYMad8DXD1S7yRJ82KuL/hKzZvpGsFMvEagxeDjHSSpQYa/JDXI8JekBjnmr3m1evPuGe+Hl7R4PPOXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUHnLXYHJI1u9ebdI61/YOuN89QTjSvP/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUHnFP5JDiTZl+TJJHu6touTPJLk+e59xcD69yTZn+S5JDeca+clSbMzF2f+66rqmqqa6OY3A49W1Rrg0W6eJFcCG4CrgPXAvUmWzcH+JUkjmo9hn5uA7d30duDmgfYdVfV6Vb0A7Aeum4f9S5LOIFU1+42TF4BXgAL+sqq2JfleVV00sM4rVbUiyWeBx6vqC137/cDDVfU3Qz53E7AJYOXKldfu2LHj5LJjx46xfPnyWfd5qeprXfteepWVF8CR/17snsy9uapr7WVvHXmbfS+9Om/76Ou/RehvbcPqWrdu3d6BEZkfc65P9XxHVR1K8nbgkSTfPs26GdI29DdPVW0DtgFMTEzU5OTkyWVTU1MMzvdFX+u6Y/Nu7l57nE/t698DZOeqrgO3TY68zR2jPtVzhH309d8i9Le22dR1TsM+VXWoez8KfJXpYZwjSS4F6N6PdqsfBC4f2HwVcOhc9i9Jmp1Zh3+SC5O8+cQ08B7gKWAXsLFbbSPwUDe9C9iQ5PwkVwBrgCdmu39J0uydy9+sK4GvJjnxOX9dVf+Q5JvAziQfBF4EbgGoqqeT7ASeAY4Dd1bVG+fUe0nSrMw6/KvqO8AvD2n/L+BdM2yzBdgy231KkuaG3/CVpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDerfd+6lMbN6xEc1zPc+7l57nMn564qWCM/8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSg85b7A5ovKzevHuxuyBpDnjmL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIO/2aZx370htMvwlnbNRTyIObL1xnnqis2X4S1ryZvMXqr9gTm/Bwz/JeuDPgWXAX1XV1oXug6TTcziw/xY0/JMsA/4C+E3gIPDNJLuq6pmF7Mc48YdQ0nxY6DP/64D9VfUdgCQ7gJsAw19qyEKc1Azbx91rj3PHHO171GGlpXZdJFU1rzv4kZ0lvwOsr6rf7+Y/APxaVd11ynqbgE3d7C8Azw0svgT47gJ0d6H1tS7ob23WNX76Wtuwun62qt420wYLfeafIW0/9tunqrYB24Z+QLKnqibmumOLra91QX9rs67x09faZlPXQn/J6yBw+cD8KuDQAvdBkpq30OH/TWBNkiuS/CSwAdi1wH2QpOYt6LBPVR1Pchfwj0zf6vlAVT094scMHQ7qgb7WBf2tzbrGT19rG7muBb3gK0laGnywmyQ1yPCXpAaNTfgnWZ/kuST7k2xe7P7MpSQHkuxL8mSSPYvdn9lK8kCSo0meGmi7OMkjSZ7v3lcsZh9na4baPpnkpe64PZnkvYvZx9lIcnmSryV5NsnTST7StY/1cTtNXX04Zj+V5Ikk/9rV9gdd+0jHbCzG/LvHQvw7A4+FAG7ty2MhkhwAJqpqrL98kuQ3gGPAg1V1ddf2x8DLVbW1+6W9oqo+tpj9nI0ZavskcKyq/mQx+3YuklwKXFpV30ryZmAvcDNwB2N83E5T1+8y/scswIVVdSzJm4CvAx8BfpsRjtm4nPmffCxEVf0PcOKxEFpCquox4OVTmm8CtnfT25n+ARw7M9Q29qrqcFV9q5t+DXgWuIwxP26nqWvs1bRj3eybulcx4jEbl/C/DPjPgfmD9ORAdgr4pyR7u0db9MnKqjoM0z+QwNsXuT9z7a4k/9YNC43V0MipkqwGfgX4F3p03E6pC3pwzJIsS/IkcBR4pKpGPmbjEv5n9ViIMfaOqvpV4LeAO7shBi199wE/D1wDHAY+tai9OQdJlgNfBj5aVd9f7P7MlSF19eKYVdUbVXUN009JuC7J1aN+xriEf68fC1FVh7r3o8BXmR7m6osj3fjriXHYo4vcnzlTVUe6H8IfAp9jTI9bN278ZeCLVfWVrnnsj9uwuvpyzE6oqu8BU8B6Rjxm4xL+vX0sRJILuwtSJLkQeA/w1Om3Giu7gI3d9EbgoUXsy5w68YPWeT9jeNy6i4f3A89W1acHFo31cZuprp4cs7cluaibvgB4N/BtRjxmY3G3D0B3S9af8f+PhdiyuD2aG0l+jumzfZh+3MZfj2ttSb4ETDL9eNkjwCeAvwV2Aj8DvAjcUlVjd+F0htommR4+KOAA8KETY67jIsmvA/8M7AN+2DV/nOnx8bE9bqep61bG/5j9EtMXdJcxfQK/s6r+MMlPM8IxG5vwlyTNnXEZ9pEkzSHDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXo/wCyR1tTDJOpXgAAAABJRU5ErkJggg==", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "df_contracts[df_contracts.a_age < 30].a_age.hist(bins=25)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Two very important distributions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Normal\n", - "\n", - "Also known as Gaussian, is a bell-shaped distribution with mass around the mean and exponentially decaying on the sides. It is fully characterized by the mean (center of mass) and standard deviation (spread).\n", - "\n", - "https://en.wikipedia.org/wiki/Normal_distribution" - ] - }, - { - "cell_type": "code", - "execution_count": 89, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 89, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "s1 = np.random.normal(5, 1, 10000)\n", - "sns.displot(s1)" - ] - }, - { - "cell_type": "code", - "execution_count": 90, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 90, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAD4CAYAAADSIzzWAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAN7ElEQVR4nO3df0zc933H8dfbwBI7XtTViVAG0W7VSammyWob1P2IVFmLvQBpuv25SZvJDwUJJkyzSMsmoTiR0P5arCX8MSlKt4FWZdrSTko2sBprnbZK2zrI0jkDB3+XXD1c16GXrCkGYg6/9wd3J+44jDHcve/M8yGhcF9/+X7fxnyefO97oJi7CwBQe/uiBwCAvYoAA0AQAgwAQQgwAAQhwAAQpHk7O991112eSqWqNAoA3JqmpqZ+5O53l2/fVoBTqZQmJyd3byoA2APM7PuVtnMLAgCCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIMi2/p9wwM0YGRlRkiQ1PefFixclSW1tbTU7Zzqd1sDAQM3Oh8ZHgFF1SZLo7XdmtHrg0zU7Z9PijyVJP/ykNl/iTYsf1uQ8uLUQYNTE6oFPa+mz3TU73/5z45JUs3MWzgdsB/eAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBrmBkZEQjIyPRYwB71l5Zg83RA9SjJEmiRwD2tL2yBrkCBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAgtQkwNlsVidOnFA2m92wLUkS9fX1qb+/X9lstuK+Wx2rIEkSPfzww0qSpGT/vr4+9fb2qq+vT5OTk+ru7taTTz6pJEnU39+vxx9/XEePHtWRI0d07Ngxvfvuu5qdnd10BgDVtbKyotnZWfX29hbfHnvsMXV2dqqzs1NTU1N69NFHdeTIET300EM6fvy4urq6imu/sO4LXSlX3pH1j7PZrPr7+4vNSJLkuk3aiZoEeHR0VGfPntXY2NiGbcPDw5qZmdH09LTGxsYq7rvVsQqGh4d15coVDQ8Pl+w/MzOj2dlZzczM6LnnntPi4qLOnz+v4eFhTU9P67333lMul5O09g+/vLyspaWlTWcAUF2XL1/W0tKSZmdni2/vv/++lpeXtby8rJMnTyqTyUiSPvnkE124cEFLS0vFtV9Y94WulCvvyPrHo6Ojmp6eLjZjeHj4uk3aiaoHOJvN6vTp03J3nT59uvgdprCt8EmUpPHxcU1MTJTsu9WxCpIkKR4rk8koSRJls1lNTEyUHGNhYaH4/vpzVzIxMcFVMFBjhUZcz/p1vF4mk9HU1FTJui9fx+UdSZKk+HhiYmJDMzKZzKZN2qnmXT1aBaOjo7p27ZokaXV1VWNjY3L34rb1VlZWZGYl+z711FPXPVbhz9df9RYeHz58uHhlezOuXr2q3t5etbe33/QxsPbNcd9Vjx6jqvYtf6wk+YkGBwejR2l4c3NzO/r4kydPlqz7lZWVklaUd2R4eLj4eGVlRe6Vv1YrNWmntrwCNrNeM5s0s8n5+fltn+DMmTPFT0Yul9Obb75Zsq1c4S9f2HerYxWUX81mMhmdOXNm00/mjfroo4929PEAtmena25hYaFk3bt7SSvKO5LJZIqPr9eLSk3aqS2vgN39ZUkvS1JHR8e2a3b06FGNj48rl8upublZx44dk7sXt5UzM7l7cd+tjlWQSqVKIpxKpXT48GG98cYbNx1hM9Mjjzyyq9/x9qLBwUFNvXc5eoyqunb7nUp/plUvvvhi9CgN79SpU3r99ddv+uMPHjyoK1euFNe9mZW0orwj7e3tmpubUy6XK/ankkpN2qmq3wPu6enRvn1rp2lqatLx48dLtq3X0tKi5ubmkn23OlbB0NBQyb5DQ0Pq6ekpHu9mtLS0bJgBQHX19PTs6OOff/75knVfvo7LOzI0NFR83NLSopaWlorHrdSknap6gA8dOqTOzk6ZmTo7O3Xo0KGSbalUqrhvd3e3urq6Svbd6lgF6XS6eKxUKqV0Oq1Dhw6pq6ur5BgHDx4svr/+3JV0dXVtmAFAdRUacT3r1/F6qVRK999/f8m6L1/H5R1Jp9PFx11dXRuakUqlNm3STlX9RThp7TtOJpPZ8F0ok8noxIkTeuGFF2RmxT8v33erYxUMDQ1pcHCw5Gq4p6dH58+f1+rqqpqamvTEE0/o2WefVVtbm5555hmdOnVKy8vLunDhgnK5nFpaWtTU1FQyD4Daam1t1eLiou69997itpWVFV26dEnS2lXuyMiIMpmMbrvtNrW2tmp+fr649gvrfrN1XN6R8sdJkiiXy6mpqUlPP/20Xnrppar0wLZzf7Sjo8MnJyd3fYh6U3glm/t5u6NwD3jps901O+f+c+OSVLNz7j83rvu5B7xrbrU1aGZT7t5Rvp1fRQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAII0Rw9Qj9LpdPQIwJ62V9YgAa5gYGAgegRgT9sra5BbEAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABGmOHgB7Q9Pih9p/bryG58tKUs3O2bT4oaTWmpwLtw4CjKpLp9M1P+fFizlJUltbraLYGvL3RGMjwKi6gYGB6BGAusQ9YAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCmLvf+M5m85K+X+GP7pL0o90aqkoaYUapMeZshBmlxpizEWaUGmPOep7x59z97vKN2wrwZsxs0t07dnygKmqEGaXGmLMRZpQaY85GmFFqjDkbYcZy3IIAgCAEGACC7FaAX96l41RTI8woNcacjTCj1BhzNsKMUmPM2QgzltiVe8AAgO3jFgQABCHAABBkRwE2sz83sw/M7J3dGmi3mdm9ZvZtM5sxs/82s8HomcqZ2e1m9l0z+15+xuejZ9qMmTWZ2X+a2d9Hz7IZM8uY2Vkze9vMJqPn2YyZfcrMXjOzc/mvz1+Jnmk9M7sv/zksvH1sZl+NnqsSM3sqv3beMbNXzez26JluxI7uAZvZlyQtSBpz91/ctal2kZndI+ked3/LzH5a0pSk33T36eDRiszMJN3h7gtm1iLpO5IG3f3fgkfbwMx+X1KHpDvd/cvR81RiZhlJHe5erz+UL0kys1FJ/+Lur5jZT0k64O7/FzxWRWbWJOmipF9y90q/jBXGzNq0tmZ+wd2XzOxvJI27+1/GTra1HV0Bu/s/S/pwl2apCne/5O5v5d//iaQZSW2xU5XyNQv5hy35t7p7ddTM2iU9LOmV6FkanZndKelLkr4mSe5+tV7jm/egpP+pt/iu0yxpv5k1Szog6QfB89yQPXUP2MxSkj4v6d+DR9kg/9T+bUkfSHrT3etuRkl/KukPJF0LnmMrLulbZjZlZr3Rw2ziM5LmJf1F/pbOK2Z2R/RQ1/Fbkl6NHqISd78o6U8kXZB0SdKP3f1bsVPdmD0TYDM7KOkbkr7q7h9Hz1PO3Vfd/XOS2iV90czq6paOmX1Z0gfuPhU9yw14wN2/IKlL0u/lb5XVm2ZJX5D0Z+7+eUlXJP1h7EiV5W+PfEXS30bPUomZ/Yyk35D085J+VtIdZvY7sVPdmD0R4Px91W9I+rq7fzN6nuvJPw39J0mdsZNs8ICkr+Tvr/61pF8zs7+KHakyd/9B/r8fSPo7SV+MnaiiOUlz657pvKa1INejLklvufvl6EE2cVTS++4+7+4rkr4p6VeDZ7oht3yA8y9wfU3SjLufip6nEjO728w+lX9/v9a+oM6FDlXG3f/I3dvdPaW1p6P/6O51d5VhZnfkX2xV/in9r0uqu5/ScfcfSvpfM7svv+lBSXXzwnCZ31ad3n7IuyDpl83sQH69P6i113rq3k5/DO1VSf8q6T4zmzOzJ3ZnrF31gKTf1doVW+HHabqjhypzj6Rvm9l/SfoPrd0Drtsf86pzrZK+Y2bfk/RdSf/g7qeDZ9rMgKSv5//dPyfpj2PH2cjMDkg6prWryrqUfxbxmqS3JJ3VWtca4teS+VVkAAhyy9+CAIB6RYABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCD/Dzz21aUFqjuaAAAAAElFTkSuQmCC", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "# for boxplots see https://en.wikipedia.org/wiki/Interquartile_range (or ask!)\n", - "sns.boxplot(x=s1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Heavy-tailed\n", - "Distributions with a small but non-negligible amount of observations with high values. Several probability distributions follow this pattern: https://en.wikipedia.org/wiki/Heavy-tailed_distribution#Common_heavy-tailed_distributions.\n", - "\n", - "We pick the lognormal here: https://en.wikipedia.org/wiki/Log-normal_distribution" - ] - }, - { - "cell_type": "code", - "execution_count": 91, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 91, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "s2 = np.random.lognormal(5, 1, 10000)\n", - "sns.displot(s2)" - ] - }, - { - "cell_type": "code", - "execution_count": 92, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 92, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAD4CAYAAADSIzzWAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAANZklEQVR4nO3dXWxb5R3H8d8/cQotHSpNu4oFVIOMVnE1oEKwTVM1gtZk07pLLlCzi2l3oVsvpiIiRRW+2TRVo5k0CRVN7d7QxNCGUFOt2cbdBEs3urKmLKZNIVkDxdV4WVOapM8ufOycOHYaG9v/xv5+JKv2eXnOOQ/pN86JKyyEIABA47V5nwAAtCoCDABOCDAAOCHAAOCEAAOAk0QlG2/atCkkk8k6nQoANKcTJ068H0LYXLy8ogAnk0mNjo7W7qwAoAWY2flSy7kFAQBOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4KSi/yfcpzU0NKRMJiNJmpqakiR1dXUV1qdSKfX39zfylADATUMDnMlk9PobY5pft1Htlz+QJE1/kjuF9suXGnkqAOCu4bcg5tdt1My2Xs2v69T8uk7NbOuNXm9s9KkAgCvuAQOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4CTRCMOMjQ0VJfx+vv7azouADRSQwKcyWRu6PEAwAO3IADACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcJ7xOoxsmTJyVJO3bscDl+IpGQmWl2dlYbN27UpUuXym67Zs0aSVJbW5s2bNig6enpwrqtW7eqra1N09PTGhoaUiqVkiRls1nt27dPk5OTSqfTOnTokMxMe/fu1cGDBzU4OKjOzk5ls1nt379fTzzxhA4cOKArV67owoULi8bKj5ffLr5/tfLjDQ4OStKS56WOE9+n3LGLxx0YGJCZ6emnn160TzabLbuukutdyTlVOy+1GM9DM1xDrdVzTngHXIW5uTnNzs5K0rLxlaSrV6/q6tWrunLlyqL4StL58+d17tw5zczMKJ1OF5YfPnxY4+PjmpmZ0eDgoMbGxnT69Gml02mdOnVKR44cKWx36tQppdNpnT59WmfPnl0yVvF28f2rlR/vyJEjJZ+XOk58u5WOm7/u4n2WW1fJ9a7knCpR6/E8NMM11Fo952TVBdjrXW+9TUxMKJPJKJvNanh4uLD8448/XrRNCEHHjh1TJpPRsWPHFELQxMREybGk3Hfv+Hb5/bPZbFXnGR9veHhYw8PDhefljhPfp9yxi8c9evRoYd3w8HBhn+L5ia+r5HpXck7VzkstxvPQDNdQa/Wek4YEeGpqSplMRplMRm1XPix9Ilc+VCaT0Z49e5Z9NLN0Oq3Dhw8X3l2XMz8/r3Q6rWvXri07lpT77l283fz8fNXfzePjzc7Oam5urvC8+Lzzx4nvU+7Y5cbNv46/6y+3rpLrXck5VaLW43lohmuotXrPyXUDbGbfNbNRMxu9ePFiTQ+OxSYmJjQyMnLd7ebm5jQxMbEoRKXGkqSRkZEl283Nzen48eNVnWN8vBCCQghLnhcfJ75PuWMXjxsXQijsMzIysmh9fF0l17uSc6pErcfz0AzXUGv1npPrBjiE8GwIYXsIYfvmzZurOkhXV5dSqZRSqZSu3XxryW2u3XyrUqmUnnnmmWUfzSyZTKq7u/u62yUSCSWTSSUS5X+HmkwmJUnd3d1LtkskEnr00UerOsf4eGYmM1vyvPg48X3KHbt43DgzK+zT3d29aH18XSXXu5JzqkStx/PQDNdQa/Wek1V3D7iZDQwMqK+vTx0dHctu197eroGBAbW1lf/PNzAwIEnq6+tbsl17e7t2795d1TnGx+vo6Ch8cXZ0dCw57/xx4vuUO3a5cfOv8/v09fWVXVfJ9a7knCpR6/E8NMM11Fq952TVBfiVV17xPoW6SCaTSqVS6uzsVE9PT2H5+vXrF21jZtq5c6dSqZR27twpMyu82y0eS5I6OzsXbZffv9qP08TH6+npUU9PT+F5uePE9yl37OJxe3t7C+t6enoK+xTPT3xdJde7knOqdl5qMZ6HZriGWqv3nKzKzwF7q8fngPPvWKXcd92xsTFNTk5q//79Sz4HHH83ODExseRzwPGxireL71+t/Hj5cYqflzpO8T4rGXd8fFxmVvLda7l1lVzvSs6pErUez0MzXEOt1XNOrPgXHsvZvn17GB0drfgg8U8vnDj7rma29WrtmdzHjGa25d7prD1zVA/cvWVF93nz4zX7PWEAzcHMToQQthcvX3W3IACgWRBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHCSaMRBUqmUJCmTydR0PABYzRoS4P7+fknSnj17ajoeAKxm3IIAACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcBJotEHbL98SWvPHFX75awkae2Zo4Xl0pZGnw4AuGlogFOpVOH51NScJKmrKx/dLYvWA0Cza2iA+/v7G3k4ALihcQ8YAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcWQlj5xmYXJZ2v8libJL1f5b7NhHlYwFwsYC4WNONcbA0hbC5eWFGAPw0zGw0hbG/IwW5gzMMC5mIBc7GgleaCWxAA4IQAA4CTRgb42QYe60bGPCxgLhYwFwtaZi4adg8YALAYtyAAwAkBBgAndQ+wme00szfNLGNm++p9vEYzszvN7C9mNmZm/zKzPdHyjWZ23MzGoz9vi+3zZDQfb5rZ12LLHzCzU9G6g2ZmHtf0aZlZu5n9w8xejl635FyY2QYze8HMzkRfHw+34lyY2fejvxtvmNlvzOzmVpyHkkIIdXtIapf0lqS7Ja2RdFLSvfU8ZqMfkm6XdH/0/DOS/i3pXkk/krQvWr5P0g+j5/dG83CTpLui+WmP1r0m6WFJJmlYUo/39VU5J3sl/VrSy9HrlpwLSYclfSd6vkbShlabC0ldks5JWhu9/q2kb7faPJR71Psd8IOSMiGEsyGEq5Kel7SrzsdsqBDChRDC36PnH0kaU+6LbpdyfwEV/fmt6PkuSc+HED4JIZyTlJH0oJndLunWEMJfQ+6r7Uhsn1XDzO6Q9HVJh2KLW24uzOxWSV+R9JwkhRCuhhD+qxacC0kJSWvNLCFpnaT/qDXnYYl6B7hL0jux15PRsqZkZklJ90l6VdKWEMIFKRdpSZ+NNis3J13R8+Llq81PJP1A0rXYslaci7slXZT08+h2zCEzu0UtNhchhClJP5b0tqQLkj4IIfxRLTYP5dQ7wKXu0TTl597MbL2k30n6Xgjhw+U2LbEsLLN81TCzb0h6L4RwYqW7lFjWFHOh3Lu++yX9LIRwn6T/KfejdjlNORfRvd1dyt1O+JykW8zs8eV2KbFs1c9DOfUO8KSkO2Ov71Dux4+mYmYdysX3VyGEF6PF70Y/Nin6871oebk5mYyeFy9fTb4k6ZtmNqHc7aavmtkv1ZpzMSlpMoTwavT6BeWC3Gpz0S3pXAjhYghhVtKLkr6o1puHkuod4L9JusfM7jKzNZIek/RSnY/ZUNFvYp+TNBZCOBBb9ZKkvuh5n6Q/xJY/ZmY3mdldku6R9Fr0Y9hHZvZQNObu2D6rQgjhyRDCHSGEpHL/rf8cQnhcrTkX05LeMbPPR4sekXRarTcXb0t6yMzWRef/iHK/J2m1eSit3r/lk9Sr3CcD3pL0lPdvHetwfV9W7kehf0p6PXr0SuqU9CdJ49GfG2P7PBXNx5uK/SZX0nZJb0TrfqroXyquxoekHVr4FERLzoWkL0gajb42fi/ptlacC0n7JZ2JruEXyn3CoeXmodSDf4oMAE74l3AA4IQAA4ATAgwATggwADghwADghAADgBMCDABO/g+N/5VHP4nZywAAAABJRU5ErkJggg==", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "sns.boxplot(x=s2)" - ] - }, - { - "cell_type": "code", - "execution_count": 93, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 93, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "# Why \"lognormal\"?\n", - "\n", - "sns.displot(np.log(s2))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Box plots\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Outliers, missing values\n", - "\n", - "An *outlier* is an observation far from the center of mass of the distribution. It might be an error or a genuine observation: this distinction requires domain knowledge. Outliers infuence the outcomes of several statistics and machine learning methods: it is important to decide how to deal with them.\n", - "\n", - "A *missing value* is an observation without a value. There can be many reasons for a missing value: the value might not exist (hence its absence is informative and it should be left empty) or might not be known (hence the value is existing but missing in the dataset and it should be marked as NA).\n", - "\n", - "*One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.*" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary statistics\n", - "A statistic is a function of a collection of observations, or otherwise stated a measure over a distribution. \n", - "\n", - "A statistic is said to be *robust* if not sensitive to outliers.\n", - "\n", - "* Not robust: min, max, mean, standard deviation.\n", - "* Robust: mode, median, other quartiles.\n", - "\n", - "A closer look at the mean:\n", - "\n", - "$\\bar{x} = \\frac{1}{n} \\sum_{i}x_i$\n", - "\n", - "And variance (the standard deviation is the square root of the variance):\n", - "\n", - "$Var(x) = \\frac{1}{n} \\sum_{i}(x_i - \\bar{x})^2$\n", - "\n", - "The mean, the median, etc. are measures of location (e.g., the typical value); the variance is a measure of dispersion." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "code", - "execution_count": 95, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "4.993729761026251\n", - "241.22132048015996\n" - ] - } - ], - "source": [ - "# Not robust: min, max, mean, mode, standard deviation\n", - "\n", - "print(np.mean(s1)) # should be 5\n", - "print(np.mean(s2))" - ] - }, - { - "cell_type": "code", - "execution_count": 96, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "5.0017132760085286\n", - "148.79782622743468\n" - ] - } - ], - "source": [ - "# Robust: median, other quartiles\n", - "\n", - "print(np.quantile(s1, 0.5)) # should coincide with mean and mode\n", - "print(np.quantile(s2, 0.5))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Questions\n", - "\n", - "* Calculate the min, max, mode and sd. *hint: explore the numpy documentation!*\n", - "* Calculate the 90% quantile values.\n", - "* Consider our normally distributed data in s1. Add an outlier (e.g., value 100). What happens to the mean and mode? Write down your answer and then check." - ] - }, - { - "cell_type": "code", - "execution_count": 97, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
annual_salarya_agelength
count7870.0000009303.0000009645.000000
mean5.91692114.2666885.005694
std6.9852142.9027701.462343
min0.1666671.0000000.083333
25%3.00000012.0000004.000000
50%4.00000014.0000005.000000
75%6.00000016.0000006.000000
max180.00000050.00000015.000000
\n", - "
" - ], - "text/plain": [ - " annual_salary a_age length\n", - "count 7870.000000 9303.000000 9645.000000\n", - "mean 5.916921 14.266688 5.005694\n", - "std 6.985214 2.902770 1.462343\n", - "min 0.166667 1.000000 0.083333\n", - "25% 3.000000 12.000000 4.000000\n", - "50% 4.000000 14.000000 5.000000\n", - "75% 6.000000 16.000000 6.000000\n", - "max 180.000000 50.000000 15.000000" - ] - }, - "execution_count": 97, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Let's explore our dataset\n", - "df_contracts[[\"annual_salary\",\"a_age\",\"length\"]].describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Relating two variables\n", - "\n", - "### Covariance\n", - "\n", - "Measure of association, specifically of the joint linear variability of two variables:\n", - "\n", - "\n", - "\n", - "Its normalized version is called the (Pearson's) correlation coefficient:\n", - "\n", - "\n", - "\n", - "Correlation is helpful to spot possible relations, but is of tricky interpretation and is not exhaustive:\n", - "\n", - "\n", - "\n", - "See: https://en.wikipedia.org/wiki/Covariance and https://en.wikipedia.org/wiki/Pearson_correlation_coefficient.\n", - "\n", - "*Note: correlation is not causation!*" - ] - }, - { - "cell_type": "code", - "execution_count": 98, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
annual_salarya_agelength
annual_salary1.0000000.205404-0.361611
a_age0.2054041.000000-0.430062
length-0.361611-0.4300621.000000
\n", - "
" - ], - "text/plain": [ - " annual_salary a_age length\n", - "annual_salary 1.000000 0.205404 -0.361611\n", - "a_age 0.205404 1.000000 -0.430062\n", - "length -0.361611 -0.430062 1.000000" - ] - }, - "execution_count": 98, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_contracts[[\"annual_salary\",\"a_age\",\"length\"]].corr()" - ] - }, - { - "cell_type": "code", - "execution_count": 99, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 99, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "sns.scatterplot(x=df_contracts.length,y=df_contracts.annual_salary)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Other ways to measure correlation exist. For example, if you are interested into how one variable will increase (or decrease) as another variable increases (or decreases), the *Spearman’s or Kendall’s rank correlation coefficients* might work well." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Questions\n", - "\n", - "* Try to explore the correlation of other variables in the dataset.\n", - "* Can you think of a possible motivation for the trend we see: older apprentices with a shorter contract getting on average a higher annual salary?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Sampling and uncertainty\n", - "\n", - "Often, we work with samples and we want the sample to be representative of the population it is taken from, in order to draw conclusions that generalise from the sample to the full population.\n", - "\n", - "Sampling is *tricky*. Samples have *variance* (variation between samples from the same population) and *bias* (systematic variation from the population)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Further reading\n", - "\n", - "* For a complementary introduction to statistics and data analysis, see https://www.humanitiesdataanalysis.org/statistics-essentials/notebook.html.\n", - "* Related to statistics and data analysis is the realm of probability theory, which allows us to formally model and calculate the likelihood of events. For an introduction, see https://www.humanitiesdataanalysis.org/intro-probability/notebook.html." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Part 2: working with texts\n", - "\n", - "Let's get some basics (or a refresher) of working with texts in Python. Texts are sequences of discrete symbols (words or, more generically, tokens).\n", - "\n", - "Key challenge: representing text for further processing. Two mainstream approaches:\n", - "* *Bag of words*: a text is a collection of tokens occurring with a certain frequence and assumed independently from each other within the text. The mapping from texts to features is determinsitic and straighforward, each text is represented as a vector of the size of the vocabulary.\n", - "* *Embeddings*: a method is used (typically, neural networks), to learn a mapping from each token to a (usually small) vector representing it. A text can be represented in turn as an aggregation of these embeddings." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Import the dataset\n", - "Let us import the Elon Musk's tweets dataset in memory.\n", - "\n", - "" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [], - "source": [ - "root_folder = \"../data/musk_tweets\"\n", - "df_elon = pd.read_csv(codecs.open(os.path.join(root_folder,\"elonmusk_tweets.csv\"), encoding=\"utf8\"), sep=\",\")\n", - "df_elon['text'] = df_elon['text'].str[1:]" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
idcreated_attext
08496368680522752002017-04-05 14:56:29'And so the robots spared humanity ... https:/...
18489887305850961922017-04-03 20:01:01\"@ForIn2020 @waltmossberg @mims @defcon_5 Exac...
28489430724234977282017-04-03 16:59:35'@waltmossberg @mims @defcon_5 Et tu, Walt?'
38489357050572800012017-04-03 16:30:19'Stormy weather in Shortville ...'
48484160495736586242017-04-02 06:05:23\"@DaveLeeBBC @verge Coal is dying due to nat g...
\n", - "
" - ], - "text/plain": [ - " id created_at \\\n", - "0 849636868052275200 2017-04-05 14:56:29 \n", - "1 848988730585096192 2017-04-03 20:01:01 \n", - "2 848943072423497728 2017-04-03 16:59:35 \n", - "3 848935705057280001 2017-04-03 16:30:19 \n", - "4 848416049573658624 2017-04-02 06:05:23 \n", - "\n", - " text \n", - "0 'And so the robots spared humanity ... https:/... \n", - "1 \"@ForIn2020 @waltmossberg @mims @defcon_5 Exac... \n", - "2 '@waltmossberg @mims @defcon_5 Et tu, Walt?' \n", - "3 'Stormy weather in Shortville ...' \n", - "4 \"@DaveLeeBBC @verge Coal is dying due to nat g... " - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_elon.head(5)" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(2819, 3)" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_elon.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Natural Language Processing in Python" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [], - "source": [ - "# import some of the most popular libraries for NLP in Python\n", - "import spacy\n", - "import nltk\n", - "import string\n", - "import sklearn" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading package punkt to\n", - "[nltk_data] /Users/giovannicolavizza/nltk_data...\n", - "[nltk_data] Unzipping tokenizers/punkt.zip.\n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "nltk.download('punkt')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A typical NLP pipeline might look like the following:\n", - " \n", - "\n", - "\n", - "### Tokenization: splitting a text into constituent tokens" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [], - "source": [ - "from nltk.tokenize import TweetTokenizer, word_tokenize\n", - "tknzr = TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\\xe2\\x80\\xa6 https://t.co/qQcTqkzgMl\"\n" - ] - } - ], - "source": [ - "example_tweet = df_elon.text[1]\n", - "print(example_tweet)" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['\"', '@ForIn2020', '@waltmossberg', '@mims', '@defcon_5', 'Exactly', '.', 'Tesla', 'is', 'absurdly', 'overvalued', 'if', 'based', 'on', 'the', 'past', ',', 'but', \"that's\", 'irr', '\\\\', 'xe2', '\\\\', 'x80', '\\\\', 'xa6', 'https://t.co/qQcTqkzgMl', '\"']\n", - "['``', '@', 'ForIn2020', '@', 'waltmossberg', '@', 'mims', '@', 'defcon_5', 'Exactly', '.', 'Tesla', 'is', 'absurdly', 'overvalued', 'if', 'based', 'on', 'the', 'past', ',', 'but', 'that', \"'s\", 'irr\\\\xe2\\\\x80\\\\xa6', 'https', ':', '//t.co/qQcTqkzgMl', \"''\"]\n" - ] - } - ], - "source": [ - "tkz1 = tknzr.tokenize(example_tweet)\n", - "print(tkz1)\n", - "tkz2 = word_tokenize(example_tweet)\n", - "print(tkz2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Question: can you spot what the Twitter tokenizer is doing instead of a standard one?" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "string.punctuation" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [], - "source": [ - "# some more pre-processing\n", - "\n", - "def filter(tweet):\n", - " \n", - " # remove punctuation and short words and urls\n", - " tweet = [t for t in tweet if t not in string.punctuation and len(t) > 3 and not t.startswith(\"http\")]\n", - " return tweet\n", - "\n", - "def tokenize_and_string(tweet):\n", - " \n", - " tkz = tknzr.tokenize(tweet)\n", - " \n", - " tkz = filter(tkz)\n", - " \n", - " return \" \".join(tkz)" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['\"', '@ForIn2020', '@waltmossberg', '@mims', '@defcon_5', 'Exactly', '.', 'Tesla', 'is', 'absurdly', 'overvalued', 'if', 'based', 'on', 'the', 'past', ',', 'but', \"that's\", 'irr', '\\\\', 'xe2', '\\\\', 'x80', '\\\\', 'xa6', 'https://t.co/qQcTqkzgMl', '\"']\n", - "['@ForIn2020', '@waltmossberg', '@mims', '@defcon_5', 'Exactly', 'Tesla', 'absurdly', 'overvalued', 'based', 'past', \"that's\"]\n" - ] - } - ], - "source": [ - "print(tkz1)\n", - "print(filter(tkz1))" - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [], - "source": [ - "df_elon[\"clean_text\"] = df_elon[\"text\"].apply(tokenize_and_string)" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
idcreated_attextclean_text
08496368680522752002017-04-05 14:56:29'And so the robots spared humanity ... https:/...robots spared humanity
18489887305850961922017-04-03 20:01:01\"@ForIn2020 @waltmossberg @mims @defcon_5 Exac...@ForIn2020 @waltmossberg @mims @defcon_5 Exact...
28489430724234977282017-04-03 16:59:35'@waltmossberg @mims @defcon_5 Et tu, Walt?'@waltmossberg @mims @defcon_5 Walt
38489357050572800012017-04-03 16:30:19'Stormy weather in Shortville ...'Stormy weather Shortville
48484160495736586242017-04-02 06:05:23\"@DaveLeeBBC @verge Coal is dying due to nat g...@DaveLeeBBC @verge Coal dying fracking It's ba...
\n", - "
" - ], - "text/plain": [ - " id created_at \\\n", - "0 849636868052275200 2017-04-05 14:56:29 \n", - "1 848988730585096192 2017-04-03 20:01:01 \n", - "2 848943072423497728 2017-04-03 16:59:35 \n", - "3 848935705057280001 2017-04-03 16:30:19 \n", - "4 848416049573658624 2017-04-02 06:05:23 \n", - "\n", - " text \\\n", - "0 'And so the robots spared humanity ... https:/... \n", - "1 \"@ForIn2020 @waltmossberg @mims @defcon_5 Exac... \n", - "2 '@waltmossberg @mims @defcon_5 Et tu, Walt?' \n", - "3 'Stormy weather in Shortville ...' \n", - "4 \"@DaveLeeBBC @verge Coal is dying due to nat g... \n", - "\n", - " clean_text \n", - "0 robots spared humanity \n", - "1 @ForIn2020 @waltmossberg @mims @defcon_5 Exact... \n", - "2 @waltmossberg @mims @defcon_5 Walt \n", - "3 Stormy weather Shortville \n", - "4 @DaveLeeBBC @verge Coal dying fracking It's ba... " - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_elon.head(5)" - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [], - "source": [ - "# save cleaned up version\n", - "\n", - "df_elon.to_csv(os.path.join(root_folder,\"df_elon.csv\"), index=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Building a dictionary" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(2819, 7864)" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from sklearn.feature_extraction.text import CountVectorizer\n", - "count_vect = CountVectorizer(lowercase=False, tokenizer=tknzr.tokenize)\n", - "X_count = count_vect.fit_transform(df_elon.clean_text)\n", - "X_count.shape" - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "6617" - ] - }, - "execution_count": 48, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "word_list = count_vect.get_feature_names_out() \n", - "count_list = X_count.toarray().sum(axis=0)\n", - "dictionary = dict(zip(word_list,count_list))\n", - "count_vect.vocabulary_.get(\"robots\")" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "3" - ] - }, - "execution_count": 49, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_count[:,count_vect.vocabulary_.get(\"robots\")].toarray().sum()" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "3" - ] - }, - "execution_count": 50, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dictionary[\"robots\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Questions\n", - "\n", - "* Find the tokens most used by Elon.\n", - "* Find the twitter users most referred to by Elon (hint: use the @ handler to spot them)." - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('Tesla', 322),\n", - " ('Model', 236),\n", - " ('that', 223),\n", - " ('will', 218),\n", - " ('with', 177),\n", - " ('@SpaceX', 169),\n", - " ('from', 163),\n", - " ('this', 159),\n", - " ('@TeslaMotors', 149),\n", - " ('launch', 124)]" - ] - }, - "execution_count": 51, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dictionary_list = sorted(dictionary.items(), key=lambda x:x[1], reverse=True)\n", - "[d for d in dictionary_list][:10]" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('@SpaceX', 169),\n", - " ('@TeslaMotors', 149),\n", - " ('@elonmusk', 85),\n", - " ('@NASA', 48),\n", - " ('@Space_Station', 19),\n", - " ('@FredericLambert', 17),\n", - " ('@ID_AA_Carmack', 15),\n", - " ('@WIRED', 14),\n", - " ('@vicentes', 14),\n", - " ('@BadAstronomer', 11)]" - ] - }, - "execution_count": 52, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dictionary_list_users = sorted(dictionary.items(), key=lambda x:x[1], reverse=True)\n", - "[d for d in dictionary_list if d[0].startswith('@')][:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Representing tweets as vectors\n", - "\n", - "Texts are of variable length and need to be represented numerically in some way. Most typically, we represent them as *equally-sized vectors*.\n", - "\n", - "Actually, this is what we have already done! Let's take a closer look at `X_count` above.." - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "id 849636868052275200\n", - "created_at 2017-04-05 14:56:29\n", - "text 'And so the robots spared humanity ... https:/...\n", - "clean_text robots spared humanity\n", - "Name: 0, dtype: object" - ] - }, - "execution_count": 53, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# This is the first Tweet of the data frame\n", - "\n", - "df_elon.loc[0]" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [], - "source": [ - "# let's get the vector representation for this Tweet\n", - "\n", - "vector_representation = X_count[0,:]" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "3" - ] - }, - "execution_count": 55, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# there are 3 positions not to zero, as we would expect: the vector contains 1 in the columns related to the 3 words that make up the Tweet. \n", - "# It would contain a number higher than 1 if a given word were occurring multiple times.\n", - "\n", - "np.sum(vector_representation)" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1\n", - "1\n", - "1\n" - ] - } - ], - "source": [ - "# Let's check that indeed the vector contains 1s for the right words\n", - "# Remember, the vector has shape (1 x size of the vocabulary)\n", - "\n", - "print(vector_representation[0,count_vect.vocabulary_.get(\"robots\")])\n", - "print(vector_representation[0,count_vect.vocabulary_.get(\"spared\")])\n", - "print(vector_representation[0,count_vect.vocabulary_.get(\"humanity\")])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Term Frequency - Inverse Document Frequency\n", - "We can use boolean counts (1/0) and raw counts (as we did before) to represent a Tweet over the space of the vocabulary, but there exist improvements on this basic idea. For example, the TF-IDF weighting scheme:\n", - "\n", - "$tfidf(t, d, D) = tf(t, d) \\cdot idf(t, D)$\n", - "\n", - "$tf(t, d) = f_{t,d}$\n", - "\n", - "$idf(t, D) = log \\Big( \\frac{|D|}{|{d \\in D: t \\in d}|} \\Big)$" - ] - }, - { - "cell_type": "code", - "execution_count": 57, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(2819, 7864)" - ] - }, - "execution_count": 57, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from sklearn.feature_extraction.text import TfidfVectorizer\n", - "count_vect = TfidfVectorizer(lowercase=False, tokenizer=tknzr.tokenize)\n", - "X_count_tfidf = count_vect.fit_transform(df_elon.clean_text)\n", - "X_count_tfidf.shape" - ] - }, - { - "cell_type": "code", - "execution_count": 58, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1.7226760995112569" - ] - }, - "execution_count": 58, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_count_tfidf[0,:].sum()" - ] - }, - { - "cell_type": "code", - "execution_count": 59, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "3" - ] - }, - "execution_count": 59, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_count[0,:].sum()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Sparse vectors (mention)\n", - "How is Python representing these vectors in memory? Most of their cells are set to zero. \n", - "\n", - "We call any vector or matrix whose cells are mostly to zero *sparse*.\n", - "There are efficient ways to store them in memory." - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<1x7864 sparse matrix of type ''\n", - "\twith 3 stored elements in Compressed Sparse Row format>" - ] - }, - "execution_count": 60, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_count_tfidf[0,:]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Spacy pipelines\n", - "\n", - "Useful to construct sequences of pre-processing steps: https://spacy.io/usage/processing-pipelines." - ] - }, - { - "cell_type": "code", - "execution_count": 65, - "metadata": {}, - "outputs": [], - "source": [ - "# Load a pre-trained pipeline (Web Small): https://spacy.io/usage/models\n", - "\n", - "#!python -m spacy download en_core_web_sm\n", - "nlp = spacy.load('en_core_web_sm')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*.. the model’s meta.json tells spaCy to use the language \"en\" and the pipeline [\"tagger\", \"parser\", \"ner\"]. spaCy will then initialize spacy.lang.en.English, and create each pipeline component and add it to the processing pipeline. It’ll then load in the model’s data from its data directory and return the modified Language class for you to use as the nlp object.*\n", - "\n", - "Let's create a simple pipeline that does **lemmatization**, **part of speech tagging** and **named entity recognition** using spaCy models.\n", - "\n", - "*If you don't know what these NLP tasks are, please ask!*" - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": {}, - "outputs": [], - "source": [ - "tweet_pos = list()\n", - "tweet_ner = list()\n", - "tweet_lemmas = list()\n", - "\n", - "for tweet in df_elon.text.values:\n", - " spacy_tweet = nlp(tweet)\n", - " \n", - " local_tweet_pos = list()\n", - " local_tweet_ner = list()\n", - " local_tweet_lemmas = list()\n", - " \n", - " for sentence in list(spacy_tweet.sents):\n", - " # --- lemmatization, remove punctuation and stop wors\n", - " local_tweet_lemmas.extend([token.lemma_ for token in sentence if not token.is_punct | token.is_stop])\n", - " local_tweet_pos.extend([token.pos_ for token in sentence if not token.is_punct | token.is_stop])\n", - " for ent in spacy_tweet.ents:\n", - " local_tweet_ner.append(ent)\n", - "\n", - " tweet_pos.append(local_tweet_pos)\n", - " tweet_ner.append(local_tweet_ner)\n", - " tweet_lemmas.append(local_tweet_lemmas)" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['robot', 'spare', 'humanity', 'https://t.co/v7JUJQWfCv']" - ] - }, - "execution_count": 67, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "tweet_lemmas[0]" - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['NOUN', 'VERB', 'NOUN', 'NOUN']" - ] - }, - "execution_count": 68, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "tweet_pos[0]" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[https://t.co/v7JUJQWfCv]" - ] - }, - "execution_count": 69, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "tweet_ner[0]" - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Shortville]" - ] - }, - "execution_count": 70, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# but it actually works!\n", - "\n", - "tweet_ner[3]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Note: we are really just scratching the surface of spaCy, but it is worth knowing it's there.*" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Searching tweets\n", - "\n", - "Once we have represented Tweets as vectors, we can easily find similar ones using basic operations such as filtering." - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "robots spared humanity\n" - ] - } - ], - "source": [ - "target = 0\n", - "print(df_elon.clean_text[target])" - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": {}, - "outputs": [], - "source": [ - "condition = X_count_tfidf[target,:] > 0" - ] - }, - { - "cell_type": "code", - "execution_count": 73, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " (0, 5198)\tTrue\n", - " (0, 6617)\tTrue\n", - " (0, 6949)\tTrue\n" - ] - } - ], - "source": [ - "print(condition)" - ] - }, - { - "cell_type": "code", - "execution_count": 74, - "metadata": {}, - "outputs": [], - "source": [ - "X_filtered = X_count_tfidf[:,np.ravel(condition.toarray())]" - ] - }, - { - "cell_type": "code", - "execution_count": 75, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<2819x3 sparse matrix of type ''\n", - "\twith 16 stored elements in Compressed Sparse Row format>" - ] - }, - "execution_count": 75, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_filtered" - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " (0, 0)\t0.495283407359234\n", - " (0, 2)\t0.6406029997190412\n", - " (0, 1)\t0.5867896924329815\n", - " (217, 0)\t0.2972381925908634\n", - " (271, 0)\t0.3284547085372313\n", - " (464, 0)\t0.2273880239746895\n", - " (473, 0)\t0.5667220639589731\n", - " (734, 1)\t0.3846355279044392\n", - " (940, 0)\t0.27312597149485407\n", - " (1004, 0)\t0.28161575586607157\n", - " (1550, 1)\t0.33303254164524276\n", - " (1862, 0)\t0.3196675199194523\n", - " (2493, 0)\t0.2685018991334563\n", - " (2559, 0)\t0.31145247014227906\n", - " (2565, 0)\t0.2645117238497897\n", - " (2661, 0)\t0.2729016388865858\n" - ] - } - ], - "source": [ - "print(X_filtered)" - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(array([ 0, 217, 271, 464, 473, 940, 1004, 1862, 2493, 2559, 2565,\n", - " 2661, 0, 734, 1550, 0], dtype=int32),\n", - " array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2], dtype=int32),\n", - " array([0.49528341, 0.29723819, 0.32845471, 0.22738802, 0.56672206,\n", - " 0.27312597, 0.28161576, 0.31966752, 0.2685019 , 0.31145247,\n", - " 0.26451172, 0.27290164, 0.58678969, 0.38463553, 0.33303254,\n", - " 0.640603 ]))" - ] - }, - "execution_count": 77, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from scipy import sparse\n", - "\n", - "sparse.find(X_filtered)" - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "metadata": {}, - "outputs": [], - "source": [ - "tweet_indices = list(sparse.find(X_filtered)[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "TARGET: robots spared humanity\n", - "1)@JustBe74 important make humanity proud this case particular duty owed American taxpayer\n", - "2)@pud Faith restored humanity French toast money\n", - "3)humanity have exciting inspiring future cannot confined Earth forever @love_to_dream #APSpaceChat\n", - "4)@ShireeshAgrawal like humanity\n", - "5)Creating neural lace thing that really matters humanity achieve symbiosis with machines\n", - "6)@tzepr Certainly agree that first foremost triumph humanity cheering good spirit\n", - "7)@ReesAndersen @FLIxrisk believe that critical ensure good future humanity\n", - "8)@NASA #Mars hard x99s worth risks extend humanity x99s frontier beyond Earth Learn about neighbor planet\n", - "9)Astronomer Royal Martin Rees soon will robots take over world @Telegraph\n", - "10)@thelogicbox @IanrossWins Mars critical long-term survival humanity life Earth know\n", - "11)humanity wishes become multi-planet species then must figure move millions people Mars\n", - "12)Sure feels weird find myself defending robots\n", - "13)Neil Armstrong hero humanity spirit will carry stars\n" - ] - } - ], - "source": [ - "print(\"TARGET: \" + df_elon.clean_text[target])\n", - "\n", - "for n, tweet_index in enumerate(list(set(tweet_indices))):\n", - " if tweet_index != target:\n", - " print(str(n) +\")\"+ df_elon.clean_text[tweet_index])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Questions\n", - "\n", - "* Can you rank the matched tweets using their tf-idf weights, so to put higher weighted tweets first?\n", - "* Which limitations do you think a bag of words representation has?\n", - "* Can you spot any limitations of this approach based on similarity measures over bag of words representations?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}