diff --git a/inference_causality/inference_causality.html b/inference_causality/inference_causality.html deleted file mode 100644 index c06b00a9..00000000 --- a/inference_causality/inference_causality.html +++ /dev/null @@ -1,1590 +0,0 @@ - - - - - - - - - -Causal Inference and Confounding - - - - - - - - - - - - - - - - - - - - - - - - -
- -
- -
-
-

Causal Inference and Confounding

-
- - - -
- - - - -
- - -
- - -
-
-
- -
-
-Learning Outcomes -
-
-
-
-
-
    -
  • Construct confidence intervals for hypothesis testing using bootstrapping
  • -
  • Understand the assumptions we make and their impact on our regression inference
  • -
  • Explore ways to overcome issues of multicollinearity
  • -
  • Compare regression correlation and causation
  • -
-
-
-
-

Last time, we introduced the idea of random variables and how they affect the data and model we construct. We also demonstrated the decomposition of model risk from a fitted model and dived into the bias-variance tradeoff.

-

In this lecture, we will explore regression inference via hypothesis testing, understand how to use bootstrapping under the right assumptions, and consider the environment of understanding causality in theory and in practice.

-
-

Parameter Inference: Interpreting Regression Coefficients

-

There are two main reasons why do we build models:

-
    -
  • Prediction: using our model to make accurate predictions on unseen data, and (2) to understand complex phenomena occurring in the world we live in.
  • -
  • Inference: using our model to draw conclusions about the underlying relationship(s) between our features and response. Its goal is to understand complex phenomena occurring in the world we live in. While training is the process of fiting a model, inference is the process of making predictions.
  • -
-

Recall the framework we established in the last lecture. The true underlying relationship between the data points is given by \(Y = g(x) + \epsilon\), where \(g(x)\) is the true underlying relationship, and \(\epsilon\) represents randomness. If we assume \(g(x)\) is linear, we can express this relationship in terms of the unknown, true model parameters \(\theta\).

-

\[f_{\theta}(x) = g(x) + \epsilon = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon\]

-

Our model attempts to estimate each true population parameter \(\theta_i\) using the sample estimates \(\hat{\theta}_i\) calculated from the design matrix \(\Bbb{X}\) and response vector \(\Bbb{Y}\).

-

\[f_{\hat{\theta}}(x) = \hat{\theta}_0 + \hat{\theta}_1 x_1 + \ldots + \hat{\theta}_p x_p\]

-

Let’s pause for a moment. At this point, we’re very used to working with the idea of a model parameter. But what exactly does each coefficient \(\theta_i\) actually mean? We can think of each \(\theta_i\) as a slope of the linear model – if all other variables are held constant, a unit change in \(x_i\) will result in a \(\theta_i\) change in \(f_{\theta}(x)\). Broadly speaking, a large value of \(\theta_i\) means that the feature \(x_i\) has a large effect on the response; conversely, a small value of \(\theta_i\) means that \(x_i\) has little effect on the response. In the extreme case, if the true parameter \(\theta_i\) is 0, then the feature \(x_i\) has no effect on \(Y(x)\).

-

If the true parameter \(\theta_i\) for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between \(x_i\) and \(Y(x)\)! How then, can we test if a parameter is actually 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate \(\hat{\theta}_i\). However, we need to also consider the fact that if our random sample had come out differently, we may have found a different result for \(\hat{\theta}_i\). To infer if the true parameter \(\theta_i\) is 0, we want to draw our conclusion from the distribution of \(\hat{\theta}_i\) estimates we could have drawn across all other random samples. This is where hypothesis testing comes in handy!

-

To test if the true parameter \(\theta_i\) is 0, we construct a hypothesis test where our null hypothesis states that the true parameter \(\theta_i\) is 0, and the alternative hypothesis states that the true parameter \(\theta_i\) is not 0. If our p-value is smaller than our cutoff value (usually p=0.05), we reject the null hypothesis.

-
-
-

Bootstrap Resampling (Review)

-

To determine properties of the sampling distribution of an estimator like variance, we’d need to have access to the population so that we can consider all possible samples and compute an estimate for each sample.

-

-y_hat -

-

However, we don’t have access to the population; we only have one random sample from the population. How can we consider all possible samples if we only have one?

-

The idea of bootstrapping is to treat our random sample as a "population" and resample from it with replacement. Intuitively, a random sample resembles the population, so a random resample also resembles a random sample of the population.

-
-
-
- -
-
-Why must we resample with replacement? -
-
-
-
-
-

Given an original sample of size \(n\), we want a resample that has the same size \(n\) as the original. Sampling without replacement will give us the original sample with shuffled rows. Hence, when we calculate summary statistics like the average, our sample without replacement will always have the same average as the original sample, defeating the purpose of a bootstrap.

-
-
-
-

-y_hat -

-

Bootstrap resampling is a technique for estimating the sampling distribution of an estimator. To execute it, we can follow the pseudocode below:

-
collect a random sample of size n (called the bootstrap population)
-
-initiate list of estimates
-
-repeat 10,000 times:
-    resample with replacement from bootstrap population
-    apply estimator f to the resample
-    store in list
-
-list of estimates is the bootstrapped sampling distribution of f
-

How well does bootstrapping actually represent our population? The bootstrapped sampling distribution of an estimator does not exactly match the sampling distribution of that estimator, but it is often close. Similarly, the variance of the bootstrapped distribution is often close to the true variance of the estimator. The example below displays the results of different bootstraps from a known population using a sample size of \(n=50\).

-

-y_hat -

-

In the real world, we don’t know the population distribution. The center of the boostrapped distribution is the estimator applied to our original sample, so we have no way of recovering the estimator’s true expected value; the center and spread of our bootstrap are approximations. The quality of our bootstrapped distribution also depends on the quality of our original sample; if our original sample was not representative of the population (like Sample 5 in the image above), then the bootstrap is next to useless. In general, bootstrapping works better for large samples, when the population distribution is not heavily skewed (no outliers), and when the estimator is “low variance” (insensitive to extreme values).

-
-

Simple Bootstrap Example

-

TODO

- -
-
-
-

Collinearity

-
-

Hypothesis Testing through Bootstrap: Snowy Plover Demo

-

An equivalent way to execute the hypothesis test described earlier is through bootstrapping (this equivalence can be proven through the duality argument, which is out of scope for this class). We use bootstrapping to compute approximate 95% confidence intervals for each \(\theta_i\). If the interval doesn’t contain 0, we reject the null hypothesis at the p=5% level. Otherwise, the data is consistent with the null, as the true parameter could possibly be 0.

-

To show an example of this hypothesis testing process, we’ll work with the snowy plover dataset throughout this section. The data are about the eggs and newly-hatched chicks of the Snowy Plover. The data were collected at the Point Reyes National Seashore by a former student at Berkeley. Here’s a parent bird and some eggs.

-

-bvt -

-

Note that Egg Length and Egg Breadth (widest diameter) are measured in millimeters, and Egg Weight and Bird Weight are measured in grams; for comparison, a standard paper clip weighs about one gram.

-
-
-Code -
import pandas as pd
-eggs = pd.read_csv("data/snowy_plover.csv")
-eggs.head(5)
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
egg_weightegg_lengthegg_breadthbird_weight
07.428.8021.845.2
17.729.0422.455.4
27.929.3622.485.6
37.530.1021.715.3
48.330.1722.755.9
- -
-
-
-

Our goal will be to predict the weight of a newborn plover chick, which we assume follows the true relationship \(Y = f_{\theta}(x)\) below.

-

\[\text{bird\_weight} = \theta_0 + \theta_1 \text{egg\_weight} + \theta_2 \text{egg\_length} + \theta_3 \text{egg\_breadth} + \epsilon\]

-
    -
  • For each \(i\), the parameter \(\theta_i\) is a fixed number, but it is unobservable. We can only estimate it.
  • -
  • The random error \(\epsilon\) is also unobservable, but it is assumed to have expectation 0 and be independent and identically distributed across eggs.
  • -
-

Say we wish to determine if the egg_weight impacts the bird_weight of a chick – we want to infer if \(\theta_1\) is equal to 0.

-

First, we define our hypotheses:

-
    -
  • Null hypothesis: the true parameter \(\theta_1\) is 0; any variation is due to random chance.
  • -
  • Alternative hypothesis: the true parameter \(\theta_1\) is not 0.
  • -
-

Next, we use our data to fit a model \(\hat{Y} = f_{\hat{\theta}}(x)\) that approximates the relationship above. This gives us the observed value of \(\hat{\theta}_1\) found from our data.

-
-
from sklearn.linear_model import LinearRegression
-import numpy as np
-
-X = eggs[["egg_weight", "egg_length", "egg_breadth"]]
-Y = eggs["bird_weight"]
-
-model = LinearRegression()
-model.fit(X, Y)
-
-# This gives an array containing the fitted model parameter estimates
-thetas = model.coef_
-
-# Put the parameter estimates in a nice table for viewing
-display(pd.DataFrame(
-  [model.intercept_] + list(model.coef_),
-  columns=['theta_hat'],
-  index=['intercept', 'egg_weight', 'egg_length', 'egg_breadth']
-))
-
-print("RMSE", np.mean((Y - model.predict(X)) ** 2))
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
theta_hat
intercept-4.605670
egg_weight0.431229
egg_length0.066570
egg_breadth0.215914
- -
-
-
-
RMSE 0.04547085380275766
-
-
-

Our single sample of data gives us the value of \(\hat{\theta}_1=0.431\). To get a sense of how this estimate might vary if we were to draw different random samples, we will use bootstrapping. To construct a bootstrap sample, we will draw a resample from the collected data that:

-
    -
  • Has the same sample size as the collected data
  • -
  • Is drawn with replacement (this ensures that we don’t draw the exact same sample every time!)
  • -
-

We draw a bootstrap sample, use this sample to fit a model, and record the result for \(\hat{\theta}_1\) on this bootstrapped sample. We then repeat this process many times to generate a bootstrapped empirical distribution of \(\hat{\theta}_1\). This gives us an estimate of what the true distribution of \(\hat{\theta}_1\) across all possible samples might look like.

-
-
# Set a random seed so you generate the same random sample as staff
-# In the "real world", we wouldn't do this
-import numpy as np
-np.random.seed(1337)
-
-# Set the sample size of each bootstrap sample
-n = len(eggs)
-
-# Create a list to store all the bootstrapped estimates
-estimates = []
-
-# Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. 
-# Repeat 10000 times.
-for i in range(10000):
-    # draw a bootstrap sample
-    bootstrap_resample = eggs.sample(n, replace=True)
-    X_bootstrap = bootstrap_resample[["egg_weight", "egg_length", "egg_breadth"]]
-    Y_bootstrap = bootstrap_resample["bird_weight"]
-    
-    # use bootstrapped sample to fit a model
-    bootstrap_model = LinearRegression()
-    bootstrap_model.fit(X_bootstrap, Y_bootstrap)
-    bootstrap_thetas = bootstrap_model.coef_
-    
-    # record the result for theta_1
-    estimates.append(bootstrap_thetas[0])
-    
-# calculate the 95% confidence interval 
-lower = np.percentile(estimates, 2.5, axis=0)
-upper = np.percentile(estimates, 97.5, axis=0)
-conf_interval = (lower, upper)
-conf_interval
-
-
(-0.258648119568487, 1.103424385420405)
-
-
-

Our bootstrapped 95% confidence interval for \(\theta_1\) is \([-0.259, 1.103]\). Immediately, we can see that 0 is indeed contained in this interval – this means that we cannot conclude that \(\theta_1\) is non-zero! More formally, we fail to reject the null hypothesis (that \(\theta_1\) is 0) under a 5% p-value cutoff.

-

We can repeat this process to construct 95% confidence intervals for the other parameters of the model.

-
-
-Code -
np.random.seed(1337)
-
-theta_0_estimates = []
-theta_1_estimates = []
-theta_2_estimates = []
-theta_3_estimates = []
-
-
-for i in range(10000):
-    bootstrap_resample = eggs.sample(n, replace=True)
-    X_bootstrap = bootstrap_resample[["egg_weight", "egg_length", "egg_breadth"]]
-    Y_bootstrap = bootstrap_resample["bird_weight"]
-    
-    bootstrap_model = LinearRegression()
-    bootstrap_model.fit(X_bootstrap, Y_bootstrap)
-    bootstrap_theta_0 = bootstrap_model.intercept_
-    bootstrap_theta_1, bootstrap_theta_2, bootstrap_theta_3 = bootstrap_model.coef_
-    
-    theta_0_estimates.append(bootstrap_theta_0)
-    theta_1_estimates.append(bootstrap_theta_1)
-    theta_2_estimates.append(bootstrap_theta_2)
-    theta_3_estimates.append(bootstrap_theta_3)
-    
-theta_0_lower, theta_0_upper = np.percentile(theta_0_estimates, 2.5), np.percentile(theta_0_estimates, 97.5)
-theta_1_lower, theta_1_upper = np.percentile(theta_1_estimates, 2.5), np.percentile(theta_1_estimates, 97.5)
-theta_2_lower, theta_2_upper = np.percentile(theta_2_estimates, 2.5), np.percentile(theta_2_estimates, 97.5)
-theta_3_lower, theta_3_upper = np.percentile(theta_3_estimates, 2.5), np.percentile(theta_3_estimates, 97.5)
-
-# Make a nice table to view results
-pd.DataFrame({"lower":[theta_0_lower, theta_1_lower, theta_2_lower, theta_3_lower], "upper":[theta_0_upper, \
-                theta_1_upper, theta_2_upper, theta_3_upper]}, index=["theta_0", "theta_1", "theta_2", "theta_3"])
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
lowerupper
theta_0-15.2785425.161473
theta_1-0.2586481.103424
theta_2-0.0991380.208557
theta_3-0.2571410.758155
- -
-
-
-

Something’s off here. Notice that 0 is included in the 95% confidence interval for every parameter of the model. Using the interpretation we outlined above, this would suggest that we can’t say for certain that any of the input variables impact the response variable! This makes it seem like our model can’t make any predictions – and yet, each model we fit in our bootstrap experiment above could very much make predictions of \(Y\).

-

How can we explain this result? Think back to how we first interpreted the parameters of a linear model. We treated each \(\theta_i\) as a slope, where a unit increase in \(x_i\) leads to a \(\theta_i\) increase in \(Y\), if all other variables are held constant. It turns out that this last assumption is very important. If variables in our model are somehow related to one another, then it might not be possible to have a change in one of them while holding the others constant. This means that our interpretation framework is no longer valid! In the models we fit above, we incorporated egg_length, egg_breadth, and egg_weight as input variables. These variables are very likely related to one another – an egg with large egg_length and egg_breadth will likely be heavy in egg_weight. This means that the model parameters cannot be meaningfully interpreted as slopes.

-

To support this conclusion, we can visualize the relationships between our feature variables. Notice the strong positive association between the features.

-
-
-Code -
import seaborn as sns
-sns.pairplot(eggs[["egg_length", "egg_breadth", "egg_weight", 'bird_weight']]);
-
-
-

-
-
-

This issue is known as collinearity, sometimes also called multicollinearity. Collinearity occurs when one feature can be predicted fairly accurately by a linear combination of the other features, which happens when one feature is highly correlated with the others.

-

Why is collinearity a problem? Its consequences span several aspects of the modeling process:

-
    -
  • Inference: Slopes can’t be interpreted for an inference task.
  • -
  • Model Variance: If features strongly influence one another, even small changes in the sampled data can lead to large changes in the estimated slopes.
  • -
  • Unique Solution: If one feature is a linear combination of the other features, the design matrix will not be full rank, and \(\mathbb{X}^{\top}\mathbb{X}\) is not invertible. This means that least squares does not have a unique solution. See this section of Course Note 12 for more on this.
  • -
-

The take-home point is that we need to be careful with what features we select for modeling. If two features likely encode similar information, it is often a good idea to choose only one of them as an input variable.

-
-
-

A Simpler Model

-

Let us now consider a more interpretable model: we instead assume a true relationship using only egg weight:

-

\[f_\theta(x) = \theta_0 + \theta_1 \text{egg\_weight} + \epsilon\]

-
-
-Code -
from sklearn.linear_model import LinearRegression
-X_int = eggs[["egg_weight"]]
-Y_int = eggs["bird_weight"]
-
-model_int = LinearRegression()
-
-model_int.fit(X_int, Y_int)
-
-# This gives an array containing the fitted model parameter estimates
-thetas_int = model_int.coef_
-
-# Put the parameter estimates in a nice table for viewing
-pd.DataFrame({"theta_hat":[model_int.intercept_, thetas_int[0]]}, index=["theta_0", "theta_1"])
-
-
-
- - - - - - - - - - - - - - - - - - - -
theta_hat
theta_0-0.058272
theta_10.718515
- -
-
-
-
-
-Code -
import matplotlib.pyplot as plt
-
-# Set a random seed so you generate the same random sample as staff
-# In the "real world", we wouldn't do this
-np.random.seed(1337)
-
-# Set the sample size of each bootstrap sample
-n = len(eggs)
-
-# Create a list to store all the bootstrapped estimates
-estimates_int = []
-
-# Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. 
-# Repeat 10000 times.
-for i in range(10000):
-    bootstrap_resample_int = eggs.sample(n, replace=True)
-    X_bootstrap_int = bootstrap_resample_int[["egg_weight"]]
-    Y_bootstrap_int = bootstrap_resample_int["bird_weight"]
-    
-    bootstrap_model_int = LinearRegression()
-    bootstrap_model_int.fit(X_bootstrap_int, Y_bootstrap_int)
-    bootstrap_thetas_int = bootstrap_model_int.coef_
-    
-    estimates_int.append(bootstrap_thetas_int[0])
-
-plt.figure(dpi=120)
-sns.histplot(estimates_int, stat="density")
-plt.xlabel(r"$\hat{\theta}_1$")
-plt.title(r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Model");
-
-
-

-
-
-

Notice how the interpretable model performs almost as well as our other model:

-
-
-Code -
from sklearn.metrics import mean_squared_error
-
-rmse = mean_squared_error(Y, model.predict(X))
-rmse_int = mean_squared_error(Y_int, model_int.predict(X_int))
-print(f'RMSE of Original Model: {rmse}')
-print(f'RMSE of Interpretable Model: {rmse_int}')
-
-
-
RMSE of Original Model: 0.04547085380275766
-RMSE of Interpretable Model: 0.046493941375556846
-
-
-

Yet, the confidence interval for the true parameter \(\theta_{1}\) does not contain zero.

-
-
-Code -
lower_int = np.percentile(estimates_int, 2.5)
-upper_int = np.percentile(estimates_int, 97.5)
-
-conf_interval_int = (lower_int, upper_int)
-conf_interval_int
-
-
-
(0.6029335250209632, 0.8208401738546206)
-
-
-

In retrospect, it’s no surprise that the weight of an egg best predicts the weight of a newly-hatched chick.

-

A model with highly correlated variables prevents us from interpreting how the variables are related to the prediction.

-
-
-

Reminder: Assumptions Matter

-

Keep the following in mind: All inference assumes that the regression model holds.

-
    -
  • If the model doesn’t hold, the inference might not be valid.
  • -
  • If the assumptions of the bootstrap don’t hold… -
      -
    • Sample size n is large
    • -
    • Sample is representative of population distribution (drawn i.i.d., unbiased)
    • -
    -…then the results of the bootstrap might not be valid.
  • -
-
-
-
-

[Bonus Content]

-

Note: the content in this section is not in scope.

- -
-

Prediction vs Causation

-

The difference between correlation/prediction vs. causation is best illustrated through examples.

-

Some questions about correlation / prediction include:

-
    -
  • Are homes with granite countertops worth more money?
  • -
  • Is college GPA higher for students who win a certain scholarship?
  • -
  • Are breastfed babies less likely to develop asthma?
  • -
  • Do cancer patients given some aggressive treatment have a higher 5-year survival rate?
  • -
  • Are people who smoke more likely to get cancer?
  • -
-

While these may sound like causal questions, they are not! Questions about causality are about the effects of interventions (not just passive observation). For example:

-
    -
  • How much do granite countertops raise the value of a house?
  • -
  • Does getting the scholarship improve students’ GPAs?
  • -
  • Does breastfeeding protect babies against asthma?
  • -
  • Does the treatment improve cancer survival?
  • -
  • Does smoking cause cancer?
  • -
-

Note, however, that regression coefficients are sometimes called “effects”, which can be deceptive!

-

When using data alone, predictive questions (i.e. are breastfed babies healthier?) can be answered, but causal questions: (i.e. does breastfeeding improve babies’ health?) cannot. The reason for this is that there are many possible causes for our predictive question. For example, possible explanations for why breastfed babies are healthier on average include:

-
    -
  • Causal effect: breastfeeding makes babies healthier
  • -
  • Reverse causality: healthier babies more likely to successfully breastfeed
  • -
  • Common cause: healthier / richer parents have healthier babies and are more likely to breastfeed
  • -
-

We cannot tell which explanations are true (or to what extent) just by observing (\(x\),\(y\)) pairs.Additionally, causal questions implicitly involve counterfactuals, events that didn’t happen. For example, we could ask, would the same breastfed babies have been less healthy if they hadn’t been breastfed? Explanation 1 from above implies they would be, but explanations 2 and 3 do not.

-
-
-

Confounders

-

Let T represent a treatment (for example, alcohol use), and Y represent an outcome (for example, lung cancer).

-

confounder

-

A confounder is a variable that affects both T and Y, distorting the correlation between them. Using the example above. Confounders can be a measured covariate (a feature) or an unmeasured variable we don’t know about, and they generally cause problems, as the relationship between T and Y is really affected by data we cannot see. We commonly assume that all confounders are observed (this is also called ignorability).

-
-
-

How to perform causal inference?

-

In a randomized experiment, participants are randomly assigned into two groups: treatment and control. A treatment is applied only to the treatment group; we assume ignorability and gather as many measurements as possible so that we can compare them between the control and treatment groups to determine whether or not the treatment is really the cause or just a confounding factor.

-

experiment

-

However, often, randomly assigning treatments is impractical or unethical. For example, assigning a treatment of cigarettes to test the effect of smoking on lungs would not only be impractical but also unethical.

-

An alternative to bypass this issue is to utilize observational studies. This can be done by obtaining two participant groups separated based on some identified treatment variable. Unlike randomized experiments, however, we cannot assume ignorability: the participants could have separated into the two groups based on other covariates! In addition, there could also be unmeasured confounders.

-

observational

- - -
-
- -
- - -
- - - - \ No newline at end of file diff --git a/inference_causality/inference_causality.qmd b/inference_causality/inference_causality.qmd index d4e520fe..bad48b48 100644 --- a/inference_causality/inference_causality.qmd +++ b/inference_causality/inference_causality.qmd @@ -7,7 +7,7 @@ format: code-fold: true code-tools: true toc: true - toc-title: Causal Inference and the Bootstrap + toc-title: Causal Inference and Confounding page-layout: full theme: - cosmo @@ -120,7 +120,117 @@ How well does bootstrapping actually represent our population? The bootstrapped In the real world, we don't know the population distribution. The center of the boostrapped distribution is the estimator applied to our original sample, so we have no way of recovering the estimator's true expected value; the center and spread of our bootstrap are *approximations*. The quality of our bootstrapped distribution also depends on the quality of our original sample; if our original sample was not representative of the population (like Sample 5 in the image above), then the bootstrap is next to useless. In general, bootstrapping works better for *large samples*, when the population distribution is *not heavily skewed* (no outliers), and when the estimator is *“low variance”* (insensitive to extreme values). ### Simple Bootstrap Example -TODO +Here we work through a simple example of the bootstrap when estimating the relationship between miles per gallon and the weight of a vehicle. + +Suppose we collected a sample of 20 cars from a population. For the purposes of this demo we will assume that the seaborn dataset is the population. The following is a visualization of our sample: + +```{python} +#| code-fold: true +#| vscode: {languageId: python} +import numpy as np +import pandas as pd +import sklearn.linear_model as lm +import seaborn as sns +import matplotlib.pyplot as plt + +np.random.seed(42) +mpg_sample = sns.load_dataset('mpg').sample(20) +sns.regplot(mpg_sample, x='weight', y='mpg',ci=False); +``` + +Fitting a linear model we get an estimate of the slope: + +```{python} +#| code-fold: false +#| vscode: {languageId: python} +model = lm.LinearRegression().fit(mpg_sample[['weight']], mpg_sample['mpg']) +model.coef_[0] +``` + +#### Bootstrap Implementation +Now let's use bootstrapping to estimate the distribution of that coefficient. Here we will construct a bootstrap function that takes an estimator function and uses that function to construct many bootstrap estimates of the slope. + +```{python} +#| code-fold: false +#| vscode: {languageId: python} +def estimator(sample): + model = lm.LinearRegression().fit(sample[['weight']], sample['mpg']) + return model.coef_[0] +``` + +The code below uses [```df.sample```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) to generate a bootstrap sample of the same size as the original sample. + +```{python} +#| code-fold: false +#| vscode: {languageId: python} +def bootstrap(sample, statistic, num_repetitions): + """ + Returns the statistic computed on a num_repetitions + bootstrap samples from sample. + """ + stats = [] + for i in np.arange(num_repetitions): + # Step 1: Sample the Sample + bootstrap_sample = sample.sample(frac=1, replace=True) + # Step 2: compute statistics on the sample of the sample + bootstrap_stat = statistic(bootstrap_sample) + # Accumulate the statistics + stats.append(bootstrap_stat) + return stats +``` + +After constructing MANY bootstrap slope estimates (in this case 10000), we can visualize the bootstrap distribution of the slope estimates. + +```{python} +#| code-fold: true +#| vscode: {languageId: python} +bs_thetas = bootstrap(mpg_sample, estimator, 10000) +fig = plt.subplots(dpi=120) +sns.histplot(bs_thetas) +plt.title('Bootstrap Distribution of the Slope'); +``` + +#### Computing a Bootstrap CI +We can now compute the confidence interval for the slope using the percentiles of the empirical distribution. Here, we are looking for a 95% confidence interval, so we are looking for the values at the 2.5 and 97.5 percentile in the bootstrap samples to be the bounds of our interval. We find the interval to be the range below: + +```{python} +#| code-fold: true +#| vscode: {languageId: python} +def bootstrap_ci(bootstrap_samples, confidence_level=95): + """ + Returns the confidence interval for the bootstrap samples. + """ + lower_percentile = (100 - confidence_level) / 2 + upper_percentile = 100 - lower_percentile + return np.percentile(bootstrap_samples, [lower_percentile, upper_percentile]) +print(bootstrap_ci(bs_thetas)) +``` + +#### Comparing to the Population CIs +In practice you don't have access to the population, but in this specific example we had taken a sample from a larger dataset that we can pretend is the population. Let's compare to resampling from the larger dataset. Here is the 95% confidence interval for the slope when sampling 10000 times from the whole data: + +```{python} +#| code-fold: true +#| vscode: {languageId: python} +mpg_pop = sns.load_dataset('mpg') +theta_est = [estimator(mpg_pop.sample(20)) for i in range(10000)] +print(bootstrap_ci(theta_est)) +``` + +Visualizing the two distributions: + +```{python} +#| code-fold: true +#| vscode: {languageId: python} +fig = plt.subplots(dpi=120,figsize=(6,4)) +sns.histplot(bs_thetas, label='Bootstrap Thetas', alpha=0.7) +sns.histplot(theta_est, label='Population Sampled Thetas', alpha=0.7) +plt.legend() +plt.xlabel('value') +plt.title('Distribution of the Slope'); +``` + +Comparing the two distributions, we see that our bootstrapped sample distribution does not exactly match the sampling distribution of the population, but it is relatively close. This demonstrates the benefit of bootstrapping, as without knowing the actual population distribution, we are able to roughly approximate the true slope for the model by using only a single random sample of 20 cars. +--> ## Collinearity @@ -212,6 +321,8 @@ To show an example of this hypothesis testing process, we'll work with the [snow Note that `Egg Length` and `Egg Breadth` (widest diameter) are measured in millimeters, and `Egg Weight` and `Bird Weight` are measured in grams; for comparison, a standard paper clip weighs about one gram. ```{python} +#| code-fold: true +#| vscode: {languageId: python} import pandas as pd eggs = pd.read_csv("data/snowy_plover.csv") eggs.head(5) @@ -235,6 +346,7 @@ Next, we use our data to fit a model $\hat{Y} = f_{\hat{\theta}}(x)$ that approx ```{python} #| code-fold: false +#| vscode: {languageId: python} from sklearn.linear_model import LinearRegression import numpy as np @@ -266,6 +378,7 @@ We draw a bootstrap sample, use this sample to fit a model, and record the resul ```{python} #| code-fold: false +#| vscode: {languageId: python} # Set a random seed so you generate the same random sample as staff # In the "real world", we wouldn't do this import numpy as np @@ -305,6 +418,7 @@ Our bootstrapped 95% confidence interval for $\theta_1$ is $[-0.259, 1.103]$. Im We can repeat this process to construct 95% confidence intervals for the other parameters of the model. ```{python} +#| vscode: {languageId: python} np.random.seed(1337) theta_0_estimates = [] @@ -345,6 +459,7 @@ How can we explain this result? Think back to how we first interpreted the param To support this conclusion, we can visualize the relationships between our feature variables. Notice the strong positive association between the features. ```{python} +#| vscode: {languageId: python} import seaborn as sns sns.pairplot(eggs[["egg_length", "egg_breadth", "egg_weight", 'bird_weight']]); ``` @@ -366,6 +481,7 @@ Let us now consider a more interpretable model: we instead assume a true relatio $$f_\theta(x) = \theta_0 + \theta_1 \text{egg\_weight} + \epsilon$$ ```{python} +#| vscode: {languageId: python} from sklearn.linear_model import LinearRegression X_int = eggs[["egg_weight"]] Y_int = eggs["bird_weight"] @@ -383,6 +499,7 @@ pd.DataFrame({"theta_hat":[model_int.intercept_, thetas_int[0]]}, index=["theta_ ```{python} #| code-fold: true +#| vscode: {languageId: python} import matplotlib.pyplot as plt # Set a random seed so you generate the same random sample as staff @@ -417,6 +534,7 @@ plt.title(r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Mode Notice how the interpretable model performs almost as well as our other model: ```{python} +#| vscode: {languageId: python} from sklearn.metrics import mean_squared_error rmse = mean_squared_error(Y, model.predict(X)) @@ -428,6 +546,7 @@ print(f'RMSE of Interpretable Model: {rmse_int}') Yet, the confidence interval for the true parameter $\theta_{1}$ does not contain zero. ```{python} +#| vscode: {languageId: python} lower_int = np.percentile(estimates_int, 2.5) upper_int = np.percentile(estimates_int, 97.5) @@ -514,4 +633,3 @@ An alternative to bypass this issue is to utilize **observational studies**. Thi observational -