Skip to content

Commit

Permalink
edit bootstrapping
Browse files Browse the repository at this point in the history
nsreddy16 committed Nov 1, 2024
1 parent 6dc27be commit 6a34f32
Showing 2 changed files with 58 additions and 31 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
89 changes: 58 additions & 31 deletions inference_causality/inference_causality.qmd
Original file line number Diff line number Diff line change
@@ -21,7 +21,7 @@ jupyter:
format_version: '1.0'
jupytext_version: 1.16.1
kernelspec:
display_name: Python 3 (ipykernel)
display_name: ds100env
language: python
name: python3
---
@@ -68,22 +68,32 @@ In this lecture, we will explore regression inference via hypothesis testing, un
## Parameter Inference: Interpreting Regression Coefficients
There are two main reasons why we build models:

1. **Prediction**: using our model to make accurate predictions about unseen data
2. **Inference**: using our model to draw conclusions about the underlying relationship(s) between our features and response. We want to understand the complex phenomena occurring in the world we live in. While training is the process of fitting a model, inference is the *process of making predictions*.
1. **Prediction**: using our model to make **accurate predictions** about unseen data
2. **Inference**: using our model to draw conclusions about the underlying relationship(s) between our features and response. We want to **understand the complex phenomena** occurring in the world we live in. While training is the process of fitting a model, inference is the *process of making predictions*.

Recall the framework we established in the last lecture. The relationship between datapoints is given by $Y = g(x) + \epsilon$, where $g(x)$ is the *true underlying relationship*, and $\epsilon$ represents randomness. If we assume $g(x)$ is linear, we can express this relationship in terms of the unknown, true model parameters $\theta$.

$$f_{\theta}(x) = g(x) + \epsilon = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon$$

Our model attempts to estimate each true population parameter $\theta_i$ using the sample estimates $\hat{\theta}_i$ calculated from the design matrix $\Bbb{X}$ and response vector $\Bbb{Y}$.
Our model attempts to estimate each **true** and **unobserved population parameter** $\theta_i$ using the sample estimates $\hat{\theta}_i$ calculated from the design matrix $\Bbb{X}$ and response vector $\Bbb{Y}$.

$$f_{\hat{\theta}}(x) = \hat{\theta}_0 + \hat{\theta}_1 x_1 + \ldots + \hat{\theta}_p x_p$$

Let's pause for a moment. At this point, we're very used to working with the idea of a model parameter. But what exactly does each coefficient $\theta_i$ actually *mean*? We can think of each $\theta_i$ as a *slope* of the linear model. If all other variables are held constant, a unit change in $x_i$ will result in a $\theta_i$ change in $f_{\theta}(x)$. Broadly speaking, a large value of $\theta_i$ means that the feature $x_i$ has a large effect on the response; conversely, a small value of $\theta_i$ means that $x_i$ has little effect on the response. In the extreme case, if the true parameter $\theta_i$ is 0, then the feature $x_i$ has **no effect** on $Y(x)$.

If the true parameter $\theta_i$ for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between $x_i$ and $Y(x)$! But how can we test if a parameter is actually 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate $\hat{\theta}_i$. However, we also need to consider that if our random sample comes out differently, we may find a different result for $\hat{\theta}_i$. To infer if the true parameter $\theta_i$ is 0, we want to draw our conclusion from the distribution of $\hat{\theta}_i$ estimates we could have drawn across all other random samples. This is where [hypothesis testing](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html) comes in handy!
If the true parameter $\theta_i$ for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between $x_i$ and $Y(x)$! But how can we test if a parameter is actually 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate $\hat{\theta}_i$. However, we also need to consider that if our random sample comes out differently, we may find a different result for $\hat{\theta}_i$. To infer if the **true parameter** $\theta_i$ is 0, we want to draw our conclusion from the distribution of $\hat{\theta}_i$ estimates we could have drawn across all other random samples. This is where [hypothesis testing](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html) comes in handy!

To test if the true parameter $\theta_i$ is 0, we construct a **hypothesis test** where our null hypothesis states that the true parameter $\theta_i$ is 0, and the alternative hypothesis states that the true parameter $\theta_i$ is *not* 0. If our p-value is smaller than our cutoff value (usually p = 0.05), we reject the null hypothesis in favor of the alternative hypothesis.
To test if the true parameter $\theta_i$ is 0, we construct a **hypothesis test** where our **null hypothesis** states that the true parameter $\theta_i$ is 0, and the **alternative hypothesis** states that the true parameter $\theta_i$ is *not* 0. We can now use **confidence intervals to test the hypothesis**:

* Compute an approximate 95% confidence interval
* If the interval does not contain 0, reject the null hypothesis at the 5% level.
* Otherwise, data are consistent with null hypothesis (the true parameter *could* be 0).

<p align="center">
<img src="images/confidence_interval.png" alt='confidence_interval' width='650'>
</p>

For example, the 95% confidence interval shown above contains 0, so we cannot reject the null hypothesis. As a result, the true value of the population parameter $\theta$ could be 0.

## Review: Bootstrap Resampling

@@ -93,12 +103,12 @@ To determine the properties (e.g., variance) of the sampling distribution of an
<img src="images/population_samples.png" alt='y_hat' width='650'>
</p>

However, this can be quite expensive and time-consuming. Even more importantly, we don’t have access to the population —— we only have *one* random sample from the population. How can we consider all possible samples if we only have one?
However, this can be quite **expensive** and **time-consuming**. Even more importantly, we don’t have access to the population —— we only have ***one* random sample from the population**. How can we consider all possible samples if we only have one?

Bootstrapping comes in handy here! With bootstrapping, we treat our random sample as a "population" and resample from it *with replacement*. Intuitively, a random sample resembles the population (if it is big enough), so a random *resample* also resembles a random sample of the population. When sampling, there are a couple things to keep in mind:
Bootstrapping comes in handy here! With bootstrapping, we treat our random sample as a "population" and resample from it *with replacement*. Intuitively, a random sample is **representative of the population** (if it is big enough), so **sampling from our sample** approximates **sampling from the population**. When sampling, there are a couple things to keep in mind:

* We need to sample the same way we constructed the original sample. Typically, this involves taking a simple random sample with replacement.
* New samples must be the same size as the original sample. We need to accurately model the variability of our estimates.
* We need to sample the same way we constructed the original sample. Typically, this involves taking a **simple random sample with replacement**.
* New samples **must be the same size** as the original sample. We need to accurately model the variability of our estimates.

::: {.callout-warning collapse=\"true\"}
### Why must we resample *with replacement*?
@@ -123,21 +133,28 @@ repeat 10,000 times:
list of estimates is the bootstrapped sampling distribution of f
```

From here, we can construct a 95% confidence interval by taking the 2.5% and (100 - 2.5)% percentiles of our bootstrapped thetas. In numpy, this could look like the following:
```
tail = (100 - 95)/2
ci = np.percentile(bs_thetas, [tail, 100-tail])
```


How well does bootstrapping actually represent our population? The bootstrapped sampling distribution of an estimator does not exactly match the sampling distribution of that estimator, but it is often close. Similarly, the variance of the bootstrapped distribution is often close to the true variance of the estimator. The example below displays the results of different bootstraps from a *known* population using a sample size of $n=50$.

<p align="center">
<img src="images/bootstrapped_samples.png" alt='y_hat' width='600'>
</p>

In the real world, we don't know the population distribution. The center of the bootstrapped distribution is the estimator applied to our original sample, so we have no way of understanding the estimator's true expected value; the center and spread of our bootstrap are *approximations*. The quality of our bootstrapped distribution also depends on the quality of our original sample. If our original sample was not representative of the population (like Sample 5 in the image above), then the bootstrap is next to useless. In general, bootstrapping works better for *large samples*, when the population distribution is *not heavily skewed* (no outliers), and when the estimator is *“low variance”* (insensitive to extreme values).
In the real world, we don't know the population distribution. The center of the bootstrapped distribution is the estimator applied to our original sample, so we have no way of understanding the estimator's true expected value; the **center and spread of our bootstrap are *approximations***. The bootstrap **does not improve our estimate**. The quality of our bootstrapped distribution also depends on the quality of our original sample. If our original sample was not representative of the population (like Sample 5 in the image above), then the bootstrap is next to useless. In general, bootstrapping works better for *large samples*, when the population distribution is *not heavily skewed* (no outliers), and when the estimator is *“low variance”* (insensitive to extreme values).

<!-- #### TODO: Good to include this example but make sure to integrate well with the following example and ensure it flows. Following example is explained under the assumption that people haven't seen bootstrapping example before.

### Simple Bootstrap Example
To get a better idea of how bootstrapping works in practice, let's walk through a simple example of bootstrapping to estimate the relationship between miles per gallon and the weight of a vehicle.

Suppose we collected a sample of 20 cars from a population. For the purposes of this demo, we will assume that the seaborn dataset represents the entire population. The following is a visualization of our sample:

```{python}
#| code-fold: true
import numpy as np
import pandas as pd
@@ -146,25 +163,35 @@ import sklearn.linear_model as lm
import seaborn as sns
np.random.seed(42)
mpg_sample = sns.load_dataset('mpg').sample(20)
sample_size = 100
mpg = sns.load_dataset('mpg')
print("Full Data Size:", len(mpg))
mpg_sample = mpg.sample(sample_size)
print("Sample Size:", len(mpg_sample))
px.scatter(mpg_sample, x='weight', y='mpg', trendline='ols', width=800)
```

Fitting a linear model, we get an estimate of the slope:

```{python}
#| code-fold: false
model = lm.LinearRegression().fit(mpg_sample[['weight']], mpg_sample['mpg'])
model.coef_[0]
```

#### Bootstrap Implementation
We can use bootstrapping to estimate the distribution of that coefficient. Here we construct a bootstrap function that takes an estimator function and uses that function to construct many bootstrap estimates of the slope.

```{python}
#| code-fold: false
def estimator(sample):
model = lm.LinearRegression().fit(sample[['weight']], sample['mpg'])
return model.coef_[0]
```

The code below uses `df.sample` [(documentation)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) to generate a bootstrap sample that is the same size as the original sample.

```{python}
#| code-fold: false
def bootstrap(sample, statistic, num_repetitions):
"""
@@ -179,21 +206,25 @@ def bootstrap(sample, statistic, num_repetitions):
bootstrap_stat = statistic(bootstrap_sample)
# Accumulate the statistics
stats.append(bootstrap_stat)
return stats
return stats
```

After constructing many bootstrap slope estimates (in this case, 10,000), we can visualize the distribution of these estimates.

```{python}
#| code-fold: true
#Construct 10,000 bootstrap slope estimates
bs_thetas = bootstrap(mpg_sample, estimator, 10000)
#Visualize the distribution of these estimates
px.histogram(bs_thetas, title='Bootstrap Distribution of the Slope',
width=800)
```

#### Computing a Bootstrap CI
We can now compute the confidence interval for the slopes using the percentiles of the empirical distribution. Here, we are looking for a 95% confidence interval, so we want values at the 2.5 and 97.5 percentiles of the bootstrap samples to be the bounds of our interval. To find the interval, we can use the function defined below:
We can now compute the confidence interval for the slopes using the percentiles of the empirical distribution. Here, we are looking for a 95% confidence interval, so we want values at the 2.5 and 97.5 percentiles of the bootstrap samples to be the bounds of our interval. To find the interval, we can use the function defined below.

```{python}
#| code-fold: true
def bootstrap_ci(bootstrap_samples, confidence_level=95):
"""
@@ -203,25 +234,29 @@ def bootstrap_ci(bootstrap_samples, confidence_level=95):
upper_percentile = 100 - lower_percentile
return np.percentile(bootstrap_samples, [lower_percentile, upper_percentile])
print(bootstrap_ci(bs_thetas))
```

#### Comparing to the Population CIs
In practice, you don't have access to the population. In this example, we took a sample from a larger dataset that we can treat as the population. Let's compare our results to what they would be if we had resampled from the larger dataset. Here is the 95% confidence interval for the slope when sampling 10,000 times from the entire dataset:

```{python}
#| code-fold: true
mpg_pop = sns.load_dataset('mpg')
theta_est = [estimator(mpg_pop.sample(20)) for i in range(10000)]
print(bootstrap_ci(theta_est))
```

Visualizing the two distributions:

```{python}
#| code-fold: true
thetas = pd.DataFrame({"bs_thetas": bs_thetas, "thetas": theta_est})
px.histogram(thetas.melt(), x='value', facet_row='variable',
title='Distribution of the Slope', width=800)
```

Although our bootstrapped sample distribution does not exactly match the sampling distribution of the population, we can see that it is relatively close. This demonstrates the benefit of bootstrapping —— without knowing the actual population distribution, we can still roughly approximate the true slope for the model by using only a single random sample of 20 cars.
-->
Although our bootstrapped sample distribution does not exactly match the sampling distribution of the population, we can see that it is relatively close. This demonstrates the benefit of bootstrapping — without knowing the actual population distribution, we can still roughly approximate the true slope for the model by using only a single random sample of 20 cars.

Although our bootstrapped sample distribution does not exactly match the sampling distribution of the population, we can see that it is relatively close. This demonstrates the benefit of bootstrapping —— without knowing the actual population distribution, we can still roughly approximate the true slope for the model by using only a single random sample of 20 cars.

<!-- #### PurpleAir (chose to skip this section because it's too complex for the amount of pedagogical value it adds)
To show an example of this hypothesis testing process, we'll work with air quality measurement data. There are 2 common sources of air quality information: Air Quality System (AQS) and [PurpleAir sensors](https://www2.purpleair.com/). AQS is seen as the gold standard because it is high quality, well-calibrated, and publicly available. However, it is very expensive, and the sensors are far apart; reports are also delayed due to extensive calibration.
@@ -301,7 +336,7 @@ pd.DataFrame(PA).head()

### Hypothesis Testing Through Bootstrap: Snowy Plover Demo

We can conduct the hypothesis testing described earlier through **bootstrapping** (this equivalence can be proven through the [duality argument](https://stats.stackexchange.com/questions/179902/confidence-interval-p-value-duality-vs-frequentist-interpretation-of-cis), which is out of scope for this class). We use bootstrapping to compute approximate 95% confidence intervals for each $\theta_i$. If the interval doesn't contain 0, we reject the null hypothesis at the p=5% level. Otherwise, the data is consistent with the null, as the true parameter *could possibly* be 0.
We looked at a simple example of bootstrapping earlier, but now, let's use bootstrapping to perform hypothesis testing. Recall that we use bootstrapping to compute approximate 95% confidence intervals for each $\theta_i$. If the interval doesn't contain 0, we reject the null hypothesis at the p=5% level. Otherwise, the data is consistent with the null, as the true parameter *could possibly* be 0.

To show an example of this hypothesis testing process, we'll work with the [snowy plover](https://www.audubon.org/field-guide/bird/snowy-plover) dataset throughout this section. The data are about the eggs and newly hatched chicks of the Snowy Plover. The data were collected at the Point Reyes National Seashore by a former [student at Berkeley](https://openlibrary.org/books/OL2038693M/BLSS_the_Berkeley_interactive_statistical_system). Here's a [parent bird and some eggs](http://cescos.fau.edu/jay/eps/articles/snowyplover.html).

@@ -313,7 +348,6 @@ Note that `Egg Length` and `Egg Breadth` (widest diameter) are measured in milli

```{python}
#| code-fold: true
#| vscode: {languageId: python}
import pandas as pd
eggs = pd.read_csv("data/snowy_plover.csv")
eggs.head(5)
@@ -336,7 +370,6 @@ Next, we use our data to fit a model $\hat{Y} = f_{\hat{\theta}}(x)$ that approx

```{python}
#| code-fold: false
#| vscode: {languageId: python}
from sklearn.linear_model import LinearRegression
import numpy as np
@@ -368,7 +401,6 @@ We draw a bootstrap sample, use this sample to fit a model, and record the resul

```{python}
#| code-fold: false
#| vscode: {languageId: python}
# Set a random seed so you generate the same random sample as staff
# In the "real world", we wouldn't do this
import numpy as np
@@ -403,12 +435,11 @@ conf_interval = (lower, upper)
conf_interval
```

Our bootstrapped 95% confidence interval for $\theta_1$ is $[-0.259, 1.103]$. Immediately, we can see that 0 *is* indeed contained in this interval – this means that we *cannot* conclude that $\theta_1$ is non-zero! More formally, we fail to reject the null hypothesis (that $\theta_1$ is 0) under a 5% p-value cutoff.
Our bootstrapped 95% confidence interval for $\theta_1$ is $[-0.259, 1.103]$. Immediately, we can see that 0 *is* indeed contained in this interval – this means that we *cannot* conclude that $\theta_1$ is non-zero! More formally, we fail to reject the null hypothesis (that $\theta_1$ is 0) at a 5% cutoff.

We can repeat this process to construct 95% confidence intervals for the other parameters of the model.

```{python}
#| vscode: {languageId: python}
np.random.seed(1337)
theta_0_estimates = []
@@ -449,7 +480,6 @@ How can we explain this result? Think back to how we first interpreted the param
To support this conclusion, we can visualize the relationships between our feature variables. Notice the strong positive association between the features.

```{python}
#| vscode: {languageId: python}
import seaborn as sns
sns.pairplot(eggs[["egg_length", "egg_breadth", "egg_weight", 'bird_weight']]);
```
@@ -460,7 +490,7 @@ Why is collinearity a problem? Its consequences span several aspects of the mode

* **Inference**: Slopes can't be interpreted for an inference task.
* **Model Variance**: If features strongly influence one another, even small changes in the sampled data can lead to large changes in the estimated slopes.
* **Unique Solution**: If one feature is a linear combination of the other features, the design matrix will not be full rank, and $\mathbb{X}^{\top}\mathbb{X}$ is not invertible. This means that least squares does not have a unique solution. See [this section](https://ds100.org/course-notes/ols/ols.html#bonus-uniqueness-of-the-solution) of Course Note 12 for more on this.
* **Unique Solution**: If one feature is a linear combination of the other features, the design matrix will not be full rank, and $\mathbb{X}^{\top}\mathbb{X}$ is not invertible. This means that least squares does not have a unique solution. See [this section](https://ds100.org/course-notes/ols/ols.html#uniqueness-of-the-ols-solution) of Course Note 12 for more on this.

The take-home point is that we need to be careful with what features we select for modeling. If two features likely encode similar information, it is often a good idea to choose only one of them as an input variable.

@@ -471,7 +501,6 @@ Let us now consider a more interpretable model: we instead assume a true relatio
$$f_\theta(x) = \theta_0 + \theta_1 \text{egg\_weight} + \epsilon$$

```{python}
#| vscode: {languageId: python}
from sklearn.linear_model import LinearRegression
X_int = eggs[["egg_weight"]]
Y_int = eggs["bird_weight"]
@@ -489,7 +518,6 @@ pd.DataFrame({"theta_hat":[model_int.intercept_, thetas_int[0]]}, index=["theta_

```{python}
#| code-fold: true
#| vscode: {languageId: python}
import matplotlib.pyplot as plt
# Set a random seed so you generate the same random sample as staff
@@ -524,7 +552,6 @@ plt.title(r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Mode
Notice how the interpretable model performs almost as well as our other model:

```{python}
#| vscode: {languageId: python}
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(Y, model.predict(X))
@@ -536,7 +563,6 @@ print(f'RMSE of Interpretable Model: {rmse_int}')
Yet, the confidence interval for the true parameter $\theta_{1}$ does not contain zero.

```{python}
#| vscode: {languageId: python}
lower_int = np.percentile(estimates_int, 2.5)
upper_int = np.percentile(estimates_int, 97.5)
@@ -622,3 +648,4 @@ However, often, randomly assigning treatments is impractical or unethical. For e
An alternative to bypass this issue is to utilize **observational studies**. This can be done by obtaining two participant groups separated based on some identified treatment variable. Unlike randomized experiments, however, we cannot assume ignorability here: the participants could have separated into two groups based on other covariates! In addition, there could also be unmeasured confounders.

<img src="images/observational.png" alt='observational' width='600'>

0 comments on commit 6a34f32

Please sign in to comment.