diff --git a/inference_causality/images/confidence_interval.png b/inference_causality/images/confidence_interval.png new file mode 100644 index 00000000..45b34dee Binary files /dev/null and b/inference_causality/images/confidence_interval.png differ diff --git a/inference_causality/inference_causality.qmd b/inference_causality/inference_causality.qmd index efbb2855..e56ead26 100644 --- a/inference_causality/inference_causality.qmd +++ b/inference_causality/inference_causality.qmd @@ -21,7 +21,7 @@ jupyter: format_version: '1.0' jupytext_version: 1.16.1 kernelspec: - display_name: Python 3 (ipykernel) + display_name: ds100env language: python name: python3 --- @@ -68,22 +68,32 @@ In this lecture, we will explore regression inference via hypothesis testing, un ## Parameter Inference: Interpreting Regression Coefficients There are two main reasons why we build models: -1. **Prediction**: using our model to make accurate predictions about unseen data -2. **Inference**: using our model to draw conclusions about the underlying relationship(s) between our features and response. We want to understand the complex phenomena occurring in the world we live in. While training is the process of fitting a model, inference is the *process of making predictions*. +1. **Prediction**: using our model to make **accurate predictions** about unseen data +2. **Inference**: using our model to draw conclusions about the underlying relationship(s) between our features and response. We want to **understand the complex phenomena** occurring in the world we live in. While training is the process of fitting a model, inference is the *process of making predictions*. Recall the framework we established in the last lecture. The relationship between datapoints is given by $Y = g(x) + \epsilon$, where $g(x)$ is the *true underlying relationship*, and $\epsilon$ represents randomness. If we assume $g(x)$ is linear, we can express this relationship in terms of the unknown, true model parameters $\theta$. $$f_{\theta}(x) = g(x) + \epsilon = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon$$ -Our model attempts to estimate each true population parameter $\theta_i$ using the sample estimates $\hat{\theta}_i$ calculated from the design matrix $\Bbb{X}$ and response vector $\Bbb{Y}$. +Our model attempts to estimate each **true** and **unobserved population parameter** $\theta_i$ using the sample estimates $\hat{\theta}_i$ calculated from the design matrix $\Bbb{X}$ and response vector $\Bbb{Y}$. $$f_{\hat{\theta}}(x) = \hat{\theta}_0 + \hat{\theta}_1 x_1 + \ldots + \hat{\theta}_p x_p$$ Let's pause for a moment. At this point, we're very used to working with the idea of a model parameter. But what exactly does each coefficient $\theta_i$ actually *mean*? We can think of each $\theta_i$ as a *slope* of the linear model. If all other variables are held constant, a unit change in $x_i$ will result in a $\theta_i$ change in $f_{\theta}(x)$. Broadly speaking, a large value of $\theta_i$ means that the feature $x_i$ has a large effect on the response; conversely, a small value of $\theta_i$ means that $x_i$ has little effect on the response. In the extreme case, if the true parameter $\theta_i$ is 0, then the feature $x_i$ has **no effect** on $Y(x)$. -If the true parameter $\theta_i$ for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between $x_i$ and $Y(x)$! But how can we test if a parameter is actually 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate $\hat{\theta}_i$. However, we also need to consider that if our random sample comes out differently, we may find a different result for $\hat{\theta}_i$. To infer if the true parameter $\theta_i$ is 0, we want to draw our conclusion from the distribution of $\hat{\theta}_i$ estimates we could have drawn across all other random samples. This is where [hypothesis testing](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html) comes in handy! +If the true parameter $\theta_i$ for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between $x_i$ and $Y(x)$! But how can we test if a parameter is actually 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate $\hat{\theta}_i$. However, we also need to consider that if our random sample comes out differently, we may find a different result for $\hat{\theta}_i$. To infer if the **true parameter** $\theta_i$ is 0, we want to draw our conclusion from the distribution of $\hat{\theta}_i$ estimates we could have drawn across all other random samples. This is where [hypothesis testing](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html) comes in handy! -To test if the true parameter $\theta_i$ is 0, we construct a **hypothesis test** where our null hypothesis states that the true parameter $\theta_i$ is 0, and the alternative hypothesis states that the true parameter $\theta_i$ is *not* 0. If our p-value is smaller than our cutoff value (usually p = 0.05), we reject the null hypothesis in favor of the alternative hypothesis. +To test if the true parameter $\theta_i$ is 0, we construct a **hypothesis test** where our **null hypothesis** states that the true parameter $\theta_i$ is 0, and the **alternative hypothesis** states that the true parameter $\theta_i$ is *not* 0. We can now use **confidence intervals to test the hypothesis**: + +* Compute an approximate 95% confidence interval +* If the interval does not contain 0, reject the null hypothesis at the 5% level. +* Otherwise, data are consistent with null hypothesis (the true parameter *could* be 0). + +
+ +
+ +For example, the 95% confidence interval shown above contains 0, so we cannot reject the null hypothesis. As a result, the true value of the population parameter $\theta$ could be 0. ## Review: Bootstrap Resampling @@ -93,12 +103,12 @@ To determine the properties (e.g., variance) of the sampling distribution of an -However, this can be quite expensive and time-consuming. Even more importantly, we don’t have access to the population —— we only have *one* random sample from the population. How can we consider all possible samples if we only have one? +However, this can be quite **expensive** and **time-consuming**. Even more importantly, we don’t have access to the population —— we only have ***one* random sample from the population**. How can we consider all possible samples if we only have one? -Bootstrapping comes in handy here! With bootstrapping, we treat our random sample as a "population" and resample from it *with replacement*. Intuitively, a random sample resembles the population (if it is big enough), so a random *resample* also resembles a random sample of the population. When sampling, there are a couple things to keep in mind: +Bootstrapping comes in handy here! With bootstrapping, we treat our random sample as a "population" and resample from it *with replacement*. Intuitively, a random sample is **representative of the population** (if it is big enough), so **sampling from our sample** approximates **sampling from the population**. When sampling, there are a couple things to keep in mind: -* We need to sample the same way we constructed the original sample. Typically, this involves taking a simple random sample with replacement. -* New samples must be the same size as the original sample. We need to accurately model the variability of our estimates. +* We need to sample the same way we constructed the original sample. Typically, this involves taking a **simple random sample with replacement**. +* New samples **must be the same size** as the original sample. We need to accurately model the variability of our estimates. ::: {.callout-warning collapse=\"true\"} ### Why must we resample *with replacement*? @@ -123,21 +133,28 @@ repeat 10,000 times: list of estimates is the bootstrapped sampling distribution of f ``` +From here, we can construct a 95% confidence interval by taking the 2.5% and (100 - 2.5)% percentiles of our bootstrapped thetas. In numpy, this could look like the following: +``` +tail = (100 - 95)/2 +ci = np.percentile(bs_thetas, [tail, 100-tail]) +``` + + How well does bootstrapping actually represent our population? The bootstrapped sampling distribution of an estimator does not exactly match the sampling distribution of that estimator, but it is often close. Similarly, the variance of the bootstrapped distribution is often close to the true variance of the estimator. The example below displays the results of different bootstraps from a *known* population using a sample size of $n=50$.-In the real world, we don't know the population distribution. The center of the bootstrapped distribution is the estimator applied to our original sample, so we have no way of understanding the estimator's true expected value; the center and spread of our bootstrap are *approximations*. The quality of our bootstrapped distribution also depends on the quality of our original sample. If our original sample was not representative of the population (like Sample 5 in the image above), then the bootstrap is next to useless. In general, bootstrapping works better for *large samples*, when the population distribution is *not heavily skewed* (no outliers), and when the estimator is *“low variance”* (insensitive to extreme values). +In the real world, we don't know the population distribution. The center of the bootstrapped distribution is the estimator applied to our original sample, so we have no way of understanding the estimator's true expected value; the **center and spread of our bootstrap are *approximations***. The bootstrap **does not improve our estimate**. The quality of our bootstrapped distribution also depends on the quality of our original sample. If our original sample was not representative of the population (like Sample 5 in the image above), then the bootstrap is next to useless. In general, bootstrapping works better for *large samples*, when the population distribution is *not heavily skewed* (no outliers), and when the estimator is *“low variance”* (insensitive to extreme values). - +Although our bootstrapped sample distribution does not exactly match the sampling distribution of the population, we can see that it is relatively close. This demonstrates the benefit of bootstrapping — without knowing the actual population distribution, we can still roughly approximate the true slope for the model by using only a single random sample of 20 cars. -Although our bootstrapped sample distribution does not exactly match the sampling distribution of the population, we can see that it is relatively close. This demonstrates the benefit of bootstrapping —— without knowing the actual population distribution, we can still roughly approximate the true slope for the model by using only a single random sample of 20 cars.