Skip to content

Commit

Permalink
Note 19 updates v3
Browse files Browse the repository at this point in the history
  • Loading branch information
nsreddy16 committed Mar 21, 2024
1 parent c4350d0 commit 693eefc
Showing 1 changed file with 7 additions and 244 deletions.
251 changes: 7 additions & 244 deletions inference_causality/inference_causality.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,10 @@ $$\begin{align}\text{MSE}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \theta

::: {.callout-note collapse="false"}
## Learning Outcomes
* Construct confidence intervals for hypothesis testing
* Understand the assumptions we make and its impact on our regression inference
* Construct confidence intervals for hypothesis testing using bootstrapping
* Understand the assumptions we make and their impact on our regression inference
* Explore ways to overcome issues of multicollinearity
* Compare regression correlation and causation
* Experiment setup, confounding variables, average treatment effect, and covariate adjustment
:::

Last time, we introduced the idea of random variables and its effect on the observed relationship we use to fit models. We also demonstrated the decomposition of model risk from a fitted model.
Expand Down Expand Up @@ -482,88 +482,15 @@ Let T represent a treatment (for example, alcohol use), and Y represent an outco

A **confounder** is a variable that affects both T and Y, distorting the correlation between them. Using the example above. Confounders can be a measured covariate (a feature) or an unmeasured variable we don’t know about, and they generally cause problems, as the relationship between T and Y is really affected by data we cannot see.

**Common assumption:** all confounders are observed (**ignorability**)
**Common assumption:** all confounders are observed (also called **ignorability**)

### Terminology
### How to perform causal inference?

Let us define some terms that will help us understand causal effects.

In prediction, we had two kinds of variables:

- **Response** ($Y$): what we are trying to predict
- **Predictors** ($X$): inputs to our prediction

Other variables in causal inference include:

- **Response** ($Y$): the outcome of interest
- **Treatment** ($T$): the variable we might intervene on
- **Covariate** ($X$): other variables we measured that may affect $T$ and/or $Y$

For this lecture, $T$ is a **binary (0/1)** variable:

### Neyman-Rubin Causal Model

Causal questions are about **counterfactuals**:

- What would have happened if T were different?
- What will happen if we set T differently in the future?

We assume every individual has two **potential outcomes**:

- $Y_{i}(1)$: value of $y_{i}$ if $T_{i} = 1$ (**treated outcome**)
- $Y_{i}(0)$: value of $y_{i}$ if $T_{i} = 0$ (**control outcome**)

For each individual in the data set, we observe:

- Covariates $x_{i}$
- Treatment $T_{i}$
- Response $y_{i} = Y_{i}(T_{i})$

We will assume ($x_{i}$, $T_{i}$, $y_{i} = Y_{i}(T_{i})$) tuples iid for $i = 1,..., n$

### Average Treatment Effect

For each individual, the **treatment effect** is $Y_{i}(1)-Y_{i}(0)$

The most common thing to estimate is the **Average Treatment Effect (ATE)**

$$ATE = \mathbb{E}[Y(1)-Y(0)] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]$$

Can we just take the sample mean?

$$\hat{ATE} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)$$

We cannot. Why? We only observe one of $Y_{i}(1)$, $Y_{i}(0)$.

**Fundamental problem of causal inference:** We only ever observe one potential outcome

To draw causal conclusions, we need some causal assumption relating the observed to the unobserved units

Instead of $\frac{1}{n}\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)$, what if we took the difference between the sample mean for each group?

$$\hat{ATE} = \frac{1}{n_{1}}\sum_{i: T_{i} = 1}{Y_{i}(1)} - \frac{1}{n_{0}}\sum_{i: T_{i} = 0}{Y_{i}(0)} = \frac{1}{n_{1}}\sum_{i: T_{i} = 1}{y_{i}} - \frac{1}{n_{0}}\sum_{i: T_{i} = 0}{y_{i}}$$

Is this estimator of $ATE$ unbiased? Thus, this proposed $\hat{ATE}$ is not suitable for our purposes.

If treatment assignment comes from random coin flips, then the treated units are an iid random sample of size $n_{1}$ from the population of $Y_{i}(1)$.

This means that,

$$\mathbb{E}[\frac{1}{n_{1}}\sum_{i: T_{i} = 1}{y_{i}}] = \mathbb{E}[Y_{i}(1)]$$

Similarly,

$$\mathbb{E}[\frac{1}{n_{0}}\sum_{i: T_{i} = 0}{y_{i}}] = \mathbb{E}[Y_{i}(0)]$$

which allows us to conclude that $\hat{ATE}$ is an unbiased estimator of $ATE$:

$$\mathbb{E}[\hat{ATE}] = ATE$$

### Randomized Experiments
In a **randomized experiment**: randomly assign participants into two groups (the treatment and the control group) and then apply the treatment to the treatment group only. We assume ignorability and gather as many measurements as possible.

However, often, randomly assigning treatments is impractical or unethical. For example, assigning a treatment of cigarettes would likely be impractical and unethical.

An alternative to bypass this issue is to utilize **observational studies**.
An alternative to bypass this issue is to utilize **observational studies**. This can be done by obtaining two participant groups separated based on some identified treatment variable. Unlike randomized experiments, however, here we cannot assume ignorability: the participants could have separated into the two groups based on other covariates! In addition, there could also be unmeasured confounders.

Experiments:

Expand All @@ -573,168 +500,4 @@ Observational Study:

<img src="images/observational.png" alt='observational' width='600'>

### Covariate Adjustment

What to do about confounders?

- **Ignorability assumption:** all important confounders are in the data set!

**One idea:** come up with a model that includes them, such as:

$$Y_{i}(t) = \theta_{0} + \theta_{1}x_{1} + ... + \theta_{p}x_{p} + \tau{t} + \epsilon$$

**Question:** what is the $ATE$ in this model? $\tau$

This approach can work but is **fragile**. Breaks if:

- Important covariates are missing or true dependence on $x$ is nonlinear
- Sometimes pejoratively called **“causal inference”**

<img src="images/ignorability.png" alt='ignorability' width='600'>

#### Covariate adjustment without parametric assumptions

What to do about confounders?

- **Ignorability assumption:** all possible confounders are in the data set!

**One idea:** come up with a model that includes them, such as:

$$Y_{i}(t) = f_{\theta}(x, t) + \epsilon$$

Then:

$$ATE = \frac{1}{n}\sum_{i=1}^{n}{f_{\theta}(x_i, 1) - f_{\theta}(x_i, 0)}$$

With enough data, we may be able to learn $f_{\theta}$ very accurately

- Very difficult if x is high-dimensional / its functional form is highly nonlinear
- Need additional assumption: **overlap**

### Other Methods

Causal inference is hard, and covariate adjustment is often not the best approach

Many other methods are some combination of:

- Modeling treatment T as a function of covariates x
- Modeling the outcome y as a function of x, T

What if we don’t believe in ignorability? Other methods look for a

- Favorite example: **regression discontinuity**

## [Bonus] Proof of Bias-Variance Decomposition

This section walks through the detailed derivation of the Bias-Variance Decomposition in the Bias-Variance Tradeoff section earlier in Note 18.

:::{.callout collapse="true"}
### Click to show
We want to prove that the model risk can be decomposed as

$$
\begin{align*}
E\left[(Y(x)-\hat{Y}(x))^2\right] &= E[\epsilon^2] + \left(g(x)-E\left[\hat{Y}(x)\right]\right)^2 + E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right].
\end{align*}
$$

To prove this, we will first need the following lemma:

<center>If $V$ and $W$ are independent random variables then $E[VW] = E[V]E[W]$.</center>

We will prove this in the discrete finite case. Trust that it's true in greater generality.

The job is to calculate the weighted average of the values of $VW$, where the weights are the probabilities of those values. Here goes.

\begin{align*}
E[VW] ~ &= ~ \sum_v\sum_w vwP(V=v \text{ and } W=w) \\
&= ~ \sum_v\sum_w vwP(V=v)P(W=w) ~~~~ \text{by independence} \\
&= ~ \sum_v vP(V=v)\sum_w wP(W=w) \\
&= ~ E[V]E[W]
\end{align*}

Now we go into the actual proof:

### Goal
Decompose the model risk into recognizable components.

### Step 1
$$
\begin{align*}
\text{model risk} ~ &= ~ E\left[\left(Y - \hat{Y}(x)\right)^2 \right] \\
&= ~ E\left[\left(g(x) + \epsilon - \hat{Y}(x)\right)^2 \right] \\
&= ~ E\left[\left(\epsilon + \left(g(x)- \hat{Y}(x)\right)\right)^2 \right] \\
&= ~ E\left[\epsilon^2\right] + 2E\left[\epsilon \left(g(x)- \hat{Y}(x)\right)\right] + E\left[\left(g(x) - \hat{Y}(x)\right)^2\right]\\
\end{align*}
$$

On the right hand side:

- The first term is the observation variance $\sigma^2$.
- The cross product term is 0 because $\epsilon$ is independent of $g(x) - \hat{Y}(x)$ and $E(\epsilon) = 0$
- The last term is the mean squared difference between our predicted value and the value of the true function at $x$

### Step 2
At this stage we have

$$
\text{model risk} ~ = ~ E\left[\epsilon^2\right] + E\left[\left(g(x) - \hat{Y}(x)\right)^2\right]
$$

We don't yet have a good understanding of $g(x) - \hat{Y}(x)$. But we do understand the deviation $D_{\hat{Y}(x)} = \hat{Y}(x) - E\left[\hat{Y}(x)\right]$. We know that

- $E\left[D_{\hat{Y}(x)}\right] ~ = ~ 0$
- $E\left[D_{\hat{Y}(x)}^2\right] ~ = ~ \text{model variance}$

So let's add and subtract $E\left[\hat{Y}(x)\right]$ and see if that helps.

$$
g(x) - \hat{Y}(x) ~ = ~ \left(g(x) - E\left[\hat{Y}(x)\right] \right) + \left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)
$$

The first term on the right hand side is the model bias at $x$. The second term is $-D_{\hat{Y}(x)}$. So

$$
g(x) - \hat{Y}(x) ~ = ~ \text{model bias} - D_{\hat{Y}(x)}
$$

### Step 3

Remember that the model bias at $x$ is a constant, not a random variable. Think of it as your favorite number, say 10. Then
$$
\begin{align*}
E\left[ \left(g(x) - \hat{Y}(x)\right)^2 \right] ~ &= ~ \text{model bias}^2 - 2(\text{model bias})E\left[D_{\hat{Y}(x)}\right] + E\left[D_{\hat{Y}(x)}^2\right] \\
&= ~ \text{model bias}^2 - 0 + \text{model variance} \\
&= ~ \text{model bias}^2 + \text{model variance}
\end{align*}
$$

Again, the cross-product term is $0$ because $E\left[D_{\hat{Y}(x)}\right] ~ = ~ 0$.

### Step 4: Bias-Variance Decomposition

In Step 2 we had

$$
\text{model risk} ~ = ~ \text{observation variance} + E\left[\left(g(x) - \hat{Y}(x)\right)^2\right]
$$

Step 3 showed

$$
E\left[ \left(g(x) - \hat{Y}(x)\right)^2 \right] ~ = ~ \text{model bias}^2 + \text{model variance}
$$

Thus we have shown the bias-variance decomposition:

$$
\text{model risk} = \text{observation variance} + \text{model bias}^2 + \text{model variance}.
$$

That is,

$$
E\left[(Y(x)-\hat{Y}(x))^2\right] = \sigma^2 + \left(E\left[\hat{Y}(x)\right] - g(x)\right)^2 + E\left[\left(\hat{Y}(x)-E\left[\hat{Y}(x)\right]\right)^2\right]
$$
:::

0 comments on commit 693eefc

Please sign in to comment.