forest bad model notebook

aangelopoulos · Oct 2, 2023 · 1c009c7 · 1c009c7
1 parent 8440ec3
commit 1c009c7
Showing 1 changed file with 53 additions and 7 deletions.
diff --git a/examples/baselines/forest_badmodel.ipynb b/examples/baselines/forest_badmodel.ipynb
@@ -1,5 +1,24 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b1a2661f",
+   "metadata": {},
+   "source": [
+    "# Cases Where Prediction-Powered Inference is Underpowered: Bad Model\n",
+    "\n",
+    "The goal of this experiment is to demonstrate a case where prediction-powered inference is underpowered due to the machine-learning model not being accurate enough.\n",
+    "The inferential target is the fraction of the Amazon rainforest lost between 2000 and 2015. The same problem is studied in the notebook [```forest.ipynb```](https://github.com/aangelopoulos/ppi_py/blob/main/examples/forest.ipynb), however here a worse predictive model is trained for the purpose of the demonstration."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c1f0f0a",
+   "metadata": {},
+   "source": [
+    "### Import necessary packages"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 1,
@@ -29,7 +48,9 @@
    "id": "5cf90ae6",
    "metadata": {},
    "source": [
-    "# Import the forest data set using the linear regression model"
+    "### Import the forest data set with predictions made via a linear model\n",
+    "\n",
+    "Load the data. The data set contains gold-standard deforestation labels (```Y```) and deforestation labels predicted via linear regression (```Yhat```)."
    ]
   },
   {
@@ -40,8 +61,8 @@
    "outputs": [],
    "source": [
     "data = np.load(\n",
-    "    \"../data/forest_badmodel.npz\"\n",
-    ")  # This data can be downloaded from this Google Drive link:\n",
+    "    \"forest_badmodel.npz\"\n",
+    ") \n",
     "Y_total = data[\"Y\"]\n",
     "Yhat_total = data[\"Yhat\"]"
    ]
@@ -51,7 +72,11 @@
    "id": "8969f9db",
    "metadata": {},
    "source": [
-    "# Problem setup"
+    "### Problem setup\n",
+    "\n",
+    "Specify the error level (```alpha```), range of values for the labeled data set size (```ns```), and number of trials (```num_trials```).\n",
+    "\n",
+    "Compute the ground-truth value of the estimand."
    ]
   },
   {
@@ -77,7 +102,14 @@
    "id": "83ce18be",
    "metadata": {},
    "source": [
-    "# Construct intervals"
+    "### Construct intervals\n",
+    "\n",
+    "Form confidence intervals for all methods and problem parameters. A dataframe with the following columns is formed:\n",
+    "1. ```method``` (one of ```PPI```, ```Classical```, and ```Imputation```)\n",
+    "2. ```n``` (labeled data set size, takes values in ```ns```)\n",
+    "3. ```lower``` (lower endpoint of the confidence interval)\n",
+    "4. ```upper``` (upper endpoint of the confidence interval)\n",
+    "5. ```trial``` (index of trial, goes from ```0``` to ```num_trials-1```)"
    ]
   },
   {
@@ -164,7 +196,11 @@
    "id": "d15ba288",
    "metadata": {},
    "source": [
-    "# Plot results"
+    "### Plot results\n",
+    "\n",
+    "Plot:\n",
+    "1. Five randomly chosen intervals from the dataframe for PPI and the classical method, and the imputed interval;\n",
+    "2. The average interval width for PPI and the classical method, together with a scatterplot of the widths from the five random draws."
    ]
   },
   {
@@ -197,6 +233,16 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "6f7398dd",
+   "metadata": {},
+   "source": [
+    "### Power experiment\n",
+    "\n",
+    "For PPI and the classical approach, find the smallest value of ```n``` such that the method has power 80% against the null that there is no deforestation, $H_0: \\text{deforestation} \\leq 0$."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -285,7 +331,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.8.15"
   }
  },
  "nbformat": 4,