Release/1.0.0 (#29)

* started updates to readme * changed readme image * adjusted curves img in readme * adjusted image again * pylint clean-up * started dividing up the GPS class * more work on dividing GPS up * started updating TMLE core * made good progress on TMLE * got the code working * start revising unit tests. Done with tests of Core class * finished revising tests * whoops, needed more tests * started making big changes to docs * tested outside of project folder * revised documentation * final changes * fixed docs Co-authored-by: rkobrosly <[email protected]>
ronikobrosly · Jan 3, 2021 · ab10e30 · ab10e30
1 parent 50213c1
commit ab10e30
Show file tree

Hide file tree

Showing 36 changed files with 2,285 additions and 1,193 deletions.
diff --git a/.pylintrc b/.pylintrc
diff --git a/README.md b/README.md
@@ -4,30 +4,28 @@
 [![codecov](https://codecov.io/gh/ronikobrosly/causal-curve/branch/master/graph/badge.svg)](https://codecov.io/gh/ronikobrosly/causal-curve)
 [![DOI](https://zenodo.org/badge/256017107.svg)](https://zenodo.org/badge/latestdoi/256017107)
 
-Python tools to perform causal inference using observational data when the treatment of interest is continuous.
+Python tools to perform causal inference when the treatment of interest is continuous.
 
 
 <p align="center">
-<img src="https://upload.wikimedia.org/wikipedia/commons/e/e8/Antikythera_mechanism.svg" align="middle" width="350" height="477" />
+<img src="/imgs/curves.png" align="middle"/>
 </p>
 
 
-[The Antikythera mechanism](https://en.wikipedia.org/wiki/Antikythera_mechanism), an ancient analog computer, with lots of beautiful curves.
-
-
 
 ## Table of Contents
 
 - [Overview](#overview)
 - [Installation](#installation)
 - [Documentation](#documentation)
-- [In Progress](#in-progress)
 - [Contributing](#contributing)
 - [Citation](#citation)
 - [References](#references)
 
 ## Overview
 
+(**Version 1.0.0 released in January 2021!**)
+
 There are many implemented methods to perform causal inference when your intervention of interest is binary,
 but few methods exist to handle continuous treatments.
 
@@ -61,15 +59,6 @@ pip install .
 [Documentation is available at readthedocs.org](https://causal-curve.readthedocs.io/en/latest/)
 
 
-## In Progress
-
-(12/26/2020) Currently working towards version 1.0.0! 
-
-This major update will include:
-* An overhaul of the TMLE tool to make it more accurate and user-friendly. 
-* Separate model classes for predicting binary or continuous outcomes (much like sklearn's approach)
-* Better TMLE example documentation
-
 ## Contributing
 
 Your help is absolutely welcome! Please do reach out or create a feature branch!
@@ -83,19 +72,22 @@ Kobrosly, R. W., (2020). causal-curve: A Python Causal Inference Package to Esti
 Galagate, D. Causal Inference with a Continuous Treatment and Outcome: Alternative
 Estimators for Parametric Dose-Response function with Applications. PhD thesis, 2016.
 
-Moodie E and Stephens DA. Estimation of dose–response functions for
-longitudinal data using the generalised propensity score. In: Statistical Methods in
-Medical Research 21(2), 2010, pp.149–166.
-
 Hirano K and Imbens GW. The propensity score with continuous treatments.
 In: Gelman A and Meng XL (eds) Applied bayesian modeling and causal inference
 from incomplete-data perspectives. Oxford, UK: Wiley, 2004, pp.73–84.
 
+Imai K, Keele L, Tingley D. A General Approach to Causal Mediation Analysis. Psychological
+Methods. 15(4), 2010, pp.309–334.
+
+Kennedy EH, Ma Z, McHugh MD, Small DS. Nonparametric methods for doubly robust estimation
+of continuous treatment effects. Journal of the Royal Statistical Society, Series B. 79(4), 2017, pp.1229-1245.
+
+Moodie E and Stephens DA. Estimation of dose–response functions for
+longitudinal data using the generalised propensity score. In: Statistical Methods in
+Medical Research 21(2), 2010, pp.149–166.
+
 van der Laan MJ and Gruber S. Collaborative double robust penalized targeted
 maximum likelihood estimation. In: The International Journal of Biostatistics 6(1), 2010.
 
 van der Laan MJ and Rubin D. Targeted maximum likelihood learning. In: U.C. Berkeley Division of
 Biostatistics Working Paper Series, 2006.
-
-Imai K., Keele L., Tingley D. A General Approach to Causal Mediation Analysis. Psychological
-Methods. 15(4), 2010, pp.309–334.
diff --git a/causal_curve/__init__.py b/causal_curve/__init__.py
@@ -4,8 +4,10 @@
 
 from statsmodels.genmod.generalized_linear_model import DomainWarning
 
-from causal_curve.gps import GPS
-from causal_curve.tmle import TMLE
+from causal_curve.gps_classifier import GPS_Classifier
+from causal_curve.gps_regressor import GPS_Regressor
+
+from causal_curve.tmle_regressor import TMLE_Regressor
 from causal_curve.mediation import Mediation
 
 

diff --git a/causal_curve/core.py b/causal_curve/core.py
@@ -1,14 +1,16 @@
 """
 Core classes (with basic methods) that will be invoked when other, model classes are defined
 """
-import pkg_resources
+
+import numpy as np
+from scipy.stats import norm
 
 
 class Core:
     """Base class for causal_curve module"""
 
     def __init__(self):
-        pass
+        __version__ = "1.0.0"
 
     def get_params(self):
         """Returns a dict of all of the object's user-facing parameters
@@ -26,4 +28,66 @@ def get_params(self):
             [(k, v) for k, v in list(attrs.items()) if (k[0] != "_") and (k[-1] != "_")]
         )
 
-    __version__ = "0.5.2"
+    def if_verbose_print(self, string):
+        """Prints the input statement if verbose is set to True
+
+        Parameters
+        ----------
+        string: str, some string to be printed
+
+        Returns
+        ----------
+        None
+
+        """
+        if self.verbose:
+            print(string)
+
+    @staticmethod
+    def rand_seed_wrapper(random_seed=None):
+        """Sets the random seed using numpy
+
+        Parameters
+        ----------
+        random_seed: int, random seed number
+
+        Returns
+        ----------
+        None
+        """
+        if random_seed is None:
+            pass
+        else:
+            np.random.seed(random_seed)
+
+    @staticmethod
+    def calculate_z_score(ci):
+        """Calculates the critical z-score for a desired two-sided,
+        confidence interval width.
+
+        Parameters
+        ----------
+        ci: float, the confidence interval width (e.g. 0.95)
+
+        Returns
+        -------
+        Float, critical z-score value
+        """
+        return norm.ppf((1 + ci) / 2)
+
+    @staticmethod
+    def clip_negatives(number):
+        """Helper function to clip negative numbers to zero
+
+        Parameters
+        ----------
+        number: int or float, any number that needs a floor at zero
+
+        Returns
+        -------
+        Int or float of modified value
+
+        """
+        if number < 0:
+            return 0
+        return number
diff --git a/causal_curve/gps_classifier.py b/causal_curve/gps_classifier.py
@@ -0,0 +1,110 @@
+"""
+Defines the Generalized Prospensity Score (GPS) classifier model class
+"""
+
+import numpy as np
+from scipy.special import logit
+
+from causal_curve.gps_core import GPS_Core
+
+
+class GPS_Classifier(GPS_Core):
+    """
+    A GPS tool that handles binary outcomes. Inherits the GPS_core
+    base class. See that base class code its docstring for more details.
+    """
+
+    def __init__(
+        self,
+        gps_family=None,
+        treatment_grid_num=100,
+        lower_grid_constraint=0.01,
+        upper_grid_constraint=0.99,
+        spline_order=3,
+        n_splines=30,
+        lambda_=0.5,
+        max_iter=100,
+        random_seed=None,
+        verbose=False,
+    ):
+        GPS_Core.__init__(
+            self,
+            gps_family=None,
+            treatment_grid_num=100,
+            lower_grid_constraint=0.01,
+            upper_grid_constraint=0.99,
+            spline_order=3,
+            n_splines=30,
+            lambda_=0.5,
+            max_iter=100,
+            random_seed=None,
+            verbose=False,
+        )
+
+    def _cdrc_predictions_binary(self, ci):
+        """Returns the predictions of CDRC for each value of the treatment grid. Essentially,
+        we're making predictions using the original treatment and gps_at_grid.
+        To be used when the outcome of interest is binary.
+        """
+        # To keep track of cdrc predictions, we create an empty 2d array of shape
+        # (n_samples, treatment_grid_num, 2). The last dimension is of length 2 because
+        # we are going to keep track of the point estimate (log-odds) of the prediction, as well as
+        # the standard error of the prediction interval (again, this is for the log odds)
+        cdrc_preds = np.zeros((len(self.T), self.treatment_grid_num, 2), dtype=float)
+
+        # Loop through each of the grid values, predict point estimate and get prediction interval
+        for i in range(0, self.treatment_grid_num):
+
+            temp_T = np.repeat(self.grid_values[i], repeats=len(self.T))
+            temp_gps = self.gps_at_grid[:, i]
+
+            temp_cdrc_preds = logit(
+                self.gam_results.predict_proba(np.column_stack((temp_T, temp_gps)))
+            )
+
+            temp_cdrc_interval = logit(
+                self.gam_results.confidence_intervals(
+                    np.column_stack((temp_T, temp_gps)), width=ci
+                )
+            )
+
+            standard_error = (
+                temp_cdrc_interval[:, 1] - temp_cdrc_preds
+            ) / self.calculate_z_score(ci)
+
+            cdrc_preds[:, i, 0] = temp_cdrc_preds
+            cdrc_preds[:, i, 1] = standard_error
+
+        return np.round(cdrc_preds, 3)
+
+    def estimate_log_odds(self, T):
+        """Calculates the estimated log odds of the highest integer class. Can
+        only be used when the outcome is binary. Can be estimate for a single
+        data point or can be run in batch for many observations. Extrapolation will produce
+        untrustworthy results; the provided treatment should be within
+        the range of the training data.
+
+        Parameters
+        ----------
+        T: Numpy array, shape (n_samples,)
+            A continuous treatment variable.
+
+        Returns
+        ----------
+        array: Numpy array
+            Contains a set of log odds
+        """
+        if self.outcome_type != "binary":
+            raise TypeError("Your outcome must be binary to use this function!")
+
+        return np.apply_along_axis(self._create_log_odds, 0, T.reshape(1, -1))
+
+    def _create_log_odds(self, T):
+        """Take a single treatment value and produces the log odds of the higher
+        integer class, in the case of a binary outcome.
+        """
+        return logit(
+            self.gam_results.predict_proba(
+                np.array([T, self.gps_function(T).mean()]).reshape(1, -1)
+            )
+        )