Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling missing data #13

Open
markmfredrickson opened this issue Nov 29, 2011 · 5 comments
Open

Handling missing data #13

markmfredrickson opened this issue Nov 29, 2011 · 5 comments

Comments

@markmfredrickson
Copy link
Owner

We need tests and handeling for missing data in the various randomization distribution functions.

One wrinkle: what to do when removing missing observations leads to single treatment conditions in a block (e.g. only treated units)? Warning? Error? Crossed-fingers?

Thanks to Megan Eisenman for flagging this issue.

@jwbowers
Copy link
Collaborator

Yep.

I know that in xBalance we have "good strata" and "bad strata" (i.e. strata with only one member or with no variation on X). Perhaps we can use some code from there?

Jake

On Nov 29, 2011, at 11:22 AM, Mark Fredrickson wrote:

We need tests and handeling for missing data in the various randomization distribution functions.

One wrinkle: what to do when removing missing observations leads to single treatment conditions in a block (e.g. only treated units)? Warning? Error? Crossed-fingers?

Thanks to Megan Eisenman for flagging this issue.


Reply to this email directly or view it on GitHub:
#13

@benthestatistician
Copy link
Collaborator

You raise a subtle and important issue, Mark. I'd agree that the first order of business should be (writing code for) identifying cases with missing data. The next question is what to do with them once you've identified them.

You ask about how to handle the situation where casewise deletion has left you without variation in the treatment variable in some blocks, but that question begs another that may be more important: Should we be defaulting to casewise deletion? In most situations I've encountered, some form of logical imputation often makes more sense than casewise deletion. E.g., in a get out the vote study, people who don't have a record as to whether they voted might be logically imputed to no-votes. I would lean toward forcing the analyst to make a decision about how to handle the missing cases, rather than defaulting to casewise deletion.

Ben
On Nov 29, 2011, at 12:22 PM, Mark Fredrickson wrote:

We need tests and handeling for missing data in the various randomization distribution functions.

One wrinkle: what to do when removing missing observations leads to single treatment conditions in a block (e.g. only treated units)? Warning? Error? Crossed-fingers?

Thanks to Megan Eisenman for flagging this issue.


Reply to this email directly or view it on GitHub:
#13

Ben Hansen
Associate Professor of Statistics, U. of Michigan

@markmfredrickson
Copy link
Owner Author

I think your points about casewise deletion are well said: let's not
make that decision for people.

In thinking about this more, let me list how the data are used in the
randomizationDistributionEngine and
parameterizedRandomizationDistribution. There are three vectors, all
of size n:

  • y: the outcome -- what we've primarily been talking about.
  • z: an indicator for treatment assignment. While it is not currently
    the case, I think it should be an error to have missing entries in
    this vector
  • blocks: an indicator for block assignment (if NULL, everyone gets
    put in a single block). Again, should probably be an error to have
    missingness.

The z and b are used to generate new treatment assignments. Even if
the vectors have no NAs, it is still possible to have blocks for which
there are only treated or only control units. Should this be an error?
As an alternative, if we had z <- c(1,0,1,0,0) and blocks <- c(1,1,1,2,2) (so no treated units in block 2), all treatment
assignments would have identical assignments of block 2. The entire
omega matrix would be composed of the observed z and z1 <- c(1,1,0,0,), z2 <- c(0,1,1,0,0). It may be worth a warning letting
users know we are doing this.

The outcomes vector y is used into two places: (1) for each model of
effect, the data are adjusted consistent with the data (e.g. adj.data <- moe(y, z, b)), (2) both the observed and adjusted data are fed to
the test statistic (e.g. observed.test.stat <- teststat(y, z, b)).

These uses suggest that is is the responsibility of model and the
statistic to handle NAs. We aren't washing our hands of the problem as
we are providing many of these out of the box, but I think it would
make sense to solve the problem at the moe and statistic level.

@jwbowers
Copy link
Collaborator

jwbowers commented Dec 1, 2011

I think the issue is mostly in the test statistic. That is, if we are not relying on E(t(Z,R)) and Var(t(Z,R)) in closed form plus a Normal approx, then the randomization distribution arising from a situation where we have blocks with all treated, or blocks with no variation on outcomes, would still be valid --- it would just be more discrete and wider (if not also less Normalish) than the situation where we have variation on treatment and outcomes in each block. We could give a warning when this happens since it may well happen by accident more often than not. Or, perhaps change the summary method for pRD objects to give a table or summary of the test statistics --- if there is only one, or it looks overly discrete to the analyst, then they would know that something is not quite right. [i.e. I am thinking about the nice summary method for optmatch as an exemplar here.]

Missing outcomes are a problem as are missing treatments. I think we can tell the analyst what we do (delete a case if missing either treatment or outcomes) and let them think about how to handle that situation. (Down the line one might imagine some interesting applications to missing outcomes like those we use with covariates --- rather than impute the outcomes we could think of the outcomes as multivariate, either having the distribution of non-missing values or missing values.).

Missing treatments are also tough. Seems like this should mostly be deleted for now unless the analyst has some model of missing treatments that imputes treatments outside of our package.

Jake

On Nov 30, 2011, at 4:19 PM, Mark Fredrickson wrote:

I think your points about casewise deletion are well said: let's not
make that decision for people.

In thinking about this more, let me list how the data are used in the
randomizationDistributionEngine and
parameterizedRandomizationDistribution. There are three vectors, all
of size n:

  • y: the outcome -- what we've primarily been talking about.
  • z: an indicator for treatment assignment. While it is not currently
    the case, I think it should be an error to have missing entries in
    this vector
  • blocks: an indicator for block assignment (if NULL, everyone gets
    put in a single block). Again, should probably be an error to have
    missingness.

The z and b are used to generate new treatment assignments. Even if
the vectors have no NAs, it is still possible to have blocks for which
there are only treated or only control units. Should this be an error?
As an alternative, if we had z <- c(1,0,1,0,0) and blocks <- c(1,1,1,2,2) (so no treated units in block 2), all treatment
assignments would have identical assignments of block 2. The entire
omega matrix would be composed of the observed z and z1 <- c(1,1,0,0,), z2 <- c(0,1,1,0,0). It may be worth a warning letting
users know we are doing this.

The outcomes vector y is used into two places: (1) for each model of
effect, the data are adjusted consistent with the data (e.g. adj.data <- moe(y, z, b)), (2) both the observed and adjusted data are fed to
the test statistic (e.g. observed.test.stat <- teststat(y, z, b)).

These uses suggest that is is the responsibility of model and the
statistic to handle NAs. We aren't washing our hands of the problem as
we are providing many of these out of the box, but I think it would
make sense to solve the problem at the moe and statistic level.


Reply to this email directly or view it on GitHub:
#13 (comment)

@benthestatistician
Copy link
Collaborator

Good discussion! Briefly:

  1. In the case that a block has not treatment variation (because of or even without NAs), I vote for Mark's "alternative", to throw a warning rather than an error and follow through with the rest of the analysis.
  2. I'm happy to have Mark & Jake follow their instincts on remaining issues. When you need a tiebreaker, though, my vote would almost invariably go against casewise deletion, when there is an alternative.
    --Ben
    On Dec 1, 2011, at 10:32 AM, Jake Bowers wrote:

I think the issue is mostly in the test statistic. That is, if we are not relying on E(t(Z,R)) and Var(t(Z,R)) in closed form plus a Normal approx, then the randomization distribution arising from a situation where we have blocks with all treated, or blocks with no variation on outcomes, would still be valid --- it would just be more discrete and wider (if not also less Normalish) than the situation where we have variation on treatment and outcomes in each block. We could give a warning when this happens since it may well happen by accident more often than not. Or, perhaps change the summary method for pRD objects to give a table or summary of the test statistics --- if there is only one, or it looks overly discrete to the analyst, then they would know that something is not quite right. [i.e. I am thinking about the nice summary method for optmatch as an exemplar here.]

Missing outcomes are a problem as are missing treatments. I think we can tell the analyst what we do (delete a case if missing either treatment or outcomes) and let them think about how to handle that situation. (Down the line one might imagine some interesting applications to missing outcomes like those we use with covariates --- rather than impute the outcomes we could think of the outcomes as multivariate, either having the distribution of non-missing values or missing values.).

Missing treatments are also tough. Seems like this should mostly be deleted for now unless the analyst has some model of missing treatments that imputes treatments outside of our package.

Jake

On Nov 30, 2011, at 4:19 PM, Mark Fredrickson wrote:

I think your points about casewise deletion are well said: let's not
make that decision for people.

In thinking about this more, let me list how the data are used in the
randomizationDistributionEngine and
parameterizedRandomizationDistribution. There are three vectors, all
of size n:

  • y: the outcome -- what we've primarily been talking about.
  • z: an indicator for treatment assignment. While it is not currently
    the case, I think it should be an error to have missing entries in
    this vector
  • blocks: an indicator for block assignment (if NULL, everyone gets
    put in a single block). Again, should probably be an error to have
    missingness.

The z and b are used to generate new treatment assignments. Even if
the vectors have no NAs, it is still possible to have blocks for which
there are only treated or only control units. Should this be an error?
As an alternative, if we had z <- c(1,0,1,0,0) and blocks <- c(1,1,1,2,2) (so no treated units in block 2), all treatment
assignments would have identical assignments of block 2. The entire
omega matrix would be composed of the observed z and z1 <- c(1,1,0,0,), z2 <- c(0,1,1,0,0). It may be worth a warning letting
users know we are doing this.

The outcomes vector y is used into two places: (1) for each model of
effect, the data are adjusted consistent with the data (e.g. adj.data <- moe(y, z, b)), (2) both the observed and adjusted data are fed to
the test statistic (e.g. observed.test.stat <- teststat(y, z, b)).

These uses suggest that is is the responsibility of model and the
statistic to handle NAs. We aren't washing our hands of the problem as
we are providing many of these out of the box, but I think it would
make sense to solve the problem at the moe and statistic level.


Reply to this email directly or view it on GitHub:
#13 (comment)


Reply to this email directly or view it on GitHub:
#13 (comment)

Ben Hansen
Associate Professor of Statistics, U. of Michigan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants