-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling missing data #13
Comments
Yep. I know that in xBalance we have "good strata" and "bad strata" (i.e. strata with only one member or with no variation on X). Perhaps we can use some code from there? Jake On Nov 29, 2011, at 11:22 AM, Mark Fredrickson wrote:
|
You raise a subtle and important issue, Mark. I'd agree that the first order of business should be (writing code for) identifying cases with missing data. The next question is what to do with them once you've identified them. You ask about how to handle the situation where casewise deletion has left you without variation in the treatment variable in some blocks, but that question begs another that may be more important: Should we be defaulting to casewise deletion? In most situations I've encountered, some form of logical imputation often makes more sense than casewise deletion. E.g., in a get out the vote study, people who don't have a record as to whether they voted might be logically imputed to no-votes. I would lean toward forcing the analyst to make a decision about how to handle the missing cases, rather than defaulting to casewise deletion. Ben
Ben Hansen |
I think your points about casewise deletion are well said: let's not In thinking about this more, let me list how the data are used in the
The z and b are used to generate new treatment assignments. Even if The outcomes vector y is used into two places: (1) for each model of These uses suggest that is is the responsibility of model and the |
I think the issue is mostly in the test statistic. That is, if we are not relying on E(t(Z,R)) and Var(t(Z,R)) in closed form plus a Normal approx, then the randomization distribution arising from a situation where we have blocks with all treated, or blocks with no variation on outcomes, would still be valid --- it would just be more discrete and wider (if not also less Normalish) than the situation where we have variation on treatment and outcomes in each block. We could give a warning when this happens since it may well happen by accident more often than not. Or, perhaps change the summary method for pRD objects to give a table or summary of the test statistics --- if there is only one, or it looks overly discrete to the analyst, then they would know that something is not quite right. [i.e. I am thinking about the nice summary method for optmatch as an exemplar here.] Missing outcomes are a problem as are missing treatments. I think we can tell the analyst what we do (delete a case if missing either treatment or outcomes) and let them think about how to handle that situation. (Down the line one might imagine some interesting applications to missing outcomes like those we use with covariates --- rather than impute the outcomes we could think of the outcomes as multivariate, either having the distribution of non-missing values or missing values.). Missing treatments are also tough. Seems like this should mostly be deleted for now unless the analyst has some model of missing treatments that imputes treatments outside of our package. Jake On Nov 30, 2011, at 4:19 PM, Mark Fredrickson wrote:
|
Good discussion! Briefly:
Ben Hansen |
We need tests and handeling for missing data in the various randomization distribution functions.
One wrinkle: what to do when removing missing observations leads to single treatment conditions in a block (e.g. only treated units)? Warning? Error? Crossed-fingers?
Thanks to Megan Eisenman for flagging this issue.
The text was updated successfully, but these errors were encountered: