Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove idle "(element weights)" entries in xbal objects #85

Open
benthestatistician opened this issue Jun 19, 2017 · 10 comments
Open

remove idle "(element weights)" entries in xbal objects #85

benthestatistician opened this issue Jun 19, 2017 · 10 comments
Assignees

Comments

@benthestatistician
Copy link
Collaborator

benthestatistician commented Jun 19, 2017

Element weights are used to normalize the figures that get compared in balanceTest calcs. By definition, they'll ordinarily show up as "1.00", raising the question of why they're being presented in tables and plots.

Reopening b/c the solution may have gone too far. See comments below. Descriptives calculation for (_non-null record_) row seems to need revisiting.

@benthestatistician benthestatistician self-assigned this Jun 19, 2017
@benthestatistician
Copy link
Collaborator Author

benthestatistician commented Jun 19, 2017

Probably the answer is that they shouldn't be presented there. However, I recall leaving them in because I thought there might be some sort of division-by-zero types edge cases where they'd show as something other than 1.00, and we'd want to keep the user informed.

I suspect that now I could think of a better handlings for those edge cases -- indeed, perhaps there are connections to @markmfredrickson's and my work on subgroups and balance. To do:

  • identify next steps on handling division-by-zero scenarios & presenting them to users, creating new issue as needed;
  • clean up xbal outputs, taking away un-meaningful 100% reports.

(My leftover cruft in the xbal objects was an impediment to @josherrickson's start on the ggbalance branch, disussed in #84 comments.)

@benthestatistician
Copy link
Collaborator Author

Here's a clue, I think, about what "(element weights)" should in some way be reflecting on:

we're now automatically treating the adjusted mean difference in element weights as a covariate

(from my commit notes to [master 2672306]).

@benthestatistician
Copy link
Collaborator Author

benthestatistician commented Aug 3, 2017

What I think I need to do is

  • retitle "(element weights)" rows to "(Intercept)"
  • Is there a test to enforce the behavior noted in [master 2672306], close to the passage quoted above, that adjusted mean diff variance = 0 entails NA for corresponding z-statistic? If not then add one.
  • internally to balanceTest or one of the functions it calls, check the "(Intercept)" entry to see if its z statistic is an NA; if so, delete that row before returning it.
  • Add tests the enforce returning the "(Intercept)" entry iff there's variance in the (element weighted and/or cluster size weighted) adjusted mean difference of (Intercept).
  • Update the docs. W/ these changes, the "(Intercept)" entry should only appear when the user has provided weights and/or clustering info, and it'll be a little less confusing that current "(element weights)"; but it'll still strike at least some as mysterious, so there should be a couple of sentences to dispell the mystery, and interpretive ambiguities. Present example with nonzero diff on "(_non-null record_)" due to having a record that was null but that contributed to design calcs.

@benthestatistician
Copy link
Collaborator Author

benthestatistician commented Dec 5, 2017

A commit I'm planning to push up in a day or two ("Changed meaning of 1st col NotMissing slots, NM.* slots"; should get a mention in comment thread to this issue) changes both the name and the meaning of this entry:

  • "Intercept" |--> "_non-null record_"
  • The column gets a TRUE iff record in question has a non-NA entry in at least one field of Covariates.

Purpose: suppose an RCT with some units lost to follow up before their covariates could be collected. These units were a part of the design, so we don't want them deleted before our calculations; if they had been deleted in this way then our inferences of first and second order assignment probabilities would be thrown off. But we should also be tracking imbalance on this form of data missingness. Thus, their inclusion.

Does this reasoning justify presenting the user with "(_non-null record_)" entries in tables and displays, even when there's no variation in this computed variable? I had been thinking that this was just one step along the way to cleanup of those displays, with the next step being to excise it from the displays in those cases. But maybe there's a benefit to reminding the user that balanceTest and similar functions aren't afraid of seeing null records, and have a use for them. In addition, if it's always there in the outputs of balanceTest() and similar functions, that simplifies programming with those functions. If we have print and plot methods for those objects, they might be set up to have an option to take these entries out, and/or we might facilitate that by always presenting them last. Action steps:

  • Get some experience looking at tables with "(_non-null record_)" entries
  • If those entries don't seem too weird, close this issue out w/o the removal of "(Intercept)"/"(_non-null record_)" entries I called for in comments earlier in this thread.

(Additional comment: After working through test failures associated with this change, I see that I had managed to mislead myself about the purpose of "elements weights"/"Intercept", through ambiguous inline comments. That column wasn't there to address any specific statistical edge case, it was there to ensure that the NotMissing slot would always have at least one column, even when there was no missingness, so that the transformation of it in aggregateDesigns and ops using it in alignDesignsByStrata, alignedToInferentials could work in a uniform way whether there was missingness or not. The same comment referenced above touches up those inline comments a little bit; hopefully with this column also gaining some meaning and purpose at the same time, similar ambiguities won't come back to bite us subsequently. )

@jwbowers
Copy link
Collaborator

jwbowers commented Dec 5, 2017 via email

benthestatistician added a commit that referenced this issue Dec 6, 2017
- "Intercept" column of NotMissing removed in favor of "_non-null record_"
- This column need not be all 1's
- If a term has no missingness, this is indicated in NM.terms,
  NM.Covariates by a 0.

cf. #85 , #89
@benthestatistician
Copy link
Collaborator Author

Commit ae0d5a7 removes these entries when they're idle, ie when Treatment and Control means on (_non-null record_) are both 1.

But in the process of this work I noticed a problem with the edge case of just 1 covariate that I don't presently have time to address; instead recording it here.

In test.balanceTest.R, test of "NAs properly handled", we have:

 n <- 20
df <- data.frame(Z = rep(c(0,1), n/2),
                   X1 = rnorm(n),
                   X2 = rnorm(n))
df$X1[1:3] <- NA

bt1 <- balanceTest(Z ~ X1, data = df)
> bt1
                    strata  Unstrat     
                    stat   std.diff    z
vars                                    
X1                             0.13 0.27
(_non-null record_)             NaN   NA

The (_non-null record_) entry is idle here, but separate from that there ought to be an entry for (X1), as we get out of the next example in the test:

  bt2 <- balanceTest(Z ~ X1 + X2, data = df)
> bt2
     strata  Unstrat     
                    stat   std.diff    z
vars                                    
X1                             0.13 0.27
X2                             1.00 2.03
(X1)                           0.27 0.61
(_non-null record_)             NaN   NA

The fact that there is probably a by-product of X1 being the only variable, but I haven't looked into it in any detail.

Commit ae0d5a7 includes a commented-out test that we can enable once this matter is addressed:

 expect_s3_class(bt1, "xbal")
  ## expect_true("(X1)" %in% dimnames(bt1[["results"]])[["vars"]])
  expect_false("(_non-null record_)" %in% dimnames(bt1[["results"]])[["vars"]])
  

Another to-do: give an example of a setup with "(_non-null record_)" Treatment and/or Control means other than 1.

benthestatistician added a commit that referenced this issue Jun 19, 2018
I'm hopeful that this takes care of the main issue, but

it does leave at least 2 loose ends;

see commented-out line of test.balanceTest.R, comments to #85.
@benthestatistician
Copy link
Collaborator Author

Preparing to push up some relevant commits on the i90-broom branch:

  • 2a89a02 addresses the problem w/ the 1 covariate edge case that was noted in my most recent comment above.
  • d12c6d6 tries out "(_any Xs recorded_)" as an alternative to "(_non-null record_)". (More informative; avoids begging the question of how R would know anything about a unit if its record is null.)

In the examples given in that last comment we're now seeing:

> bt1
                    strata():       --     
                    stat      std.diff    z
vars                                       
X1                                0.13 0.27
(_any Xs recorded_)               0.27   NA


> bt2
     strata():       --     
     stat      std.diff    z
vars                        
X1                 0.13 0.27
X2                 1.00 2.03
(X1)               0.27 0.61

(The fact that there's no "(_any Xs recorded_) " line in the second one flows from ae0d5a7, on master as well as in i90-broom .)

@benthestatistician
Copy link
Collaborator Author

Never got around to an example w/ "(_any Xs recorded_)" means other than 1, but there are now tests with this characteristic, and I'm hopeful that the new name will better indicate how it might come about. Closing; reopen for further discussion or fixes.

@benthestatistician
Copy link
Collaborator Author

The name "(_any Xs recorded_)" is more informative than prior iterations ("(element_weights)", "(_non-null record_)"), but I think it's too long. Example:

##                     strata():  thematch
##                     stat      Treatment Control adj.diff std.diff    z
## vars
## grdavg                          68      81       -13      -0.4    -1.4
## pctwht                          34      35       -0.6      0.0     0.8
## pctFRPlunch                     74      76       -1.9     -0.1    -0.6
## readpro                         58      60       -2.6     -0.1    -0.9
## readProficient                  45      44        0.26     0.0     0.4
## (readpro)                       0.98    0.97     0.014     0.1     1.3
## (readProficient)                0.76    0.74     0.01      0.0     1.0
## (_any Xs recorded_)             0.99    0.99     -0.0068  -0.1      NA
## ---Overall Test---
##          chisquare df   p.value
## thematch 8.09      7.00 0.32
## ---
## Signif. codes:

Notice how much horizontal space the variables column consumes, all b/c of this long label. (Additionally in some contexts the underscores mean it renders in italics. Not sure whether this is bug or feature.)

For a replacement how about "(any Xs?)". @jwbowers ?

@benthestatistician
Copy link
Collaborator Author

benthestatistician commented May 15, 2020

Note to self:

  • As opposed to the current string "(_non-null record_)", "*non-null records*" may be easier to read, and in any case would be ever-so-slightly shorter.
  • We do seem to have logic to avoid printing this when it's idle (100% all around).
  • We still do seem to need an example to show what it means when it's not idle. E.g., balT(z~x1+x2) with
z x1 x2
0 1 NA
0 NA NA
1 1 2

Except that presently, this isn't such an example, although it should be. Both treatment and control groups' means on (_non-null record_) get presented as 1's, although the Control mean should really be 1/2.

dat = matrix(c( 0,1,NA,  0,NA,NA,  1,1,2 ),
                      3,3, 
                      byrow=T, 
                      dimnames=list(1:3, c("z", "x1", "x2"))
                      )
dat  <- as.data.frame(dat)
balanceTest(z~x1+x2, dat=dat, report="all")[['results']]
, , strata = --

      stat
vars   Control Treatment std.diff adj.diff pooled.sd     z     p
  x1         1         1      NaN        0         0 0.707 0.959
  x2       NaN         2      NaN      NaN       NaN 0.707 0.959
  (x2)       0         1      Inf        1         0 1.414 0.472

attr(,"originals")
[1] 1 2 0
attr(,"term.labels")
[1] "x1" "x2"
attr(,"include.NA.flags")
[1] TRUE

Dropping inside the balT call in order to pull out the adjusted difference of totals generates a more specific indication of what should have been shown:

cbind(descriptives[,c(1,2,4),1, drop=T], adj.diff.of.totals=tmp$`--`$adj.diff.of.totals)
                    Control Treatment adj.diff adj.diff.of.totals
x1                        1         1        0                0.5
x2                      NaN         2      NaN                1.0
(x2)                      0         1        1                1.0
(_non-null record_)       1         1        0                0.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants