remove idle "(element weights)" entries in xbal objects #85

benthestatistician · 2017-06-19T18:42:50Z

Element weights are used to normalize the figures that get compared in balanceTest calcs. By definition, they'll ordinarily show up as "1.00", raising the question of why they're being presented in tables and plots.

Reopening b/c the solution may have gone too far. See comments below. Descriptives calculation for (_non-null record_) row seems to need revisiting.

The text was updated successfully, but these errors were encountered:

benthestatistician · 2017-06-19T18:50:13Z

Probably the answer is that they shouldn't be presented there. However, I recall leaving them in because I thought there might be some sort of division-by-zero types edge cases where they'd show as something other than 1.00, and we'd want to keep the user informed.

I suspect that now I could think of a better handlings for those edge cases -- indeed, perhaps there are connections to @markmfredrickson's and my work on subgroups and balance. To do:

identify next steps on handling division-by-zero scenarios & presenting them to users, creating new issue as needed;
~~clean up xbal outputs, taking away un-meaningful 100% reports~~.

(My leftover cruft in the xbal objects was an impediment to @josherrickson's start on the ggbalance branch, disussed in #84 comments.)

benthestatistician · 2017-08-03T03:07:50Z

Here's a clue, I think, about what "(element weights)" should in some way be reflecting on:

we're now automatically treating the adjusted mean difference in element weights as a covariate

(from my commit notes to [master 2672306]).

benthestatistician · 2017-08-03T14:35:57Z

What I think I need to do is

retitle "(element weights)" rows to "(Intercept)"
Is there a test to enforce the behavior noted in [master 2672306], close to the passage quoted above, that adjusted mean diff variance = 0 entails NA for corresponding z-statistic? If not then add one.
internally to balanceTest or one of the functions it calls, check the "(Intercept)" entry to see if its z statistic is an NA; if so, delete that row before returning it.
Add tests the enforce returning the "(Intercept)" entry iff there's variance in the (element weighted and/or cluster size weighted) adjusted mean difference of (Intercept).
Update the docs. W/ these changes, the "(Intercept)" entry should only appear when the user has provided weights and/or clustering info, and it'll be a little less confusing that current "(element weights)"; but it'll still strike at least some as mysterious, so there should be a couple of sentences to dispell the mystery, and interpretive ambiguities. Present example with nonzero diff on "(_non-null record_)" due to having a record that was null but that contributed to design calcs.

cf #85

benthestatistician · 2017-12-05T15:20:06Z

A commit I'm planning to push up in a day or two ("Changed meaning of 1st col NotMissing slots, NM.* slots"; should get a mention in comment thread to this issue) changes both the name and the meaning of this entry:

"Intercept" |--> "_non-null record_"
The column gets a TRUE iff record in question has a non-NA entry in at least one field of Covariates.

Purpose: suppose an RCT with some units lost to follow up before their covariates could be collected. These units were a part of the design, so we don't want them deleted before our calculations; if they had been deleted in this way then our inferences of first and second order assignment probabilities would be thrown off. But we should also be tracking imbalance on this form of data missingness. Thus, their inclusion.

Does this reasoning justify presenting the user with "(_non-null record_)" entries in tables and displays, even when there's no variation in this computed variable? I had been thinking that this was just one step along the way to cleanup of those displays, with the next step being to excise it from the displays in those cases. But maybe there's a benefit to reminding the user that balanceTest and similar functions aren't afraid of seeing null records, and have a use for them. In addition, if it's always there in the outputs of balanceTest() and similar functions, that simplifies programming with those functions. If we have print and plot methods for those objects, they might be set up to have an option to take these entries out, and/or we might facilitate that by always presenting them last. Action steps:

Get some experience looking at tables with "(_non-null record_)" entries
If those entries don't seem too weird, close this issue out w/o the removal of "(Intercept)"/"(_non-null record_)" entries I called for in comments earlier in this thread.

(Additional comment: After working through test failures associated with this change, I see that I had managed to mislead myself about the purpose of "elements weights"/"Intercept", through ambiguous inline comments. That column wasn't there to address any specific statistical edge case, it was there to ensure that the NotMissing slot would always have at least one column, even when there was no missingness, so that the transformation of it in aggregateDesigns and ops using it in alignDesignsByStrata, alignedToInferentials could work in a uniform way whether there was missingness or not. The same comment referenced above touches up those inline comments a little bit; hopefully with this column also gaining some meaning and purpose at the same time, similar ambiguities won't come back to bite us subsequently. )

jwbowers · 2017-12-05T16:44:04Z

I like those action steps and the instinct to be clear about treatment of missingness.

…

On Tue, Dec 5, 2017 at 9:20 AM, Ben ***@***.***> wrote: A commit I'm planning to push up in a day or two ("Changed meaning of 1st col NotMissing slots, NM.* slots"; should get a mention in comment thread to this issue) changes both the name and the meaning of this entry: - "Intercept" |--> "_non-null record_" - The column gets a TRUE iff record in question has a non-NA entry in at least one field of Covariates. Purpose: suppose an RCT with some units lost to follow up before their covariates could be collected. These units were a part of the design, so we don't want them deleted before our calculations; if they had been deleted in this way then our inferences of first and second order assignment probabilities would be thrown off. But we should also be tracking imbalance on this form of data missingness. Thus, their inclusion. Does this reasoning justify presenting the user with "(_non-null record_)" entries in tables and displays, even when there's no variation in this computed variable? I had been thinking that this was just one step along the way to cleanup of those displays, with the next step being to excise it from the displays in those cases. But maybe there's a benefit to reminding the user that balanceTest and similar functions aren't afraid of seeing null records, and have a use for them. In addition, if it's always there in the outputs of balanceTest() and similar functions, that simplifies programming with those functions. If we have print and plot methods for those objects, they might be set up to have an option to take these entries out, and/or we might facilitate that by always presenting them last. Action steps: - Get some experience looking at tables with "(_non-null record_)" entries - If those entries don't seem too weird, close this issue out w/o the removal of "(Intercept)"/"(_non-null record_)" entries I called for in comments earlier in this thread. (Additional comment: After working through test failures associated with this change, I see that I had managed to mislead myself about the purpose of "elements weights"/"Intercept", through ambiguous inline comments. That column wasn't there to address any specific statistical edge case, it was there to ensure that the NotMissing slot would always have at least one column, even when there was no missingness, so that the transformation of it in aggregateDesigns and ops using it in alignDesignsByStrata, alignedToInferentials could work in a uniform way whether there was missingness or not. The same comment referenced above touches up those inline comments a little bit; hopefully with this column also gaining some meaning and purpose at the same time, similar ambiguities won't come back to bite us subsequently. ) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#85 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAiyTFbh80lTJwJW7minzFSFrW6RAktZks5s9V8ngaJpZM4N-oKd> .

- "Intercept" column of NotMissing removed in favor of "_non-null record_" - This column need not be all 1's - If a term has no missingness, this is indicated in NM.terms, NM.Covariates by a 0. cf. #85 , #89

benthestatistician · 2018-06-19T22:44:01Z

Commit ae0d5a7 removes these entries when they're idle, ie when Treatment and Control means on (_non-null record_) are both 1.

But in the process of this work I noticed a problem with the edge case of just 1 covariate that I don't presently have time to address; instead recording it here.

In test.balanceTest.R, test of "NAs properly handled", we have:

 n <- 20
df <- data.frame(Z = rep(c(0,1), n/2),
                   X1 = rnorm(n),
                   X2 = rnorm(n))
df$X1[1:3] <- NA

bt1 <- balanceTest(Z ~ X1, data = df)
> bt1
                    strata  Unstrat     
                    stat   std.diff    z
vars                                    
X1                             0.13 0.27
(_non-null record_)             NaN   NA

The (_non-null record_) entry is idle here, but separate from that there ought to be an entry for (X1), as we get out of the next example in the test:

  bt2 <- balanceTest(Z ~ X1 + X2, data = df)
> bt2
     strata  Unstrat     
                    stat   std.diff    z
vars                                    
X1                             0.13 0.27
X2                             1.00 2.03
(X1)                           0.27 0.61
(_non-null record_)             NaN   NA

The fact that there is probably a by-product of X1 being the only variable, but I haven't looked into it in any detail.

Commit ae0d5a7 includes a commented-out test that we can enable once this matter is addressed:

 expect_s3_class(bt1, "xbal")
  ## expect_true("(X1)" %in% dimnames(bt1[["results"]])[["vars"]])
  expect_false("(_non-null record_)" %in% dimnames(bt1[["results"]])[["vars"]])

Another to-do: give an example of a setup with "(_non-null record_)" Treatment and/or Control means other than 1.

I'm hopeful that this takes care of the main issue, but it does leave at least 2 loose ends; see commented-out line of test.balanceTest.R, comments to #85.

benthestatistician · 2018-06-28T16:14:58Z

Preparing to push up some relevant commits on the i90-broom branch:

2a89a02 addresses the problem w/ the 1 covariate edge case that was noted in my most recent comment above.
d12c6d6 tries out "(_any Xs recorded_)" as an alternative to "(_non-null record_)". (More informative; avoids begging the question of how R would know anything about a unit if its record is null.)

In the examples given in that last comment we're now seeing:

> bt1
                    strata():       --     
                    stat      std.diff    z
vars                                       
X1                                0.13 0.27
(_any Xs recorded_)               0.27   NA


> bt2
     strata():       --     
     stat      std.diff    z
vars                        
X1                 0.13 0.27
X2                 1.00 2.03
(X1)               0.27 0.61

(The fact that there's no "(_any Xs recorded_) " line in the second one flows from ae0d5a7, on master as well as in i90-broom .)

benthestatistician · 2018-06-28T20:02:02Z

Never got around to an example w/ "(_any Xs recorded_)" means other than 1, but there are now tests with this characteristic, and I'm hopeful that the new name will better indicate how it might come about. Closing; reopen for further discussion or fixes.

benthestatistician · 2018-07-13T15:33:01Z

The name "(_any Xs recorded_)" is more informative than prior iterations ("(element_weights)", "(_non-null record_)"), but I think it's too long. Example:

##                     strata():  thematch
##                     stat      Treatment Control adj.diff std.diff    z
## vars
## grdavg                          68      81       -13      -0.4    -1.4
## pctwht                          34      35       -0.6      0.0     0.8
## pctFRPlunch                     74      76       -1.9     -0.1    -0.6
## readpro                         58      60       -2.6     -0.1    -0.9
## readProficient                  45      44        0.26     0.0     0.4
## (readpro)                       0.98    0.97     0.014     0.1     1.3
## (readProficient)                0.76    0.74     0.01      0.0     1.0
## (_any Xs recorded_)             0.99    0.99     -0.0068  -0.1      NA
## ---Overall Test---
##          chisquare df   p.value
## thematch 8.09      7.00 0.32
## ---
## Signif. codes:

Notice how much horizontal space the variables column consumes, all b/c of this long label. (Additionally in some contexts the underscores mean it renders in italics. Not sure whether this is bug or feature.)

For a replacement how about "(any Xs?)". @jwbowers ?

benthestatistician · 2020-05-15T15:37:27Z

Note to self:

As opposed to the current string "(_non-null record_)", "*non-null records*" may be easier to read, and in any case would be ever-so-slightly shorter.
We do seem to have logic to avoid printing this when it's idle (100% all around).
We still do seem to need an example to show what it means when it's not idle. E.g., balT(z~x1+x2) with

z	x1	x2
0	1	NA
0	NA	NA
1	1	2

Except that presently, this isn't such an example, although it should be. Both treatment and control groups' means on (_non-null record_) get presented as 1's, although the Control mean should really be 1/2.

dat = matrix(c( 0,1,NA,  0,NA,NA,  1,1,2 ),
                      3,3, 
                      byrow=T, 
                      dimnames=list(1:3, c("z", "x1", "x2"))
                      )
dat  <- as.data.frame(dat)
balanceTest(z~x1+x2, dat=dat, report="all")[['results']]
, , strata = --

      stat
vars   Control Treatment std.diff adj.diff pooled.sd     z     p
  x1         1         1      NaN        0         0 0.707 0.959
  x2       NaN         2      NaN      NaN       NaN 0.707 0.959
  (x2)       0         1      Inf        1         0 1.414 0.472

attr(,"originals")
[1] 1 2 0
attr(,"term.labels")
[1] "x1" "x2"
attr(,"include.NA.flags")
[1] TRUE

Dropping inside the balT call in order to pull out the adjusted difference of totals generates a more specific indication of what should have been shown:

cbind(descriptives[,c(1,2,4),1, drop=T], adj.diff.of.totals=tmp$`--`$adj.diff.of.totals)
                    Control Treatment adj.diff adj.diff.of.totals
x1                        1         1        0                0.5
x2                      NaN         2      NaN                1.0
(x2)                      0         1        1                1.0
(_non-null record_)       1         1        0                0.5

benthestatistician self-assigned this Jun 19, 2017

benthestatistician mentioned this issue Aug 3, 2017

switch plot.xbal to ggplot2 #84

Open

benthestatistician added a commit that referenced this issue Nov 30, 2017

"(element weights)" |--> "(Intercept)"

1d6844d

cf #85

benthestatistician added a commit that referenced this issue Jun 19, 2018

#85: remove idle xbal entries

ae0d5a7

I'm hopeful that this takes care of the main issue, but it does leave at least 2 loose ends; see commented-out line of test.balanceTest.R, comments to #85.

benthestatistician closed this as completed Jun 28, 2018

benthestatistician mentioned this issue May 15, 2020

balanceTest() examples for less common answers, function arguments #105

Open

3 tasks

benthestatistician mentioned this issue May 22, 2020

I47 align on totals #116

Merged

benthestatistician reopened this Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove idle "(element weights)" entries in xbal objects #85

remove idle "(element weights)" entries in xbal objects #85

benthestatistician commented Jun 19, 2017 •

edited

Loading

benthestatistician commented Jun 19, 2017 •

edited

Loading

benthestatistician commented Aug 3, 2017

benthestatistician commented Aug 3, 2017 •

edited

Loading

benthestatistician commented Dec 5, 2017 •

edited

Loading

jwbowers commented Dec 5, 2017 via email

benthestatistician commented Jun 19, 2018

benthestatistician commented Jun 28, 2018

benthestatistician commented Jun 28, 2018

benthestatistician commented Jul 13, 2018

benthestatistician commented May 15, 2020 •

edited

Loading

remove idle "(element weights)" entries in xbal objects #85

remove idle "(element weights)" entries in xbal objects #85

Comments

benthestatistician commented Jun 19, 2017 • edited Loading

benthestatistician commented Jun 19, 2017 • edited Loading

benthestatistician commented Aug 3, 2017

benthestatistician commented Aug 3, 2017 • edited Loading

benthestatistician commented Dec 5, 2017 • edited Loading

jwbowers commented Dec 5, 2017 via email

benthestatistician commented Jun 19, 2018

benthestatistician commented Jun 28, 2018

benthestatistician commented Jun 28, 2018

benthestatistician commented Jul 13, 2018

benthestatistician commented May 15, 2020 • edited Loading

benthestatistician commented Jun 19, 2017 •

edited

Loading

benthestatistician commented Jun 19, 2017 •

edited

Loading

benthestatistician commented Aug 3, 2017 •

edited

Loading

benthestatistician commented Dec 5, 2017 •

edited

Loading

benthestatistician commented May 15, 2020 •

edited

Loading