Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch plot.xbal to ggplot2 #84

Open
josherrickson opened this issue Jun 16, 2017 · 20 comments
Open

switch plot.xbal to ggplot2 #84

josherrickson opened this issue Jun 16, 2017 · 20 comments
Assignees

Comments

@josherrickson
Copy link
Collaborator

Right now plot.xbal has a lot of issues (see eg #82 #80 and another unreported issue that vertical margins look wonky). Have we considered moving to ggplot2 instead of base R plotting? It would address a lot of the concerns immediately. We can get 90% of the way there very quickly:

library(optmatch)
library(RItools)
data(nuclearplants)

fm <- fullmatch(pr ~ ., data = nuclearplants)

bal <- xBalance(pr ~ . + strata(fm), data = nuclearplants)

x <- as.data.frame(RItools:::prepareXbalForPlot(bal))

library(tidyverse)
x <- rownames_to_column(x)
x <- gather(x, strata, values, -rowname)

# x<- mutate(values = abs(values)) # enable for `abs = TRUE`.

ggplot(x, aes(y = rowname, x = values, color = strata, shape = strata)) + 
  geom_vline(xintercept = 0) +
  geom_point() +
  theme(legend.position = "bottom")

rplot03

@markmfredrickson
Copy link
Owner

I would support this. I always wanted to like lattice graphics, but could never really get them to do what I wanted so I went with base on the last rewrite. As I recall, I think the big reason was I wanted to group dummies from factor variables. If we can do that, with ggplot, then I'm all for it.

@josherrickson
Copy link
Collaborator Author

ggplot, by default, is sorting the y-axis alphabetically, so as long as you don't have oddity like a variable named "XYc" and a factor variable named "XY with levels "a", "b" and "d", it will group all factors together.

@jwbowers
Copy link
Collaborator

I would also support this as long as the factor grouping could happen nicely. I agree with wanting to like but not really liking lattice graphics. I might advocate +theme_bw() just on the basis that it uses less ink (maybe there is a strong psycho-visual argument in favor of gray backgrounds?).

@benthestatistician
Copy link
Collaborator

@markmfredrickson @rstudley FYI I think we're going to need to address this issue in the process of RCT work on EE. (Everybody else, sorry for being cryptic.)

@josherrickson
Copy link
Collaborator Author

@benthestatistician To un-cryptic that a bit, are you implying this will be handled elsewhere and we can close this issue? Or that it would be beneficial to move ahead with this idea?

@rstudley
Copy link

rstudley commented Jun 19, 2017 via email

@benthestatistician
Copy link
Collaborator

benthestatistician commented Jun 19, 2017

@josherrickson I wasn't intending to imply anything one way or another about where the issue will be handled, rather to inform @markmfredrickson and @rstudley that the work being discussed under this thread connects to another stream of upcoming work that they (and I) will need to plan for. I just (cryptically) linked this issue to another one, leaving open the question of who should grab this issue and run with it.

The two issues are

  1. Should RItools's plot.xbal start fresh with a ggplot2 implementation, or fix up the existing base graphics implementation?
  2. Should (Mark's, Roger's and my) EE project continue to maintain and extend the existing base R based plot.xbal?
    My own answers would be, for the time being, yes and yes, with the EE project seeing now things go with the ggplot2 based function before switching over from what it's got.

In support of my yes to (1), I think that switching over to ggplot2 will make the plotting function easier to maintain. I defer to Jake and others on aesthetics, so if he's happy (after we accept his +theme_bw() amendment, which I support), then I'm happy too. I understand Mark's main concern to have been (largely) addressed. So there's nothing in the way of work getting started on this. In support of my yes to (2), Mark is currently responsible for implementation of EE's plotting, and I'm not sure it makes sense to ask him immediately set aside the work he's already done; also he seems to be the least familiar with ggplot2 of all of us. Rather, it seems to make more sense for work to proceed on these two streams in parallel, at least for the time being.

@josherrickson would you be willing to start us off with a new branch implementing a ggplot2-based plot.xbal? There needn't be any promise of completeness, just enough to play with. I'd like to play with it and potentially contribute; perhaps @jwbowers could do the same.

When Mark turns his attention back toward EE implementation, with @rstudley's blessing might ask him to contribute a little (<=5hrs) to whatever may remain. Even if these contributions won't necessarily get incorporated into EE immediately, I think the potential savings in coder time for the EE are sufficiently large that having Mark devote a bit of his time on that project to this effort is likely to save money for the EE project, on balance. (After an hour or two of working on the new implementation, he may well find that with a couple more hours spent adapting this work to EE he could spare himself another 10 hours that would have otherwise had to be spent.)

@rstudley
Copy link

rstudley commented Jun 19, 2017 via email

@josherrickson
Copy link
Collaborator Author

josherrickson commented Jun 19, 2017

Threw together a quick function here: 387950c.

data(nuclearplants)

library(optmatch)
fm <- fullmatch(pr ~ ct + cost + t1 + t2, data = nuclearplants)

bal <- balanceTest(pr ~ ct + cost + t1 + t2 + strata(fm), data = nuclearplants)

plot <- plotggxbal(bal, color = c('red', 'green'),
                   strata.labels = c("fm" = "Matched", 
                                     "Unstrat" = "Unmatched" ),
                   variable.labels = c("ct" = "Cooling Tower",
                                       "cost" = "Cost",
                                       "t1" = "Time 1",
                                       "t2" = "Time 2",
                                       "(element weight)" = "(element weight)"),
                   legend.position = "bottom",
                   legend.title = "Strata",
                   var.order = c("Time 2", "Cost", "Cooling Tower",
                                 "(element weight)", "Time 1"))

plot

rplot

Notes:

  1. I wrote this as a standalone function for now to postpone needing to think about how this will interact with the current plot.xbal.
  2. I used some other tidyverse packages (tidyr, dplyr, tibble) to simplify some code. They are not required if those dependencies are undesired; they can be replaced by base R data.frame hacking.
  3. I'm not familiar with the redesigned balanceTest call; it was producing the (element weight) object that I couldn't immediately see how to remove. I left it in, I'm sure there's some easy solution to remove it.
  4. The returned object is a gg object so it can be further tweaked, e.g. @jwbowers
library(ggplot2)
plot + theme_bw()

You can even override parts which were defined in the function.

plot + theme_bw() + theme(legend.position = "right")

@benthestatistician
Copy link
Collaborator

👍 🥇 Thanks @josherrickson !

I don't think the dependencies should be a problem. (Even in the EE, Mark's done some nice work to make it easier to keep an R installation current.)

@benthestatistician
Copy link
Collaborator

Would there be a reasonably easy way to represent groupings of variables on these plots, by adding

  1. Vertical spacing between defined groups of variables, and/or
  2. Variable group header labels, along the y-axis?

It would be something along lines of what you see here:

balplot-withsense-large-n

(Ignore "Outcome sensitivity" stuff at the bottom, at least for now.)

(Aside: current plans for #85 call for renaming "(element weights)" and removing it most of the time but not all of the time.)

@josherrickson
Copy link
Collaborator Author

josherrickson commented Aug 3, 2017

I can think of two ways to do that.

  1. Create "Reading", "Math", etc, variables whose values are NA. Would have to do some hijinks with factors and strings because it doesn't look like geom_point supports categorical y-axis values.

  2. Assign each variable a y-location (e.g. "Male" would be 1, "Female" 2, then skip to 6 for "Migrant, 2 yrs") and then manually create the y-axis labels, including label "Gender" at 4 even though it has no values.

I think 2 is more likely to be an easier implementation, but I'd be hesitant to do it - anyway I can think of actually implementing it would be fragile.

@josherrickson
Copy link
Collaborator Author

josherrickson commented Aug 4, 2017

I rethought 2., and think it might be feasible. I added some preliminary work to the branch in f2caf76. What remains is associating each "variable" (in quotes because the variables now include the group labels) with a y-value. Perhaps something like adding additional spacing above each group label?

Edit: The reason I rethought it was that ggplot is less fickle about plotting region and margins than base R - it should handle things properly in general.

@benthestatistician
Copy link
Collaborator

@josherrickson in an offline conversation you mentioned that tests were failing on the master branch in a way that was blocking your progress. When I run the tests on the master branch, the only failure I'm seeing is

Failed -------------------------------------------------------------------------
1. Failure: preparing xbalance objects for plotting, includes groups (@test.plot.xbal.R#216) 
all(...) isn't true.

more specifically, here:

   xbp <- RItools:::prepareXbalForPlot(xb)
   grps <- attr(xbp, "groups")
   expect_true(all(grps[!is.na(grps)] %in% c("x2", "x3", "x1:x2", "x2:x3", "x1:x3", "x1:x2:x3")))

Can you confirm that this is the trouble you have in mind?

(It happens that I don't have an RSVGTips on this computer, so it's skipping those tests, with a warning. If you're seeing issues related to RSVGTips, I'd encourage you to set them aside if possible.)

@josherrickson
Copy link
Collaborator Author

I get that error, an additional RSVGTips error, and several warnings. The warnings may be specific to my system.

Warnings -----------------------------------------------------------------------
1. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#154) - unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
  dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
  Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
  Reason: image not found

2. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#170) - unable to load shared object '/Library/Frameworks/R.framework/Resources/library/grDevices/libs//cairo.so':
  dlopen(/Library/Frameworks/R.framework/Resources/library/grDevices/libs//cairo.so, 6): Library not loaded: /opt/X11/lib/libcairo.2.dylib
  Referenced from: /Library/Frameworks/R.framework/Resources/library/grDevices/libs//cairo.so
  Reason: image not found

3. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#170) - failed to load cairo DLL

4. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#173) - cannot remove file '/var/folders/k0/pq3644jx7v91201_5_sm73bh0000gn/T//RtmpOpdxyV/filee3c364f9abf3', reason 'No such file or directory'

Failed -------------------------------------------------------------------------
1. Failure: preparing xbalance objects for plotting, includes groups (@test.plot.xbal.R#216) 
all(...) isn't true.


2. Error: Plotting using RSVGTips (@test.plot.xbal.R#243) ----------------------
subscript out of bounds
1: .handleSimpleError(function (e) 
   {
       e$call <- sys.calls()[(frame + 11):(sys.nframe() - 2)]
       register_expectation(e, frame + 11, sys.nframe() - 2)
       signalCondition(e)
   }, "subscript out of bounds", quote(xb$results[, "std.diff", 2])) at /Users/josh/repositories/r/ritools/tests/testthat/test.plot.xbal.R:243
2: eval(code, test_env)

@markmfredrickson
Copy link
Owner

As per a conversation with Ben, we're going to use the ggplot function that Josh came up with for the results of balanceTest, rather than xBalance. I've been able to use ggplot's "facet" function to get the variable groups working:
issue-84-example

On the API, I've copied the existing plot function signature, which had fewer options than Josh had in his function. I did this because with ggplot, the result of our plot function can be manipulated to have different text, colors, etc. I've added some documentation on this point. Here is an example that updates the previous plot with the black and white theme and different text.

plot(bt) + ggplot2::theme_bw() + ggplot2::scale_color_manual(values = c("black", "black")) + ggplot2::labs(color = "My label", shape = "My label"

issue-84-example-bw

I'm going to clean up the other failing tests, and I think we might be able to close this issue.

@benthestatistician
Copy link
Collaborator

benthestatistician commented Jan 2, 2018 via email

@markmfredrickson
Copy link
Owner

@benthestatistician There is a blocking issue right now related to missing data in balanceTest (#92). I'll need to figure that out before I can show an example. I think the default behavior will be to create a facet for each variable with missingness and then group each variable with it's missing indicator, but I'll have to double check that.

@markmfredrickson
Copy link
Owner

Here is an updated version of the plot now that issue #92 was resolved. I fixed a small bug that was causing the missing indicator to appear in a different variable group. As you can see the (X1) column appears along with X1.

issue-84-with-nas

Comments?

@benthestatistician
Copy link
Collaborator

Interesting. As noted above my thinking had been to group all the missingness indicator vars together amongst themselves, perhaps as their own variable group. But perhaps we should discuss and consider a bit before going down that route. Here are my thoughts/reactions about the current rendering scheme:

  1. I'm guessing that using facets to create a dedicated group for missingness indicators sounds like a minor extension of what we've got now -- would you agree, @markmfredrickson?
  2. It can happen that multiple variables share a missingness pattern. Our current rule is to name that pattern with reference to the first term in the generating model formula that manifests it: if the model formula was Z ~ X1 + X2 +X3, with X1 and X2 sharing the same missingness pattern, that pattern is tagged as (X1), not (X2). I suspect that the current implementation would display an (X1) row once, beneath X1, and neither (X1) nor (X2) beneath X2. This seems a little misleading to the reader: If we're going to group missingness indicators with the variables they indicate missingness for, then we should be presenting an (X2) row, identical to the (X1) row, but positioned just below X2.
  3. If our practice is to present e.g. (X) just beneath X, it would be better to present the (_non-null record_) row at the bottom rather than at the top. Any ideas about how to neatly engineer this? (Tweaks to the character string identifying it being fair game if they help.)

@benthestatistician benthestatistician changed the title switch plot.xbal to ggplot2? switch plot.xbal to ggplot2 Aug 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants