Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to use function ppc_stat_grouped() on data including NAs #236

Open
juthzi opened this issue Sep 3, 2020 · 9 comments
Open

Trying to use function ppc_stat_grouped() on data including NAs #236

juthzi opened this issue Sep 3, 2020 · 9 comments

Comments

@juthzi
Copy link

juthzi commented Sep 3, 2020

Hello everyone,

I would like to use the function ppc_stat_grouped() on data that includes NAs.

model_subj_shift_brm %>% posterior_predict(draws = 5) %>% ppc_stat_grouped(y = df_all_subjects$answer, group = df_all_subjects$diagnosis, stat = "median")

I get the following error:
Error: NAs not allowed in 'y'.

I tried deleting NAs in that column:
model_subj_shift_brm %>% posterior_predict(draws = 5) %>% ppc_stat_grouped(y = na.omit(df_all_subjects$answer), group = df_all_subjects$diagnosis, stat = "median")

I get the following error message:
Error: 'y' must be a vector or 1D array.

Thank you for any advice on how to run this function.
Juliane

@jgabry
Copy link
Member

jgabry commented Sep 3, 2020

I think that's because the na.omit() function returns a weird object with class "omit":

df <- data.frame(a = c(1, 3, NA, 4))
na.omit(df$a)
[1] 1 3 4
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"

Maybe either of these would work:

df_all_subjects$answer[!is.na(df_all_subjects$answer)]
with(df_all_subjects, answer[!is.na(answer)])

@juthzi
Copy link
Author

juthzi commented Sep 4, 2020

Thank you, jgabry, for your advice. Indeed, it seems to work better with the following:
df_all_subjects$answer[!is.na(df_all_subjects$answer)]

However, now I get a different error:
Error: ncol(yrep) must be equal to length(y).

This makes sense, since the simulated values will not contain NA values. Does anybody have an idea how to work around this issue?

@jgabry
Copy link
Member

jgabry commented Sep 4, 2020

This makes sense, since the simulated values will not contain NA values.

Is that because you have missing data but are still making predictions for the missing data points?

Since you're using stat="median" one option is to create a variable, say y2, that is the same as y but the NAs are replaced with other values such that median(y2) = median(y, na.rm=TRUE). So if you had

y <- c(1, 2, 3, 4, NA)
median(y, na.rm=TRUE) # median 2.5

you can instead use

y2 <- c(1, 2, 3, 4, 2.5)
median(y2) # median still 2.5

In the future, I think we could relax the requirement of no NAs for the ppc_stat plots. We really don't need to error if there are NAs in y, we just need to compute the statistic dropping NAs.

@juthzi
Copy link
Author

juthzi commented Sep 5, 2020

Is that because you have missing data but are still making predictions for the missing data points?
Yes, that must be it.

As you suggested, I substituted the NAs with the subject's median and ran the above function again:
model_subj_shift_brm %>% posterior_predict(draws = 500) %>% ppc_stat_grouped(y = df_all_subjects$answer_na_is_median, group = df_all_subjects$diagnosis, stat = "median").

However, I get an error:

Error: ncol(yrep) must be equal to length(y).

The length of y and group are the same, but the length of posterior_predict is different. should I repeat the whole modelling process with the NA-substitutes before running the ppc_stat_grouped-function?

@jgabry
Copy link
Member

jgabry commented Sep 5, 2020

Hmm, I'm not sure I understand. Why is the number of columns of yrep not equal to the number of elements of y? If you replace (instead of remove) the NAs in y then shouldn't it have the same number of observations as yrep?

@jgabry
Copy link
Member

jgabry commented Sep 5, 2020

I just made a branch of bayesplot that should work if there are NAs in y for ppc_stat() and ppc_stat_grouped(). You can install it with:

devtools::install_github("stan-dev/bayesplot", ref = "ppc_stat_allow_NA")

Does that let you use the original code you wanted to use without errors?

@juthzi
Copy link
Author

juthzi commented Sep 6, 2020

Thank you for taking the time! I installed it and tried to run it. Now I do not get the NA-related error anymore. But I still get the error of different lengths:

Error: ncol(yrep) must be equal to length(y).

I do not understand this, either. When I ask for the lengths, I get the following:

length(df_all_subjects$answer)
[1] 2160
length(df_all_subjects$diagnosis)
[1] 2160
length(model_subj_shift_brm %>% posterior_predict(draws = 500))
[1] 42640000

So, the length of the posterior predictions is much longer than the other ones, but I thought this was normal, because I expect there will be many different draws.

Maybe I did something wrong when creating the model, or maybe I am using the function wrong?

@jgabry
Copy link
Member

jgabry commented Sep 10, 2020

Sorry for the delay. The length of the posterior predictions will be longer because there are many draws per observation. It's the number of columns in the predictions that needs to match the length of y. But apparently that's not the case here either. What do you get for ncol(model_subj_shift_brm %>% posterior_predict(draws = 500))?

Also, is it possible to share the data and code you're using for this so I can reproduce it on my end? That would probably help me figure this out much quicker. If the data is private then fake data that also demonstrates this problem is totally fine.

@juthzi
Copy link
Author

juthzi commented Oct 20, 2020

Sorry for my late reply. I ran into problems again. I downsized the dataset and uploaded it to this link, together with the brms model I created:
https://uni-koeln.sciebo.de/s/XbolWaNdeRjtj9v

When I run the following, I get 1066 columns:

ncol(model_subj_shift_brm_downsized %>% posterior_predict(draws = 5))

Then, I tried running the code (below), but I get the following error:

Error in validate_y(y, allow_NA = TRUE) : is.numeric(y) is not TRUE

# load packages
library(tidyverse)
library(brms)    
library(bayesplot)

## set working directory 
current_path = rstudioapi::getActiveDocumentContext()$path 
setwd(dirname(current_path))

# load the dataframe for all subjects
df_all_subjects_downsized <- read.csv('df_all_subjects_downsized.csv', stringsAsFactors=FALSE)
df_all_subjects_downsized <- df_all_subjects_downsized %>% 
  mutate(answer = as.numeric(as.character(answer)),
         item = as.numeric(as.character(item)),
         subject_shift = as.factor(subject_shift),
         diagnosis = as.factor(diagnosis))

# sum-code the factors to be able to interpret the effects from the model output and the intercept will be the mean
contrasts(df_all_subjects_downsized$subject_shift) <- c(-0.5, 0.5) # set the contrast manually
contrasts(df_all_subjects_downsized$diagnosis) <- c(-0.5, 0.5) # set the contrast manually

# run the model
model_subj_shift_brm_downsized = brm(answer ~ 1 + subject_shift * diagnosis + # 1 is included for clarity, it will be included anyway
                                       (1 + subject_shift|subj_uid) + (1 + subject_shift|item), 
                                     data = df_all_subjects_downsized %>% filter(type %in% c('c', 'd')), 
                                     family = cumulative('probit'), #indicates ordinal modelling
                                     warmup = 1000, 
                                     iter = 2000, # reduce iterations to e.g. 5000 and warmup to 1000, if you want to roughly get an impression. 
                                     # This many iterations takes quite some time. Remember that the number of warmups is included in the number of iterations. 
                                     # It is recommended to use a total of 40000 samples for later estimating the Bayes Factor, i.e. 10000 per chain. 
                                     # Therefore we need to use 12000 iterations if we do warmups for 2000 iterations. 
                                     save_all_pars = TRUE, # this is needed for later calculating Bayes factor
                                     set.seed(336))

# save/load the model in the working folder 
save(model_subj_shift_brm_downsized, file = 'model_subj_shift_brm_downsized')
load('model_subj_shift_brm_downsized')

model_subj_shift_brm_downsized %>%
  posterior_predict(draws = 5) %>% # take 5 instead of 500 for checking
  ppc_stat_grouped(y = model_subj_shift_brm_downsized$answer,
                   group = df_all_subjects_downsized$diagnosis,
                   stat = "median")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants