Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R inference error in calculating likelihood when data is missing/NA #286

Open
saraloo opened this issue Aug 7, 2024 · 1 comment
Open
Assignees
Labels
bug Defects or errors in the code. high priority High priority. r-inference Relating to the R inference package.

Comments

@saraloo
Copy link
Contributor

saraloo commented Aug 7, 2024

Describe the bug

Inconsistent/unwanted behaviour when there are NAs in the ground truth fitting data. Currently in classical R inference in inference_slot.R ~line 270 the observations/fitting data is read in and all NAs are replaced with 0. This is not wanted behaviour (we want to maintain the NAs - to be dealt with later when we apply any aggregation to a certain time period we require, i.e. to a week).

There is a downstream issue if ALL values are NAs for a given subpopulation-outcome combination (eg in Disparities round there is no Latino population in North Carolina and so all ground truth values are NAs). There is an error when calculating the likelihood then.

To Reproduce

Using ground truth data file from Disparities round

obs <- readr::read_csv("data/target_data_phase1_adjust.csv")
  data_stats <- lapply(
    "37000",
    function(x) {
      df <- obs[obs[[obs_subpop]] == x, ]
      inference::getStats(
        df,
        "date",
        "data_var",
        stat_list = config$inference$statistics,
        start_date = gt_start_date,
        end_date = gt_end_date
      )
    }) %>%
    set_names("37000")

This gives a statistic

data_stats$`37000`$sum_case_latino
  date data_var
1   NA        1

But this needs to be compared to a simulation that has values presumably for each date so the lengths of the variables (1) do not match in order to calculate the likelihood, and (2) if it was just all NAs to reflect the data, it cannot compute the likelihood.

Expected behavior

I am not entirely sure what behaviour we want here.

Currently the workflow is:

  1. Read in data file and simulation output as is
  2. Aggregates these to any time period aggregation required (and deals with NAs etc, removing them if we specify remove_na: TRUE)
  3. Calculate the likelihood statistic but this has some errors if there are any NAs (I think?), and definitely if they are ALL NAs

re: 3 - there is an error in the logic here I think (?)

sim <- c(1,2,3,4)
obs <- rep(NA,4)
if (add_one) {
    eval <- sim + obs != 0
    sim[sim == 0 & eval == 1] = 1
  }
  else {
    eval <- as.logical(rep(1, length(obs)))
  }
  rc <- rep(0, length(obs))
  if (distr == "pois") {
    rc[eval] <- dpois(round(obs[eval]), sim[eval], log = T)
  }

In the example above (code from logLikStat function in R inference package) If add_one = TRUE then

> eval
[1] NA NA NA NA

and there is an error in calculate rc

If add_one=FALSE then eval is a vector of 1's, and we end up with rc being a vector of NAs, but this will give us a likelihood of NA anyway.

I'm lost about what exactly should happen here instead. 🫠 Will think about this more but adding it here for the moment.

@saraloo saraloo added bug Defects or errors in the code. r-inference Relating to the R inference package. labels Aug 7, 2024
@saraloo saraloo self-assigned this Aug 7, 2024
@saraloo saraloo added the high priority High priority. label Aug 7, 2024
@saraloo
Copy link
Contributor Author

saraloo commented Aug 8, 2024

Thinking about this a bit more, I think this is convoluted with #272 . We do need some sort of filtering of simulation dates that matches the observed ground truth data but it was in the wrong spot before.

I think the modeled simulation output (after aggregation) should only be compared to observed fitting data (after aggregation and removing NAs etc) for dates where the values are not NA, and dates match after aggregation...
If there are no aggregated values that are not NA for a given outcome-location combination, can we just set the likelihood for that statistic as 0?

In the process of stress testing these functions for all options - complex with aggregation options etc...

@saraloo saraloo changed the title R inference error in calculating likelihood when data is missing/NA for all dates R inference error in calculating likelihood when data is missing/NA Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Defects or errors in the code. high priority High priority. r-inference Relating to the R inference package.
Projects
None yet
Development

No branches or pull requests

2 participants