R inference error in calculating likelihood when data is missing/NA #286

saraloo · 2024-08-07T17:18:43Z

Describe the bug

Inconsistent/unwanted behaviour when there are NAs in the ground truth fitting data. Currently in classical R inference in inference_slot.R ~line 270 the observations/fitting data is read in and all NAs are replaced with 0. This is not wanted behaviour (we want to maintain the NAs - to be dealt with later when we apply any aggregation to a certain time period we require, i.e. to a week).

There is a downstream issue if ALL values are NAs for a given subpopulation-outcome combination (eg in Disparities round there is no Latino population in North Carolina and so all ground truth values are NAs). There is an error when calculating the likelihood then.

To Reproduce

Using ground truth data file from Disparities round

obs <- readr::read_csv("data/target_data_phase1_adjust.csv")
  data_stats <- lapply(
    "37000",
    function(x) {
      df <- obs[obs[[obs_subpop]] == x, ]
      inference::getStats(
        df,
        "date",
        "data_var",
        stat_list = config$inference$statistics,
        start_date = gt_start_date,
        end_date = gt_end_date
      )
    }) %>%
    set_names("37000")

This gives a statistic

data_stats$`37000`$sum_case_latino
  date data_var
1   NA        1

But this needs to be compared to a simulation that has values presumably for each date so the lengths of the variables (1) do not match in order to calculate the likelihood, and (2) if it was just all NAs to reflect the data, it cannot compute the likelihood.

Expected behavior

I am not entirely sure what behaviour we want here.

Currently the workflow is:

Read in data file and simulation output as is
Aggregates these to any time period aggregation required (and deals with NAs etc, removing them if we specify remove_na: TRUE)
Calculate the likelihood statistic but this has some errors if there are any NAs (I think?), and definitely if they are ALL NAs

re: 3 - there is an error in the logic here I think (?)

sim <- c(1,2,3,4)
obs <- rep(NA,4)
if (add_one) {
    eval <- sim + obs != 0
    sim[sim == 0 & eval == 1] = 1
  }
  else {
    eval <- as.logical(rep(1, length(obs)))
  }
  rc <- rep(0, length(obs))
  if (distr == "pois") {
    rc[eval] <- dpois(round(obs[eval]), sim[eval], log = T)
  }

In the example above (code from logLikStat function in R inference package) If add_one = TRUE then

> eval
[1] NA NA NA NA

and there is an error in calculate rc

If add_one=FALSE then eval is a vector of 1's, and we end up with rc being a vector of NAs, but this will give us a likelihood of NA anyway.

I'm lost about what exactly should happen here instead. 🫠 Will think about this more but adding it here for the moment.

The text was updated successfully, but these errors were encountered:

saraloo · 2024-08-08T16:04:20Z

Thinking about this a bit more, I think this is convoluted with #272 . We do need some sort of filtering of simulation dates that matches the observed ground truth data but it was in the wrong spot before.

I think the modeled simulation output (after aggregation) should only be compared to observed fitting data (after aggregation and removing NAs etc) for dates where the values are not NA, and dates match after aggregation...
If there are no aggregated values that are not NA for a given outcome-location combination, can we just set the likelihood for that statistic as 0?

In the process of stress testing these functions for all options - complex with aggregation options etc...

saraloo added bug Defects or errors in the code. r-inference Relating to the R inference package. labels Aug 7, 2024

saraloo self-assigned this Aug 7, 2024

saraloo added the high priority High priority. label Aug 7, 2024

saraloo mentioned this issue Aug 8, 2024

Fixing issues with inference and NAs #292

Merged

saraloo changed the title ~~R inference error in calculating likelihood when data is missing/NA for all dates~~ R inference error in calculating likelihood when data is missing/NA Aug 9, 2024

TimothyWillard added this to the Modular Inference And Comparisons milestone Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R inference error in calculating likelihood when data is missing/NA #286

R inference error in calculating likelihood when data is missing/NA #286

saraloo commented Aug 7, 2024

saraloo commented Aug 8, 2024

R inference error in calculating likelihood when data is missing/NA #286

R inference error in calculating likelihood when data is missing/NA #286

Comments

saraloo commented Aug 7, 2024

saraloo commented Aug 8, 2024