Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column type error when combining max(na.rm = TRUE) with any(is.na()) in grouped operation #6608

Open
dcolombara opened this issue Nov 8, 2024 · 2 comments

Comments

@dcolombara
Copy link

Description

I've encountered an error when combining max(..., na.rm = TRUE) with any(is.na(...)) in the same grouped operation. The operation fails with a type error, but works fine when na.rm = FALSE.

Expected behavior

The second operation below should work the same as the first one, just handling NAs differently via na.rm = TRUE.

Observed behavior

The operation fails with a type error suggesting column type inconsistency across groups, even though all input columns are integers.

Column 1 of result for group 3 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

Minimal reproducible example

Create sample data

library(data.table)

dt_test = data.table(
  hh_id = rep(1:3, each = 3),
  age = c(65L, 65L, NA_integer_,
          45L, 45L, 45L,
          NA_integer_, NA_integer_, NA_integer_), 
  disability = sample(0L:1L, 9, replace = T)
)

Works (when na.rm = FALSE)

dt_test[, .(
  age = max(age, na.rm = FALSE)
  ,any_disability = max(disability, na.rm = FALSE)
  ,any_disabilityNA = any(is.na(disability))
), by = hh_id]

Fails (when set na.rm = TRUE)

dt_test[, .(
  age = max(age, na.rm = TRUE)
  ,any_disability = max(disability, na.rm = TRUE)
  ,any_disabilityNA = any(is.na(disability))
), by = hh_id, verbose = TRUE]

Works (when comment out the any(...) )

dt_test[, .(
  age = max(age, na.rm = TRUE)
  , any_disability = max(disability, na.rm = TRUE)
  #, any_disabilityNA = any(is.na(disability))
), by = hh_id]

## Works (when comment out the max(age) ) 
dt_test[, .(
  # age = max(age, na.rm = TRUE),
  any_disability = max(disability, na.rm = TRUE)
  , any_disabilityNA = any(is.na(disability))
), by = hh_id]

Output of sessionInfo()

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.16.2

loaded via a namespace (and not attached):
[1] compiler_4.4.1 tools_4.4.1  
@ben-schwen
Copy link
Member

ben-schwen commented Nov 9, 2024

Ty for the report. This is part, due to the bug fix 35 of 1.15.0

DT[, min(intCol, na.rm=TRUE), by=grp] would return Inf for any groups containing all NAs, with a type change from integer to numeric to hold the Inf, and with warning. Similarly max would return -Inf. Now NA is returned for such all-NA groups, without warning or type change. This is almost-surely less surprising, more convenient, consistent, and efficient. There was no user request for this, likely because our desire to be consistent with base R in this regard was known (base::min(x, na.rm=TRUE) returns Inf with warning for all-NA input). Matt Dowle made this change when reworking internals, #5105. The old behavior seemed so bad, and since there was a warning too, it seemed appropriate to treat it as a bug.

So when you leave out the any part your query gets optimized with GForce to gmax which returns NA instead of -Inf for your 3rd group. You can also cast your column to double to avoid the error. Since most of us are no big fans of -Inf for max(NA_integer_), I'm not sure when or if we will fix this

@tdhock
Copy link
Member

tdhock commented Nov 11, 2024

there are several types of consistency to consider here

  • A consistency between base R and data table -- this was abandoned in rework gminmax #5105
  • B output type consistent with input type -- this was started in rework gminmax #5105
  • C consistency between what happens with GForce and without -- I think we are stuck with this inconsistency because of the decision to do B instead of A.

MRE below has good warnings and verbose output so I don't think we need to change anything (this behavior is expected and well documented).

> data.table(x=c(NA,1L))[,.(m=max(x)),by=x]
       x     m
   <int> <int>
1:    NA    NA
2:     1     1
> data.table(x=c(NA,1L))[,.(m=max(x,na.rm=TRUE)),by=x]
Error in `[.data.table`(data.table(x = c(NA, 1L)), , .(m = max(x, na.rm = TRUE)),  : 
  Column 1 of result for group 2 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.
In addition: Warning message:
In max(x, na.rm = TRUE) : no non-missing arguments to max; returning -Inf
> data.table(x=c(NA,1L))[,.(m=max(x,na.rm=TRUE)),by=x,verbose=TRUE]
Detected that j uses these columns: <none>
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 2 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'list(max(x, na.rm = TRUE))'
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... Error in `[.data.table`(data.table(x = c(NA, 1L)), , .(m = max(x, na.rm = TRUE)),  : 
  Column 1 of result for group 2 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.
In addition: Warning message:
In max(x, na.rm = TRUE) : no non-missing arguments to max; returning -Inf
> data.table(x=c(NA,1L))[,.(m=as.integer(max(x,na.rm=TRUE))),by=x]
       x     m
   <int> <int>
1:    NA    NA
2:     1     1
Warning messages:
1: In max(x, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf
2: In `[.data.table`(data.table(x = c(NA, 1L)), , .(m = as.integer(max(x,  :
  NAs introduced by coercion to integer range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants