-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix type coercion in bmerge #6603
base: master
Are you sure you want to change the base?
Conversation
Generated via commit c9d3d74 Download link for the artifact containing the test results: ↓ atime-results.zip
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6603 +/- ##
=======================================
Coverage 98.60% 98.61%
=======================================
Files 79 79
Lines 14516 14542 +26
=======================================
+ Hits 14314 14340 +26
Misses 202 202 ☔ View full report in Codecov by Sentry. |
@MichaelChirico is it worth to make the (length(unique(icols))!=length(icols) || length(unique(xcols))!=length(xcols)) ensuring that either |
Sorry, I just saw that this is a CRAN requirement, checking now. |
Definitely thanks for triaging a fix here. I think we both agree it's pretty hack-y & ideally not needed. I'm not sure I fully understand the bug yet, but as stated in OP we're doing IMO |
Unfortunately this "contract" does not hold. I have a Windows dev version installed here and get the following > typeof(.Date(0L))
[1] "integer"
> typeof(as.Date(.Date(0L)))
[1] "integer" |
Thanks, also confirmed that on 4.4.1 |
This is strange as Kurt mentioned there wasn't an intention to do this but I figure it would have been fixed by now if that were the case. |
Update this fix now can convert into one direction from integer to double # this works
x = data.table(a=1L)
y = data.table(c=1L, d=2)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)]
# this still needs to fixed
x = data.table(a=1)
y = data.table(c=1, d=2L)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)] |
Should I consider this PR in-progress for now? The diff has grown enough that it would help to add a brief overview of the changes to the PR description to orient reading |
Just received the "deadline" on this from Kurt:
|
@MichaelChirico the PR is ready for review. What I'm unsure about is if we always want to try to copy at |
@@ -34,86 +34,120 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos | |||
ans | |||
} | |||
|
|||
cast_with_atts = function(x, as.f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these helpers be declared in the {data.table} namespace instead (not exported)?
if bmerge()
gets called dozens/hundreds of times, these helpers would have to be re-created each time. my approach is usually to put such helpers outside function bodies unless it's strongly beneficial for them to inherit directly from the function environment (not the case here AFAICT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if it's about matching how getClass
(now mergeType
) is handled, I would move that helper out too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see coerce_col
inherits verbose
.
Suggestion: verbose_msg = NULL
in the signature, then if (!is.null(verbose_msg))
to replace if (verbose)
I found myself wondering `ic`... "`i` character? `i` class?". Simpler to encode more info in the name
if (anyNA(i[[ic]]) && allNA(i[[ic]])) { | ||
if (verbose) catf("Coercing all-NA %s (%s) to type %s to match type of %s.\n", iname, iclass, xclass, xname) | ||
set(i, j=ic, value=match.fun(paste0("as.", xclass))(i[[ic]])) | ||
cfl = c("character", "logical", "factor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weren't factor
columns handled in the previous branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if x_merge_type
and i_merge_type
are both factor
"like". This branch is for handling coercing all NA values (mostly ofc coercing NA logical
)
if (wclass=="integer" || (wclass=="double" && !isReallyReal(w[[wc]]))) { | ||
if (verbose) catf("Coercing %s column %s%s to type integer64 to match type of %s.\n", wclass, nm[1L], if (wclass=="double") " (which contains no fractions)" else "", nm[2L]) | ||
set(w, j=wc, value=bit64::as.integer64(w[[wc]])) | ||
} else stopf("Incompatible join types: %s is type integer64 but %s is type double and contains fractions", nm[2L], nm[1L]) | ||
} else { | ||
# just integer and double left | ||
if (iclass=="double") { | ||
if (!isReallyReal(i[[ic]])) { | ||
ic_idx = which(icol == icols) # check if on is joined on multiple conditions, #6602 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first read I think "this can be doing a lot of duplicate work since it's basically for (icol in icols) icol==icols
", but actually it only applies to the subset icols[icols != xcols & i_merge_type %in% c("integer", "double", "complex")]
.
length(icols)
will not really be large in general (almost always <10
), the subset there is even smaller, I doubt there's any real concern about the "lost efficiency" here unless you see something obvious (my first instinct is duplicated()
but that's not quite right, everything else that comes to mind harms readability).
but it might be worth calling out the intentional redundancy in a comment.
IINM this also closes #6554 (cc @rikivillalba) |
@@ -15727,7 +15727,8 @@ DT = data.table(z = 1i) | |||
test(2069.33, DT[DT, on = 'z'], error = "Type 'complex' is not supported for joining/merging") | |||
|
|||
# forder verbose message when !isReallyReal Date, #1738 | |||
DT = data.table(d=sample(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days"), 20, replace=TRUE)) | |||
date_dbl = as.Date(as.double(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days")), origin="1970-01-01") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to edit this test? I'd've thought one effect of this PR is to make the old test keep working as intended...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we alter this test to become future proof and ensure that we have a indeed a double
, no matter what R
decides what the underlying type should be
# real double | ||
x = data.table(a=1) | ||
y = data.table(c=1.5, d=1L) | ||
test(2297.21, y[x, on=.(c == a, d == a)], data.table(c=1, d=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to include a non-join column in these tests, here at first glance it's surprising, but this is the right-join-by-default effect at play (i.e. there'd be no rows if nomatch=NULL
).
This reverts commit c9d3d74.
Closes #6602
Base does not encounter this problem since one join column can not be in multiple join conditions.
What do we do:
In bmerge we check if the types to merge x and i in
x[i]
are compatible. This also does some type coercion for example when the responsible columns are integer and double.If a single column of i is in multiple conditions with columns of x, all of the to be joined columns need to coerce, this is why we introduce the extra check
length(ic_idx)>1L
and then iterate over the corresponding columns of x.