Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix type coercion in bmerge #6603

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open

fix type coercion in bmerge #6603

wants to merge 36 commits into from

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Nov 3, 2024

Closes #6602

Base does not encounter this problem since one join column can not be in multiple join conditions.

What do we do:
In bmerge we check if the types to merge x and i in x[i] are compatible. This also does some type coercion for example when the responsible columns are integer and double.
If a single column of i is in multiple conditions with columns of x, all of the to be joined columns need to coerce, this is why we introduce the extra check length(ic_idx)>1L and then iterate over the corresponding columns of x.

Copy link

github-actions bot commented Nov 3, 2024

Comparison Plot

Generated via commit c9d3d74

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 38 seconds
Installing different package versions 7 minutes and 54 seconds
Running and plotting the test cases 2 minutes and 22 seconds

Copy link

codecov bot commented Nov 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.61%. Comparing base (6a15f86) to head (e171500).
Report is 10 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6603   +/-   ##
=======================================
  Coverage   98.60%   98.61%           
=======================================
  Files          79       79           
  Lines       14516    14542   +26     
=======================================
+ Hits        14314    14340   +26     
  Misses        202      202           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ben-schwen ben-schwen marked this pull request as ready for review November 3, 2024 21:52
@ben-schwen
Copy link
Member Author

ben-schwen commented Nov 3, 2024

@MichaelChirico is it worth to make the if more specific to this corner case e.g.

(length(unique(icols))!=length(icols) || length(unique(xcols))!=length(xcols))

ensuring that either icols has a double condition or xcols has a double condition?

R/bmerge.R Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

Sorry, I just saw that this is a CRAN requirement, checking now.

R/bmerge.R Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

Definitely thanks for triaging a fix here. I think we both agree it's pretty hack-y & ideally not needed.

I'm not sure I fully understand the bug yet, but as stated in OP we're doing as.Date().

IMO as.Date() should always coerce to double storage, even if it's possible to get Date objects with integer storage. Therefore I would hold off until (1) R-core confirms that yes, as.Date(x) might return integer sometimes; and/or (2) I get a clearer understanding of the underlying issue here.

@ben-schwen
Copy link
Member Author

IMO as.Date() should always coerce to double storage, even if it's possible to get Date objects with integer storage. Therefore I would hold off until (1) R-core confirms that yes, as.Date(x) might return integer sometimes; and/or (2) I get a clearer understanding of the underlying issue here.

Unfortunately this "contract" does not hold. I have a Windows dev version installed here and get the following

> typeof(.Date(0L))
[1] "integer"
> typeof(as.Date(.Date(0L)))
[1] "integer"

@MichaelChirico
Copy link
Member

I have a Windows dev version installed here and get the following

Thanks, also confirmed that on 4.4.1

@TysonStanley
Copy link
Member

This is strange as Kurt mentioned there wasn't an intention to do this but I figure it would have been fixed by now if that were the case.

@ben-schwen
Copy link
Member Author

Update this fix now can convert into one direction from integer to double

# this works
x = data.table(a=1L)
y = data.table(c=1L, d=2)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)]
# this still needs to fixed
x = data.table(a=1)
y = data.table(c=1, d=2L)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)]

@MichaelChirico
Copy link
Member

this still needs to fixed

Should I consider this PR in-progress for now?

The diff has grown enough that it would help to add a brief overview of the changes to the PR description to orient reading

R/bmerge.R Outdated Show resolved Hide resolved
@ben-schwen ben-schwen marked this pull request as draft November 25, 2024 11:58
@TysonStanley
Copy link
Member

Just received the "deadline" on this from Kurt:

Please correct before 2024-12-09 to safely retain your package on CRAN. Note that this will be the final reminder.

@ben-schwen ben-schwen marked this pull request as ready for review November 27, 2024 18:33
@ben-schwen
Copy link
Member Author

@MichaelChirico the PR is ready for review. What I'm unsure about is if we always want to try to copy at cast_with_atts or maybe want to introduce an extra argument for that. Moreover, I tried to keep everything quite modular.

R/bmerge.R Outdated Show resolved Hide resolved
@@ -34,86 +34,120 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
ans
}

cast_with_atts = function(x, as.f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these helpers be declared in the {data.table} namespace instead (not exported)?

if bmerge() gets called dozens/hundreds of times, these helpers would have to be re-created each time. my approach is usually to put such helpers outside function bodies unless it's strongly beneficial for them to inherit directly from the function environment (not the case here AFAICT).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if it's about matching how getClass (now mergeType) is handled, I would move that helper out too)

Copy link
Member

@MichaelChirico MichaelChirico Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see coerce_col inherits verbose.

Suggestion: verbose_msg = NULL in the signature, then if (!is.null(verbose_msg)) to replace if (verbose)

I found myself wondering `ic`... "`i` character? `i` class?". Simpler to encode more info in the name
if (anyNA(i[[ic]]) && allNA(i[[ic]])) {
if (verbose) catf("Coercing all-NA %s (%s) to type %s to match type of %s.\n", iname, iclass, xclass, xname)
set(i, j=ic, value=match.fun(paste0("as.", xclass))(i[[ic]]))
cfl = c("character", "logical", "factor")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weren't factor columns handled in the previous branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if x_merge_type and i_merge_type are both factor "like". This branch is for handling coercing all NA values (mostly ofc coercing NA logical)

if (wclass=="integer" || (wclass=="double" && !isReallyReal(w[[wc]]))) {
if (verbose) catf("Coercing %s column %s%s to type integer64 to match type of %s.\n", wclass, nm[1L], if (wclass=="double") " (which contains no fractions)" else "", nm[2L])
set(w, j=wc, value=bit64::as.integer64(w[[wc]]))
} else stopf("Incompatible join types: %s is type integer64 but %s is type double and contains fractions", nm[2L], nm[1L])
} else {
# just integer and double left
if (iclass=="double") {
if (!isReallyReal(i[[ic]])) {
ic_idx = which(icol == icols) # check if on is joined on multiple conditions, #6602
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first read I think "this can be doing a lot of duplicate work since it's basically for (icol in icols) icol==icols", but actually it only applies to the subset icols[icols != xcols & i_merge_type %in% c("integer", "double", "complex")].

length(icols) will not really be large in general (almost always <10), the subset there is even smaller, I doubt there's any real concern about the "lost efficiency" here unless you see something obvious (my first instinct is duplicated() but that's not quite right, everything else that comes to mind harms readability).

but it might be worth calling out the intentional redundancy in a comment.

@MichaelChirico
Copy link
Member

IINM this also closes #6554 (cc @rikivillalba)

@MichaelChirico
Copy link
Member

MichaelChirico commented Nov 27, 2024

Aside for follow-up (#6628):

I think we should implement mergeType returning I?Date, POSIXct as separate cases. That would for example solve #3506 and could make the solution here clearer.

R/bmerge.R Outdated Show resolved Hide resolved
@@ -15727,7 +15727,8 @@ DT = data.table(z = 1i)
test(2069.33, DT[DT, on = 'z'], error = "Type 'complex' is not supported for joining/merging")

# forder verbose message when !isReallyReal Date, #1738
DT = data.table(d=sample(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days"), 20, replace=TRUE))
date_dbl = as.Date(as.double(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days")), origin="1970-01-01")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to edit this test? I'd've thought one effect of this PR is to make the old test keep working as intended...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we alter this test to become future proof and ensure that we have a indeed a double, no matter what R decides what the underlying type should be

# real double
x = data.table(a=1)
y = data.table(c=1.5, d=1L)
test(2297.21, y[x, on=.(c == a, d == a)], data.table(c=1, d=1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to include a non-join column in these tests, here at first glance it's surprising, but this is the right-join-by-default effect at play (i.e. there'd be no rows if nomatch=NULL).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

as.Date can result in different underlying types
4 participants