-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Ignore names of technical inner fields (of List and Map types) when comparing datatypes for logical equivalence #13522
Conversation
…hen comparing datatypes for logical equivalence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thank you @Blizzara
FYI @timsaucer this may be related to #13468 |
Very nice. I think there is another issue I’ve come across but didn’t trace all the way down when using delta-sr. I will test this today or tomorrow. |
Should we make a similar change in This will unblock my specific problem, but I wonder if I should still keep #13468 going anyways. It seems like a good thing to maintain that inner field name, even if we generally ignore it. I could go either way. |
// Don't compare the names of the technical inner field | ||
// Usually "item" but that's not mandated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that depends what "logically equal" actually means.
The definition for this function is currently by-an-example, we should try to formalize what we mean.
I'm for it, can be added here or as another PR? |
Whichever you like. This one was approved so we could merge and add another ticket. Or if you want to just add it in, I'll review it right away. Or if you want me to take it on, I can do that too |
I went ahead and added it in b714d2e. FYI @alamb since you had already approved. (As a followup - I wonder if we should do something similar with StringView etc? At least for the Substrait purposes, a Utf8 should probably be considered equal to Utf8View, maybe also LargeUTF8.. I guess what we'd really need is the logical type system :D) |
Which issue does this PR close?
Closes #13437
Rationale for this change
Explained more in the issue, but in short: In Substrait consumer we check schemas of the input dataset and the Substrait input relation using
logically_equivalent_names_and_types(..)
. This then callsdatatype_is_logically_equal(..)
on all fields, which can fail if the technical inner fields of a list or map have differing names. That happens to be the case when reading lists from parquet, as the parquet reader uses "element" as the name vs DF (incl. the substrait consumer) mostly using "item".I think this makes sense, since arguably the names here shouldn't matter, and since Arrow doesn't mandate any specific names for these fields, we should ignore them.
What changes are included in this PR?
Ignore technical inner fields' names when comparing data types for logical equivalence.
Arguably we should ignore these names in all equivalence testing. That's a bigger change and might be hard to even enumerate all the places to check, so I only did the minimal thing I need here, but if it'd be preferred, I can try to expand to other cases as well - at least
datatype_is_semantically_equal
.Are these changes tested?
Added unit test
Are there any user-facing changes?
No