Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ignore names of technical inner fields (of List and Map types) when comparing datatypes for logical equivalence #13522

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Nov 21, 2024

Which issue does this PR close?

Closes #13437

Rationale for this change

Explained more in the issue, but in short: In Substrait consumer we check schemas of the input dataset and the Substrait input relation using logically_equivalent_names_and_types(..). This then calls datatype_is_logically_equal(..) on all fields, which can fail if the technical inner fields of a list or map have differing names. That happens to be the case when reading lists from parquet, as the parquet reader uses "element" as the name vs DF (incl. the substrait consumer) mostly using "item".

I think this makes sense, since arguably the names here shouldn't matter, and since Arrow doesn't mandate any specific names for these fields, we should ignore them.

What changes are included in this PR?

Ignore technical inner fields' names when comparing data types for logical equivalence.

Arguably we should ignore these names in all equivalence testing. That's a bigger change and might be hard to even enumerate all the places to check, so I only did the minimal thing I need here, but if it'd be preferred, I can try to expand to other cases as well - at least datatype_is_semantically_equal.

Are these changes tested?

Added unit test

Are there any user-facing changes?

No

…hen comparing datatypes for logical equivalence
@github-actions github-actions bot added the common Related to common crate label Nov 21, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @Blizzara

@alamb
Copy link
Contributor

alamb commented Nov 23, 2024

FYI @timsaucer this may be related to #13468

@timsaucer
Copy link
Contributor

Very nice. I think there is another issue I’ve come across but didn’t trace all the way down when using delta-sr.

I will test this today or tomorrow.

@timsaucer
Copy link
Contributor

timsaucer commented Nov 23, 2024

Should we make a similar change in datatype_is_semantically_equal ? Adding similar code there does get me to pass the test that delta-rs identified. I'm willing to push a commit up with this after I'd have some time to write a unit test.

This will unblock my specific problem, but I wonder if I should still keep #13468 going anyways. It seems like a good thing to maintain that inner field name, even if we generally ignore it. I could go either way.

Comment on lines +660 to +661
// Don't compare the names of the technical inner field
// Usually "item" but that's not mandated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that depends what "logically equal" actually means.
The definition for this function is currently by-an-example, we should try to formalize what we mean.

@Blizzara
Copy link
Contributor Author

Should we make a similar change in datatype_is_semantically_equal ? Adding similar code there does get me to pass the test that delta-rs identified. I'm willing to push a commit up with this after I'd have some time to write a unit test.

I'm for it, can be added here or as another PR?

@timsaucer
Copy link
Contributor

Whichever you like. This one was approved so we could merge and add another ticket. Or if you want to just add it in, I'll review it right away. Or if you want me to take it on, I can do that too

@Blizzara
Copy link
Contributor Author

Whichever you like. This one was approved so we could merge and add another ticket. Or if you want to just add it in, I'll review it right away. Or if you want me to take it on, I can do that too

I went ahead and added it in b714d2e. FYI @alamb since you had already approved.

(As a followup - I wonder if we should do something similar with StringView etc? At least for the Substrait purposes, a Utf8 should probably be considered equal to Utf8View, maybe also LargeUTF8.. I guess what we'd really need is the logical type system :D)

@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

I went ahead and added it in b714d2e. FYI @alamb since you had already approved.

Thank you for the really nice test coverage too. 🙏

@alamb alamb merged commit 1e67364 into apache:main Nov 25, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Substrait consumer's schema check can fail for list (probs also map) columns
4 participants