-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix range out of index error with a temporary workaround #584
Conversation
I will add a custom Java importer implementation to use the fixed |
Looks like I need to copy more Arrow Java classes to make it work... |
Instead of a real fix which requires adding I re-ran the query and verified this workaround fixes the range out of index error. |
// HACK: For the issue https://github.com/apache/datafusion-comet/issues/540 | ||
// As Arrow Java doesn't support `offset` in C Data interface, we cannot correctly import | ||
// a slice of string from arrow-rs to Java Arrow and then export it to arrow-rs again. | ||
// So we add this hack to always take full length of data buffer by assuming the first offset | ||
// is always 0 which is true for Arrow Java and arrow-rs. | ||
final int len = end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the workaround added in Arrow Java side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar workaround is also added in the custom arrow-rs branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workaround assumes that the first element in offset buffer is always 0. This is true for arrow-rs and Arrow Java although Arrow spec just recommends it but not enforces it. As we only use arrow-rs and Arrow Java, this workaround should work for us without correctness issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes in the custom arrow-rs branch different to the fix that went into arrow-rs master branch recently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix into arrow-rs repo is real fix that correctly reflecting offset of offset buffer into offset
field in C Data interface. But only the fix is not enough, we still need the offset
support in Arrow Java.
So the workaround here is different but just hacky solution. Instead filing correct offset
in C Data interface, we calculate correct data buffer length in a hacky way by assuming the first offset value in offset buffer is always 0 (this is held in both arrow-rs and Arrow Java). So we get correct data buffer length even offset buffer is a slice (i.e., offset-ed
in arrow-rs).
cc @andygrove This uses a temporary workaround to fix the issue. I'm not sure how long it will take to have the Arrow Java fix, so I think a workaround but not a real fix might be good for us to move forward. |
I also ran the benchmarks and can confirm that this fixes the issue. |
I proposed the workaround fix to arrow-rs at apache/arrow-rs#5958. If it could be merged in the upstream, we can just use official arrow-rs and DataFusion repos/releases. |
@andygrove Could you verify again if the latest change still resolves the issue? I changed it to use the commit proposed for arrow-rs apache/arrow-rs#5958 behind a feature flag. Thanks. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #584 +/- ##
============================================
+ Coverage 34.13% 34.31% +0.17%
Complexity 809 809
============================================
Files 106 108 +2
Lines 38586 38742 +156
Branches 8566 8568 +2
============================================
+ Hits 13172 13294 +122
- Misses 22674 22708 +34
Partials 2740 2740 ☔ View full report in Codecov by Sentry. |
I changed the arrow-rs repo to use the change at apache/arrow-rs#5964. See if everything in CI can pass. |
@viirya I confirmed that this fixes the issue when I run the benchmarks locally. |
Thank you, @andygrove. Then we can merge this and wait for new arrow-rs and DataFusion releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Hopefully we only need to use the forks for 2-3 weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can depend on tustvold
's repo because the branch could get deleted. Could you create a branch in your repo instead?
core/Cargo.toml
Outdated
arrow-string = { version = "52.0.0" } | ||
parquet = { version = "52.0.0", default-features = false, features = ["experimental"] } | ||
arrow = { git = "https://github.com/tustvold/arrow-rs.git", rev = "78b6139", features = ["prettyprint", "ffi", "chrono-tz"] } | ||
arrow-array = { git = "https://github.com/tustvold/arrow-rs.git", rev = "78b6139" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the PR is merged, can we switch this to a revision of the official arrow-rs repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can point to arrow-rs repo.
@andygrove This uses official arrow-rs repo now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya
Thanks @andygrove |
…#584) * fix: Fix range out of index error by using custom arrow-rs repo * Add custom Java Arrow classes * Add a hack * Update * Update * Update to use apache/arrow-rs#5958 * Use tustvold's branch * Use official arrow-rs repo
Which issue does this PR close?
Closes #540.
Rationale for this change
What changes are included in this PR?
How are these changes tested?