-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow #239
Conversation
e925ace
to
9e6f9b2
Compare
…set buffers from Java Arrow
9e6f9b2
to
36c2f12
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #239 +/- ##
=========================================
Coverage 33.48% 33.48%
Complexity 776 776
=========================================
Files 108 108
Lines 37178 37178
Branches 8146 8146
=========================================
Hits 12448 12448
Misses 22107 22107
Partials 2623 2623 ☔ View full report in Codecov by Sentry. |
With the specified arrow-rs and DataFusion forked repos which include a hacky workaround, I don't see
the error showing anymore from TPCDSQuerySuite. But there are a few query failures unrelated to that. |
parquet = { version = "~50.0.0", default-features = false, features = ["experimental"] } | ||
arrow = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c", features = ["prettyprint", "ffi", "chrono-tz"] } | ||
arrow-array = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" } | ||
arrow-data = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" } | ||
arrow-schema = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" } | ||
arrow-string = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c" } | ||
parquet = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c", default-features = false, features = ["experimental"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch basically switches to use the specified branch in my forked repo. It adds a hacky fix to the Java Arrow bug. Once the Java Arrow bug fix is merged and released, we can restore this back.
datafusion-common = { git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940" } | ||
datafusion = { default-features = false, git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940", features = ["unicode_expressions"] } | ||
datafusion-functions = { git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940" } | ||
datafusion-physical-expr = { git = "https://github.com/viirya/arrow-datafusion.git", rev = "111a940", default-features = false, features = ["unicode_expressions"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to use the specified version of arrow-rs in DataFusion, otherwise there will be conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for arrow-rs and DataFusion crates changes, other changes are for API changes in DataFusion, including new built-in scalar resolution, new ExecutionPlan API, etc.
I tried to follow all the discussions of the related issues/PRs, and want to make sure I understand the issues/situations correctly. Per my understanding, please correct me if I'm wrong:
If the upstream arrow-rs is going to pick-up the fix, is it possible for us to wait a bit and use arrow-rs upstream directly? I'm a bit of worried about the stability/quality of using the main branch of arrow-rs directly. |
Also, do you think it's a viable option to special handle offset buf in Comet's Java side @viirya ? I just did a quick browsing of Arrow's export code, I think it should be doable to define a special ArrayExporter of Java Arrow c's exporter and write single value buffer for empty offset buffer in that exporter? |
Correct.
No. As empty offset buffer is invalid, we are not going to merge the fix into arrow-rs. It is a hacky fix temporarily only before we have the fix at Java Arrow.
We don't use the main branch of arrow-rs directly but a specified branch in forked repo with a temporary fix. The branch is frozen If we don't update it. |
Looks like it is feasible to have a custom |
I see. Thanks for the clarification. I didn't see that conclusion and thought arrow-rs will allow empty buffer for compatible reason.
I know the specified branch is frozen if we don't update it. But it might contain untested/unstable code in the specified branch as it was kind of cut directly from master at some point with a temporary fix. If we are going to go with that approach, do you think it's better to checkout the specified branch from a released tag, such as
Of course, it's hacky too. The long term fix should be fixing Java Arrow's C data interface's exporting. Compared to a specified branch of arrow-rs though, this fix is self-contained in the Comet's repo, which might has lower maintenance cost and doesn't depend on the arrow-rs/datafusion, which might be iterated fast to introduce new features/fixes. Anyway, I think current approach is also a good way to iterate fast as long as the Java's Arrow C data API will be fixed soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Do we have an issue to track switching back to a versioned release of DataFusion/Arrow once Arrow Java 16 is released?
Created #248 to track it. |
Merged. Thanks. |
…fset buffers from Java Arrow (apache#239) * feat: Use specified branch of arrow-rs with workaround to invalid offset buffers from Java Arrow * Use FunctionRegistry * Fix * Update * Restore config * Restore plan stability
Which issue does this PR close?
Closes #236.
Rationale for this change
What changes are included in this PR?
How are these changes tested?