-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4314: [Rust] Strongly-typed reading of Parquet data #3461
Conversation
Thanks @alecmocatta . Will spend sometime look at this in the next few days. Since this is a huge PR, to help the review, it will be great if you can break it up somehow, and/or strip out the unnecessary parts. For instance, the benchmarks can itself be in a different PR. |
Thanks @sunchao! Good idea – I've removed the benchmarks for their own PR. I also re-ran the benchmarks now that the bulk of this PR is complete. I changed Before:
After:
So for untyped -> untyped there's a 1.3-3.2x speedup; for untyped -> typed there's a 1.5-13x speedup. And the bigger the parquet file, the bigger the speedup. |
The benchmark result looks great @alecmocatta . It will be good if we can also separate the changes on typed reader into a different PR for easier review. You can stack that on top of the improvements on untyped reader. |
Also cc @sadikovi who might be interested in this. |
Just to confirm, by "untyped reader" you're meaning If that is the case (sorry if I'm misunderstanding you!) then I think what you've asked for will not really be viable, as what I've done is first implemented the typed readers, and then re-implemented the untyped readers on top of those typed readers. So it's not really possible to submit the former without the latter, does this make sense? |
I see. That's OK then. Let me take a pass on the PR first. |
@sunchao Just to let you know, I think this PR is pretty much there. Unfortunately rust-lang/rust#58011 is breaking Do let me know if you have any feedback regarding any aspect, particularly structure of the file layout and API. I wasn't sure how much to merge |
Could you add a bit more documentation for module, functions and inline comments, so people could easily follow and potentially extend/patch the code? I also noticed that there were changes for Int96 and ByteArray. Would it be possible to submit them first in a separate PR to reduce the amount of code to review? Thanks. |
78c9842
to
28b0fd5
Compare
Thanks for the feedback @sadikovi! I've gone ahead and postponed my changes to Int96 and ByteArray as you suggest, as well as some other changes not directly required by this PR. I've also documented every user-facing item (I think, unfortunately it's hard to check while rust-lang/rust#58011 is breaking In the meantime I've been testing this in my own project (i.e. "dogfooding") which has thrown up a couple of issues which are now fixed. Do let me know how else I can help to get this over the line and merged! |
Rebased branch on master. |
^ Thanks @xhochy. @sunchao and @sadikovi, have you had a chance to make progress reviewing this? Do let me know if there's anything I can do. I've worked around the aformentioned rust-lang/rust#58011 that was breaking cc @andygrove – in case this is relevant to your work with Arrow and/or DataFusion. I know Wes cc'd you on my related PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alecmocatta and sorry for the delay. I just got time to look at this. Left some early comments and I'll add more later.
_ => panic!("Cannot call get_physical_type() on a non-primitive type"), | ||
} | ||
} | ||
|
||
/// Gets the type length of this primitive type. | ||
/// Note that this will panic if called on a non-primitive type. | ||
pub fn get_type_length(&self) -> i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put these in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are used a bunch of times by src/record/types/value.rs
; I needed functions to avoid duplicating this logic many times and decided this was the best place for them. I could move them there, or make these pub(crate)
? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to move these into a different place. Rather, may be you can put these few changes in a different (and much smaller) PR and have it reviewed/committed, then rebase this PR to use them?
|
||
/// This trait is implemented by Schemas so that they can be printed as Parquet schema | ||
/// strings. | ||
pub trait Schema: Debug { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this? only for displaying purpose? can we reuse Type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user uses the wrong type with file_reader.get_row_iter<T>()
then ideally the error message includes the actual schema of the file, as well as the schema dictated by their type. This lets the user see where they differ and thus what's wrong.
The Schema
trait exists so that a schema can be printed from just their type. This wasn't possible using Type
as that requires a full actual schema, which cannot be attained from just the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. It's a little confusing that the fmt
is a public interface for the Schema
type though - to someone who is new to this it may be hard to use (e.g., what value should I pass to the r
and name
parameters).
rust/parquet/src/record/reader.rs
Outdated
// ---------------------------------------------------------------------- | ||
// Implementations for "anonymous" sum types | ||
|
||
impl<A, B> Reader for sum::Sum2<A, B> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty confusing - why we implement reader on Sum2
and Sum3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're familiar with bluss's Either
crate, then Sum2
is just the same – an enum with two generic variants. Sum3
is the same, but with three generic variants. Rather than creating a new enum every time you need "an enum that can contain either of two/three things", they let you reuse the generic Either<A, B>
/Sum2<A, B>
/Sum3<A, B, C>
.
src/record/types/time.rs
requires "an enum that can contain either of two things" and "an enum that can contain either of three things". Up until a couple of days ago when I implemented a temporary workaround to a bug in cargo doc
, src/record/types/decimal.rs
also required "an enum that can contain either of three things". Rather than create three new specific enums, I chose to reuse the generic enums Sum2
and Sum3
instead.
I hesitated over this as there are only three enums required (and two right now until the cargo doc
bug is resolved) so there's minimal gain. If however the codebase grows and more such enums are required, it makes more sense.
Let me know what you think. I'm going to go ahead and add in my rationale where you've commented, but I'm more than happy to replace them with seperate specific enums.
Thanks so much @sunchao! I'll respond to each comment shortly. The first 8 of them are on an old version of the PR, I'm not sure how that happened? GitHub is showing a yellow "Outdated" warning for me by your comments, so perhaps if you refresh at your end, or do the code review directly on GitHub if you're using a 3rd party tool? Thanks! |
Thanks @alecmocatta . The outdated comments were added some time ago and I forgot to remove them.. On the high level, one thing I'm thinking is how this can play nicely with the upcoming Arrow reader since it will be the standard reader in future. It would be great if this can align with that effort. For instance, is it possible to reuse Arrow types instead of defining a separate set of types here. Will leave more comments later. |
b5ce3c5
to
ee6ff1e
Compare
I'm not so familiar with Arrow so pardon any ignorance. I'll lay out how I see it, and then you and anyone else are welcome to feed into how my PR could better dovetail with the Arrow work. The types introduced in this PR are
What data types in the Arrow crate could be reused to this end? From a cursory look I can see the As a sidenote, I'm dealing with data stored in the Parquet format that might not fit in memory. For this reason, streaming rows from the Parquet file directly, rather than working with Arrow's in-memory materialization of it, was considered necessary. When the Arrow reader does become the standard reader, being able to stream rows without loading them into memory would be a highly desirable property to maintain. |
The supported types for Arrow is here. I think it covers most of the types you defined except
Yes this is definitely required for the Arrow reader as well. |
The type system prevents this, and the untyped test is invalid_map_type in record/schemas.rs
…that Group, Value etc are Send; Return Err on overflowing Timetamp math
Unfortunately I'm unable to dedicate the necessary time to this so closing for now. |
Thanks @alecmocatta -- I hope someone can pick this up in the future!! |
See the proposal I made on @sunchao's repository sunchao/parquet-rs#205 for more details.
This aims to let the user opt in to strong typing and substantial performance improvements (2x-7x, see here) by optionally specifying the type of the records that they are iterating over.
It is currently a work in progress. All pre-existing tests succeed, bar those in src/record/api.rs which are commented out as they require reworking. Where relevant, pre-existing tests and benchmarks have been duplicated to make new strongly-typed tests and benchmarks, which all also succeed. I've tried to maintain pre-existing APIs where possible. Some changes have been made to better align with prior art in the Rust ecosystem.
Any feedback while I continue working on it very welcome! Looking forward to hopefully seeing this merged when it's ready.
JIRA: https://issues.apache.org/jira/browse/ARROW-4314