Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4314: [Rust] Strongly-typed reading of Parquet data #3461

Closed
wants to merge 40 commits into from

Conversation

alecmocatta
Copy link

See the proposal I made on @sunchao's repository sunchao/parquet-rs#205 for more details.

This aims to let the user opt in to strong typing and substantial performance improvements (2x-7x, see here) by optionally specifying the type of the records that they are iterating over.

It is currently a work in progress. All pre-existing tests succeed, bar those in src/record/api.rs which are commented out as they require reworking. Where relevant, pre-existing tests and benchmarks have been duplicated to make new strongly-typed tests and benchmarks, which all also succeed. I've tried to maintain pre-existing APIs where possible. Some changes have been made to better align with prior art in the Rust ecosystem.

Any feedback while I continue working on it very welcome! Looking forward to hopefully seeing this merged when it's ready.

JIRA: https://issues.apache.org/jira/browse/ARROW-4314

@sunchao
Copy link
Member

sunchao commented Jan 26, 2019

Thanks @alecmocatta . Will spend sometime look at this in the next few days.

Since this is a huge PR, to help the review, it will be great if you can break it up somehow, and/or strip out the unnecessary parts. For instance, the benchmarks can itself be in a different PR.

@alecmocatta
Copy link
Author

alecmocatta commented Jan 27, 2019

Thanks @sunchao! Good idea – I've removed the benchmarks for their own PR.

I also re-ran the benchmarks now that the bulk of this PR is complete. I changed .collect::<Vec<_>>() to .for_each(drop) (which is what the docs suggest to exhaust an iterator) as growing that Vec was dominating benchmark times.

Before:

test record_reader_10k                     ... bench:  17,110,653 ns/iter (+/- 1,098,487) = 39 MB/s
test record_reader_stock_simulated         ... bench: 122,829,389 ns/iter (+/- 10,006,685) = 10 MB/s
test record_reader_stock_simulated_column  ... bench:  10,915,527 ns/iter (+/- 723,523) = 118 MB/s

After:

test record_reader_10k                    ... bench:  13,136,380 ns/iter (+/- 468,750) = 50 MB/s
test record_reader_10k_typed              ... bench:  11,704,658 ns/iter (+/- 288,187) = 57 MB/s
test record_reader_stock_simulated        ... bench:  37,810,293 ns/iter (+/- 540,046) = 34 MB/s
test record_reader_stock_simulated_typed  ... bench:   9,508,126 ns/iter (+/- 168,147) = 135 MB/s
test record_reader_stock_simulated_column ... bench:   8,386,171 ns/iter (+/- 173,131) = 153 MB/s

So for untyped -> untyped there's a 1.3-3.2x speedup; for untyped -> typed there's a 1.5-13x speedup. And the bigger the parquet file, the bigger the speedup.

@sunchao
Copy link
Member

sunchao commented Jan 27, 2019

The benchmark result looks great @alecmocatta . It will be good if we can also separate the changes on typed reader into a different PR for easier review. You can stack that on top of the improvements on untyped reader.

@sunchao
Copy link
Member

sunchao commented Jan 28, 2019

Also cc @sadikovi who might be interested in this.

@alecmocatta
Copy link
Author

Just to confirm, by "untyped reader" you're meaning Reader and TreeBuilder?

If that is the case (sorry if I'm misunderstanding you!) then I think what you've asked for will not really be viable, as what I've done is first implemented the typed readers, and then re-implemented the untyped readers on top of those typed readers. So it's not really possible to submit the former without the latter, does this make sense?

@sunchao
Copy link
Member

sunchao commented Jan 28, 2019

I see. That's OK then. Let me take a pass on the PR first.

@alecmocatta
Copy link
Author

@sunchao Just to let you know, I think this PR is pretty much there.

Unfortunately rust-lang/rust#58011 is breaking cargo doc at the moment, which is making my review of documentation and API changes a little trickier.

Do let me know if you have any feedback regarding any aspect, particularly structure of the file layout and API. I wasn't sure how much to merge record::types::* with data_type for example so erred towards conservatism and minimal changes.

@sadikovi
Copy link
Contributor

sadikovi commented Feb 1, 2019

Could you add a bit more documentation for module, functions and inline comments, so people could easily follow and potentially extend/patch the code?

I also noticed that there were changes for Int96 and ByteArray. Would it be possible to submit them first in a separate PR to reduce the amount of code to review? Thanks.

@alecmocatta
Copy link
Author

Thanks for the feedback @sadikovi! I've gone ahead and postponed my changes to Int96 and ByteArray as you suggest, as well as some other changes not directly required by this PR.

I've also documented every user-facing item (I think, unfortunately it's hard to check while rust-lang/rust#58011 is breaking cargo doc!) as well as every file and internal methods as you suggest.

In the meantime I've been testing this in my own project (i.e. "dogfooding") which has thrown up a couple of issues which are now fixed.

Do let me know how else I can help to get this over the line and merged!

@alecmocatta alecmocatta changed the title [WIP] ARROW-4314: [Rust] Strongly-typed reading of Parquet data ARROW-4314: [Rust] Strongly-typed reading of Parquet data Feb 7, 2019
@xhochy
Copy link
Member

xhochy commented Feb 8, 2019

Rebased branch on master.

@alecmocatta
Copy link
Author

^ Thanks @xhochy.

@sunchao and @sadikovi, have you had a chance to make progress reviewing this? Do let me know if there's anything I can do.

I've worked around the aformentioned rust-lang/rust#58011 that was breaking cargo doc by temporarily removing existential types. Hopefully this makes review easier.

cc @andygrove – in case this is relevant to your work with Arrow and/or DataFusion. I know Wes cc'd you on my related PR.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alecmocatta and sorry for the delay. I just got time to look at this. Left some early comments and I'll add more later.

rust/parquet/src/data_type.rs Outdated Show resolved Hide resolved
rust/parquet/src/data_type.rs Outdated Show resolved Hide resolved
rust/parquet/src/encodings/encoding.rs Outdated Show resolved Hide resolved
rust/parquet/src/encodings/encoding.rs Outdated Show resolved Hide resolved
rust/parquet/src/encodings/rle.rs Outdated Show resolved Hide resolved
_ => panic!("Cannot call get_physical_type() on a non-primitive type"),
}
}

/// Gets the type length of this primitive type.
/// Note that this will panic if called on a non-primitive type.
pub fn get_type_length(&self) -> i32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put these in a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used a bunch of times by src/record/types/value.rs; I needed functions to avoid duplicating this logic many times and decided this was the best place for them. I could move them there, or make these pub(crate)? Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to move these into a different place. Rather, may be you can put these few changes in a different (and much smaller) PR and have it reviewed/committed, then rebase this PR to use them?

rust/parquet/src/record/mod.rs Outdated Show resolved Hide resolved

/// This trait is implemented by Schemas so that they can be printed as Parquet schema
/// strings.
pub trait Schema: Debug {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this? only for displaying purpose? can we reuse Type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user uses the wrong type with file_reader.get_row_iter<T>() then ideally the error message includes the actual schema of the file, as well as the schema dictated by their type. This lets the user see where they differ and thus what's wrong.

The Schema trait exists so that a schema can be printed from just their type. This wasn't possible using Type as that requires a full actual schema, which cannot be attained from just the type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. It's a little confusing that the fmt is a public interface for the Schema type though - to someone who is new to this it may be hard to use (e.g., what value should I pass to the r and name parameters).

// ----------------------------------------------------------------------
// Implementations for "anonymous" sum types

impl<A, B> Reader for sum::Sum2<A, B>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty confusing - why we implement reader on Sum2 and Sum3?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're familiar with bluss's Either crate, then Sum2 is just the same – an enum with two generic variants. Sum3 is the same, but with three generic variants. Rather than creating a new enum every time you need "an enum that can contain either of two/three things", they let you reuse the generic Either<A, B>/Sum2<A, B>/Sum3<A, B, C>.

src/record/types/time.rs requires "an enum that can contain either of two things" and "an enum that can contain either of three things". Up until a couple of days ago when I implemented a temporary workaround to a bug in cargo doc, src/record/types/decimal.rs also required "an enum that can contain either of three things". Rather than create three new specific enums, I chose to reuse the generic enums Sum2 and Sum3 instead.

I hesitated over this as there are only three enums required (and two right now until the cargo doc bug is resolved) so there's minimal gain. If however the codebase grows and more such enums are required, it makes more sense.

Let me know what you think. I'm going to go ahead and add in my rationale where you've commented, but I'm more than happy to replace them with seperate specific enums.

rust/parquet/src/record/reader.rs Outdated Show resolved Hide resolved
@alecmocatta
Copy link
Author

Thanks so much @sunchao! I'll respond to each comment shortly. The first 8 of them are on an old version of the PR, I'm not sure how that happened? GitHub is showing a yellow "Outdated" warning for me by your comments, so perhaps if you refresh at your end, or do the code review directly on GitHub if you're using a 3rd party tool? Thanks!

@sunchao
Copy link
Member

sunchao commented Feb 18, 2019

Thanks @alecmocatta . The outdated comments were added some time ago and I forgot to remove them..

On the high level, one thing I'm thinking is how this can play nicely with the upcoming Arrow reader since it will be the standard reader in future. It would be great if this can align with that effort. For instance, is it possible to reuse Arrow types instead of defining a separate set of types here.

Will leave more comments later.

@alecmocatta
Copy link
Author

On the high level, one thing I'm thinking is how this can play nicely with the upcoming Arrow reader since it will be the standard reader in future. It would be great if this can align with that effort. For instance, is it possible to reuse Arrow types instead of defining a separate set of types here.

I'm not so familiar with Arrow so pardon any ignorance. I'll lay out how I see it, and then you and anyone else are welcome to feed into how my PR could better dovetail with the Arrow work.

The types introduced in this PR are Date, Time, Timestamp, each of which can represent any Parquet-valid value for their respective logical type; and Enum, Json, Bson which simply wrap String/Vec<u8>. These types exist for two reasons:

  • Allow the user to describe the logical schema with appropriate types (i.e. allowing a column of logical type TIME_MILLIS/MICROS to be represented by the Rust type Time rather than i32/i64).
  • Convenience for correctly converting between appropriate types (e.g. Time <-> chrono::NaiveTime).

What data types in the Arrow crate could be reused to this end? From a cursory look I can see the DataType enum for describing logical types, but no types that can represent values of the aforementioned logical types?


As a sidenote, I'm dealing with data stored in the Parquet format that might not fit in memory. For this reason, streaming rows from the Parquet file directly, rather than working with Arrow's in-memory materialization of it, was considered necessary. When the Arrow reader does become the standard reader, being able to stream rows without loading them into memory would be a highly desirable property to maintain.

@sunchao
Copy link
Member

sunchao commented Mar 3, 2019

What data types in the Arrow crate could be reused to this end? From a cursory look I can see the DataType enum for describing logical types, but no types that can represent values of the aforementioned logical types?

The supported types for Arrow is here. I think it covers most of the types you defined except Enum, Json and Bson which I think can just be mapped to strings. The rust crate may not be complete yet.

As a sidenote, I'm dealing with data stored in the Parquet format that might not fit in memory ...

Yes this is definitely required for the Arrow reader as well.

The type system prevents this, and the untyped test is invalid_map_type in record/schemas.rs
…that Group, Value etc are Send; Return Err on overflowing Timetamp math
@alecmocatta
Copy link
Author

Unfortunately I'm unable to dedicate the necessary time to this so closing for now.

@wesm
Copy link
Member

wesm commented Aug 24, 2019

Thanks @alecmocatta -- I hope someone can pick this up in the future!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants