ARROW-4314: [Rust] Strongly-typed reading of Parquet data #3461

alecmocatta · 2019-01-22T19:41:27Z

See the proposal I made on @sunchao's repository sunchao/parquet-rs#205 for more details.

This aims to let the user opt in to strong typing and substantial performance improvements (2x-7x, see here) by optionally specifying the type of the records that they are iterating over.

It is currently a work in progress. All pre-existing tests succeed, bar those in src/record/api.rs which are commented out as they require reworking. Where relevant, pre-existing tests and benchmarks have been duplicated to make new strongly-typed tests and benchmarks, which all also succeed. I've tried to maintain pre-existing APIs where possible. Some changes have been made to better align with prior art in the Rust ecosystem.

Any feedback while I continue working on it very welcome! Looking forward to hopefully seeing this merged when it's ready.

JIRA: https://issues.apache.org/jira/browse/ARROW-4314

sunchao · 2019-01-26T07:24:29Z

Thanks @alecmocatta . Will spend sometime look at this in the next few days.

Since this is a huge PR, to help the review, it will be great if you can break it up somehow, and/or strip out the unnecessary parts. For instance, the benchmarks can itself be in a different PR.

alecmocatta · 2019-01-27T21:52:41Z

Thanks @sunchao! Good idea – I've removed the benchmarks for their own PR.

I also re-ran the benchmarks now that the bulk of this PR is complete. I changed .collect::<Vec<_>>() to .for_each(drop) (which is what the docs suggest to exhaust an iterator) as growing that Vec was dominating benchmark times.

Before:

test record_reader_10k                     ... bench:  17,110,653 ns/iter (+/- 1,098,487) = 39 MB/s
test record_reader_stock_simulated         ... bench: 122,829,389 ns/iter (+/- 10,006,685) = 10 MB/s
test record_reader_stock_simulated_column  ... bench:  10,915,527 ns/iter (+/- 723,523) = 118 MB/s

After:

test record_reader_10k                    ... bench:  13,136,380 ns/iter (+/- 468,750) = 50 MB/s
test record_reader_10k_typed              ... bench:  11,704,658 ns/iter (+/- 288,187) = 57 MB/s
test record_reader_stock_simulated        ... bench:  37,810,293 ns/iter (+/- 540,046) = 34 MB/s
test record_reader_stock_simulated_typed  ... bench:   9,508,126 ns/iter (+/- 168,147) = 135 MB/s
test record_reader_stock_simulated_column ... bench:   8,386,171 ns/iter (+/- 173,131) = 153 MB/s

So for untyped -> untyped there's a 1.3-3.2x speedup; for untyped -> typed there's a 1.5-13x speedup. And the bigger the parquet file, the bigger the speedup.

sunchao · 2019-01-27T23:44:55Z

The benchmark result looks great @alecmocatta . It will be good if we can also separate the changes on typed reader into a different PR for easier review. You can stack that on top of the improvements on untyped reader.

sunchao · 2019-01-28T00:02:09Z

Also cc @sadikovi who might be interested in this.

alecmocatta · 2019-01-28T00:30:11Z

Just to confirm, by "untyped reader" you're meaning Reader and TreeBuilder?

If that is the case (sorry if I'm misunderstanding you!) then I think what you've asked for will not really be viable, as what I've done is first implemented the typed readers, and then re-implemented the untyped readers on top of those typed readers. So it's not really possible to submit the former without the latter, does this make sense?

sunchao · 2019-01-28T00:33:53Z

I see. That's OK then. Let me take a pass on the PR first.

alecmocatta · 2019-02-01T15:52:35Z

@sunchao Just to let you know, I think this PR is pretty much there.

Unfortunately rust-lang/rust#58011 is breaking cargo doc at the moment, which is making my review of documentation and API changes a little trickier.

Do let me know if you have any feedback regarding any aspect, particularly structure of the file layout and API. I wasn't sure how much to merge record::types::* with data_type for example so erred towards conservatism and minimal changes.

sadikovi · 2019-02-01T17:14:13Z

Could you add a bit more documentation for module, functions and inline comments, so people could easily follow and potentially extend/patch the code?

I also noticed that there were changes for Int96 and ByteArray. Would it be possible to submit them first in a separate PR to reduce the amount of code to review? Thanks.

alecmocatta · 2019-02-07T18:17:16Z

Thanks for the feedback @sadikovi! I've gone ahead and postponed my changes to Int96 and ByteArray as you suggest, as well as some other changes not directly required by this PR.

I've also documented every user-facing item (I think, unfortunately it's hard to check while rust-lang/rust#58011 is breaking cargo doc!) as well as every file and internal methods as you suggest.

In the meantime I've been testing this in my own project (i.e. "dogfooding") which has thrown up a couple of issues which are now fixed.

Do let me know how else I can help to get this over the line and merged!

xhochy · 2019-02-08T21:18:33Z

Rebased branch on master.

alecmocatta · 2019-02-15T19:16:10Z

^ Thanks @xhochy.

@sunchao and @sadikovi, have you had a chance to make progress reviewing this? Do let me know if there's anything I can do.

I've worked around the aformentioned rust-lang/rust#58011 that was breaking cargo doc by temporarily removing existential types. Hopefully this makes review easier.

cc @andygrove – in case this is relevant to your work with Arrow and/or DataFusion. I know Wes cc'd you on my related PR.

sunchao

Thanks @alecmocatta and sorry for the delay. I just got time to look at this. Left some early comments and I'll add more later.

rust/parquet/src/data_type.rs

rust/parquet/src/encodings/encoding.rs

rust/parquet/src/encodings/rle.rs

sunchao · 2019-02-17T20:30:10Z

rust/parquet/src/schema/types.rs

            _ => panic!("Cannot call get_physical_type() on a non-primitive type"),
        }
    }

+    /// Gets the type length of this primitive type.
+    /// Note that this will panic if called on a non-primitive type.
+    pub fn get_type_length(&self) -> i32 {


please put these in a separate PR.

These are used a bunch of times by src/record/types/value.rs; I needed functions to avoid duplicating this logic many times and decided this was the best place for them. I could move them there, or make these pub(crate)? Thoughts?

I didn't mean to move these into a different place. Rather, may be you can put these few changes in a different (and much smaller) PR and have it reviewed/committed, then rebase this PR to use them?

rust/parquet/src/record/mod.rs

sunchao · 2019-02-17T21:06:09Z

rust/parquet/src/record/mod.rs

+
+/// This trait is implemented by Schemas so that they can be printed as Parquet schema
+/// strings.
+pub trait Schema: Debug {


Why we need this? only for displaying purpose? can we reuse Type?

If the user uses the wrong type with file_reader.get_row_iter<T>() then ideally the error message includes the actual schema of the file, as well as the schema dictated by their type. This lets the user see where they differ and thus what's wrong.

The Schema trait exists so that a schema can be printed from just their type. This wasn't possible using Type as that requires a full actual schema, which cannot be attained from just the type.

OK. It's a little confusing that the fmt is a public interface for the Schema type though - to someone who is new to this it may be hard to use (e.g., what value should I pass to the r and name parameters).

sunchao · 2019-02-17T21:07:45Z

rust/parquet/src/record/reader.rs

+// ----------------------------------------------------------------------
+// Implementations for "anonymous" sum types
+
+impl<A, B> Reader for sum::Sum2<A, B>


This is pretty confusing - why we implement reader on Sum2 and Sum3?

If you're familiar with bluss's Either crate, then Sum2 is just the same – an enum with two generic variants. Sum3 is the same, but with three generic variants. Rather than creating a new enum every time you need "an enum that can contain either of two/three things", they let you reuse the generic Either<A, B>/Sum2<A, B>/Sum3<A, B, C>.

src/record/types/time.rs requires "an enum that can contain either of two things" and "an enum that can contain either of three things". Up until a couple of days ago when I implemented a temporary workaround to a bug in cargo doc, src/record/types/decimal.rs also required "an enum that can contain either of three things". Rather than create three new specific enums, I chose to reuse the generic enums Sum2 and Sum3 instead.

I hesitated over this as there are only three enums required (and two right now until the cargo doc bug is resolved) so there's minimal gain. If however the codebase grows and more such enums are required, it makes more sense.

Let me know what you think. I'm going to go ahead and add in my rationale where you've commented, but I'm more than happy to replace them with seperate specific enums.

rust/parquet/src/record/reader.rs

alecmocatta · 2019-02-17T21:34:35Z

Thanks so much @sunchao! I'll respond to each comment shortly. The first 8 of them are on an old version of the PR, I'm not sure how that happened? GitHub is showing a yellow "Outdated" warning for me by your comments, so perhaps if you refresh at your end, or do the code review directly on GitHub if you're using a 3rd party tool? Thanks!

sunchao · 2019-02-18T19:42:46Z

Thanks @alecmocatta . The outdated comments were added some time ago and I forgot to remove them..

On the high level, one thing I'm thinking is how this can play nicely with the upcoming Arrow reader since it will be the standard reader in future. It would be great if this can align with that effort. For instance, is it possible to reuse Arrow types instead of defining a separate set of types here.

Will leave more comments later.

alecmocatta · 2019-02-26T15:08:28Z

On the high level, one thing I'm thinking is how this can play nicely with the upcoming Arrow reader since it will be the standard reader in future. It would be great if this can align with that effort. For instance, is it possible to reuse Arrow types instead of defining a separate set of types here.

I'm not so familiar with Arrow so pardon any ignorance. I'll lay out how I see it, and then you and anyone else are welcome to feed into how my PR could better dovetail with the Arrow work.

The types introduced in this PR are Date, Time, Timestamp, each of which can represent any Parquet-valid value for their respective logical type; and Enum, Json, Bson which simply wrap String/Vec<u8>. These types exist for two reasons:

Allow the user to describe the logical schema with appropriate types (i.e. allowing a column of logical type TIME_MILLIS/MICROS to be represented by the Rust type Time rather than i32/i64).
Convenience for correctly converting between appropriate types (e.g. Time <-> chrono::NaiveTime).

What data types in the Arrow crate could be reused to this end? From a cursory look I can see the DataType enum for describing logical types, but no types that can represent values of the aforementioned logical types?

As a sidenote, I'm dealing with data stored in the Parquet format that might not fit in memory. For this reason, streaming rows from the Parquet file directly, rather than working with Arrow's in-memory materialization of it, was considered necessary. When the Arrow reader does become the standard reader, being able to stream rows without loading them into memory would be a highly desirable property to maintain.

sunchao · 2019-03-03T23:36:19Z

What data types in the Arrow crate could be reused to this end? From a cursory look I can see the DataType enum for describing logical types, but no types that can represent values of the aforementioned logical types?

The supported types for Arrow is here. I think it covers most of the types you defined except Enum, Json and Bson which I think can just be mapped to strings. The rust crate may not be complete yet.

As a sidenote, I'm dealing with data stored in the Parquet format that might not fit in memory ...

Yes this is definitely required for the Arrow reader as well.

…r<T> on it

The type system prevents this, and the untyped test is invalid_map_type in record/schemas.rs

…that Group, Value etc are Send; Return Err on overflowing Timetamp math

alecmocatta · 2019-08-24T16:17:18Z

Unfortunately I'm unable to dedicate the necessary time to this so closing for now.

wesm · 2019-08-24T21:05:21Z

Thanks @alecmocatta -- I hope someone can pick this up in the future!!

alecmocatta mentioned this pull request Jan 25, 2019

Add 10k-v2 and stock_simulated for testing and benchmarking Rust library apache/parquet-testing#4

Closed

alecmocatta force-pushed the master branch from 3c90c12 to 87888e4 Compare January 25, 2019 14:47

alecmocatta force-pushed the master branch from 984199e to 354a924 Compare January 29, 2019 15:47

alecmocatta force-pushed the master branch 4 times, most recently from 78c9842 to 28b0fd5 Compare February 7, 2019 17:18

alecmocatta changed the title ~~[WIP] ARROW-4314: [Rust] Strongly-typed reading of Parquet data~~ ARROW-4314: [Rust] Strongly-typed reading of Parquet data Feb 7, 2019

xhochy force-pushed the master branch from 29f76df to 2b9155a Compare February 8, 2019 20:59

xhochy force-pushed the master branch from 2ca7ce5 to b441f6e Compare February 8, 2019 21:18

alecmocatta force-pushed the master branch from b441f6e to 30ec54b Compare February 15, 2019 17:46

sunchao reviewed Feb 17, 2019

View reviewed changes

wesm force-pushed the master branch from 3088183 to 0c6b2d2 Compare February 18, 2019 19:34

alecmocatta force-pushed the master branch 2 times, most recently from b5ce3c5 to ee6ff1e Compare February 25, 2019 18:38

alecmocatta added 26 commits August 24, 2019 00:26

Better printing of schema

421e946

Reflect upstream rustfmt config change

38a0023

Test parsing and printing of schemas

ec59b24

Clean up warnings

4270c9b

Avoid resizing Group's Vec during read

5c3bb68

Update time tests

0fc3db9

Unbox result of FileReader::get_row_group so one can call get_row_ite…

757f3ed

…r<T> on it

Impl Default for Schemas where sensical; clean up interface

31fb201

List Rust <-> Parquet type correspondence

6027368

Avoid need for box_syntax feature

50380c9

Revert "Fix some UB; make clippy less unhappy; cleanup"

b45e3c4

Document various components

a9dedef

Document more and add invalid map test

c2a6f55

More permissive with unknown LogicalTypes, Faster [u8; N]; Docs

08b4bf4

Remove invalid map derived test

a8c339d

The type system prevents this, and the untyped test is invalid_map_type in record/schemas.rs

Return errors thru iterator

5c41457

Make RowIter own rather than borrow FileReader

e64ba9c

Change derive const from DESERIALIZE -> RECORD

0b727a2

Make derive macro more hygienic

5b969fd

Implement From/Into<chrono::DateTime<Utc>> for Timestamp

0ee0a8b

Use a faster hashmap for Group, and wrap it in Arc rather than Rc so …

4ce5c0e

…that Group, Value etc are Send; Return Err on overflowing Timetamp math

Don't rely on existential types; Docs now build; Doc improvements

b36e0ed

Remove #[derive(Record)] for another commit

3e1f368

Integrate @sunchao's feedback

49259dd

Remove get_row_iter argument; document sound usage of unsafe

e0c4064

fix tests

c0b5e45

alecmocatta force-pushed the master branch from ee6ff1e to c0b5e45 Compare August 24, 2019 13:31

alecmocatta closed this Aug 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-4314: [Rust] Strongly-typed reading of Parquet data #3461

ARROW-4314: [Rust] Strongly-typed reading of Parquet data #3461

alecmocatta commented Jan 22, 2019

sunchao commented Jan 26, 2019

alecmocatta commented Jan 27, 2019 •

edited

Loading

sunchao commented Jan 27, 2019

sunchao commented Jan 28, 2019

alecmocatta commented Jan 28, 2019

sunchao commented Jan 28, 2019

alecmocatta commented Feb 1, 2019

sadikovi commented Feb 1, 2019

alecmocatta commented Feb 7, 2019

xhochy commented Feb 8, 2019

alecmocatta commented Feb 15, 2019

sunchao left a comment

sunchao Feb 17, 2019

alecmocatta Feb 17, 2019

sunchao Mar 3, 2019

sunchao Feb 17, 2019

alecmocatta Feb 17, 2019

sunchao Mar 3, 2019

sunchao Feb 17, 2019

alecmocatta Feb 17, 2019

alecmocatta commented Feb 17, 2019

sunchao commented Feb 18, 2019

alecmocatta commented Feb 26, 2019

sunchao commented Mar 3, 2019

alecmocatta commented Aug 24, 2019

wesm commented Aug 24, 2019

ARROW-4314: [Rust] Strongly-typed reading of Parquet data #3461

ARROW-4314: [Rust] Strongly-typed reading of Parquet data #3461

Conversation

alecmocatta commented Jan 22, 2019

sunchao commented Jan 26, 2019

alecmocatta commented Jan 27, 2019 • edited Loading

sunchao commented Jan 27, 2019

sunchao commented Jan 28, 2019

alecmocatta commented Jan 28, 2019

sunchao commented Jan 28, 2019

alecmocatta commented Feb 1, 2019

sadikovi commented Feb 1, 2019

alecmocatta commented Feb 7, 2019

xhochy commented Feb 8, 2019

alecmocatta commented Feb 15, 2019

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alecmocatta commented Feb 17, 2019

sunchao commented Feb 18, 2019

alecmocatta commented Feb 26, 2019

sunchao commented Mar 3, 2019

alecmocatta commented Aug 24, 2019

wesm commented Aug 24, 2019

alecmocatta commented Jan 27, 2019 •

edited

Loading