-
Notifications
You must be signed in to change notification settings - Fork 20
Parameterize reading of rows with a type parameter, checked against the schema, and used to specialize the deserialization of rows #205
Comments
Thanks for the write up and your interest. I have a few questions:
|
@sadikovi Thanks for your response! Good questions.
I haven't implemented this yet but I intend to use the same approach as struct Record {
#[parquet(rename = "¡field_name!")]
field_name: u64
} such that any column name can be referred to, while keeping the struct field name valid.
I believe I've used all of the relevant work and workarounds that are in the codebase currently to handle this correctly. i.e. if the user provides
Of course! I'm in the middle of this work so there's still a way to go, but the early numbers show a 2-7x improvement. Before:
After:
32 is an arbitrary number, I can make it higher or lower if you think it's appropriate – it's a trade-off of convenience vs compile time. Groups can be deserialized to tuples, but also to structs annotated with |
I would like to see the code, because you are using existing record assembly machinery and seeing these improvements. I am curious to what changes you have made to improve the throughput? Thanks. How do you do projection on legacy parquet files then? Do you prune columns from already parsed schema? |
@sadikovi An estimate based on my experimentation so far: 50% of the speedup is from avoiding allocation (and the resulting optimisations the compiler can make; allocation is an optimisation blocker), 25% is from specialization, and 25% is from various other changes. I have further to go so I'm hopeful I'll speed it up a bit more – in theory the bottleneck should be syscalls and decompression here rather than anything else so that's what I'm aiming for. The first method on the new trait mentioned above, "Produce Projection occurs as a byproduct of this recursive "matching" process – if a struct omits a field that is in the actual schema, then its column is ignored, no reader is generated and no value is read for it. |
Thanks @alecmocatta ! The performance improvement looks very impressive 👍 ! Looking forward to a PR on this 😄 . |
Allocation of what? How do you avoid allocation - you need to return rows? Are you using a mutable row reference instead? |
Hi @alecmocatta , just curious whether there's any update on this? |
@sunchao I've been on holiday but will PR this next week. I intend to open a JIRA and PR against https://github.com/apache/arrow/tree/master/rust/parquet/src, is that the right thing to do? |
Thanks. Yes, filing a JIRA against arrow is the right thing to do. Looking forward to it! |
My fork is here: https://github.com/alecmocatta/parquet-rs It currently triggers an ICE on usage rust-lang/rust#53443 and much is currently commented out until I finish refactoring. I'll investigate, finish the refactor, clean up the code, and rebase on https://github.com/apache/arrow in the coming week or so. |
Thanks @alecmocatta ! Could you open a pull request in arrow? it's a pretty big change and I'll take some time to look at it. |
Proposal
Parameterize reading (and potentially writing) of rows with a type parameter, which is checked against the file's schema, and used to specialize the deserialization (and potentially serialization) of rows.
Achieve this by adding a type parameter to
get_row_iter()
andRowIter
for the user to specify the type of the returned rows. In cases where the type information is not known, a generic enum that can represent any valid type can be used, which would preserve the current dynamically-typed functionality. Type information is also leveraged to provide the projection.What's currently done:
Which under this proposal becomes:
Upsides
Currently, allocations are done for each row. This impacts performance (see #140). With the user-specified row type, no dynamic allocation needs to occur besides for Lists/Maps within the row.
Currently, the decode logic is largely generic – i.e. there are lots of nested enums which branch on the type. User-specified row type information would enable the logic to be specialised and optimised by the compiler.
Together these would offer a substantial boost to performance.
Projections are typically written as text, parsed with
parse_message_type()
. The user-specified row type can instead be used as the projection. This saves having to keep them both in sync.Downsides
More sophisticated API. The old behaviour would however still be available simply with
file_reader.get_row_iter::<Row>()
.Breaking changes: implementation details like the precise API of the
Reader
enum are difficult to maintain exactly, and my current implementation doesn't attempt to. As such I would suggest that if these proposed changes are accepted, a semver bump to 0.5 is made.Prior art
Many/most Rust implementations of serialization/deserialization leverage type information to specialise the logic and avoid allocations. An example of leveraging an enum to enable the spectrum of untyped to strongly-typed (i.e. gradual typing) is
serde_json::Value
.Implementation
A new trait (which I'm currently calling
ParquetDeserialize
), implemented on u8, i8, u16, i16, u32, i32, u64, i64, f32, f64,Vec<u8>
, String, Decimal, Timestamp, Row,List<T>
,Map<K,V>
, as well asOption<T>
of each of the aforementioned.This trait has two associated types:
Schema
The typed schema (akin toType
)Reader
The typed reader (akin toReader
)And has methods to:
Self::Schema
given the user-provided type and aType
. This returns a helpful error if they don't match.Self
fromSelf::Reader
.It is implemented on tuples (up to length 32), where it is valid for reading group types that exactly match the length and types of the tuple (i.e. ignoring names). This is intended as a convenience for reading group types without having to create a struct.
It can be derived on structs with
#[derive(ParquetDeserialize)]
, where it is valid for reading group types that have all of the field names as columns with matching types. Projection can be achieved by omitting fields from the struct.Projection (avoiding unnecessary reading by specifying which columns you're interested in) would change from being given as a
Type
(which is in practise usually calculated from the text version of the schema), to being inferred directly from the user-specified type. The assumption here is that if the user has knowledge about the schema of the file to use as a projection, they should include that knowledge in the type in any case.List
andMap
would become typed, i.e.List<T>
andMap<K,V>
. They can be dynamically-typed akin to the current implementation like so:List<Value>
andMap<Primitive,Value>
(Primitive
for the key as that is a restriction imposed by the format).A new generic
Value
enum that can represent any valid type, which preserves the current dynamically-typed functionality:as well as a
Primitive
enum that excludesList
,Map
andRow
.Interaction with other features/work
I'm not so familiar with the Row writing mechanisms, so I'm currently unsure how that is impacted. #197, #203 are relevant issues. There is potential for the exact schema to be written to be taken from the type (as per #203), though it needs to be overridable as there are multiple schemas that map to a single type. For example there are 6 valid schemas for a
List<T>
, so the ability to provide a custom schema to specify which one if not the default is necessary.Status
I've implemented the bulk of this, and it's running successfully on all the test data. I'm looking for feedback as to how best to contribute my work back to this project. I'd like to make a PR later this week if it's ready (probably after Christmas if it's not), but wanted to let the community know that this is being actively worked on to avoid any duplication of effort, and garner any thoughts and feedback in the meantime.
The text was updated successfully, but these errors were encountered: