Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Derive parquet schema from struct #203

Open
xrl opened this issue Dec 6, 2018 · 0 comments
Open

Derive parquet schema from struct #203

xrl opened this issue Dec 6, 2018 · 0 comments

Comments

@xrl
Copy link
Contributor

xrl commented Dec 6, 2018

If users write to their parquet files through an intermediate struct, we can help them out by generating the parquet schema from the struct.

I like to model rows of my parquet file using structs, for example:

struct PurchaseOrderRecord<'a> {
    id: i32,
    ad_po_number: &'a Option<String>
}

and then I have to manually track the schema, writing something by hand like:

lazy_static! {
    static ref purchase_orders_schema: &'static str = "message schema {
REQUIRED INT32 id;
OPTIONAL BINARY ad_po_number (UTF8);
    }";
}

and any time I make a change to the PurchaseOrderRecord I have to manually update purchase_orders_schema or else I get runtime errors.

We can avoid this whole situation by providing a deriving procedural macro. I was thinking something name ParquetSchema, to be used:

#[derive(ParquetSchema)]
struct PurchaseOrderRecord<'a> {
  ...
}

which would derive a value and an accessor trait. With the macro fully expanded you would get something like:

struct PurchaseOrderRecord<'a> {
  ...
}
lazy_static! {
  static ref purchase_order_schema: parquet::schema::types::Type = ...
}

what's interesting here is that I can build the concrete schema enum at compile time.

This functionality would remove error prone steps for writers/schemas. This is a big pain point for me 😄.

The dream would be to enable functionality like:

#[derive(ParquetSchema,ParquetRecordWriter)]
struct PurchaseOrderRecord<'a> {
  ...
}

and then users can focus on their data and the parquet stuff is taken care of!

Also, I glossed it over, but we may want some kind of schema accessor trait to map a struct type to the macro-generated static schema type enum:

trait Schema {
  pub schema() -> &'static parquet::schema::types::Type;
}

which would allow the user to access the schema anywhere with:

PurchaseOrderRecord::schema()
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant