You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.
This issue has been mentioned in multiple tickets including #174. I'd like to have a tracking issue for design of a Record Writer. I was playing around with a procedural macro design, something which could support #[derive(ParquetRecord)] or similar name.
I'd like to support all the pointer/non-pointer values inside of the record struct: String, &String, &str, Option<&str>, Option<&String>, &Option<&str>, &Option<&String>, &Option<String>. These ownerships styles came up often when loading data from Diesel, sometimes I had an owned string, sometimes I had a computed optional which yielded a borrowed string, etc.
that would derive an implementation on struct which writes those fields in the order they are defined: owned_val, borrowed_val, computed_borrowed_val.
Now we need a record writer method on a RowGroup:
let records = ... // the user does all their work for this
let parquet_file = ...
row_group = parquet_file.next_row_group().unwrap();
for record in records {
row_group.write_record(record)
}
where RowGroup#write_record is something like:
fn write_record(&self, r: ParquetRecord) {
for (file_col, record_val) in self.columns.zip(r.values) {
file_col.write(record_val);
}
}
and file_col would implement the interface ColumnEasyWriter (these names are total stand-ins btw):
and then we build out all the implementations of ColumnEasyValue for the variations of String, &Option<&str>, etc.
Figuring out the responsibilities for enumerating columns, dispatching writes, and keeping the number of traits to a minimum sounds tough! This is part I feel weakest about.
Other open questions:
Could we build the CoolDataForParquet from the schema string? Think a macro like parquet_record_writer!(schema message { REQUIRED BINARY owned_value (UTF8), ... })? Then the struct and schema are kept in sync and we get more type safety?
Could this be done with more Iterator<Item=...> kind of code? Less intermediate vectors could be good.
The text was updated successfully, but these errors were encountered:
@sadikovi what do you think of this design? you mentioned you were going to work on a high-level record writer and I was curious if this design is in line with what you wanted.
Thanks for writing the comment. I have been snowed under with the current project, so apologies for that.
I quite like your idea, it is a bit different from mine. Would like to see it done, it could be a performant solution.
I was going to reuse our Row API for values, mainly because we already have it.
We can add macros to help users write less code, including file creation.
When it comes to the actual value writing, I was planning to reuse record reader technique for writes with triplet iterators and value readers, well, in this case value writers. But yes, I agree, it could get complicated.
Let me know what you think. I will try doing at least something this weekend, I have been a bit off the project for the last two weeks.
This issue has been mentioned in multiple tickets including #174. I'd like to have a tracking issue for design of a Record Writer. I was playing around with a procedural macro design, something which could support
#[derive(ParquetRecord)]
or similar name.I'd like to support all the pointer/non-pointer values inside of the record struct:
String
,&String
,&str
,Option<&str>
,Option<&String>
,&Option<&str>
,&Option<&String>
,&Option<String>
. These ownerships styles came up often when loading data from Diesel, sometimes I had an owned string, sometimes I had a computed optional which yielded a borrowed string, etc.So a sample struct:
that would derive an implementation on struct which writes those fields in the order they are defined:
owned_val
,borrowed_val
,computed_borrowed_val
.Now we need a record writer method on a RowGroup:
where
RowGroup#write_record
is something like:and
file_col
would implement the interfaceColumnEasyWriter
(these names are total stand-ins btw):and then we build out all the implementations of
ColumnEasyValue
for the variations ofString
,&Option<&str>
, etc.Figuring out the responsibilities for enumerating columns, dispatching writes, and keeping the number of traits to a minimum sounds tough! This is part I feel weakest about.
Other open questions:
CoolDataForParquet
from the schema string? Think a macro likeparquet_record_writer!(schema message { REQUIRED BINARY owned_value (UTF8), ... })
? Then the struct and schema are kept in sync and we get more type safety?Iterator<Item=...>
kind of code? Less intermediate vectors could be good.The text was updated successfully, but these errors were encountered: