Parquet Output Filesize #211

jcgomes90 · 2019-01-29T15:34:10Z

I have a program that is writing out to a parquet file. Although parquet is a columnar data storage, I am writing row by row. I can understand this being a hit on performance since the library allows columnar bulk writes.

My concern is the file size. When I view the parquet file using the Apache reader, everything looks fine. But opening the file in text editor, it looks like the column title is being written for every column for every row. Is there a configuration option or something that I am missing? The files are much bigger in size comparatively to other parquet files iv seen with a lot more rows than what I have with the same schema.

jcgomes90 · 2019-01-30T17:46:44Z

The file size looks to be coming from the fact that the library is writing the column header every time we are writing a row.

sunchao · 2019-02-03T22:56:46Z

Sorry for the late reply. Have you resolved the issue? If not, can you share the code which does the writing? You should write multiple rows in each row group.

jcgomes90 · 2019-02-04T16:56:09Z

I am reading rows as they come (real-time). My schema looks something like this:

let message_type = "message schema
{
     OPTIONAL BOOLEAN a;
     OPTIONAL INT64 b;
     OPTIONAL BOOLEAN c;
}

I would call a write_data function which I call for every row which looks something like this:

let mut row_group_writer = serialized_writer.next_row_group().unwrap();

while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
     match col_writer {
           ColumnWriter::Int64ColumnWriter(ref mut typed_writer) =>
                 typed_writer.write_batch(...).unwrap();

           ColumnWriter::BoolColumnWriter ...

           ColumnWriter::ByteArrayColumnWriter ...
     }
     row_group_writer.close_column(col_writer).unwrap();
}
serialized_writer.close_row_group(row_group_writer).unwrap();

When I am done writing all the rows, I call a close_writing function which simply:
serialized_writer.close();

So essentially, I am calling that write_data function for every row of data which looks to be adding the column headers for every row.

sunchao · 2019-02-04T19:34:29Z

Yes, it seems you are calling close_column and close_row_group for every row, which is not optimal. The latter will write the Parquet row group metadata to the file. Instead, you should keep writing and only close them until all rows (or a fixed number of rows, such as 1024) are written.

jcgomes90 · 2019-02-04T20:19:32Z

Thanks for the reply. Is there any way to get the last open row group from the row group writer? Maybe I can try closing the row group from the close_writing function.

jcgomes90 · 2019-02-05T18:43:46Z

If I am writing the parquet output row by row, it doesnt seem like it is possible to write multiple rows in one row group since the row is iterating through each of the columns.

brainstorm · 2019-11-11T14:27:55Z

Hi @jcgomes90, could you share a bit more of this parquet column-level write code? I'm about to write some code that does a "migration" from a parquet file towards another parquet file with two extra columns and would like to have some good working examples to base my work on.

I know that write row support is not there yet on parquet-rs but your code seems to be the closest to get there... performance is not a big issue in my case since this is a one-time transformation.

/cc @chris-zen

brainstorm · 2019-11-11T14:53:23Z

Oh, nevermind, I think I'll use parquet_derive (https://github.com/ccakes/parquet_derive) for now while apache/arrow#4140 gets merged/worked on by @xrl, @sunchao et al 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Output Filesize #211

Parquet Output Filesize #211

jcgomes90 commented Jan 29, 2019

jcgomes90 commented Jan 30, 2019

sunchao commented Feb 3, 2019

jcgomes90 commented Feb 4, 2019

sunchao commented Feb 4, 2019

jcgomes90 commented Feb 4, 2019 •

edited

Loading

jcgomes90 commented Feb 5, 2019

brainstorm commented Nov 11, 2019

brainstorm commented Nov 11, 2019 •

edited

Loading

Parquet Output Filesize #211

Parquet Output Filesize #211

Comments

jcgomes90 commented Jan 29, 2019

jcgomes90 commented Jan 30, 2019

sunchao commented Feb 3, 2019

jcgomes90 commented Feb 4, 2019

sunchao commented Feb 4, 2019

jcgomes90 commented Feb 4, 2019 • edited Loading

jcgomes90 commented Feb 5, 2019

brainstorm commented Nov 11, 2019

brainstorm commented Nov 11, 2019 • edited Loading

jcgomes90 commented Feb 4, 2019 •

edited

Loading

brainstorm commented Nov 11, 2019 •

edited

Loading