Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Parquet Output Filesize #211

Open
jcgomes90 opened this issue Jan 29, 2019 · 8 comments
Open

Parquet Output Filesize #211

jcgomes90 opened this issue Jan 29, 2019 · 8 comments

Comments

@jcgomes90
Copy link

I have a program that is writing out to a parquet file. Although parquet is a columnar data storage, I am writing row by row. I can understand this being a hit on performance since the library allows columnar bulk writes.

My concern is the file size. When I view the parquet file using the Apache reader, everything looks fine. But opening the file in text editor, it looks like the column title is being written for every column for every row. Is there a configuration option or something that I am missing? The files are much bigger in size comparatively to other parquet files iv seen with a lot more rows than what I have with the same schema.

@jcgomes90
Copy link
Author

The file size looks to be coming from the fact that the library is writing the column header every time we are writing a row.

@sunchao
Copy link
Owner

sunchao commented Feb 3, 2019

Sorry for the late reply. Have you resolved the issue? If not, can you share the code which does the writing? You should write multiple rows in each row group.

@jcgomes90
Copy link
Author

I am reading rows as they come (real-time). My schema looks something like this:

let message_type = "message schema
{
     OPTIONAL BOOLEAN a;
     OPTIONAL INT64 b;
     OPTIONAL BOOLEAN c;
}

I would call a write_data function which I call for every row which looks something like this:

let mut row_group_writer = serialized_writer.next_row_group().unwrap();

while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
     match col_writer {
           ColumnWriter::Int64ColumnWriter(ref mut typed_writer) =>
                 typed_writer.write_batch(...).unwrap();

           ColumnWriter::BoolColumnWriter ...

           ColumnWriter::ByteArrayColumnWriter ...
     }
     row_group_writer.close_column(col_writer).unwrap();
}
serialized_writer.close_row_group(row_group_writer).unwrap();

When I am done writing all the rows, I call a close_writing function which simply:
serialized_writer.close();

So essentially, I am calling that write_data function for every row of data which looks to be adding the column headers for every row.

@sunchao
Copy link
Owner

sunchao commented Feb 4, 2019

Yes, it seems you are calling close_column and close_row_group for every row, which is not optimal. The latter will write the Parquet row group metadata to the file. Instead, you should keep writing and only close them until all rows (or a fixed number of rows, such as 1024) are written.

@jcgomes90
Copy link
Author

jcgomes90 commented Feb 4, 2019

Thanks for the reply. Is there any way to get the last open row group from the row group writer? Maybe I can try closing the row group from the close_writing function.

@jcgomes90
Copy link
Author

If I am writing the parquet output row by row, it doesnt seem like it is possible to write multiple rows in one row group since the row is iterating through each of the columns.

@brainstorm
Copy link

Hi @jcgomes90, could you share a bit more of this parquet column-level write code? I'm about to write some code that does a "migration" from a parquet file towards another parquet file with two extra columns and would like to have some good working examples to base my work on.

I know that write row support is not there yet on parquet-rs but your code seems to be the closest to get there... performance is not a big issue in my case since this is a one-time transformation.

/cc @chris-zen

@brainstorm
Copy link

brainstorm commented Nov 11, 2019

Oh, nevermind, I think I'll use parquet_derive (https://github.com/ccakes/parquet_derive) for now while apache/arrow#4140 gets merged/worked on by @xrl, @sunchao et al 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants