-
Notifications
You must be signed in to change notification settings - Fork 20
Parquet Output Filesize #211
Comments
The file size looks to be coming from the fact that the library is writing the column header every time we are writing a row. |
Sorry for the late reply. Have you resolved the issue? If not, can you share the code which does the writing? You should write multiple rows in each row group. |
I am reading rows as they come (real-time). My schema looks something like this:
I would call a write_data function which I call for every row which looks something like this:
When I am done writing all the rows, I call a close_writing function which simply: So essentially, I am calling that write_data function for every row of data which looks to be adding the column headers for every row. |
Yes, it seems you are calling |
Thanks for the reply. Is there any way to get the last open row group from the row group writer? Maybe I can try closing the row group from the close_writing function. |
If I am writing the parquet output row by row, it doesnt seem like it is possible to write multiple rows in one row group since the row is iterating through each of the columns. |
Hi @jcgomes90, could you share a bit more of this parquet column-level write code? I'm about to write some code that does a "migration" from a parquet file towards another parquet file with two extra columns and would like to have some good working examples to base my work on. I know that write row support is not there yet on /cc @chris-zen |
Oh, nevermind, I think I'll use |
I have a program that is writing out to a parquet file. Although parquet is a columnar data storage, I am writing row by row. I can understand this being a hit on performance since the library allows columnar bulk writes.
My concern is the file size. When I view the parquet file using the Apache reader, everything looks fine. But opening the file in text editor, it looks like the column title is being written for every column for every row. Is there a configuration option or something that I am missing? The files are much bigger in size comparatively to other parquet files iv seen with a lot more rows than what I have with the same schema.
The text was updated successfully, but these errors were encountered: