Change default parquet compression format from Snappy to LZ4 #424

mrocklin · 2023-11-24T14:42:44Z

Snappy's status as default is maybe just due to history. Snappy had better Java support and LZ4 wasn't always available in systems like Spark. Today Spark and other systems support LZ4 as well, and LZ4 generally performs a bit better, especially on decompression.

This is a significant change, but I think the only reason not to do it is historical, which I think maybe isn't a good enough reason these days.

Snappy's status as default is maybe just due to history. Snappy had better Java support and LZ4 wasn't always available in systems like Spark. Today Spark and other systems support LZ4 as well, and LZ4 generally performs a bit better, especially on decompression. This is a significant change, but I think the only reason not to do it is historical, which I think maybe isn't a good enough reason these days.

mrocklin · 2023-11-24T14:43:07Z

@rjzamora any concerns here? (no urgency during the holiday)

rjzamora · 2023-11-27T15:48:22Z

cuDF doesn't support LZ4 compression at the moment. However, that should also improve decompression performance on GPUs once we do have support (ref). I have a feeling that the lack of support is just because no one has asked for it yet (New Issue: rapidsai/cudf#14495).

I'd obviously prefer this kind of change to come after there is LZ4 support in cudf, but I can probably find short-term workarounds if you are eager to get this in immediately.

rjzamora · 2023-11-27T18:04:17Z

Just a side note that LZ4 is technically "deprecated" in the Parquet spec, so we should also make sure that pyarrow is actually using the interoperable LZ4_RAW codec under the hood when the user specifies compression="lz4".

mrocklin · 2023-11-27T18:17:16Z

It looks like DuckDB doesn't do this either, which is a little sad 😞

but I can probably find short-term workarounds if you are eager to get this in immediately

I'm not in a huge rush.

Just a side note that LZ4 is technically "deprecated" in the Parquet spec, so we should also make sure that pyarrow is actually using the interoperable LZ4_RAW codec under the hood when the user specifies compression="lz4".

I think that I care not-at-all what the spec says. I do care deeply what conventions and general support is among popular libraries that are used to read and write parquet.

rjzamora · 2023-11-27T18:48:24Z

I think that I care not-at-all what the spec says. I do care deeply what conventions and general support is among popular libraries that are used to read and write parquet.

Yeah, that makes sense. Perhaps I'll rephrase my comment a bit: It seems like there has been some commotion around the LZ4 Parquet codec recently. Since I don't personally understand this commotion very well, we should probably double check with the pyarrow folks that using compression="lz4" is not about to become a problem for some reason.

mrocklin · 2023-11-27T18:51:14Z

It seems like there has been some commotion around the LZ4 Parquet codec recently

This came out of this issue: apache/arrow#38389 (comment)

To give more context, most Parquet deserialization cost seems to be SNAPPY decompression. The use of SNAPPY seems to be a significant performance bottleneck. You might find that issue of interest.

we should probably double check with the pyarrow folks that using compression="lz4" is not about to become a problem for some reason

To be clear, the fact that DuckDB doesn't support LZ4 already I think makes this a problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default parquet compression format from Snappy to LZ4 #424

Change default parquet compression format from Snappy to LZ4 #424

mrocklin commented Nov 24, 2023

mrocklin commented Nov 24, 2023

rjzamora commented Nov 27, 2023 •

edited

Loading

rjzamora commented Nov 27, 2023 •

edited

Loading

mrocklin commented Nov 27, 2023

rjzamora commented Nov 27, 2023

mrocklin commented Nov 27, 2023

Change default parquet compression format from Snappy to LZ4 #424

Are you sure you want to change the base?

Change default parquet compression format from Snappy to LZ4 #424

Conversation

mrocklin commented Nov 24, 2023

mrocklin commented Nov 24, 2023

rjzamora commented Nov 27, 2023 • edited Loading

rjzamora commented Nov 27, 2023 • edited Loading

mrocklin commented Nov 27, 2023

rjzamora commented Nov 27, 2023

mrocklin commented Nov 27, 2023

rjzamora commented Nov 27, 2023 •

edited

Loading

rjzamora commented Nov 27, 2023 •

edited

Loading