-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default parquet compression format from Snappy to LZ4 #424
base: main
Are you sure you want to change the base?
Conversation
Snappy's status as default is maybe just due to history. Snappy had better Java support and LZ4 wasn't always available in systems like Spark. Today Spark and other systems support LZ4 as well, and LZ4 generally performs a bit better, especially on decompression. This is a significant change, but I think the only reason not to do it is historical, which I think maybe isn't a good enough reason these days.
@rjzamora any concerns here? (no urgency during the holiday) |
cuDF doesn't support LZ4 compression at the moment. However, that should also improve decompression performance on GPUs once we do have support (ref). I have a feeling that the lack of support is just because no one has asked for it yet (New Issue: rapidsai/cudf#14495). I'd obviously prefer this kind of change to come after there is LZ4 support in cudf, but I can probably find short-term workarounds if you are eager to get this in immediately. |
Just a side note that |
It looks like DuckDB doesn't do this either, which is a little sad 😞
I'm not in a huge rush.
I think that I care not-at-all what the spec says. I do care deeply what conventions and general support is among popular libraries that are used to read and write parquet. |
Yeah, that makes sense. Perhaps I'll rephrase my comment a bit: It seems like there has been some commotion around the LZ4 Parquet codec recently. Since I don't personally understand this commotion very well, we should probably double check with the pyarrow folks that using |
This came out of this issue: apache/arrow#38389 (comment) To give more context, most Parquet deserialization cost seems to be SNAPPY decompression. The use of SNAPPY seems to be a significant performance bottleneck. You might find that issue of interest.
To be clear, the fact that DuckDB doesn't support LZ4 already I think makes this a problem. |
Snappy's status as default is maybe just due to history. Snappy had better Java support and LZ4 wasn't always available in systems like Spark. Today Spark and other systems support LZ4 as well, and LZ4 generally performs a bit better, especially on decompression.
This is a significant change, but I think the only reason not to do it is historical, which I think maybe isn't a good enough reason these days.