Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Dec 21, 2024

Which issue does this PR close?

Part of #1123

Closes #1178

Rationale for this change

We currently perform encoding + compression in native code and decoding + decompression in JVM code. There are some downsides to this approach:

  • We need compatible and efficient JVM and Rust compression libraries (this seems challenging for LZ4)
  • We need compatible and efficient JVM and Rust encoding libraries (we use Arrow IPC currently)
  • We cannot have unit tests for roundtrip encoding and compression; only integration tests (results in slow dev cycles)
  • Makes it difficult to experiment with different encoding and compression libraries and techniques
  • We are missing out on performance, potentially

What changes are included in this PR?

  • Call native code for decompression + decoding
  • Add metrics for decode + decompress
  • Add support for LZ4
  • Implement unit tests for round-trip

ZSTD

2024-12-22_09-22

LZ4

2024-12-23_10-19

Microbenchmarks

shuffle_writer/shuffle_writer: encode (no compression))
                        time:   [10.044 µs 10.371 µs 10.698 µs]
shuffle_writer/shuffle_writer: encode and compress (lz4)
                        time:   [132.60 µs 133.01 µs 133.51 µs]
shuffle_writer/shuffle_writer: encode and compress (zstd level 1)
                        time:   [217.81 µs 218.01 µs 218.25 µs]

TPC-H

edit: Results update on 1/1/2025 after making compression configurable for columnar shuffle as well as native shuffle.

tpch_allqueries

tpch_queries_compare

How are these changes tested?

Existing tests

@andygrove andygrove marked this pull request as ready for review December 21, 2024 21:57
@andygrove andygrove marked this pull request as draft December 21, 2024 22:52
@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2024

Codecov Report

Attention: Patch coverage is 77.67857% with 25 lines in your changes missing coverage. Please review.

Project coverage is 56.86%. Comparing base (103f82f) to head (1c08a4b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...execution/shuffle/NativeBatchDecoderIterator.scala 71.60% 12 Missing and 11 partials ⚠️
...t/execution/shuffle/CometShuffleExchangeExec.scala 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #1192       +/-   ##
=============================================
+ Coverage     34.06%   56.86%   +22.79%     
- Complexity      925      933        +8     
=============================================
  Files           115      112        -3     
  Lines         43569    11004    -32565     
  Branches       9528     2122     -7406     
=============================================
- Hits          14843     6257     -8586     
+ Misses        25777     3637    -22140     
+ Partials       2949     1110     -1839     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove
Copy link
Member Author

The status is that this is working, but I am running into some executor OOMs when trying to run complete benchmarks. I will pick this up again after the holidays.

@andygrove andygrove changed the title feat: Move shuffle block decompression and decoding to native code feat: Move shuffle block decompression and decoding to native code and add LZ4 support Dec 23, 2024
@andygrove andygrove marked this pull request as ready for review December 23, 2024 17:43
@andygrove
Copy link
Member Author

@Dandandan you may be interested in the benchmark results

b.iter(|| {
buffer.clear();
let mut cursor = Cursor::new(&mut buffer);
write_ipc_compressed(&batch, &mut cursor, &CompressionCodec::Zstd(1), &ipc_time)
Copy link

@Dandandan Dandandan Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think zstd should have faster negative levels as well (-4 or -5 might come close), would be interesting to see how it compares. Not sure if it is available in the rust bindings.

@andygrove andygrove changed the title feat: Move shuffle block decompression and decoding to native code and add LZ4 support feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support Dec 27, 2024
@andygrove
Copy link
Member Author

andygrove commented Dec 27, 2024

Added Snappy.

shuffle_writer/shuffle_writer: encode and compress (snappy)
                        time:   [87.556 µs 88.328 µs 89.092 µs]

@viirya
Copy link
Member

viirya commented Dec 28, 2024

Hmm, the difference between ZSTD and LZ4 looks not significant? Does the benchmark of TPC-H shown in the description include native shuffle block decompression and decoding? Do we have the number of difference between current Comet (ZSTD) and this patch (ZSTD + native decompression and decoding)?

@andygrove
Copy link
Member Author

andygrove commented Dec 28, 2024

Hmm, the difference between ZSTD and LZ4 looks not significant? Does the benchmark of TPC-H shown in the description include native shuffle block decompression and decoding? Do we have the number of difference between current Comet (ZSTD) and this patch (ZSTD + native decompression and decoding)?

Sure, here are fresh benchmarks, with a single run for each measurement.

Branch Compression Codec TPC-H Time
0.4.0 ZSTD 364
main ZSTD 364
native-decode ZSTD 367
native-decode Snappy 343
native-decode LZ4 340

I had hoped for a bigger win, but 340 compared to 364 is still a 7% improvement overall.

@andygrove
Copy link
Member Author

I enabled the unsafe version of lz4_flex, and the time reduced from 340s to 336s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for lz4 compression in shuffle
4 participants