Stop encoding schema with each batch in shuffle writer #1186

andygrove · 2024-12-19T17:32:54Z

What is the problem the feature request solves?

We use Arrow IPC to write shuffle output. We create a new writer for each batch and this means that we seralize the schema for each batch.

let mut arrow_writer = StreamWriter::try_new(zstd::Encoder::new(output, 1)?, &batch.schema())?;
arrow_writer.write(batch)?;
arrow_writer.finish()?;

The schema is guaranteed to be the same for every batch because the input is a DataFusion ExecutionPlan so we should be able to use a single writer for all batches and avoid the cost of serializing the schema each time.

Based on one benchmarks in #1180 I am seeing a 4x speedup in encoding time by re-using the writer.

Describe the potential solution

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2024-12-19T20:14:40Z

This may not be possible because Spark shuffle is block-based and not streaming. The order in which blocks arrive in the reader is not guaranteed.

Perhaps we can consider encoding the raw buffers without schema and then infer the schema in the reader based on spark schema, or otherwise provide the reader with the schema in advance.

andygrove added enhancement New feature or request performance labels Dec 19, 2024

andygrove added this to the 0.5.0 milestone Dec 19, 2024

andygrove self-assigned this Dec 19, 2024

This was referenced Dec 19, 2024

[EPIC] Improve shuffle performance #1123

Open

[do not review] experimental support for lz4 compression (not working) #1181

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop encoding schema with each batch in shuffle writer #1186

Stop encoding schema with each batch in shuffle writer #1186

andygrove commented Dec 19, 2024 •

edited

Loading

andygrove commented Dec 19, 2024 •

edited

Loading

Stop encoding schema with each batch in shuffle writer #1186

Stop encoding schema with each batch in shuffle writer #1186

Comments

andygrove commented Dec 19, 2024 • edited Loading

What is the problem the feature request solves?

Describe the potential solution

Additional context

andygrove commented Dec 19, 2024 • edited Loading

andygrove commented Dec 19, 2024 •

edited

Loading

andygrove commented Dec 19, 2024 •

edited

Loading