Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop encoding schema with each batch in shuffle writer #1186

Open
andygrove opened this issue Dec 19, 2024 · 1 comment
Open

Stop encoding schema with each batch in shuffle writer #1186

andygrove opened this issue Dec 19, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request performance
Milestone

Comments

@andygrove
Copy link
Member

andygrove commented Dec 19, 2024

What is the problem the feature request solves?

We use Arrow IPC to write shuffle output. We create a new writer for each batch and this means that we seralize the schema for each batch.

let mut arrow_writer = StreamWriter::try_new(zstd::Encoder::new(output, 1)?, &batch.schema())?;
arrow_writer.write(batch)?;
arrow_writer.finish()?;

The schema is guaranteed to be the same for every batch because the input is a DataFusion ExecutionPlan so we should be able to use a single writer for all batches and avoid the cost of serializing the schema each time.

Based on one benchmarks in #1180 I am seeing a 4x speedup in encoding time by re-using the writer.

Describe the potential solution

No response

Additional context

No response

@andygrove
Copy link
Member Author

andygrove commented Dec 19, 2024

This may not be possible because Spark shuffle is block-based and not streaming. The order in which blocks arrive in the reader is not guaranteed.

Perhaps we can consider encoding the raw buffers without schema and then infer the schema in the reader based on spark schema, or otherwise provide the reader with the schema in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

1 participant