You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The schema is guaranteed to be the same for every batch because the input is a DataFusion ExecutionPlan so we should be able to use a single writer for all batches and avoid the cost of serializing the schema each time.
Based on one benchmarks in #1180 I am seeing a 4x speedup in encoding time by re-using the writer.
Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
This may not be possible because Spark shuffle is block-based and not streaming. The order in which blocks arrive in the reader is not guaranteed.
Perhaps we can consider encoding the raw buffers without schema and then infer the schema in the reader based on spark schema, or otherwise provide the reader with the schema in advance.
What is the problem the feature request solves?
We use Arrow IPC to write shuffle output. We create a new writer for each batch and this means that we seralize the schema for each batch.
The schema is guaranteed to be the same for every batch because the input is a DataFusion
ExecutionPlan
so we should be able to use a single writer for all batches and avoid the cost of serializing the schema each time.Based on one benchmarks in #1180 I am seeing a 4x speedup in encoding time by re-using the writer.
Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: