-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added documentation for SortMergeJoin #13469
Conversation
I would be adding this short video to make the dev/user familiar with SMJ concepts https://www.youtube.com/watch?v=jiWCPJtDE2c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @athultr1997 this is good start
@korowa @viirya do you have anything to add on it?
/// partitions. | ||
/// Join execution plan that executes equi-join predicates on multiple partitions using Sort-Merge | ||
/// join algorithm and applies an optional filter post join. Can be used to join arbitrarily large | ||
/// inputs where one or both of the inputs don't fit in the available memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where one or both of the inputs don't fit in the available memory.
Hmm, is this true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
streamed - Always taken one batch at a time.
buffered - Has spilling support.
Hence, the inputs don't have to fit in memory.
Also, I think this was the vision behind SMJ: #1599 (comment).
/// | ||
/// Buffered input is buffered for all record batches having the same value of join key. | ||
/// If the memory limit increases beyond the specified value and spilling is enabled, | ||
/// buffered batches could be spilled to disk. If spilling is disabled, the execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a config for spilling? Shall we mention it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One has to enable the disk manager when creating the RuntimeEnv
during the creation of TaskContext
.
let runtime = RuntimeEnvBuilder::new()
.with_memory_limit(100, 1.0)
.with_disk_manager(DiskManagerConfig::NewOs)
.build_arc()?;
I am not sure if we should mention it here, since this is all the way over in RuntimeEnv
and there are multiple strategies to enable the disk manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can point to DiskManager
documentation here
7d803ed
to
ba6641a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @athultr1997
its a good start, however its many more things to document there
I changed the PR description to say "part of" rather than closing #10357 Thanks @athultr1997 @comphead and @viirya |
Which issue does this PR close?
Part of #10357
Rationale for this change
What changes are included in this PR?
Added documentation for SortMergeJoin
Are these changes tested?
No tests needed since only documentation is added.
Are there any user-facing changes?
No