-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC for join reordering using simplify_up #809
base: main
Are you sure you want to change the base?
Conversation
# This should be a more general cardinality esimate of the | ||
# resulting join but for now just sort them by size | ||
(_, left, _), (_, right, _) = join_tuple | ||
return (right.npartitions, left.npartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If divisions are known we could actually try to be smart here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. This seems like an easy thing to do. I'm excited to see if it has impact on TPC-H queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks pleasant too
Just saying something for completeness (probably you're already thinking this) but in theory we'd get good at pushing what operations we could through a join (obviously not possible with all operations) so that we often get to the point where we have |
Yes, I thought so too. My sense is that even with this kind of optimization we could already catch a lot of stuff. My textbook/research knowledge for join order optimization is a little thin here. I honestly don't know if a "rearrange individual join tuples to get a local optimum" is a decent strategy or a horrible one ¯_(ツ)_/¯ |
When I looked at join reordering strategies earlier in this process I was surprised to learn that most systems like Spark/Presto that don't store their own data tend to have pretty generic join reordering strategies on by default. They might not do a ton. My guess is that the fancy optimizers tend to all be in databases that hold onto their own data and so store cardinality estimates (postgres, snowflake). Maybe Spark+Delta achieves this though? |
This might be quite useful. I can see quite a few use cases where we drop the join keys as soon as we finished a join, which would inject a
The risk of running into a local optimum is often mentioned as a downside of hill-climbing or iterative improvement for join ordering, which is AFAIR why it's not widely used as opposed to Selinger-style optimization. Then again, our cost model has very little to work with and we're dealing with DAGs, both of which change the game. |
This is a POC for join reordering using simplify_up. There are plenty of cases where this'll break and I haven't implemented this fully but for this specific example it shows that this kind of optimization works in principle.
So far, It only works if the merges are indeed nested and as soon as there is another expression in between than cannot be pushed further down (e.g. a column rename/assign wouldn't be uncommon). I haven't explored this direction yet.
I doubt we'll ever merge this but I was playing with it and wanted to share this
cc @hendrikmakait