Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC for join reordering using simplify_up #809

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jan 26, 2024

This is a POC for join reordering using simplify_up. There are plenty of cases where this'll break and I haven't implemented this fully but for this specific example it shows that this kind of optimization works in principle.

So far, It only works if the merges are indeed nested and as soon as there is another expression in between than cannot be pushed further down (e.g. a column rename/assign wouldn't be uncommon). I haven't explored this direction yet.

I doubt we'll ever merge this but I was playing with it and wanted to share this

cc @hendrikmakait

Comment on lines +371 to +374
# This should be a more general cardinality esimate of the
# resulting join but for now just sort them by size
(_, left, _), (_, right, _) = join_tuple
return (right.npartitions, left.npartitions)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If divisions are known we could actually try to be smart here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. This seems like an easy thing to do. I'm excited to see if it has impact on TPC-H queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks pleasant too

@fjetter fjetter changed the title POC for join reordering POC for join reordering using simplify_up Jan 26, 2024
@mrocklin
Copy link
Member

It only works if the merges are indeed nested and as soon as there is another expression in between than cannot be pushed further down (e.g. a column rename/assign wouldn't be uncommon).

Just saying something for completeness (probably you're already thinking this) but in theory we'd get good at pushing what operations we could through a join (obviously not possible with all operations) so that we often get to the point where we have Join(Join(...), ...). My guess is that various benchmark queries will start pointing out situations where we can push operations through and get to this point.

@fjetter
Copy link
Member Author

fjetter commented Jan 26, 2024

Yes, I thought so too. My sense is that even with this kind of optimization we could already catch a lot of stuff.
I could even see us implementing similar logic for stuff like "Is not a merge but a series of blockwise things that lead up to a merge".
I think it would be generally a good thing to know a little about the chain of things on either side. For instance, if there are already applied filters on either side this would be helpful to know in the cost estimation. We may end up with a global pass eventually... but for now I liked the simplicity to play with this a little.

My textbook/research knowledge for join order optimization is a little thin here. I honestly don't know if a "rearrange individual join tuples to get a local optimum" is a decent strategy or a horrible one ¯_(ツ)_/¯

@mrocklin
Copy link
Member

When I looked at join reordering strategies earlier in this process I was surprised to learn that most systems like Spark/Presto that don't store their own data tend to have pretty generic join reordering strategies on by default. They might not do a ton. My guess is that the fancy optimizers tend to all be in databases that hold onto their own data and so store cardinality estimates (postgres, snowflake). Maybe Spark+Delta achieves this though?

@hendrikmakait
Copy link
Member

hendrikmakait commented Jan 26, 2024

Yes, I thought so too. My sense is that even with this kind of optimization we could already catch a lot of stuff.
I could even see us implementing similar logic for stuff like "Is not a merge but a series of blockwise things that lead up to a merge".

This might be quite useful. I can see quite a few use cases where we drop the join keys as soon as we finished a join, which would inject a Projection between the Merges.

My textbook/research knowledge for join order optimization is a little thin here. I honestly don't know if a "rearrange individual join tuples to get a local optimum" is a decent strategy or a horrible one ¯_(ツ)_/¯

The risk of running into a local optimum is often mentioned as a downside of hill-climbing or iterative improvement for join ordering, which is AFAIR why it's not widely used as opposed to Selinger-style optimization. Then again, our cost model has very little to work with and we're dealing with DAGs, both of which change the game.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants