POC for join reordering using simplify_up #809

fjetter · 2024-01-26T13:00:58Z

This is a POC for join reordering using simplify_up. There are plenty of cases where this'll break and I haven't implemented this fully but for this specific example it shows that this kind of optimization works in principle.

So far, It only works if the merges are indeed nested and as soon as there is another expression in between than cannot be pushed further down (e.g. a column rename/assign wouldn't be uncommon). I haven't explored this direction yet.

I doubt we'll ever merge this but I was playing with it and wanted to share this

cc @hendrikmakait

fjetter · 2024-01-26T13:02:14Z

dask_expr/_merge.py

+                    # This should be a more general cardinality esimate of the
+                    # resulting join but for now just sort them by size
+                    (_, left, _), (_, right, _) = join_tuple
+                    return (right.npartitions, left.npartitions)


If divisions are known we could actually try to be smart here

Cool. This seems like an easy thing to do. I'm excited to see if it has impact on TPC-H queries.

The implementation looks pleasant too

mrocklin · 2024-01-26T15:39:05Z

It only works if the merges are indeed nested and as soon as there is another expression in between than cannot be pushed further down (e.g. a column rename/assign wouldn't be uncommon).

Just saying something for completeness (probably you're already thinking this) but in theory we'd get good at pushing what operations we could through a join (obviously not possible with all operations) so that we often get to the point where we have Join(Join(...), ...). My guess is that various benchmark queries will start pointing out situations where we can push operations through and get to this point.

fjetter · 2024-01-26T15:56:11Z

Yes, I thought so too. My sense is that even with this kind of optimization we could already catch a lot of stuff.
I could even see us implementing similar logic for stuff like "Is not a merge but a series of blockwise things that lead up to a merge".
I think it would be generally a good thing to know a little about the chain of things on either side. For instance, if there are already applied filters on either side this would be helpful to know in the cost estimation. We may end up with a global pass eventually... but for now I liked the simplicity to play with this a little.

My textbook/research knowledge for join order optimization is a little thin here. I honestly don't know if a "rearrange individual join tuples to get a local optimum" is a decent strategy or a horrible one ¯_(ツ)_/¯

mrocklin · 2024-01-26T16:05:29Z

When I looked at join reordering strategies earlier in this process I was surprised to learn that most systems like Spark/Presto that don't store their own data tend to have pretty generic join reordering strategies on by default. They might not do a ton. My guess is that the fancy optimizers tend to all be in databases that hold onto their own data and so store cardinality estimates (postgres, snowflake). Maybe Spark+Delta achieves this though?

hendrikmakait · 2024-01-26T19:20:40Z

Yes, I thought so too. My sense is that even with this kind of optimization we could already catch a lot of stuff.
I could even see us implementing similar logic for stuff like "Is not a merge but a series of blockwise things that lead up to a merge".

This might be quite useful. I can see quite a few use cases where we drop the join keys as soon as we finished a join, which would inject a Projection between the Merges.

My textbook/research knowledge for join order optimization is a little thin here. I honestly don't know if a "rearrange individual join tuples to get a local optimum" is a decent strategy or a horrible one ¯_(ツ)_/¯

The risk of running into a local optimum is often mentioned as a downside of hill-climbing or iterative improvement for join ordering, which is AFAIR why it's not widely used as opposed to Selinger-style optimization. Then again, our cost model has very little to work with and we're dealing with DAGs, both of which change the game.

POC for join reordering

b889cfe

fjetter commented Jan 26, 2024

View reviewed changes

fjetter changed the title ~~POC for join reordering~~ POC for join reordering using simplify_up Jan 26, 2024

hendrikmakait mentioned this pull request May 15, 2024

Optimize join ordering #1065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC for join reordering using simplify_up #809

POC for join reordering using simplify_up #809

fjetter commented Jan 26, 2024

fjetter Jan 26, 2024

mrocklin Jan 26, 2024

mrocklin Jan 26, 2024

mrocklin commented Jan 26, 2024

fjetter commented Jan 26, 2024

mrocklin commented Jan 26, 2024

hendrikmakait commented Jan 26, 2024 •

edited

Loading

POC for join reordering using simplify_up #809

Are you sure you want to change the base?

POC for join reordering using simplify_up #809

Conversation

fjetter commented Jan 26, 2024

fjetter Jan 26, 2024

Choose a reason for hiding this comment

mrocklin Jan 26, 2024

Choose a reason for hiding this comment

mrocklin Jan 26, 2024

Choose a reason for hiding this comment

mrocklin commented Jan 26, 2024

fjetter commented Jan 26, 2024

mrocklin commented Jan 26, 2024

hendrikmakait commented Jan 26, 2024 • edited Loading

hendrikmakait commented Jan 26, 2024 •

edited

Loading