-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CTEs for metrics in multi-metric cases #1526
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this one today?
I'm not sure that this does actually improve the readability of the SQL, especially in cases when it's actually adding a decent amount more SQL to the optimized query. I left several comments inline.
@@ -41,7 +41,7 @@ def find_common_branches(dataflow_plan: DataflowPlan) -> Sequence[DataflowPlanNo | |||
|
|||
@staticmethod | |||
def group_nodes_by_type(dataflow_plan: DataflowPlan) -> DataflowPlanNodeSet: | |||
"""Grouops dataflow plan nodes by type.""" | |||
"""Groups dataflow plan nodes by type.""" | |||
grouping_visitor = _GroupNodesByTypeVisitor() | |||
return dataflow_plan.sink_node.accept(grouping_visitor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a meta-question that's not specific to this PR - do we have any concerns about the performance of traversing the dataflow plan DAG so many times with all these new visitors? If so, I wonder if we could combine the functionality of some of these visitors that collect metadata about the DAG for optimization purposes into one visitor so that we only need to traverse the DAG once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine since the number of nodes in a dataflow plan are generally small, e.g. < 100 nodes.
FROM ( | ||
-- Re-aggregate Metric via Group By | ||
-- Read From CTE For node_id=cm_5 | ||
WITH cm_4_cte AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker, but it might improve readability here if the CTE's had more readable aliases. Including the metric name in the alias would be great. But I'm not sure if that would be possible in a multi-metric scenario where we might have any number of metrics in one of these ComputeMetricsNodes
.
) subq_17 | ||
) | ||
|
||
, cm_5_cte AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting case where it actually seems like the CTE is making the SQL longer / harder to read. Since the derived metric only has one input metric and a simple expr
, but two ComputeMetricsNodes, we're adding a lot more SQL than we need. All we really do in the second CTE is add an alias.
Is there a way to better optimize for cases like this? If not, this makes me question if it is actually a good idea to always use CTEs for multi-metric cases (this whole PR).
@@ -40,4 +37,23 @@ FROM ( | |||
WHERE metric_time__martian_day = '2020-01-01' | |||
GROUP BY | |||
metric_time__day | |||
) subq_19 | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here re: longer / harder to read SQL
@@ -23,4 +20,24 @@ FROM ( | |||
DATE_TRUNC('day', bookings_source_src_28000.ds) = subq_14.ds | |||
GROUP BY | |||
subq_14.martian_day | |||
) subq_18 | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Because this change creates CTEs when CTEs aren't necessary, this will always produce more verbose SQL. This is more of a stylistic choice that provides a consistent form to the CTE with smaller parts. Analogous example is a long function that does a lot of things, or multiple smaller functions that each do one thing. @Jstein77 might have some thoughts here. |
Also to clarify, this was thought as a nice to have, so it's not blocking. |
booking__ds__martian_day | ||
, bookings_5_days_ago AS bookings_5_day_lag | ||
FROM ( | ||
-- Read From CTE For node_id=cm_5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean? Does this mean the final select statment is from cm_5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the ID of the node that produces the final result.
) subq_17 | ||
) | ||
|
||
SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to rename these columns at the end? We already did that step in cm_5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the result of following the rule, if multiple metrics are computed in a query (parts of a derived query are each considered a metric computation), have each metric computation is out into a CTE. We can change the rule to get different behavior.
, bookings_5_days_ago AS bookings_5_day_lag | ||
FROM ( | ||
-- Read From CTE For node_id=cm_5 | ||
WITH cm_4_cte AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does cm mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the internal ID prefix used for the dataflow node that computes metrics.
) subq_23 | ||
) | ||
|
||
, cm_5_cte AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to read from CTE cm_4 and compute metrics via expressions in one step?
i.e if I we're to write this by hand I would write
cm_5_cte as (
select
metric_time__day
, every_2_days_bookers_2_days_ago as every_2_days_bookers_2_days_ago
from cm_4_cte
)
SELECT
metric_time__day
, every_2_days_bookers_2_days_ago
from cm_5_cte
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I think following the previous comments, it seems like we need to settle on a rule for when it helps to add these CTEs.
-- Compute Metrics via Expressions | ||
SELECT | ||
metric_time__day | ||
, every_2_days_bookers_2_days_ago AS every_2_days_bookers_2_days_ago |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the metric and measure name are the same do we need to try and alias the measure? In this case the alias doesn't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the rule is that we always have column aliases. We could change this, but it's existing behavior.
) | ||
|
||
SELECT | ||
metric_time__day AS metric_time__day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above where the alias is redudant
) | ||
|
||
, cm_9_cte AS ( | ||
-- Compute Metrics via Expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seperation of logic here between cte 8 and 9 is really clear.
This PR updates the CTE generation logic to generate a CTE for each metric when computing multiple metrics in a query (counting parts of a derived metric as well). Although not necessary for performance as there might not be common computation, the generated SQL for multi-metric queries are easier to follow.