Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CTEs for metrics in multi-metric cases #1526

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Use CTEs for metrics in multi-metric cases #1526

wants to merge 8 commits into from

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Nov 13, 2024

This PR updates the CTE generation logic to generate a CTE for each metric when computing multiple metrics in a query (counting parts of a derived metric as well). Although not necessary for performance as there might not be common computation, the generated SQL for multi-metric queries are easier to follow.

@cla-bot cla-bot bot added the cla:yes label Nov 13, 2024
@plypaul plypaul marked this pull request as ready for review November 13, 2024 06:36
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this one today?
I'm not sure that this does actually improve the readability of the SQL, especially in cases when it's actually adding a decent amount more SQL to the optimized query. I left several comments inline.

@@ -41,7 +41,7 @@ def find_common_branches(dataflow_plan: DataflowPlan) -> Sequence[DataflowPlanNo

@staticmethod
def group_nodes_by_type(dataflow_plan: DataflowPlan) -> DataflowPlanNodeSet:
"""Grouops dataflow plan nodes by type."""
"""Groups dataflow plan nodes by type."""
grouping_visitor = _GroupNodesByTypeVisitor()
return dataflow_plan.sink_node.accept(grouping_visitor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a meta-question that's not specific to this PR - do we have any concerns about the performance of traversing the dataflow plan DAG so many times with all these new visitors? If so, I wonder if we could combine the functionality of some of these visitors that collect metadata about the DAG for optimization purposes into one visitor so that we only need to traverse the DAG once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine since the number of nodes in a dataflow plan are generally small, e.g. < 100 nodes.

FROM (
-- Re-aggregate Metric via Group By
-- Read From CTE For node_id=cm_5
WITH cm_4_cte AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but it might improve readability here if the CTE's had more readable aliases. Including the metric name in the alias would be great. But I'm not sure if that would be possible in a multi-metric scenario where we might have any number of metrics in one of these ComputeMetricsNodes.

) subq_17
)

, cm_5_cte AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case where it actually seems like the CTE is making the SQL longer / harder to read. Since the derived metric only has one input metric and a simple expr, but two ComputeMetricsNodes, we're adding a lot more SQL than we need. All we really do in the second CTE is add an alias.
Is there a way to better optimize for cases like this? If not, this makes me question if it is actually a good idea to always use CTEs for multi-metric cases (this whole PR).

@@ -40,4 +37,23 @@ FROM (
WHERE metric_time__martian_day = '2020-01-01'
GROUP BY
metric_time__day
) subq_19
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here re: longer / harder to read SQL

@@ -23,4 +20,24 @@ FROM (
DATE_TRUNC('day', bookings_source_src_28000.ds) = subq_14.ds
GROUP BY
subq_14.martian_day
) subq_18
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@plypaul
Copy link
Contributor Author

plypaul commented Nov 13, 2024

I'm not sure that this does actually improve the readability of the SQL, especially in cases when it's actually adding a decent amount more SQL to the optimized query. I left several comments inline.

Because this change creates CTEs when CTEs aren't necessary, this will always produce more verbose SQL. This is more of a stylistic choice that provides a consistent form to the CTE with smaller parts. Analogous example is a long function that does a lot of things, or multiple smaller functions that each do one thing.

@Jstein77 might have some thoughts here.

@plypaul
Copy link
Contributor Author

plypaul commented Nov 13, 2024

Also to clarify, this was thought as a nice to have, so it's not blocking.

booking__ds__martian_day
, bookings_5_days_ago AS bookings_5_day_lag
FROM (
-- Read From CTE For node_id=cm_5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? Does this mean the final select statment is from cm_5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the ID of the node that produces the final result.

) subq_17
)

SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to rename these columns at the end? We already did that step in cm_5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the result of following the rule, if multiple metrics are computed in a query (parts of a derived query are each considered a metric computation), have each metric computation is out into a CTE. We can change the rule to get different behavior.

, bookings_5_days_ago AS bookings_5_day_lag
FROM (
-- Read From CTE For node_id=cm_5
WITH cm_4_cte AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does cm mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the internal ID prefix used for the dataflow node that computes metrics.

) subq_23
)

, cm_5_cte AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to read from CTE cm_4 and compute metrics via expressions in one step?

i.e if I we're to write this by hand I would write

cm_5_cte as (
select
metric_time__day
, every_2_days_bookers_2_days_ago as every_2_days_bookers_2_days_ago
from cm_4_cte
)

SELECT
metric_time__day
, every_2_days_bookers_2_days_ago
from cm_5_cte

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I think following the previous comments, it seems like we need to settle on a rule for when it helps to add these CTEs.

-- Compute Metrics via Expressions
SELECT
metric_time__day
, every_2_days_bookers_2_days_ago AS every_2_days_bookers_2_days_ago
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the metric and measure name are the same do we need to try and alias the measure? In this case the alias doesn't do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the rule is that we always have column aliases. We could change this, but it's existing behavior.

)

SELECT
metric_time__day AS metric_time__day
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above where the alias is redudant

)

, cm_9_cte AS (
-- Compute Metrics via Expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seperation of logic here between cte 8 and 9 is really clear.

Base automatically changed from p--cte--18 to main November 14, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants