Use CTEs for metrics in multi-metric cases #1526

plypaul · 2024-11-13T06:09:20Z

This PR updates the CTE generation logic to generate a CTE for each metric when computing multiple metrics in a query (counting parts of a derived metric as well). Although not necessary for performance as there might not be common computation, the generated SQL for multi-metric queries are easier to follow.

courtneyholcomb

Let's discuss this one today?
I'm not sure that this does actually improve the readability of the SQL, especially in cases when it's actually adding a decent amount more SQL to the optimized query. I left several comments inline.

courtneyholcomb · 2024-11-13T14:48:35Z

metricflow/dataflow/dataflow_plan_analyzer.py

@@ -41,7 +41,7 @@ def find_common_branches(dataflow_plan: DataflowPlan) -> Sequence[DataflowPlanNo

    @staticmethod
    def group_nodes_by_type(dataflow_plan: DataflowPlan) -> DataflowPlanNodeSet:
-        """Grouops dataflow plan nodes by type."""
+        """Groups dataflow plan nodes by type."""
        grouping_visitor = _GroupNodesByTypeVisitor()
        return dataflow_plan.sink_node.accept(grouping_visitor)


This is just a meta-question that's not specific to this PR - do we have any concerns about the performance of traversing the dataflow plan DAG so many times with all these new visitors? If so, I wonder if we could combine the functionality of some of these visitors that collect metadata about the DAG for optimization purposes into one visitor so that we only need to traverse the DAG once.

This should be fine since the number of nodes in a dataflow plan are generally small, e.g. < 100 nodes.

courtneyholcomb · 2024-11-13T14:53:27Z

...QueryPlan/DuckDB/test_derived_cumulative_metric_with_non_default_grains__plan0_optimized.sql

-FROM (
-  -- Re-aggregate Metric via Group By
+-- Read From CTE For node_id=cm_5
+WITH cm_4_cte AS (


Not a blocker, but it might improve readability here if the CTE's had more readable aliases. Including the metric name in the alias would be great. But I'm not sure if that would be possible in a multi-metric scenario where we might have any number of metrics in one of these ComputeMetricsNodes.

courtneyholcomb · 2024-11-13T14:59:33Z

...arity.py/SqlQueryPlan/DuckDB/test_offset_metric_with_custom_granularity__plan0_optimized.sql

-) subq_17
+)
+
+, cm_5_cte AS (


This is an interesting case where it actually seems like the CTE is making the SQL longer / harder to read. Since the derived metric only has one input metric and a simple expr, but two ComputeMetricsNodes, we're adding a lot more SQL than we need. All we really do in the second CTE is add an alias.
Is there a way to better optimize for cases like this? If not, this makes me question if it is actually a good idea to always use CTEs for multi-metric cases (this whole PR).

courtneyholcomb · 2024-11-13T15:00:24Z

...uckDB/test_offset_metric_with_custom_granularity_filter_not_in_group_by__plan0_optimized.sql

@@ -40,4 +37,23 @@ FROM (
  WHERE metric_time__martian_day = '2020-01-01'
  GROUP BY
    metric_time__day
-) subq_19
+)


Same here re: longer / harder to read SQL

courtneyholcomb · 2024-11-13T15:01:20Z

...rity.py/SqlQueryPlan/DuckDB/test_derived_metric_with_custom_granularity__plan0_optimized.sql

@@ -23,4 +20,24 @@ FROM (
    DATE_TRUNC('day', bookings_source_src_28000.ds) = subq_14.ds
  GROUP BY
    subq_14.martian_day
-) subq_18
+)


plypaul · 2024-11-13T15:54:32Z

I'm not sure that this does actually improve the readability of the SQL, especially in cases when it's actually adding a decent amount more SQL to the optimized query. I left several comments inline.

Because this change creates CTEs when CTEs aren't necessary, this will always produce more verbose SQL. This is more of a stylistic choice that provides a consistent form to the CTE with smaller parts. Analogous example is a long function that does a lot of things, or multiple smaller functions that each do one thing.

@Jstein77 might have some thoughts here.

plypaul · 2024-11-13T19:16:44Z

Also to clarify, this was thought as a nice to have, so it's not blocking.

Jstein77 · 2024-11-13T21:15:19Z

...arity.py/SqlQueryPlan/DuckDB/test_offset_metric_with_custom_granularity__plan0_optimized.sql

-  booking__ds__martian_day
-  , bookings_5_days_ago AS bookings_5_day_lag
-FROM (
+-- Read From CTE For node_id=cm_5


What does this comment mean? Does this mean the final select statment is from cm_5?

Yes, that's the ID of the node that produces the final result.

Jstein77 · 2024-11-13T21:17:37Z

...arity.py/SqlQueryPlan/DuckDB/test_offset_metric_with_custom_granularity__plan0_optimized.sql

+  ) subq_17
+)
+
+SELECT


Do we need to rename these columns at the end? We already did that step in cm_5

This is the result of following the rule, if multiple metrics are computed in a query (parts of a derived query are each considered a metric computation), have each metric computation is out into a CTE. We can change the rule to get different behavior.

Jstein77 · 2024-11-13T21:17:49Z

...arity.py/SqlQueryPlan/DuckDB/test_offset_metric_with_custom_granularity__plan0_optimized.sql

-  , bookings_5_days_ago AS bookings_5_day_lag
-FROM (
+-- Read From CTE For node_id=cm_5
+WITH cm_4_cte AS (


What does cm mean?

That's the internal ID prefix used for the dataflow node that computes metrics.

Jstein77 · 2024-11-13T21:25:53Z

...ueryPlan/DuckDB/test_cumulative_time_offset_metric_with_time_constraint__plan0_optimized.sql

-) subq_23
+)
+
+, cm_5_cte AS (


Is it possible to read from CTE cm_4 and compute metrics via expressions in one step?

i.e if I we're to write this by hand I would write

cm_5_cte as (
select
metric_time__day
, every_2_days_bookers_2_days_ago as every_2_days_bookers_2_days_ago
from cm_4_cte
)

SELECT
metric_time__day
, every_2_days_bookers_2_days_ago
from cm_5_cte

Sure, but I think following the previous comments, it seems like we need to settle on a rule for when it helps to add these CTEs.

Jstein77 · 2024-11-13T21:26:31Z

...ueryPlan/DuckDB/test_cumulative_time_offset_metric_with_time_constraint__plan0_optimized.sql

+  -- Compute Metrics via Expressions
+  SELECT
+    metric_time__day
+    , every_2_days_bookers_2_days_ago AS every_2_days_bookers_2_days_ago


if the metric and measure name are the same do we need to try and alias the measure? In this case the alias doesn't do anything.

Right now, the rule is that we always have column aliases. We could change this, but it's existing behavior.

Jstein77 · 2024-11-13T21:26:53Z

...ueryPlan/DuckDB/test_cumulative_time_offset_metric_with_time_constraint__plan0_optimized.sql

+)
+
+SELECT
+  metric_time__day AS metric_time__day


Same comment as above where the alias is redudant

Jstein77 · 2024-11-13T21:28:04Z

...est_derived_metric_rendering.py/SqlQueryPlan/DuckDB/test_derived_metric__plan0_optimized.sql

+)
+
+, cm_9_cte AS (
+  -- Compute Metrics via Expressions


The seperation of logic here between cte 8 and 9 is really clear.

plypaul added the Skip Changelog label Nov 13, 2024

cla-bot bot added the cla:yes label Nov 13, 2024

plypaul force-pushed the p--cte--19 branch from 96d31c1 to 9d23bbc Compare November 13, 2024 06:30

plypaul marked this pull request as ready for review November 13, 2024 06:36

courtneyholcomb reviewed Nov 13, 2024

View reviewed changes

Jstein77 reviewed Nov 13, 2024

View reviewed changes

plypaul force-pushed the p--cte--18 branch from 11ec9f8 to e1901ab Compare November 14, 2024 17:27

plypaul force-pushed the p--cte--19 branch from 9d23bbc to 3c153b2 Compare November 14, 2024 17:27

plypaul added 3 commits November 14, 2024 09:35

/* PR_START p--cte 18 */ Update default optimization level to O5.

8434dc8

Update snapshots for DuckDB.

0eafcbe

Update other engine snapshots.

e90741b

plypaul force-pushed the p--cte--18 branch from e1901ab to 68f9c00 Compare November 14, 2024 17:35

plypaul force-pushed the p--cte--19 branch from 3c153b2 to 10af148 Compare November 14, 2024 17:35

plypaul added 5 commits November 14, 2024 09:38

Add change log for #1040.

fbca5cf

/* PR_START p--cte 19 */ Add method to group nodes by type.

ada9adc

Update to convert metric nodes to cte.

f0c6b9f

Update snapshots with DuckDB.

812a6cc

Update snapshots for other SQL engines.

144f1b4

plypaul force-pushed the p--cte--18 branch from 68f9c00 to fbca5cf Compare November 14, 2024 17:39

plypaul force-pushed the p--cte--19 branch from 10af148 to 144f1b4 Compare November 14, 2024 17:39

Base automatically changed from p--cte--18 to main November 14, 2024 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CTEs for metrics in multi-metric cases #1526

Use CTEs for metrics in multi-metric cases #1526

plypaul commented Nov 13, 2024 •

edited

Loading

courtneyholcomb left a comment

courtneyholcomb Nov 13, 2024

plypaul Nov 13, 2024

courtneyholcomb Nov 13, 2024

courtneyholcomb Nov 13, 2024

courtneyholcomb Nov 13, 2024

courtneyholcomb Nov 13, 2024

plypaul commented Nov 13, 2024

plypaul commented Nov 13, 2024

Jstein77 Nov 13, 2024

plypaul Nov 13, 2024

Jstein77 Nov 13, 2024

plypaul Nov 13, 2024

Jstein77 Nov 13, 2024

plypaul Nov 13, 2024

Jstein77 Nov 13, 2024

plypaul Nov 13, 2024

Jstein77 Nov 13, 2024

plypaul Nov 13, 2024

Jstein77 Nov 13, 2024

Jstein77 Nov 13, 2024

Use CTEs for metrics in multi-metric cases #1526

Are you sure you want to change the base?

Use CTEs for metrics in multi-metric cases #1526

Conversation

plypaul commented Nov 13, 2024 • edited Loading

courtneyholcomb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plypaul commented Nov 13, 2024

plypaul commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plypaul commented Nov 13, 2024 •

edited

Loading