performance improvements #486

JamesSLogan · 2024-11-22T17:19:10Z

Fixes #459

int_performance__detector_metrics_agg_daily
- Made this model incremental. This is the daily performance aggregation, it straddles the line between the 5 min/hourly ones, which are already incremental, and the weekly/monthly ones, which are not.
performance__station_metric_agg_recent_one_week
- The second self join in the model explodes to >4 billion records, even on 1 week's worth of data. I reworked it and the query takes <30s on an XS warehouse.
quality__row_count_summary
- Bumped up the warehouse size. This model is meant to show an accurate view of row counts across a few different places. It could be made incremental but would likely be forgotten in future full-refreshes and therefore compromise its accuracy. The XL warehouse will be faster and parallelize well with this query.

…to perf_fixes

…-mdsa-caltrans-pems into perf_fixes

ian-r-rose

Thanks @JamesSLogan. These look good to me, though I wonder if we can figure out more what's going wrong with the row count summary

ian-r-rose · 2024-11-22T19:07:32Z

transform/models/marts/performance/performance__station_metric_agg_recent_one_week.sql

Very clever

ian-r-rose · 2024-11-22T19:07:34Z

transform/models/marts/quality/quality__row_count_summary.sql

@@ -1,3 +1,6 @@
+{{ config(
+    snowflake_warehouse = get_snowflake_refresh_warehouse(small="XL", big="XL")


I'm still kind of shocked that this would require an XL warehouse. Counts should be dirt cheap with Snowflake in most circumstances, even on very large tables.

I suspect that the distinct keywords might be throwing off the performance here. Is there a way we could remove them from this query? I wonder if there's a linting rule we could enable to be like "are you sure you want to use distinct?"

I also see that partition pruning isn't happening in this model, even though the query selects the past 16 days of data, so it should be highly pruned.

I would have to think about how we might remove/replace the distinct counts. In the current query profile, in terms of expensiveness, the count nodes are a fraction of a percent of the I/O and group by operations.

It looks like the quality__station_row_count_summary model does select the past 16 days worth of data. @mmmiah , do you think there is a need for the whole history of row counts in this quality__row_count_summary model?

Have we tried using array size aggregate methods to see if this would improve performance? There are some size limitations to arrays in Snowflake, so I'm not sure if this is feasible here, but this might be interesting to test: https://docs.snowflake.com/en/user-guide/querying-arrays-for-distinct-counts

Yeah, I was off base about distinct here (though I generally think it's not great practice). The lack of partition pruning, however, does seem like a significant issue.

Thanks Mintu! That's helpful - I'm still not inclined to make it incremental for potential quality issues if imputation methods change and are backfilled.

@ian-r-rose this model is for all time, I think you were viewing the (similarly named) quality__station_row_count_summary model

yep! I was looking at the wrong one. sigh...

If this model is not incremental, and for all time, then I think we want the get_snowflake_warehouse() macro, vs the get_snowflake_refresh_warehouse(). The latter does different things depending on whether the full_refresh flag is set, but that's not relevant for non-incremental models

Good catch! This has been fixed.

Have we tried using array size aggregate methods to see if this would improve performance? There are some size limitations to arrays in Snowflake, so I'm not sure if this is feasible here, but this might be interesting to test: https://docs.snowflake.com/en/user-guide/querying-arrays-for-distinct-counts

This is interesting, I'm unfamiliar with this technique. I don't think I quite understand it from the docs...

Yeah I wish they explained or linked to documentation that expands upon why this can improve performance, and I'm not finding anything useful elsewhere!

mmmiah

@JamesSLogan Looks good to me! Thanks for doing this!

ian-r-rose and others added 2 commits November 21, 2024 15:54

Remove trailing whitespace to fix pre-commit ci

1914810

performance improvements

85a7d8d

JamesSLogan added this to the VDS Data Modeling: Performance Optimization milestone Nov 22, 2024

JamesSLogan requested review from ian-r-rose, mmmiah and summer-mothwood November 22, 2024 17:19

JamesSLogan self-assigned this Nov 22, 2024

JamesSLogan added 2 commits November 22, 2024 10:54

Merge branch 'main' of github.com:cagov/caldata-mdsa-caltrans-pems in…

9d93d8f

…to perf_fixes

Merge branch 'remove-trailing-whitespace' of github.com:cagov/caldata…

7081d25

…-mdsa-caltrans-pems into perf_fixes

ian-r-rose reviewed Nov 22, 2024

View reviewed changes

use correct macro

a8dd43b

ian-r-rose approved these changes Nov 22, 2024

View reviewed changes

mmmiah approved these changes Nov 22, 2024

View reviewed changes

ian-r-rose merged commit de1f2b9 into main Nov 22, 2024
3 checks passed

JamesSLogan deleted the perf_fixes branch January 9, 2025 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance improvements #486

performance improvements #486

JamesSLogan commented Nov 22, 2024

ian-r-rose left a comment

ian-r-rose Nov 22, 2024

ian-r-rose Nov 22, 2024

ian-r-rose Nov 22, 2024

JamesSLogan Nov 22, 2024

summer-mothwood Nov 22, 2024

ian-r-rose Nov 22, 2024

JamesSLogan Nov 22, 2024

ian-r-rose Nov 22, 2024

ian-r-rose Nov 22, 2024

JamesSLogan Nov 22, 2024

summer-mothwood Nov 22, 2024

mmmiah left a comment

		@@ -1,3 +1,6 @@
		{{ config(
		snowflake_warehouse = get_snowflake_refresh_warehouse(small="XL", big="XL")

performance improvements #486

performance improvements #486

Conversation

JamesSLogan commented Nov 22, 2024

ian-r-rose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmmiah left a comment

Choose a reason for hiding this comment