Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance improvements #486

Merged
merged 5 commits into from
Nov 22, 2024
Merged

performance improvements #486

merged 5 commits into from
Nov 22, 2024

Conversation

JamesSLogan
Copy link
Contributor

Fixes #459

  • int_performance__detector_metrics_agg_daily
    • Made this model incremental. This is the daily performance aggregation, it straddles the line between the 5 min/hourly ones, which are already incremental, and the weekly/monthly ones, which are not.
  • performance__station_metric_agg_recent_one_week
    • The second self join in the model explodes to >4 billion records, even on 1 week's worth of data. I reworked it and the query takes <30s on an XS warehouse.
  • quality__row_count_summary
    • Bumped up the warehouse size. This model is meant to show an accurate view of row counts across a few different places. It could be made incremental but would likely be forgotten in future full-refreshes and therefore compromise its accuracy. The XL warehouse will be faster and parallelize well with this query.

Copy link
Member

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JamesSLogan. These look good to me, though I wonder if we can figure out more what's going wrong with the row count summary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clever

@@ -1,3 +1,6 @@
{{ config(
snowflake_warehouse = get_snowflake_refresh_warehouse(small="XL", big="XL")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still kind of shocked that this would require an XL warehouse. Counts should be dirt cheap with Snowflake in most circumstances, even on very large tables.

I suspect that the distinct keywords might be throwing off the performance here. Is there a way we could remove them from this query? I wonder if there's a linting rule we could enable to be like "are you sure you want to use distinct?"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see that partition pruning isn't happening in this model, even though the query selects the past 16 days of data, so it should be highly pruned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to think about how we might remove/replace the distinct counts. In the current query profile, in terms of expensiveness, the count nodes are a fraction of a percent of the I/O and group by operations.

It looks like the quality__station_row_count_summary model does select the past 16 days worth of data. @mmmiah , do you think there is a need for the whole history of row counts in this quality__row_count_summary model?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried using array size aggregate methods to see if this would improve performance? There are some size limitations to arrays in Snowflake, so I'm not sure if this is feasible here, but this might be interesting to test: https://docs.snowflake.com/en/user-guide/querying-arrays-for-distinct-counts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was off base about distinct here (though I generally think it's not great practice). The lack of partition pruning, however, does seem like a significant issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mintu! That's helpful - I'm still not inclined to make it incremental for potential quality issues if imputation methods change and are backfilled.

@ian-r-rose this model is for all time, I think you were viewing the (similarly named) quality__station_row_count_summary model

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! I was looking at the wrong one. sigh...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this model is not incremental, and for all time, then I think we want the get_snowflake_warehouse() macro, vs the get_snowflake_refresh_warehouse(). The latter does different things depending on whether the full_refresh flag is set, but that's not relevant for non-incremental models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This has been fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried using array size aggregate methods to see if this would improve performance? There are some size limitations to arrays in Snowflake, so I'm not sure if this is feasible here, but this might be interesting to test: https://docs.snowflake.com/en/user-guide/querying-arrays-for-distinct-counts

This is interesting, I'm unfamiliar with this technique. I don't think I quite understand it from the docs...

Yeah I wish they explained or linked to documentation that expands upon why this can improve performance, and I'm not finding anything useful elsewhere!

Copy link
Contributor

@mmmiah mmmiah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesSLogan Looks good to me! Thanks for doing this!

@ian-r-rose ian-r-rose merged commit de1f2b9 into main Nov 22, 2024
3 checks passed
@JamesSLogan JamesSLogan deleted the perf_fixes branch January 9, 2025 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate slow QA/QC models
4 participants