-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance improvements #486
Conversation
…-mdsa-caltrans-pems into perf_fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JamesSLogan. These look good to me, though I wonder if we can figure out more what's going wrong with the row count summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clever
@@ -1,3 +1,6 @@ | |||
{{ config( | |||
snowflake_warehouse = get_snowflake_refresh_warehouse(small="XL", big="XL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still kind of shocked that this would require an XL warehouse. Counts should be dirt cheap with Snowflake in most circumstances, even on very large tables.
I suspect that the distinct
keywords might be throwing off the performance here. Is there a way we could remove them from this query? I wonder if there's a linting rule we could enable to be like "are you sure you want to use distinct?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see that partition pruning isn't happening in this model, even though the query selects the past 16 days of data, so it should be highly pruned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to think about how we might remove/replace the distinct counts. In the current query profile, in terms of expensiveness, the count nodes are a fraction of a percent of the I/O and group by operations.
It looks like the quality__station_row_count_summary
model does select the past 16 days worth of data. @mmmiah , do you think there is a need for the whole history of row counts in this quality__row_count_summary
model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tried using array size aggregate methods to see if this would improve performance? There are some size limitations to arrays in Snowflake, so I'm not sure if this is feasible here, but this might be interesting to test: https://docs.snowflake.com/en/user-guide/querying-arrays-for-distinct-counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was off base about distinct
here (though I generally think it's not great practice). The lack of partition pruning, however, does seem like a significant issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mintu! That's helpful - I'm still not inclined to make it incremental for potential quality issues if imputation methods change and are backfilled.
@ian-r-rose this model is for all time, I think you were viewing the (similarly named) quality__station_row_count_summary
model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep! I was looking at the wrong one. sigh...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this model is not incremental, and for all time, then I think we want the get_snowflake_warehouse()
macro, vs the get_snowflake_refresh_warehouse()
. The latter does different things depending on whether the full_refresh
flag is set, but that's not relevant for non-incremental models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This has been fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tried using array size aggregate methods to see if this would improve performance? There are some size limitations to arrays in Snowflake, so I'm not sure if this is feasible here, but this might be interesting to test: https://docs.snowflake.com/en/user-guide/querying-arrays-for-distinct-counts
This is interesting, I'm unfamiliar with this technique. I don't think I quite understand it from the docs...
Yeah I wish they explained or linked to documentation that expands upon why this can improve performance, and I'm not finding anything useful elsewhere!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesSLogan Looks good to me! Thanks for doing this!
Fixes #459
int_performance__detector_metrics_agg_daily
performance__station_metric_agg_recent_one_week
quality__row_count_summary