-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
developed imputation QC model that will monitor daily sample balance for different imputation methods #438
Conversation
…between observed and imputed and also monitor number of daily samples imputed by different imputation methods
…_agg_recent_one_week.sql
(occ_unobserved_unimputed / nullif(sample_ct, 0)) * 100 as pct_occ_unobserved, | ||
|
||
-- Volume check: Sum of all volume percentages should equal 100 | ||
coalesce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think adding these checks as tests instead of columns would be a better way of raising issues? That way if somebody in the future, for example, adds a new imputation method not captured here, the test(s) would automatically flag that this model needs changes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that adding these as columns is the right way to go, it may be better to add them as tests only.
In general, it is very unreliable to do floating point math and expect strict equality checks to work. This is due to fundamental limitations in how floating point numbers are represented internally. Here is some recommendation from snowflake on the matter. The usual way to write this kind of test is to instead assert that the absolute value of the difference of the numbers is less than some small number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to drop those numerical test within the data frame. I have dropped all numerical sum test and will add separate yml test for that may be later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @JamesSLogan that moving the checks into tests would be more appropriate. I also recommend using a more precision-safe check for those checks.
(occ_unobserved_unimputed / nullif(sample_ct, 0)) * 100 as pct_occ_unobserved, | ||
|
||
-- Volume check: Sum of all volume percentages should equal 100 | ||
coalesce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that adding these as columns is the right way to go, it may be better to add them as tests only.
In general, it is very unreliable to do floating point math and expect strict equality checks to work. This is due to fundamental limitations in how floating point numbers are represented internally. Here is some recommendation from snowflake on the matter. The usual way to write this kind of test is to instead assert that the absolute value of the difference of the numbers is less than some small number.
@kengodleskidot , would you be able to add three test on this imputation summary table in your previous test file. One for speed, one for volume and another for occupancy. Observed, imputed and observed_unimputed sum for each metric should be 100 or super minor less |
fixed the typos
…aldata-mdsa-caltrans-pems.git; branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans-pems into imputation_daily_summary
…-pems into imputation_daily_summary
…-pems into imputation_daily_summary
@JamesSLogan , Could you please look on this PR whenever you get time. It is failed due to dbt CI failure of 00:24:39 11 of 83 ERROR creating sql incremental model dbt_cloud_pr_15_438_diagnostics.int_diagnostics__constant_occupancy [ERROR in 1.70s]. I tried to drop the relevant one, but still failed. Thanks! |
…-pems into imputation_daily_summary
@JamesSLogan this error in the constant occupancy code is happening because Mintu is working off a version of the code before the change I applied to remove station id from its output. I'm assuming the best way to address this is to rebase Mintu's branch to the current version of the code then trying again. Is there a good way to do that in dbt Cloud? |
I learned the hard way that attempting to rebase with dbt Cloud is a recipe for heartache. If using cloud, it better to just merge from |
Woow, It seems that it passed the dbt failure! If you do not have any comments, please go ahead and approve this PR, so that we can merge it to main! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mintu! Has an issue been created for adding the tests discussed in this PR?
Ken is supposed to add test separately. No test in this PR |
This PR is built based on issue #437 which counted the daily sample size by different imputation and check the balance of sample size between the daily observed and imputation for QC purpose