Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

developed imputation QC model that will monitor daily sample balance for different imputation methods #438

Merged
merged 23 commits into from
Nov 21, 2024

Conversation

mmmiah
Copy link
Contributor

@mmmiah mmmiah commented Oct 21, 2024

This PR is built based on issue #437 which counted the daily sample size by different imputation and check the balance of sample size between the daily observed and imputation for QC purpose

…between observed and imputed and also monitor number of daily samples imputed by different imputation methods
@mmmiah mmmiah added this to the Data Quality Checks milestone Oct 21, 2024
@mmmiah mmmiah self-assigned this Oct 21, 2024
@mmmiah mmmiah marked this pull request as draft October 21, 2024 23:27
@mmmiah mmmiah marked this pull request as ready for review October 22, 2024 00:31
@jkarpen jkarpen linked an issue Oct 24, 2024 that may be closed by this pull request
@mmmiah mmmiah marked this pull request as draft October 24, 2024 23:41
@mmmiah mmmiah marked this pull request as ready for review October 25, 2024 00:01
(occ_unobserved_unimputed / nullif(sample_ct, 0)) * 100 as pct_occ_unobserved,

-- Volume check: Sum of all volume percentages should equal 100
coalesce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think adding these checks as tests instead of columns would be a better way of raising issues? That way if somebody in the future, for example, adds a new imputation method not captured here, the test(s) would automatically flag that this model needs changes as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that adding these as columns is the right way to go, it may be better to add them as tests only.

In general, it is very unreliable to do floating point math and expect strict equality checks to work. This is due to fundamental limitations in how floating point numbers are represented internally. Here is some recommendation from snowflake on the matter. The usual way to write this kind of test is to instead assert that the absolute value of the difference of the numbers is less than some small number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to drop those numerical test within the data frame. I have dropped all numerical sum test and will add separate yml test for that may be later

Copy link
Member

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @JamesSLogan that moving the checks into tests would be more appropriate. I also recommend using a more precision-safe check for those checks.

(occ_unobserved_unimputed / nullif(sample_ct, 0)) * 100 as pct_occ_unobserved,

-- Volume check: Sum of all volume percentages should equal 100
coalesce(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that adding these as columns is the right way to go, it may be better to add them as tests only.

In general, it is very unreliable to do floating point math and expect strict equality checks to work. This is due to fundamental limitations in how floating point numbers are represented internally. Here is some recommendation from snowflake on the matter. The usual way to write this kind of test is to instead assert that the absolute value of the difference of the numbers is less than some small number.

@mmmiah
Copy link
Contributor Author

mmmiah commented Nov 6, 2024

@kengodleskidot , would you be able to add three test on this imputation summary table in your previous test file. One for speed, one for volume and another for occupancy. Observed, imputed and observed_unimputed sum for each metric should be 100 or super minor less
image

@mmmiah
Copy link
Contributor Author

mmmiah commented Nov 16, 2024

@JamesSLogan , Could you please look on this PR whenever you get time. It is failed due to dbt CI failure of 00:24:39 11 of 83 ERROR creating sql incremental model dbt_cloud_pr_15_438_diagnostics.int_diagnostics__constant_occupancy [ERROR in 1.70s]. I tried to drop the relevant one, but still failed. Thanks!
image

@mmmiah mmmiah requested a review from JamesSLogan November 16, 2024 01:06
@summer-mothwood
Copy link
Contributor

@JamesSLogan this error in the constant occupancy code is happening because Mintu is working off a version of the code before the change I applied to remove station id from its output. I'm assuming the best way to address this is to rebase Mintu's branch to the current version of the code then trying again. Is there a good way to do that in dbt Cloud?

@ian-r-rose
Copy link
Member

I learned the hard way that attempting to rebase with dbt Cloud is a recipe for heartache. If using cloud, it better to just merge from main.

@mmmiah
Copy link
Contributor Author

mmmiah commented Nov 21, 2024

Woow, It seems that it passed the dbt failure! If you do not have any comments, please go ahead and approve this PR, so that we can merge it to main!

Copy link
Contributor

@JamesSLogan JamesSLogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mintu! Has an issue been created for adding the tests discussed in this PR?

@mmmiah
Copy link
Contributor Author

mmmiah commented Nov 21, 2024

Thanks Mintu! Has an issue been created for adding the tests discussed in this PR?

Ken is supposed to add test separately. No test in this PR

@mmmiah mmmiah merged commit 296acd9 into main Nov 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Daily Imputed and Non imputed sample count by Imputation Methods
4 participants