developed imputation QC model that will monitor daily sample balance for different imputation methods #438

mmmiah · 2024-10-21T23:25:17Z

This PR is built based on issue #437 which counted the daily sample size by different imputation and check the balance of sample size between the daily observed and imputation for QC purpose

…between observed and imputed and also monitor number of daily samples imputed by different imputation methods

…-pems into imputation_daily_summary

…utput

…-pems into imputation_daily_summary

…_agg_recent_one_week.sql

JamesSLogan · 2024-10-28T21:40:31Z

transform/models/marts/quality/quality_imputation_daily_sample_count.sql

+        (occ_unobserved_unimputed / nullif(sample_ct, 0)) * 100 as pct_occ_unobserved,
+
+        -- Volume check: Sum of all volume percentages should equal 100
+        coalesce(


Do you think adding these checks as tests instead of columns would be a better way of raising issues? That way if somebody in the future, for example, adds a new imputation method not captured here, the test(s) would automatically flag that this model needs changes as well.

I'm not sure that adding these as columns is the right way to go, it may be better to add them as tests only.

In general, it is very unreliable to do floating point math and expect strict equality checks to work. This is due to fundamental limitations in how floating point numbers are represented internally. Here is some recommendation from snowflake on the matter. The usual way to write this kind of test is to instead assert that the absolute value of the difference of the numbers is less than some small number.

I agree to drop those numerical test within the data frame. I have dropped all numerical sum test and will add separate yml test for that may be later

ian-r-rose

I agree with @JamesSLogan that moving the checks into tests would be more appropriate. I also recommend using a more precision-safe check for those checks.

ian-r-rose · 2024-10-31T23:22:04Z

transform/models/marts/quality/quality_imputation_daily_sample_count.sql

+        (occ_unobserved_unimputed / nullif(sample_ct, 0)) * 100 as pct_occ_unobserved,
+
+        -- Volume check: Sum of all volume percentages should equal 100
+        coalesce(


I'm not sure that adding these as columns is the right way to go, it may be better to add them as tests only.

In general, it is very unreliable to do floating point math and expect strict equality checks to work. This is due to fundamental limitations in how floating point numbers are represented internally. Here is some recommendation from snowflake on the matter. The usual way to write this kind of test is to instead assert that the absolute value of the difference of the numbers is less than some small number.

mmmiah · 2024-11-06T22:35:07Z

@kengodleskidot , would you be able to add three test on this imputation summary table in your previous test file. One for speed, one for volume and another for occupancy. Observed, imputed and observed_unimputed sum for each metric should be 100 or super minor less

transform/models/marts/quality/_quality.yml

fixed the typos

…aldata-mdsa-caltrans-pems.git; branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans-pems into imputation_daily_summary

…-pems into imputation_daily_summary

mmmiah · 2024-11-16T00:58:27Z

@JamesSLogan , Could you please look on this PR whenever you get time. It is failed due to dbt CI failure of 00:24:39 11 of 83 ERROR creating sql incremental model dbt_cloud_pr_15_438_diagnostics.int_diagnostics__constant_occupancy [ERROR in 1.70s]. I tried to drop the relevant one, but still failed. Thanks!

…-pems into imputation_daily_summary

summer-mothwood · 2024-11-18T20:09:18Z

@JamesSLogan this error in the constant occupancy code is happening because Mintu is working off a version of the code before the change I applied to remove station id from its output. I'm assuming the best way to address this is to rebase Mintu's branch to the current version of the code then trying again. Is there a good way to do that in dbt Cloud?

ian-r-rose · 2024-11-18T20:38:44Z

I learned the hard way that attempting to rebase with dbt Cloud is a recipe for heartache. If using cloud, it better to just merge from main.

…-pems into imputation_daily_summary

mmmiah · 2024-11-21T01:17:49Z

Woow, It seems that it passed the dbt failure! If you do not have any comments, please go ahead and approve this PR, so that we can merge it to main!

JamesSLogan

Thanks Mintu! Has an issue been created for adding the tests discussed in this PR?

mmmiah · 2024-11-21T16:56:08Z

Thanks Mintu! Has an issue been created for adding the tests discussed in this PR?

Ken is supposed to add test separately. No test in this PR

developed imputation QC model that will monitor daily sample balance …

0adb928

…between observed and imputed and also monitor number of daily samples imputed by different imputation methods

mmmiah added the p3 - Low priority label Oct 21, 2024

mmmiah added this to the Data Quality Checks milestone Oct 21, 2024

mmmiah self-assigned this Oct 21, 2024

mmmiah requested review from ian-r-rose and kengodleskidot October 21, 2024 23:26

mmmiah marked this pull request as draft October 21, 2024 23:27

mmmiah added 2 commits October 21, 2024 23:30

fixed the wring label

1c6b765

added a new line at the end of model

82dc022

mmmiah marked this pull request as ready for review October 22, 2024 00:31

mmmiah added 2 commits October 22, 2024 23:16

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

5ccfab9

…-pems into imputation_daily_summary

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

492ec5a

…-pems into imputation_daily_summary

jkarpen linked an issue Oct 24, 2024 that may be closed by this pull request

Daily Imputed and Non imputed sample count by Imputation Methods #437

Closed

mmmiah added 4 commits October 24, 2024 16:31

use the coalesce function to avoid null issues that generated empty o…

4cee094

…utput

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

55aa326

…-pems into imputation_daily_summary

relabel the check

8caea28

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

b472ef8

…-pems into imputation_daily_summary

mmmiah marked this pull request as draft October 24, 2024 23:41

re-structure the quality check

38742d1

mmmiah marked this pull request as ready for review October 25, 2024 00:01

mmmiah and others added 2 commits October 25, 2024 21:51

fixed the bugs

d4f8334

Delete transform/models/marts/performance/performance__station_metric…

43cfcf1

…_agg_recent_one_week.sql

mmmiah requested review from thehanggit and JamesSLogan October 28, 2024 16:46

JamesSLogan reviewed Oct 28, 2024

View reviewed changes

ian-r-rose reviewed Oct 31, 2024

View reviewed changes

mmmiah added 2 commits November 6, 2024 19:39

resolved the issues without making any change

36afed0

added yml and removed the numerical test column

622532d

mmmiah requested review from ian-r-rose and JamesSLogan November 6, 2024 22:31

JamesSLogan reviewed Nov 7, 2024

View reviewed changes

transform/models/marts/quality/_quality.yml Outdated Show resolved Hide resolved

mmmiah and others added 5 commits November 6, 2024 16:23

Update _quality.yml

4e332d2

fixed the typos

Update _quality.yml

381b4b6

Merge branch 'imputation_daily_summary' of https://github.com/cagov/c…

22852c8

…aldata-mdsa-caltrans-pems.git; branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans-pems into imputation_daily_summary

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

c1c0cb0

…-pems into imputation_daily_summary

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

b11a3df

…-pems into imputation_daily_summary

mmmiah requested a review from JamesSLogan November 16, 2024 01:06

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

0c4d550

…-pems into imputation_daily_summary

summer-mothwood requested a review from tnrahim November 19, 2024 17:44

mmmiah removed the p3 - Low priority label Nov 19, 2024

mmmiah added 3 commits November 20, 2024 18:48

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

467ac68

…-pems into imputation_daily_summary

rebuild all upstream and downstream model to pass dbt CI failure

2539e6a

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

b467f46

…-pems into imputation_daily_summary

JamesSLogan approved these changes Nov 21, 2024

View reviewed changes

mmmiah merged commit 296acd9 into main Nov 21, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

developed imputation QC model that will monitor daily sample balance for different imputation methods #438

developed imputation QC model that will monitor daily sample balance for different imputation methods #438

mmmiah commented Oct 21, 2024

JamesSLogan Oct 28, 2024

ian-r-rose Oct 31, 2024

mmmiah Nov 6, 2024

ian-r-rose left a comment

ian-r-rose Oct 31, 2024

mmmiah commented Nov 6, 2024

mmmiah commented Nov 16, 2024

summer-mothwood commented Nov 18, 2024

ian-r-rose commented Nov 18, 2024

mmmiah commented Nov 21, 2024

JamesSLogan left a comment

mmmiah commented Nov 21, 2024

developed imputation QC model that will monitor daily sample balance for different imputation methods #438

developed imputation QC model that will monitor daily sample balance for different imputation methods #438

Conversation

mmmiah commented Oct 21, 2024

JamesSLogan Oct 28, 2024

Choose a reason for hiding this comment

ian-r-rose Oct 31, 2024

Choose a reason for hiding this comment

mmmiah Nov 6, 2024

Choose a reason for hiding this comment

ian-r-rose left a comment

Choose a reason for hiding this comment

ian-r-rose Oct 31, 2024

Choose a reason for hiding this comment

mmmiah commented Nov 6, 2024

mmmiah commented Nov 16, 2024

summer-mothwood commented Nov 18, 2024

ian-r-rose commented Nov 18, 2024

mmmiah commented Nov 21, 2024

JamesSLogan left a comment

Choose a reason for hiding this comment

mmmiah commented Nov 21, 2024