-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize unique_key for incremental models #432
Comments
I agree James, there are several incremental models that have unique keys set that are not at the most unique / lowest level of grain, which leaves us open to potential CI issues with duplicates when detectors move to different lanes, etc.
Category 1: UNIQUE KEY = DETECTOR_ID + TIME -- no action needed
Category 2: UNIQUE KEY = STATION_ID + TIME but model is at the station level -- no action needed(The bottleneck models all seem to be based on station ID, not detector ID, so I would keep these as is):
note: this model has a unique key at the station_id, lane, sample_date level, and both upstream and downstream models are merged at the detector id level, so it's possible we want to expand this model to be at that level as well, but because of what it's doing, it didn't seem to me that this model needed a change -- curious what you think @JamesSLogan :
Category 3: UNIQUE KEY = STATION ID + TIME and model has upstream/downstream models with an incremental merge strategy that relies on detector ID -- UNIQUE KEY CONFIG NEEDS CHANGEnote: I think the whole imputation journey should be updated to merge uniquely on the detector_id level, just as the non-imputed journey is.
Category 4: UNIQUE KEY = STATION ID + LANE + ETC but code / output columns do not currently contain detector IDs BUT the data grain is at the detector level (using station id + lane) -- UNIQUE KEY SHOULD BE UPDATED but CODE NEEDS UPDATING FIRST
Category 5: MODELS ARE NOT INCREMENTAL but maybe they should be -- evaluate if unique key would be worth configuringThese are models that are not set to be incremental, probably because these are mostly aggregate models, so we expect them to be small enough where the unique key constraint isn't necessary. But if we are looking for performance gains in CI, it may be worth configuring these as incremental.
|
Thanks for the excellent, thorough analysis @summer-mothwood! Regarding |
That all sounds correct to me @summer-mothwood. One note on your category 4: there is a bit of history here, where many of the basal tables start out at the station+lane level, and don't have the metadata included. We've gone back and forth a bit on when we should try to join them so that these extra data are attached (see #214 and linked). One of the reasons it's a bit tricky is that with the large tables the join is quite expensive, so it wasn't always obvious at what stage we absolutely needed to do it. All of this is to say, there may be good performance-related reasons to not join detector ID in for some of those large category 4 tables. Or maybe it's not a huge deal, but it's something to keep an eye out for. |
This ticket is now done -- I split this work out into 3 PRs, mostly because several of these involved schema changes to models that had downstream dependencies, which made building and testing very annoying to do all together! PR 1: #470
PR2: #474
PR3: #481
|
Many incremental models were developed before we had access to
detector_id
, so they were configured to use bothstation_id
andlane
to merge in new records. We now have some models usingdetector_id
and some models using the prior method. This can lead to issues due to (potentially invalid) metadata updates. Wherever possible, these unique_key values should be updated to usedetector_id
, ideally. Part of this task is determining if this strategy will work correctly.List of files under transform/models currently specifying any
unique_key
:The text was updated successfully, but these errors were encountered: