-
Notifications
You must be signed in to change notification settings - Fork 3
Integrate county level data (WIP) #52
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking super great, what an impressive accomplishment to integrate all this on your own. Exciting to see all this. I left lots of comments throughout, but that's only because you added such a large volume of functionality. Looking great and I'll probably be a quick thumb on a second pass!
@@ -0,0 +1,383 @@ | |||
,locationID,slug,name,level,city,county,state,country,lat,long,population,aggregate,tz,cases,deaths,recovered,active,tested,hospitalized,hospitalized_current,discharged,icu,icu_current,date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love some good sample test data! So awesome. Small nit: should we create a new directory for this called test_fixtures
or something like that? We have some "real" / production data in /data
so it might be good to keep that clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for the two new files that are in /covid
, we could maybe move those into the same folder. It looks like you haven't used them yet in tests but that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminds me -- we don't ever run these tests, haha. Just created #53 to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes. Tests in a dark room haha.
Agreed on fixtures folder
The csvs in /covid are an accident I think haha, but will move those too!
COUNTY_POSITIVITY_3DRA_FIELD = "COVID+ RATE (3DRA)" | ||
COUNTY_POSITIVITY_COLOR_FIELD = "COVID+ COLOR" | ||
_COUNTY_NUM_LAGS = 14 | ||
_, _COUNTY_NEW_CASES_LAG_FIELDS = generate_lag_column_name_formatter_and_column_names( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, there might be a better way to do this -- I wrote this but it still feels a little hacky to me, so I'm open to all other suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I adopted the "unless I'm refactoring everything or it's necessary, don't change pattern" mindset. Maybe we can track a few issues on refactors we know we should do?
covid/transform.py
Outdated
_GREEN = "Green" | ||
_YELLOW = "Yellow" | ||
_RED = "Red" | ||
_DEEP_SHIT = "Dark Red" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as fun as that is, maybe we do _DARK_RED
just since this is open source (and about a tragic topic)? But also I would be fine leaving it as is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good to change this before it's merged
_RED = "Red" | ||
_DEEP_SHIT = "Dark Red" | ||
|
||
# Define the upper bounds for each color for the new cases per million metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
|
||
# NaN rows where the 3DRA is NaN. | ||
# Note: this handles a common county-level case where a county simply does not have any testing data. | ||
county_df.loc[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised fit_and_predict_cubic_spline_in_r
doesn't automatically return NaNs
if its input data is NaN
-- we could make that part of the function? Or this works just fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It breaks if you have all-NaN I think. Like the fitting portion was breaking on me. I can investigate further if we want!
|
||
date_to_lookup = date_to_lookup - lag_timedelta | ||
|
||
lags_df = lags_df.reset_index() | ||
return lags_df | ||
|
||
|
||
def calculate_state_summary(transformed_df, columns): | ||
def compute_lagged_frame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said this logic is definitely complex but I'm sure it's way faster than the terrible for loops I had running! Should we potentially replace the other lagging parts of transform
with this to also speed those up? (If we want to do that, we could potentially implement the text fixtures first so that we can tell if they worked successfully without changing the underlying data.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think the real improvement would be to apply transformations across states in parallel, which is maybe a larger refactor I didn't want to do here. I'm happy to add it in or just plug this function in on a state-by-state level
@@ -180,3 +296,13 @@ def calculate_consecutive_boolean_series(boolean_series): | |||
) | |||
|
|||
return consecutive_true_series, consecutive_false_series | |||
|
|||
|
|||
def get_color_series_from_range(series, color_range_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
covidtracking_df = extract_covidtracking_historical_data() | ||
cdc_ili_df = extract_cdc_ili_data() | ||
covidatlas_df = extract_covid_atlas_data() | ||
# covidtracking_df = extract_covidtracking_historical_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these intentinoally commented out, or was that just to speed up local development of your new feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for local dev!
@@ -51,7 +55,7 @@ def extract_transform_and_load_covid_data(post_to_google_sheets=True): | |||
debugging of data processing | |||
|
|||
""" | |||
print("Starting to ETL...") | |||
logger.info("Starting to ETL...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hell yea! lol.
We have a bunch of other prints
we could remove as well in this PR or future work...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep for sure. I'll see if it's an easy copy -replace
main.py
Outdated
@@ -86,19 +90,31 @@ def extract_transform_and_load_covid_data(post_to_google_sheets=True): | |||
# credentials=credentials, | |||
# ) | |||
|
|||
covidtracking_df = extract_covidtracking_historical_data() | |||
cdc_ili_df = extract_cdc_ili_data() | |||
covidatlas_df = extract_covid_atlas_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't think we'll ever need to mix county-level and state-level data, I'm wondering if we want to actually put this in a separate function -- and maybe the function is invoked separately by main()
so it's easy to run one command that updates all data, but there's a bit cleaner separation of concerns.
I feel like extract_transform_and_load_covid_data
has gotten to be a beast and now might be a good time to refactor a bit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed it would be better, but the status quo here isn't much different. We don't do a lot where we merge all the state data into one state object and then do additional things with that. As far as main
is concerned, county-data is similar to state-level beds data. Happy to refactor, but perhaps we could save some of these for some "internal code cleanup" tasks once the county-level functionality is squared away
Closes #43.
Some caveats here:
Ok that's all I'll preface with haha, please rip to shreds!