Skip to content
This repository has been archived by the owner on Jan 26, 2021. It is now read-only.

Integrate county level data (WIP) #52

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

patricksheehan
Copy link
Contributor

Closes #43.

Some caveats here:

  • We pull this data from COVID Atlas (new connector)
  • Since there are thousands of counties, I could not use some of our existing utilities (was taking > 10min to transform), so there's a new utility to do the lagging
  • Many counties are missing testing data, so there's some handling of this to avoid 0-division and appropriately NaN data that could potentially be improved
  • There's an issue with pandas groupby-rolling with multi indexes, so that call may look inefficient/weird, but I swear it's quick :)
  • This is (I think) the first sheet where we're making all the data wrapper columns directly, so even though some of the names are similar, they are a bit different format wise, so I've made new constants for them

Ok that's all I'll preface with haha, please rip to shreds!

Copy link
Contributor

@lucasmbrown lucasmbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking super great, what an impressive accomplishment to integrate all this on your own. Exciting to see all this. I left lots of comments throughout, but that's only because you added such a large volume of functionality. Looking great and I'll probably be a quick thumb on a second pass!

@@ -0,0 +1,383 @@
,locationID,slug,name,level,city,county,state,country,lat,long,population,aggregate,tz,cases,deaths,recovered,active,tested,hospitalized,hospitalized_current,discharged,icu,icu_current,date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love some good sample test data! So awesome. Small nit: should we create a new directory for this called test_fixtures or something like that? We have some "real" / production data in /data so it might be good to keep that clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for the two new files that are in /covid, we could maybe move those into the same folder. It looks like you haven't used them yet in tests but that would be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me -- we don't ever run these tests, haha. Just created #53 to track it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Tests in a dark room haha.

Agreed on fixtures folder

The csvs in /covid are an accident I think haha, but will move those too!

COUNTY_POSITIVITY_3DRA_FIELD = "COVID+ RATE (3DRA)"
COUNTY_POSITIVITY_COLOR_FIELD = "COVID+ COLOR"
_COUNTY_NUM_LAGS = 14
_, _COUNTY_NEW_CASES_LAG_FIELDS = generate_lag_column_name_formatter_and_column_names(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, there might be a better way to do this -- I wrote this but it still feels a little hacky to me, so I'm open to all other suggestions!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I adopted the "unless I'm refactoring everything or it's necessary, don't change pattern" mindset. Maybe we can track a few issues on refactors we know we should do?

_GREEN = "Green"
_YELLOW = "Yellow"
_RED = "Red"
_DEEP_SHIT = "Dark Red"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as fun as that is, maybe we do _DARK_RED just since this is open source (and about a tragic topic)? But also I would be fine leaving it as is!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good to change this before it's merged

_RED = "Red"
_DEEP_SHIT = "Dark Red"

# Define the upper bounds for each color for the new cases per million metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!


# NaN rows where the 3DRA is NaN.
# Note: this handles a common county-level case where a county simply does not have any testing data.
county_df.loc[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised fit_and_predict_cubic_spline_in_r doesn't automatically return NaNs if its input data is NaN -- we could make that part of the function? Or this works just fine!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It breaks if you have all-NaN I think. Like the fitting portion was breaking on me. I can investigate further if we want!


date_to_lookup = date_to_lookup - lag_timedelta

lags_df = lags_df.reset_index()
return lags_df


def calculate_state_summary(transformed_df, columns):
def compute_lagged_frame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said this logic is definitely complex but I'm sure it's way faster than the terrible for loops I had running! Should we potentially replace the other lagging parts of transform with this to also speed those up? (If we want to do that, we could potentially implement the text fixtures first so that we can tell if they worked successfully without changing the underlying data.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think the real improvement would be to apply transformations across states in parallel, which is maybe a larger refactor I didn't want to do here. I'm happy to add it in or just plug this function in on a state-by-state level

@@ -180,3 +296,13 @@ def calculate_consecutive_boolean_series(boolean_series):
)

return consecutive_true_series, consecutive_false_series


def get_color_series_from_range(series, color_range_dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

covidtracking_df = extract_covidtracking_historical_data()
cdc_ili_df = extract_cdc_ili_data()
covidatlas_df = extract_covid_atlas_data()
# covidtracking_df = extract_covidtracking_historical_data()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these intentinoally commented out, or was that just to speed up local development of your new feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for local dev!

@@ -51,7 +55,7 @@ def extract_transform_and_load_covid_data(post_to_google_sheets=True):
debugging of data processing

"""
print("Starting to ETL...")
logger.info("Starting to ETL...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hell yea! lol.

We have a bunch of other prints we could remove as well in this PR or future work...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep for sure. I'll see if it's an easy copy -replace

main.py Outdated
@@ -86,19 +90,31 @@ def extract_transform_and_load_covid_data(post_to_google_sheets=True):
# credentials=credentials,
# )

covidtracking_df = extract_covidtracking_historical_data()
cdc_ili_df = extract_cdc_ili_data()
covidatlas_df = extract_covid_atlas_data()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't think we'll ever need to mix county-level and state-level data, I'm wondering if we want to actually put this in a separate function -- and maybe the function is invoked separately by main() so it's easy to run one command that updates all data, but there's a bit cleaner separation of concerns.

I feel like extract_transform_and_load_covid_data has gotten to be a beast and now might be a good time to refactor a bit!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed it would be better, but the status quo here isn't much different. We don't do a lot where we merge all the state data into one state object and then do additional things with that. As far as main is concerned, county-data is similar to state-level beds data. Happy to refactor, but perhaps we could save some of these for some "internal code cleanup" tasks once the county-level functionality is squared away

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support county-level data
2 participants