Integrate county level data (WIP) #52

patricksheehan · 2020-08-10T18:21:38Z

Closes #43.

Some caveats here:

We pull this data from COVID Atlas (new connector)
Since there are thousands of counties, I could not use some of our existing utilities (was taking > 10min to transform), so there's a new utility to do the lagging
Many counties are missing testing data, so there's some handling of this to avoid 0-division and appropriately NaN data that could potentially be improved
There's an issue with pandas groupby-rolling with multi indexes, so that call may look inefficient/weird, but I swear it's quick :)
This is (I think) the first sheet where we're making all the data wrapper columns directly, so even though some of the names are similar, they are a bit different format wise, so I've made new constants for them

Ok that's all I'll preface with haha, please rip to shreds!

wip wip

lucasmbrown

Overall looking super great, what an impressive accomplishment to integrate all this on your own. Exciting to see all this. I left lots of comments throughout, but that's only because you added such a large volume of functionality. Looking great and I'll probably be a quick thumb on a second pass!

lucasmbrown · 2020-08-12T00:43:29Z

covid/data/covidatlas_example_subset.csv

@@ -0,0 +1,383 @@
+,locationID,slug,name,level,city,county,state,country,lat,long,population,aggregate,tz,cases,deaths,recovered,active,tested,hospitalized,hospitalized_current,discharged,icu,icu_current,date


Love some good sample test data! So awesome. Small nit: should we create a new directory for this called test_fixtures or something like that? We have some "real" / production data in /data so it might be good to keep that clean.

Also, for the two new files that are in /covid, we could maybe move those into the same folder. It looks like you haven't used them yet in tests but that would be great.

This reminds me -- we don't ever run these tests, haha. Just created #53 to track it.

Ah yes. Tests in a dark room haha.

Agreed on fixtures folder

The csvs in /covid are an accident I think haha, but will move those too!

lucasmbrown · 2020-08-12T00:50:23Z

covid/transform.py

+COUNTY_POSITIVITY_3DRA_FIELD = "COVID+ RATE (3DRA)"
+COUNTY_POSITIVITY_COLOR_FIELD = "COVID+ COLOR"
+_COUNTY_NUM_LAGS = 14
+_, _COUNTY_NEW_CASES_LAG_FIELDS = generate_lag_column_name_formatter_and_column_names(


FYI, there might be a better way to do this -- I wrote this but it still feels a little hacky to me, so I'm open to all other suggestions!

Yeah I think I adopted the "unless I'm refactoring everything or it's necessary, don't change pattern" mindset. Maybe we can track a few issues on refactors we know we should do?

lucasmbrown · 2020-08-12T00:50:39Z

covid/transform.py

+_GREEN = "Green"
+_YELLOW = "Yellow"
+_RED = "Red"
+_DEEP_SHIT = "Dark Red"


as fun as that is, maybe we do _DARK_RED just since this is open source (and about a tragic topic)? But also I would be fine leaving it as is!

Yeah good to change this before it's merged

lucasmbrown · 2020-08-12T00:51:58Z

covid/transform.py

+_RED = "Red"
+_DEEP_SHIT = "Dark Red"
+
+# Define the upper bounds for each color for the new cases per million metric.


lucasmbrown · 2020-08-12T00:57:29Z

covid/transform.py

+
+        # NaN rows where the 3DRA is NaN.
+        # Note: this handles a common county-level case where a county simply does not have any testing data.
+        county_df.loc[


I'm surprised fit_and_predict_cubic_spline_in_r doesn't automatically return NaNs if its input data is NaN -- we could make that part of the function? Or this works just fine!

It breaks if you have all-NaN I think. Like the fitting portion was breaking on me. I can investigate further if we want!

lucasmbrown · 2020-08-12T01:06:09Z

covid/transform_utils.py


            date_to_lookup = date_to_lookup - lag_timedelta

    lags_df = lags_df.reset_index()
    return lags_df


-def calculate_state_summary(transformed_df, columns):
+def compute_lagged_frame(


As you said this logic is definitely complex but I'm sure it's way faster than the terrible for loops I had running! Should we potentially replace the other lagging parts of transform with this to also speed those up? (If we want to do that, we could potentially implement the text fixtures first so that we can tell if they worked successfully without changing the underlying data.)

yeah I think the real improvement would be to apply transformations across states in parallel, which is maybe a larger refactor I didn't want to do here. I'm happy to add it in or just plug this function in on a state-by-state level

lucasmbrown · 2020-08-12T01:07:07Z

covid/transform_utils.py

@@ -180,3 +296,13 @@ def calculate_consecutive_boolean_series(boolean_series):
    )

    return consecutive_true_series, consecutive_false_series
+
+
+def get_color_series_from_range(series, color_range_dict):


lucasmbrown · 2020-08-12T01:07:58Z

main.py

-    covidtracking_df = extract_covidtracking_historical_data()
-    cdc_ili_df = extract_cdc_ili_data()
+    covidatlas_df = extract_covid_atlas_data()
+    # covidtracking_df = extract_covidtracking_historical_data()


Are these intentinoally commented out, or was that just to speed up local development of your new feature?

just for local dev!

lucasmbrown · 2020-08-12T01:08:33Z

main.py

@@ -51,7 +55,7 @@ def extract_transform_and_load_covid_data(post_to_google_sheets=True):
            debugging of data processing

    """
-    print("Starting to ETL...")
+    logger.info("Starting to ETL...")


hell yea! lol.

We have a bunch of other prints we could remove as well in this PR or future work...

yep for sure. I'll see if it's an easy copy -replace

lucasmbrown · 2020-08-12T01:10:26Z

main.py

@@ -86,19 +90,31 @@ def extract_transform_and_load_covid_data(post_to_google_sheets=True):
    #     credentials=credentials,
    # )

-    covidtracking_df = extract_covidtracking_historical_data()
-    cdc_ili_df = extract_cdc_ili_data()
+    covidatlas_df = extract_covid_atlas_data()


If we don't think we'll ever need to mix county-level and state-level data, I'm wondering if we want to actually put this in a separate function -- and maybe the function is invoked separately by main() so it's easy to run one command that updates all data, but there's a bit cleaner separation of concerns.

I feel like extract_transform_and_load_covid_data has gotten to be a beast and now might be a good time to refactor a bit!

agreed it would be better, but the status quo here isn't much different. We don't do a lot where we merge all the state data into one state object and then do additional things with that. As far as main is concerned, county-data is similar to state-level beds data. Happy to refactor, but perhaps we could save some of these for some "internal code cleanup" tasks once the county-level functionality is squared away

patricksheehan added 5 commits August 2, 2020 21:35

wip

2dc9948

wip

b0a0a31

wip

bd1e25e

wip wip

wip

2cfb481

wip

171afb0

patricksheehan requested a review from lucasmbrown August 10, 2020 18:21

patricksheehan added 3 commits August 10, 2020 20:25

wip

7ff2bed

wip

1b511e8

wip

7019b5c

lucasmbrown suggested changes Aug 12, 2020

View reviewed changes

patricksheehan added 7 commits August 11, 2020 18:49

wip

958cb58

wip

1943bca

wip

0e521b9

wip

4b8e076

wip

e6cefcb

wip

458f5be

wip

87ce354

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate county level data (WIP) #52

Integrate county level data (WIP) #52

patricksheehan commented Aug 10, 2020

lucasmbrown left a comment

lucasmbrown Aug 12, 2020

lucasmbrown Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

lucasmbrown Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

lucasmbrown Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

lucasmbrown Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

lucasmbrown Aug 12, 2020

patricksheehan Aug 12, 2020

		@@ -0,0 +1,383 @@
		,locationID,slug,name,level,city,county,state,country,lat,long,population,aggregate,tz,cases,deaths,recovered,active,tested,hospitalized,hospitalized_current,discharged,icu,icu_current,date

Integrate county level data (WIP) #52

Are you sure you want to change the base?

Integrate county level data (WIP) #52

Conversation

patricksheehan commented Aug 10, 2020

lucasmbrown left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment