-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Staging #5
base: master
Are you sure you want to change the base?
Staging #5
Conversation
updated from 2018 to 2019
2018 -> 2019 overcrowding vars
2018 -> 2019
2018 -> 2019
2018 -> 2019
2018 - > 2019
2018 - > 2019
2018 - > 2019
2018 - > 2019
2018 - > 2019
Update acs jobloss abby
Updates CHAS data to 2017
2018 -> 2019
1) reran code with updated variables 2) starting line 420, updated year from 2018 to 2019. replaced all instances of "tracts_18" with "tracts_19"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! I made a few style comments, There were a few errors I fixed via a commit to code-review
. Those were:
- A timeout I was having with the
download.file()
fxn, probably due to my slow internet conenction. I increased the default timeout and used a differentmethod=
argument insidedownload.file()
- There was a slight error in the
perc_cost_burdened_under_35k
calculations. Previously when the denominator was 0, the value was NA and we manually converted those NA back to 0. But our code to check whether the dneominator was zero was incomplete and missing a few terms. This didn't affect any rows in the 2018 data, but did affect 835 rows in the 2019 data. - I've also added a few data checks with the
assert()
function that I used for testing. I've kept those in for future updates. - I've also slightly modified the
generate_index
function which now requires that we explicitly download and provide the tracts data (which we were/are already doing).
And below are things that still need to be changed:
- As a result of the above changes and the new 2019 data, the total number of all water tracts in 2019 is now 320 instead of 319 so we'll want to change that in the technical appendix. And we'll need to update the comments in the code near line 338 in script 1
- As a result of the above changes and the new 2019 data,there are currently no tracts where we have to use the national level means (like we did with the 9 tracts in New Mexico last time). So we'll want to change that in the technical appendix and update the comments in the code near line 344 in script 1
- There are a couple of places where we could be smarter about centralizing variables and changing variable names so that in the future, we only need to update the
year
in a few places to rerun the code. These are optional changes, but would definitely help if we foresee running another update
Finally it would be good to compare final data files to ensure the scripts are reproducible. If someone has ran the full pipeline on their computer (after pulling my changes) wants to send me their final outputs via email, I can double check that our outputs are identical.
@@ -5,25 +5,25 @@ library(tigris) | |||
|
|||
options(tigris_use_cache=FALSE) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're planning on updating this in the future, I would suggest defining a global variable at the top like year_used=2019
and then substituting in year_used
everywhere you manually typed in 2019 in the script. Then in future years, you only have to change the years in one spot!
@@ -5,25 +5,25 @@ library(tigris) | |||
|
|||
options(tigris_use_cache=FALSE) | |||
|
|||
state_2018 = tigris::states(year = 2018, class = "sf") %>% | |||
state_2019 = tigris::states(year = 2019, class = "sf") %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an add on the above comment, you probably want to give generic variable names here instead of having the variable names dependent on the year. So names this states
, and the other objects us_counties
, and us_counties_cb
. Then you won't have to update these names every year
@@ -12,7 +12,7 @@ library(urbnthemes) | |||
set_urbn_defaults() | |||
|
|||
# Load variable metadata from ACS 5 yr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're planning on updating this in the future, I would suggest defining a global variable at the top like year_used=2019
and then substituting in year_used
everywhere you manually typed in 2019
in the script. Then in future years, you only have to change the years in one spot!
@@ -425,7 +425,7 @@ row_sum_weighted = function(df, weight_vec){ | |||
generate_index = function(df, | |||
indicator_weights_df, | |||
index_weights_vec, | |||
tracts_18 = NA, | |||
tracts_19 = NA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this argument to something more generic like tract_list
so that you don't have to update the names of the variable with each update. That means changing it here in the function definition and everywhere it appears in the function
@@ -519,15 +520,15 @@ generate_index = function(df, | |||
|
|||
###---Create Indexes--------- | |||
|
|||
tracts_18_job_loss = | |||
tracts_19_job_loss = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a comment, I would note the date that you pulled the data from this URL since this URL is from the covid job loss tool that updates monthly. Otherwise, you might have problems replicating the exact data in later months and not understand why
@@ -519,15 +520,15 @@ generate_index = function(df, | |||
|
|||
###---Create Indexes--------- | |||
|
|||
tracts_18_job_loss = | |||
tracts_19_job_loss = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about renaming to something more generic like tract_list
# income on rent | ||
cb_stats <- cb_vars %>% | ||
# select ACS table variables w/ attached GEOID | ||
select( | ||
# These are all the peolpe making under 35k (denominator) | ||
B25074_002E, B25074_011E, B25074_020E, | ||
# These are all the people making under 35k who pay more than 50% of thier income on rent (numerator) | ||
# These are all the people making under 35k who pay more than 50% of their income on rent (numerator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes, the variable definitions may change across years. Has someone explicitly checked that each of the variables referenced in this script have remained the same across 2018 and 2019? If not someone should definitely spend some time to do this. You have to compare the var_list
objects for each of the variables referenced in this script, and create two versions of the var_list
object, one for 2018 and one for 2019
@@ -260,14 +261,13 @@ fb_stats <- fb_vars %>% | |||
|
|||
# Income Indicator | |||
# Download in Zip file from HUD website, unzip and rename | |||
download.file("https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip", | |||
download.file("https://www.huduser.gov/portal/datasets/cp/2013thru2017-140-csv.zip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The download.file()
fxn spits out a timeout error for me because the file seems to be big. So you should increase the timeout to a few minutes (instead of default 1 min). You can do this by including the following line before:
options(timeout=180)
I would also set method="curl"
instead of method="libcurl"
as I've found that to be faster. I will probably make a commit that makes both of these changes for you
First two changes were incorporated into the technical appendix, and final data checks were manually run and approved |
Confirmed with the team that someone has double checked the variable definitions across 2018 and 2019 and they have not changed. |
Confirmed with Nicole that these changes have been made to the technical appendix |
Upload all files to s3 which Comms uses for feature and data catalog refrences
- Made March versions of S3 objects publicly accessible on S3 console and referenced them explicitly. Now code will produce same output data no matter when it's run instead of pulling the latest job loss numbers
ERAP 2019 update