Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging #5

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open

Staging #5

wants to merge 28 commits into from

Conversation

abwilliams13
Copy link

ERAP 2019 update

ndubois-urban and others added 15 commits March 9, 2021 16:35
updated from 2018 to 2019
2018 -> 2019 overcrowding vars
2018 - > 2019
1) reran code with updated variables
2) starting line 420, updated year from 2018 to 2019. replaced all instances of "tracts_18" with "tracts_19"
@abwilliams13 abwilliams13 requested a review from ajjitn March 11, 2021 17:13
@ajjitn ajjitn self-assigned this Mar 19, 2021
@ajjitn ajjitn marked this pull request as ready for review March 19, 2021 14:24
Copy link
Contributor

@ajjitn ajjitn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! I made a few style comments, There were a few errors I fixed via a commit to code-review. Those were:

  • A timeout I was having with the download.file() fxn, probably due to my slow internet conenction. I increased the default timeout and used a different method= argument inside download.file()
  • There was a slight error in the perc_cost_burdened_under_35k calculations. Previously when the denominator was 0, the value was NA and we manually converted those NA back to 0. But our code to check whether the dneominator was zero was incomplete and missing a few terms. This didn't affect any rows in the 2018 data, but did affect 835 rows in the 2019 data.
  • I've also added a few data checks with the assert() function that I used for testing. I've kept those in for future updates.
  • I've also slightly modified the generate_index function which now requires that we explicitly download and provide the tracts data (which we were/are already doing).

And below are things that still need to be changed:

  • As a result of the above changes and the new 2019 data, the total number of all water tracts in 2019 is now 320 instead of 319 so we'll want to change that in the technical appendix. And we'll need to update the comments in the code near line 338 in script 1
  • As a result of the above changes and the new 2019 data,there are currently no tracts where we have to use the national level means (like we did with the 9 tracts in New Mexico last time). So we'll want to change that in the technical appendix and update the comments in the code near line 344 in script 1
  • There are a couple of places where we could be smarter about centralizing variables and changing variable names so that in the future, we only need to update the year in a few places to rerun the code. These are optional changes, but would definitely help if we foresee running another update

Finally it would be good to compare final data files to ensure the scripts are reproducible. If someone has ran the full pipeline on their computer (after pulling my changes) wants to send me their final outputs via email, I can double check that our outputs are identical.

@@ -5,25 +5,25 @@ library(tigris)

options(tigris_use_cache=FALSE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're planning on updating this in the future, I would suggest defining a global variable at the top like year_used=2019 and then substituting in year_used everywhere you manually typed in 2019 in the script. Then in future years, you only have to change the years in one spot!

@@ -5,25 +5,25 @@ library(tigris)

options(tigris_use_cache=FALSE)

state_2018 = tigris::states(year = 2018, class = "sf") %>%
state_2019 = tigris::states(year = 2019, class = "sf") %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an add on the above comment, you probably want to give generic variable names here instead of having the variable names dependent on the year. So names this states, and the other objects us_counties, and us_counties_cb. Then you won't have to update these names every year

@@ -12,7 +12,7 @@ library(urbnthemes)
set_urbn_defaults()

# Load variable metadata from ACS 5 yr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're planning on updating this in the future, I would suggest defining a global variable at the top like year_used=2019 and then substituting in year_used everywhere you manually typed in 2019 in the script. Then in future years, you only have to change the years in one spot!

@@ -425,7 +425,7 @@ row_sum_weighted = function(df, weight_vec){
generate_index = function(df,
indicator_weights_df,
index_weights_vec,
tracts_18 = NA,
tracts_19 = NA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this argument to something more generic like tract_list so that you don't have to update the names of the variable with each update. That means changing it here in the function definition and everywhere it appears in the function

@@ -519,15 +520,15 @@ generate_index = function(df,

###---Create Indexes---------

tracts_18_job_loss =
tracts_19_job_loss =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a comment, I would note the date that you pulled the data from this URL since this URL is from the covid job loss tool that updates monthly. Otherwise, you might have problems replicating the exact data in later months and not understand why

@@ -519,15 +520,15 @@ generate_index = function(df,

###---Create Indexes---------

tracts_18_job_loss =
tracts_19_job_loss =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about renaming to something more generic like tract_list

# income on rent
cb_stats <- cb_vars %>%
# select ACS table variables w/ attached GEOID
select(
# These are all the peolpe making under 35k (denominator)
B25074_002E, B25074_011E, B25074_020E,
# These are all the people making under 35k who pay more than 50% of thier income on rent (numerator)
# These are all the people making under 35k who pay more than 50% of their income on rent (numerator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes, the variable definitions may change across years. Has someone explicitly checked that each of the variables referenced in this script have remained the same across 2018 and 2019? If not someone should definitely spend some time to do this. You have to compare the var_list objects for each of the variables referenced in this script, and create two versions of the var_list object, one for 2018 and one for 2019

@@ -260,14 +261,13 @@ fb_stats <- fb_vars %>%

# Income Indicator
# Download in Zip file from HUD website, unzip and rename
download.file("https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip",
download.file("https://www.huduser.gov/portal/datasets/cp/2013thru2017-140-csv.zip",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The download.file() fxn spits out a timeout error for me because the file seems to be big. So you should increase the timeout to a few minutes (instead of default 1 min). You can do this by including the following line before:

options(timeout=180)

I would also set method="curl" instead of method="libcurl" as I've found that to be faster. I will probably make a commit that makes both of these changes for you

@ajjitn
Copy link
Contributor

ajjitn commented Mar 24, 2021

  • As a result of the above changes and the new 2019 data, the total number of all water tracts in 2019 is now 320 instead of 319 so we'll want to change that in the technical appendix. And we'll need to update the comments in the code near line 338 in script 1
  • As a result of the above changes and the new 2019 data,there are currently no tracts where we have to use the national level means (like we did with the 9 tracts in New Mexico last time). So we'll want to change that in the technical appendix and update the comments in the code near line 344 in script 1
    Finally it would be good to compare final data files to ensure the scripts are reproducible. If someone has ran the full pipeline on their computer (after pulling my changes) wants to send me their final outputs via email, I can double check that our outputs are identical.

First two changes were incorporated into the technical appendix, and final data checks were manually run and approved

@ajjitn
Copy link
Contributor

ajjitn commented Apr 4, 2021

Sometimes, the variable definitions may change across years. Has someone explicitly checked that each of the variables referenced in this script have remained the same across 2018 and 2019? If not someone should definitely spend some time to do this. You have to compare the var_list objects for each of the variables referenced in this script, and create two versions of the var_list object, one for 2018 and one for 2019

Confirmed with the team that someone has double checked the variable definitions across 2018 and 2019 and they have not changed.

@ajjitn
Copy link
Contributor

ajjitn commented Apr 4, 2021

  • As a result of the above changes and the new 2019 data, the total number of all water tracts in 2019 is now 320 instead of 319 so we'll want to change that in the technical appendix. And we'll need to update the comments in the code near line 338 in script 1
  • As a result of the above changes and the new 2019 data,there are currently no tracts where we have to use the national level means (like we did with the 9 tracts in New Mexico last time). So we'll want to change that in the technical appendix and update the comments in the code near line 344 in script 1

Confirmed with Nicole that these changes have been made to the technical appendix

ajjitn added 5 commits May 14, 2021 10:24
Upload all files to s3 which Comms uses for feature and data catalog refrences
- Made March versions of S3 objects publicly accessible on S3 console and referenced them explicitly. Now code will produce same output data no matter when it's run instead of pulling the latest job loss numbers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants