Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data treatment/tidying #7

Open
ajstewartlang opened this issue Apr 18, 2019 · 1 comment
Open

Data treatment/tidying #7

ajstewartlang opened this issue Apr 18, 2019 · 1 comment

Comments

@ajstewartlang
Copy link
Contributor

ajstewartlang commented Apr 18, 2019

It looks like we need to do a fair bit of data tidying at the start. For example, for the variables comp_int_bed_16 and comp_noint_bed_16 the values in the dataset are "Yes" and "NA". Presumably the NA corresponds to "No" rather than reflecting real missing data:

> unique(maps_synthetic_data$comp_int_bed_16)
[1] NA "Yes"

> unique(maps_synthetic_data$comp_noint_bed_16)
[1] NA "Yes"

So I'm thinking we replace the NAs in these two variables with "No" and then turn them into 2-level factors:

maps_synthetic_data[is.na(maps_synthetic_data$comp_int_bed_16), ]$comp_int_bed_16 <- "No"
maps_synthetic_data[is.na(maps_synthetic_data$comp_noint_bed_16), ]$comp_noint_bed_16 <- "No"
maps_synthetic_data$comp_int_bed_16 <- factor(maps_synthetic_data$comp_int_bed_16)
maps_synthetic_data$comp_noint_bed_16 <- factor(maps_synthetic_data$comp_noint_bed_16)

For the for anxiety measure at age 15 variable, is looks like the NAs correspond to 0 rather than missing data:

> unique(maps_synthetic_data$anx_band_15)
[1] "~0.5%" NA "~3%" "~15%" "~50%" "<0.1%"

So we might want to replace the NAs there with zeros.

maps_synthetic_data[is.na(maps_synthetic_data$anx_band_15), ]$anx_band_15 <- 0

But what about the other values? Should we make this an ordered factor or treat as numerical? If treating as numerical we could recode as:

maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = as.integer(recode(anx_band_15, "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", "<0.1%" = "0")))

Although that forces the <0.1% values to be 0. Would an ordered factor be better do you think? @wjchulme, @jspickering, @OliJimbo

@ajstewartlang
Copy link
Contributor Author

Perhaps recoding as an ordered factor is better as it seems to capture the difference between the discrete scores better:

maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = recode_factor(anx_band_15, "0" = "0", "<0.1%" = "0.1", "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", ))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant