You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like we need to do a fair bit of data tidying at the start. For example, for the variables comp_int_bed_16 and comp_noint_bed_16 the values in the dataset are "Yes" and "NA". Presumably the NA corresponds to "No" rather than reflecting real missing data:
> unique(maps_synthetic_data$comp_int_bed_16)
[1] NA "Yes"
> unique(maps_synthetic_data$comp_noint_bed_16)
[1] NA "Yes"
So I'm thinking we replace the NAs in these two variables with "No" and then turn them into 2-level factors:
It looks like we need to do a fair bit of data tidying at the start. For example, for the variables comp_int_bed_16 and comp_noint_bed_16 the values in the dataset are "Yes" and "NA". Presumably the NA corresponds to "No" rather than reflecting real missing data:
> unique(maps_synthetic_data$comp_int_bed_16)
[1] NA "Yes"
> unique(maps_synthetic_data$comp_noint_bed_16)
[1] NA "Yes"
So I'm thinking we replace the NAs in these two variables with "No" and then turn them into 2-level factors:
maps_synthetic_data[is.na(maps_synthetic_data$comp_int_bed_16), ]$comp_int_bed_16 <- "No"
maps_synthetic_data[is.na(maps_synthetic_data$comp_noint_bed_16), ]$comp_noint_bed_16 <- "No"
maps_synthetic_data$comp_int_bed_16 <- factor(maps_synthetic_data$comp_int_bed_16)
maps_synthetic_data$comp_noint_bed_16 <- factor(maps_synthetic_data$comp_noint_bed_16)
For the for anxiety measure at age 15 variable, is looks like the NAs correspond to 0 rather than missing data:
> unique(maps_synthetic_data$anx_band_15)
[1] "~0.5%" NA "~3%" "~15%" "~50%" "<0.1%"
So we might want to replace the NAs there with zeros.
maps_synthetic_data[is.na(maps_synthetic_data$anx_band_15), ]$anx_band_15 <- 0
But what about the other values? Should we make this an ordered factor or treat as numerical? If treating as numerical we could recode as:
maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = as.integer(recode(anx_band_15, "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", "<0.1%" = "0")))
Although that forces the <0.1% values to be 0. Would an ordered factor be better do you think? @wjchulme, @jspickering, @OliJimbo
The text was updated successfully, but these errors were encountered: