-
Notifications
You must be signed in to change notification settings - Fork 2
/
intro_data_prob_project.Rmd
301 lines (143 loc) · 10.9 KB
/
intro_data_prob_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
---
title: "Exploring the BRFSS data"
output:
html_document:
fig_height: 4
highlight: pygments
theme: spacelab
---
## Setup
### Load packages
```{r load-packages, message = FALSE}
library(ggplot2)
library(dplyr)
library(corrplot)
```
### Load data
```{r load-data}
load("brfss2013.RData")
```
* * *
## Part 1: Data
The Behavioral Risk Factor Surveillance System (BRFSS) is the system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.
# Sample Collection
The samples were collected by land-line telephonic surveys and cell-phones telephonic surveys.
In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household.
In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.
Disproportionate stratified sampling was used for land-line telephones. And the cellular phone sample is randomly generated from the combination od area code and prefix.
# Inference
If we look at the sample collection methods, we would conclude that this is an observational study. It uses stratified sampling method based on random digit dialing methods.
# Generabizability
Using this type of random digit dialing and Disproportionate stratified sampling, we can canclude that this study can be generalized to the population of all non-institutionalized adults of 18 years and above
# Causality
As this is an observational study, so we can not be sure about causality. We can only be sure about associations.
* * *
## Part 2: Research questions
**Research quesion 1:**
How does the sleeping time (duration) affect the general health. In other words, is there any association between variable genhlth: General Health and sleptim1: How Much Time Do You Sleep. This question is of an interest because generally, good sleeping habits might be associated with general good health.
**Research quesion 2:**
Generally, people with reasonable consumption of vegetables and fruits per day are healthier.
It would be interesting to know that what is the association between vegetables and fruits comsumption with the general health. In other words, association between frutsum variable and vegesum variable with the variable genhlth.
**Research quesion 3:**
Is there any association between bmi5cat: Computed Body Mass Index catefory and the consumption of vegetables and fruits
It would be interesting to know about how the BMI category is affected by vegetable and fruits intake.
* * *
## Part 3: Exploratory data analysis
**Research quesion 1 :**
The mean and STD of sleeping hours for different general health categories are computed. The following code shows the summary statistics of sleeping hours for four general health categories.
```{r}
df1 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Poor') %>% filter(sleptim1 !='NA') %>% summarise(mean_poorhealth_sleep = mean(sleptim1), sd_poorhealth_sleep = sd(sleptim1), n = n())
df1
df2 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Good') %>% filter(sleptim1 !='NA') %>% summarise(mean_goodhealth_sleep = mean(sleptim1), sd_goodhealth_sleep = sd(sleptim1), n = n())
df2
df3 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Very good') %>% filter(sleptim1 !='NA') %>% summarise(mean_vgoodhealth_sleep = mean(sleptim1), sd_vgoodhealth_sleep = sd(sleptim1), n = n())
df3
df4 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Excellent') %>% filter(sleptim1 !='NA') %>% summarise(mean_excellenthealth_sleep = mean(sleptim1), sd_excellenthealth_sleep = sd(sleptim1), n = n())
df4
```
```{r}
mean_sleep_vector <- c(df1$mean_poorhealth_sleep,df2$mean_goodhealth_sleep,df3$mean_vgoodhealth_sleep,df4$mean_excellenthealth_sleep)
mean_sleep_vector
sd_sleep_vector <- c(df1$sd_poorhealth_sleep,df2$sd_goodhealth_sleep,df3$sd_vgoodhealth_sleep,df4$sd_excellenthealth_sleep)
sd_sleep_vector
names <- c("poorhealth", "goodhealth", "vgoodhealth", "excellenthealth")
barplot(sd_sleep_vector, names.arg=names, cex.names=.9, ylab = 'STD in hrs sleep', main='STD in hours for various health categories', xlab="Genral health category")
lines(sd_sleep_vector, col='Red')
barplot(mean_sleep_vector, names.arg=names, cex.names=.9, ylab = 'mean of hrs sleep', main='Mean of hours sleep for various health categories', xlab="Genral health category")
lines(mean_sleep_vector, col='Red')
```
For excellent, good and very good health category, STD for sleeping hours is relatively smaller than the poor general health category.
Similarly, the mean of sleeping hours for excellent, good and very good health category is more than the mean sleeping hours of poor general health category people.
From the summary statistics and the plots, it is shown the generally poor health is associated with less sleep. Though, this difference is not huge, yet general health is slightly associated with the sleep.
People with around 7 hours per day sleep are more healtheir than the people with less than 7 hours per day sleep.
**Research quesion 2:**
Below, we have computed the mean consumption of fruits and vegetables for all four general health categories.
```{r}
df5 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Poor') %>% filter(X_vegesum !='NA') %>% summarise(mean_poorhealth_veg = mean(X_vegesum))
df5
df6 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Good') %>% filter(X_vegesum !='NA') %>% summarise(mean_goodhealth_veg = mean(X_vegesum))
df6
df7 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Very good') %>% filter(X_vegesum !='NA') %>% summarise(mean_vgoodhealth_veg = mean(X_vegesum))
df7
df8 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Excellent') %>% filter(X_vegesum !='NA') %>% summarise(mean_excellenthealth_veg = mean(X_vegesum))
df8
df9 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Poor') %>% filter(X_frutsum !='NA') %>% summarise(mean_poorhealth_fruit = mean(X_frutsum))
df9
df10 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Good') %>% filter(X_frutsum !='NA') %>% summarise(mean_goodhealth_fruit = mean(X_frutsum))
df10
df11 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Very good') %>% filter(X_frutsum !='NA') %>% summarise(mean_vgoodhealth_fruit = mean(X_frutsum))
df11
df12 <-brfss2013 %>% filter(genhlth != "NA") %>% filter(genhlth == 'Excellent') %>% filter(X_frutsum !='NA') %>% summarise(mean_excellenthealth_fruit = mean(X_frutsum))
df12
```
It is shown that people who consumes more vegetable and fruits are generally healthier.
We also plot the vegetable and fruit mean consumption for various general helath categories
```{r}
mean_veg_vector <- c(df5$mean_poorhealth_veg,df6$mean_goodhealth_veg,df7$mean_vgoodhealth_veg,df8$mean_excellenthealth_veg)
mean_veg_vector
names <- c("poorhealth", "goodhealth", "vgoodhealth", "excellenthealth")
barplot(mean_veg_vector, names.arg=names, cex.names=.9, ylab = 'mean of veg consumption', main='Mean of veg consumption for various health categories', xlab="Genral health category")
lines(mean_veg_vector, col='Red')
mean_fruit_vector <- c(df9$mean_poorhealth_fruit,df10$mean_goodhealth_fruit,df11$mean_vgoodhealth_fruit,df12$mean_excellenthealth_fruit)
mean_fruit_vector
names <- c("poorhealth", "goodhealth", "vgoodhealth", "excellenthealth")
barplot(mean_fruit_vector, names.arg=names, cex.names=.9, ylab = 'mean of fruits consumption', main='Mean of fruits consumption for various health categories', xlab="Genral health category")
lines(mean_fruit_vector, col='Red')
```
In both the above plots, we see a clear association of better health and more consumption of fruits and vegetables.
**Research quesion 3:**
Here we provide the mean veg and fruits consumption for various BMI categories.
```{r}
df13 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Underweight') %>% filter(X_vegesum !='NA') %>% summarise(mean_Underweight_veg = mean(X_vegesum))
df13
df14 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Normal weight') %>% filter(X_vegesum !='NA') %>% summarise(mean_NormalWeight_veg = mean(X_vegesum))
df14
df15 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Overweight') %>% filter(X_vegesum !='NA') %>% summarise(mean_Overweight_veg = mean(X_vegesum))
df15
df16 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Obese') %>% filter(X_vegesum !='NA') %>% summarise(mean_Obese_veg = mean(X_vegesum))
df16
df17 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Underweight') %>% filter(X_frutsum !='NA') %>% summarise(mean_Underweight_fruit = mean(X_frutsum))
df17
df18 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Normal weight') %>% filter(X_frutsum !='NA') %>% summarise(mean_NormalWeight_fruit = mean(X_frutsum))
df18
df19 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Overweight') %>% filter(X_frutsum !='NA') %>% summarise(mean_Overweight_fruit = mean(X_frutsum))
df19
df20 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Obese') %>% filter(X_frutsum !='NA') %>% summarise(mean_Obese_fruit = mean(X_frutsum))
df20
```
The mean fuits and vegetable consumption for normal weight people is higher.
Now we provide the plotting to see any pattern (if there is).
```{r}
mean_veg_vector_bmi <- c(df13$mean_Underweight_veg,df14$mean_NormalWeight_veg,df15$mean_Overweight_veg,df16$mean_Obese_veg)
mean_veg_vector_bmi
names <- c("Underweight", "NormalWeight", "Overweight", "Obese")
barplot(mean_veg_vector_bmi, names.arg=names, cex.names=.9, ylab = 'mean of veg consumption', main='Mean of veg consumption for various BMI categories', xlab="BMI category")
lines(mean_veg_vector_bmi, col='Red')
mean_fruit_vector_bmi <- c(df17$mean_Underweight_fruit,df18$mean_NormalWeight_fruit,df19$mean_Overweight_fruit,df20$mean_Obese_fruit)
mean_fruit_vector_bmi
names <- c("Underweight", "NormalWeight", "Overweight", "Obese")
barplot(mean_fruit_vector_bmi, names.arg=names, cex.names=.9, ylab = 'mean of fruits consumption', main='Mean of fruits consumption for various BMI categories', xlab="BMI category")
lines(mean_fruit_vector_bmi, col='Red')
```
It can be seen that normal weight BMI category people consume relatively more fruits and vegetables. On the average, obese people consume less vegetables and fruits.
This relation of fruit and vegetable consumption with the BMI category is associational.