-
Notifications
You must be signed in to change notification settings - Fork 3
/
Group_7_Analysis.Rmd
335 lines (239 loc) · 12.5 KB
/
Group_7_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
---
title: "Group_7_Analysis"
author: "Group 7"
date: "23/03/2022"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(kableExtra)
library(skimr)
library(ggcorrplot)
library(gridExtra)
library(countrycode)
library(sjPlot)
library(knitr)
library(car)
library(moderndive)
library(jtools)
```
# Exploratory Data Analysis
```{r data, echo= FALSE, eval = TRUE}
# import and process data
coffee <- read.csv('dataset7.csv')
```
```{r process, echo= FALSE, warning=FALSE}
coffee <- as_tibble(coffee)
coffee <- coffee %>%
rename(altitude = altitude_mean_meters) %>%
rename(defects = category_two_defects) %>%
rename(country = country_of_origin)
summary(coffee)
# removing the two data errors for altitude
coffee$altitude <- ifelse(coffee$altitude > 8000, NA, coffee$altitude)
#removing na obs
coffee <- coffee %>% na.omit()
```
Initially, summary statistics reveal a couple of erroneous altitude datapoints. (The highest two values are many times the altitude of the peak of Mount Everest!)
```{r}
# removing the two data errors for altitude
coffee$altitude <- ifelse(coffee$altitude > 8000, NA, coffee$altitude)
#removing na obs
coffee <- coffee %>% na.omit()
```
Table of summary statistics for continuous variables:
```{r cont_summaries, warning = FALSE}
cont_summ <- coffee %>%
select(aroma, flavor, acidity, altitude, defects) %>%
summarise_each(funs(mean = round(mean(., na.rm = TRUE), 2),
sd = round(sd(., na.rm = TRUE), 2),
min = min(., na.rm = TRUE),
q25 = quantile(., 0.25, na.rm = TRUE),
median = median(., na.rm = TRUE),
q75 = quantile(., 0.75, na.rm = TRUE),
max = max(., na.rm = TRUE))) %>%
gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, mean, sd, min, q25, median, q75, max) %>%
slice(3, 5, 1, 2, 4) %>%
kable(caption = 'Summary statistics for coffee grades (1-10), mean altitude (metres) and defects (integer).')
cont_summ
```
Summary statistics reveal a couple of erroneous altitude datapoints. (The highest two values are many times the altitude of the peak of Mount Everest!)
A correlation plot of the numeric variables was generated:
```{r}
correlations <- cor(coffee[,2:6], use="pairwise.complete.obs") # ignore missing values
ggcorrplot(correlations,
hc.order = TRUE, type = "lower", lab = TRUE)
```
The three coffee scoring variables (aroma, flavour, acidity) seem strongly positive correlated, so we may not expect to see all in a final model.
Boxplots of the numerical variables were produced, split by quality classification:
```{r boxplots, warning = FALSE}
box1 <- coffee %>%
ggplot(aes(x = Qualityclass, y = aroma)) +
geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
labs(x = "Quality Class", y = "Aroma grade") +
theme(axis.title = element_text(size = 9))
box2 <- coffee %>%
ggplot(aes(x = Qualityclass, y = flavor)) +
geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
labs(x = "Quality Class", y = "Flavour grade") +
theme(axis.title = element_text(size = 9))
box3 <- coffee %>%
ggplot(aes(x = Qualityclass, y = acidity)) +
geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
labs(x = "Quality Class", y = "Acidity grade") +
theme(axis.title = element_text(size = 9))
box4 <- coffee %>%
ggplot(aes(x = Qualityclass, y = altitude)) +
geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
labs(x = "Quality Class", y = "Altitude (m)") +
theme(axis.title = element_text(size = 9))
box5 <- coffee %>%
ggplot(aes(x = Qualityclass, y = defects)) +
geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
labs(x = "Quality Class", y = "# of defects") +
theme(axis.title = element_text(size = 9))
box6 <- coffee %>%
ggplot(aes(x = Qualityclass, y = harvested)) +
geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
labs(x = "Quality Class", y = "Year") +
theme(axis.title = element_text(size = 9))
lay <- cbind(c(1,1,2,2,3,3),
c(4,4,5,5,6,6))
grid.arrange(box1, box2, box3, box5, box4, box6, layout_matrix = lay)
```
There are too many categories by country, so we use the countrycode() function to divide each country into its corresponding continent.
```{r conti var, warnings = FALSE}
coffee <- as.data.frame(coffee)
coffee$continent = countrycode(sourcevar = coffee[,"country"],
origin="country.name",
destination = "continent")
# United States (Puerto Rico) were not matched unambiguously
coffee[c(374,408,764,787),"continent"] <- 'Americas'
coffee <- as_tibble(coffee)
```
Bar plots of quality by country. (add continent variable)
```{r bar charts, warnings = FALSE}
countries <- table(coffee$Qualityclass, coffee$country)
countries <- as.data.frame(countries)
colnames(countries)<- c('quality','country','count')
bar1 <- countries %>%
ggplot(aes(x = country, y = count)) +
geom_col(aes(fill = quality)) +
coord_flip()
bar1
continent <- table(coffee$Qualityclass, coffee$continent)
continent <- as.data.frame(continent)
colnames(continent)<- c('quality','continent','count')
#Easy to compare the quality of coffee beans within the same continent
plot_xtab(coffee$continent,coffee$Qualityclass, show.values = FALSE,
show.total = FALSE,
legend.title = "Quality")
```
# Formal Data Analysis
In formal data analysis, we use three link functions to fit generalized linear models, then use step-wise regression to select reasonable explanatory variables based on AIC values, and finally use chi-square values to judge the fitness of the model.
```{r model selection, echo= FALSE, warning=FALSE}
coffee1 <- as.data.frame(coffee)
coffee1[which(coffee1$Qualityclass=="Poor"),]$Qualityclass=0
coffee1[which(coffee1$Qualityclass=="Good"),]$Qualityclass=1
coffee1$Qualityclass = as.factor(coffee1$Qualityclass)
```
***GLM (logit link)***
```{r logit, echo= FALSE, warning=FALSE}
logit_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "logit"))
both_logit = step(logit_model,direction="both")
summ(both_logit)
both_logit$anova %>%
select(Step, AIC) %>%
kable() %>%
kable_styling(font_size = 10, latex_options = "hold_position")
## Fitness of model
summary(both_logit)$null.deviance - summary(both_logit)$deviance > qchisq(0.95,891-886)
## TRUE. we can reject the null hypothesis, and the terms are all significant
```
- formula = Qualityclass ~ aroma + flavor + acidity + altitude + harvested
- AIC: 533.47, BIC = 562.23
- fit the model, we can reject the null hypothesis, and the terms are all significant.
***GLM (probit link)***
```{r probit, echo= FALSE, warning=FALSE}
# Probit link
probit_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "probit"))
both_probit = step(probit_model,direction = "both")
summ(both_probit)
both_probit$anova %>%
select(Step, AIC) %>%
kable() %>%
kable_styling(font_size = 10, latex_options = "hold_position")
## Fittness of model
summary(both_probit)$null.deviance - summary(both_probit)$deviance > qchisq(0.95,891-886)
## TRUE. we can reject the null hypothesis, and the terms are all significant
```
- formula = Qualityclass ~ aroma + flavor + acidity + altitude + harvested
- AIC = 554.03, BIC = 582.79
- fit the model, we can reject the null hypothesis, and the terms are all significant.
***GLM (complementary log-log link)***
```{r cloglog, echo= FALSE, warning=FALSE}
clog_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "cloglog"))
both_clog = step(clog_model,direction = "both")
summ(both_clog)
both_clog$anova %>%
select(Step, AIC) %>%
kable() %>%
kable_styling(font_size = 10, latex_options = "hold_position")
## Fittness of model
summary(both_clog)$null.deviance - summary(both_clog)$deviance > qchisq(0.95,891-887)
## TRUE.
```
- formula = Qualityclass ~ aroma + flavor + acidity + harvested
- AIC = 636.96, BIC = 660.93
- fit the model, we can reject the null hypothesis, and the terms are all significant.
***Model Statement***
According to the stepwise regression results, the GLM explanatory variables of complementary log-log link are *aroma*, *flavor*, *acidity* and *harvested*, and the GLM explanatory variables of logit link and probit link are *aroma*, *flavor*, *acidity*, *altitude* and *harvested*.
The Pearson chi-squared statistics of three models are all greater than the 95th percentile of the $\chi^{2}(4)$ distribution. Therefore the models fit the data well and We need to choose the appropriate link function by comparing the information criteria.
# GLM Model Selection
The AIC and BIC values corresponding to each link function are as follows:
```{r aicbic, echo= FALSE, warning=FALSE}
summlogit = summ(both_logit)
summprobit = summ(both_probit)
summclog = summ(both_clog)
aic1 <- round(summlogit$model$aic, 3)
aic2 <- round(summprobit$model$aic, 3)
aic3 <- round(summclog$model$aic, 3)
```
Link | Link Function | AIC | BIC
:-----------------|:----------------------- |:----:|:----:
Logit link | $g\left(p_{i}\right)=\log \left(\frac{p_{i}}{1-p_{i}}\right)$ | 533.47 | 562.23
Probit link | $g\left(p_{i}\right)=\Phi^{-1}\left(p_{i}\right)=\beta_{0}+\beta_{1} x_{i}$ | 554.03 | 582.79
Complementary log-log link| $g\left(p_{i}\right)=\log \left[-\log \left(1-p_{i}\right)\right]=\beta_{0}+\beta_{1} x_{i}$ | 636.96 | 660.93
Based on the AIC and BIC values in the table above, the model using logit link fits best in three. So we finally choose the logit link function in GLM.
The GLM regression model of logit link is as follows:
$$
Y \sim B(m_i, p{(\text {Qualityclass = Good})}_i),
$$
$$
g\left(p{(\text {Qualityclass = Good})}_{i}\right)=\log \left(\frac{p{(\text {Qualityclass = Good})}_{i}}{1-p{(\text {Qualityclass = Good})}_{i}}\right),
$$
```{r logitformula, echo= FALSE, warning=FALSE}
logitsele = summary(both_logit)
Coefs <- round(coef(logitsele), 4)
```
$$
\log \left(\frac{p{(\text {Qualityclass = Good})}}{1-p{(\text {Qualityclass = Good})}}\right) = `r Coefs[1]` + `r Coefs[2]` \cdot aroma + `r Coefs[3]` \cdot flavor + `r Coefs[4]` \cdot acidity + `r Coefs[5]` \cdot altitude + `r Coefs[6]` \cdot harvested.
$$
Considering that the correlation coefficient of aroma, flavor and acidity in EDA is relatively large, we calculated the VIF value of the variables in the regression.
```{r VIF, echo= FALSE, warning=FALSE}
vif(both_logit) %>%
kable(caption = 'VIF of Variables', digits = 4)%>%
kable_styling(latex_options = "hold_position")
```
The results show that the VIF values are all small, excluding multicollinearity.
# Conclusion & Further work
As we can see from the predict model, where p=Prob(Good) and 1-p=Prob(Poor). The correlation coefficients of explanatory variables to quality score for the batch are all positive, which means that the log-odds of the qualityclass being good increases as the grade of aroma, flavor and acidity, the altitude of the grower farm and the year increase.
This provides us with a point estimate of how the log-odds changes with aroma grade, flavor grade, acidity grade, altitude meters and year. Therefore, if we want to improve the quality of coffee, we should focus on raising the level of the aroma, flavor and acidity grade, plant seeds at high elevations and extend the harvest year.
There is clearly future work to be done on exploring the influencing factors of coffee quality and the degree of how much they will influence coffee quality.
First, I want to approach this problem with tree based methods such as bagged tree. They are widely popular due to their ease of use and easy interpretation of the outcome. Tree models are trained on the training dataset and produce an inferred model that can be used to map to new datapoints to predict an outcome. The outcome can be a numeric value in the case of regression (e.g. quality score of a coffee), but we do not classify them to be 'Good' and 'Poor'. Except that, we may try to look at other factors that might affect the quality of coffee. For example, sunshine time, the average temperature, air wetness, farm income and so on. Further development of improving coffee quality should be emphasized in taking the holistic view to include both the natural environment and economic environment.