Group_7_Analysis.Rmd

---
title: "Group_7_Analysis"
author: "Group 7"
date: "23/03/2022"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(kableExtra)
library(skimr)
library(ggcorrplot)
library(gridExtra)
library(countrycode)
library(sjPlot)
library(knitr)
library(car)
library(moderndive)
library(jtools)

```

# Exploratory Data Analysis

```{r data, echo= FALSE, eval = TRUE}
# import and process data
coffee <- read.csv('dataset7.csv')
```

```{r process, echo= FALSE, warning=FALSE}
coffee <- as_tibble(coffee)
coffee <- coffee %>% 
  rename(altitude = altitude_mean_meters) %>% 
  rename(defects = category_two_defects) %>% 
  rename(country = country_of_origin)
summary(coffee)

# removing the two data errors for altitude
coffee$altitude <- ifelse(coffee$altitude > 8000, NA, coffee$altitude)
#removing na obs
coffee <- coffee %>% na.omit()
```

Initially, summary statistics reveal a couple of erroneous altitude datapoints. (The highest two values are many times the altitude of the peak of Mount Everest!)

```{r}
# removing the two data errors for altitude
coffee$altitude <- ifelse(coffee$altitude > 8000, NA, coffee$altitude)
#removing na obs
coffee <- coffee %>% na.omit()
```


Table of summary statistics for continuous variables:

```{r cont_summaries, warning = FALSE}
cont_summ <- coffee %>% 
  select(aroma, flavor, acidity, altitude, defects) %>% 
  summarise_each(funs(mean = round(mean(., na.rm = TRUE), 2),
                      sd = round(sd(., na.rm = TRUE), 2),
                      min = min(., na.rm = TRUE),
                      q25 = quantile(., 0.25, na.rm = TRUE), 
                      median = median(., na.rm = TRUE), 
                      q75 = quantile(., 0.75, na.rm = TRUE), 
                      max = max(., na.rm = TRUE))) %>% 
  gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, mean, sd, min, q25, median, q75, max) %>%
  slice(3, 5, 1, 2, 4) %>% 
  kable(caption = 'Summary statistics for coffee grades (1-10), mean altitude (metres) and defects (integer).')

cont_summ
```

Summary statistics reveal a couple of erroneous altitude datapoints. (The highest two values are many times the altitude of the peak of Mount Everest!)


A correlation plot of the numeric variables was generated:

```{r}
correlations <- cor(coffee[,2:6], use="pairwise.complete.obs") # ignore missing values
ggcorrplot(correlations, 
           hc.order = TRUE, type = "lower", lab = TRUE)
```

The three coffee scoring variables (aroma, flavour, acidity) seem strongly positive correlated, so we may not expect to see all in a final model.

Boxplots of the numerical variables were produced, split by quality classification:

```{r boxplots, warning = FALSE}
box1 <- coffee %>%
  ggplot(aes(x = Qualityclass, y = aroma)) +
  geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
  labs(x = "Quality Class", y = "Aroma grade") +
  theme(axis.title = element_text(size = 9))

box2 <- coffee %>%
  ggplot(aes(x = Qualityclass, y = flavor)) +
  geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
  labs(x = "Quality Class", y = "Flavour grade") +
  theme(axis.title = element_text(size = 9))

box3 <- coffee %>%
  ggplot(aes(x = Qualityclass, y = acidity)) +
  geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
  labs(x = "Quality Class", y = "Acidity grade") +
  theme(axis.title = element_text(size = 9))

box4 <- coffee %>%
  ggplot(aes(x = Qualityclass, y = altitude)) +
  geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
  labs(x = "Quality Class", y = "Altitude (m)") +
  theme(axis.title = element_text(size = 9))

box5 <- coffee %>%
  ggplot(aes(x = Qualityclass, y = defects)) +
  geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
  labs(x = "Quality Class", y = "# of defects") +
  theme(axis.title = element_text(size = 9))

box6 <- coffee %>%
  ggplot(aes(x = Qualityclass, y = harvested)) +
  geom_boxplot(aes(fill = Qualityclass), show.legend = FALSE) +
  labs(x = "Quality Class", y = "Year") +
  theme(axis.title = element_text(size = 9))

lay <- cbind(c(1,1,2,2,3,3),
             c(4,4,5,5,6,6))
grid.arrange(box1, box2, box3, box5, box4, box6, layout_matrix = lay)
```


There are too many categories by country, so we use the countrycode() function to divide each country into its corresponding continent.

```{r conti var, warnings = FALSE}

coffee <- as.data.frame(coffee)
coffee$continent = countrycode(sourcevar = coffee[,"country"],
                         origin="country.name",
                         destination = "continent")
# United States (Puerto Rico) were not matched unambiguously
coffee[c(374,408,764,787),"continent"] <- 'Americas'
coffee <- as_tibble(coffee)

```

Bar plots of quality by country. (add continent variable)

```{r bar charts, warnings = FALSE}

countries <- table(coffee$Qualityclass, coffee$country)
countries <- as.data.frame(countries)
colnames(countries)<- c('quality','country','count')

bar1 <- countries %>% 
  ggplot(aes(x = country, y = count)) +
  geom_col(aes(fill = quality)) +
  coord_flip()

bar1

continent <- table(coffee$Qualityclass, coffee$continent)
continent <- as.data.frame(continent)
colnames(continent)<- c('quality','continent','count')

#Easy to compare the quality of coffee beans within the same continent
plot_xtab(coffee$continent,coffee$Qualityclass, show.values = FALSE,
          show.total = FALSE,
          legend.title = "Quality")

```


# Formal Data Analysis 

In formal data analysis, we use three link functions to fit generalized linear models, then use step-wise regression to select reasonable explanatory variables based on AIC values, and finally use chi-square values to judge the fitness of the model.

```{r model selection, echo= FALSE, warning=FALSE}
coffee1 <- as.data.frame(coffee)

coffee1[which(coffee1$Qualityclass=="Poor"),]$Qualityclass=0
coffee1[which(coffee1$Qualityclass=="Good"),]$Qualityclass=1

coffee1$Qualityclass = as.factor(coffee1$Qualityclass)
```

***GLM (logit link)***

```{r logit, echo= FALSE, warning=FALSE}
logit_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "logit"))
both_logit = step(logit_model,direction="both")
summ(both_logit)
both_logit$anova %>%
  select(Step, AIC) %>%
  kable() %>%
  kable_styling(font_size = 10, latex_options = "hold_position")


## Fitness of model
summary(both_logit)$null.deviance - summary(both_logit)$deviance > qchisq(0.95,891-886)
## TRUE. we can reject the null hypothesis, and the terms are all significant
```

- formula = Qualityclass ~ aroma + flavor + acidity + altitude + harvested

- AIC: 533.47, BIC = 562.23

- fit the model, we can reject the null hypothesis, and the terms are all significant.

***GLM (probit link)***

```{r probit, echo= FALSE, warning=FALSE}

# Probit link
probit_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "probit"))
both_probit = step(probit_model,direction = "both")
summ(both_probit)
both_probit$anova %>%
  select(Step, AIC) %>%
  kable() %>%
  kable_styling(font_size = 10, latex_options = "hold_position")

## Fittness of model
summary(both_probit)$null.deviance - summary(both_probit)$deviance > qchisq(0.95,891-886)
## TRUE. we can reject the null hypothesis, and the terms are all significant

```

- formula = Qualityclass ~ aroma + flavor + acidity + altitude + harvested

- AIC = 554.03, BIC = 582.79

- fit the model, we can reject the null hypothesis, and the terms are all significant.

***GLM (complementary log-log link)***

```{r cloglog, echo= FALSE, warning=FALSE}

clog_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "cloglog"))
both_clog = step(clog_model,direction = "both")
summ(both_clog)
both_clog$anova %>%
  select(Step, AIC) %>%
  kable() %>%
  kable_styling(font_size = 10, latex_options = "hold_position")

## Fittness of model
summary(both_clog)$null.deviance - summary(both_clog)$deviance > qchisq(0.95,891-887)
## TRUE. 

```

- formula = Qualityclass ~ aroma + flavor + acidity + harvested

- AIC = 636.96, BIC = 660.93

- fit the model, we can reject the null hypothesis, and the terms are all significant.

***Model Statement***

According to the stepwise regression results, the GLM explanatory variables of complementary log-log link are *aroma*, *flavor*, *acidity* and *harvested*, and the GLM explanatory variables of logit link and probit link are *aroma*, *flavor*, *acidity*, *altitude* and *harvested*.

The Pearson chi-squared statistics of three models are all greater than the 95th percentile of the $\chi^{2}(4)$ distribution. Therefore the models fit the data well and We need to choose the appropriate link function by comparing the information criteria.

# GLM Model Selection

The AIC and BIC values corresponding to each link function are as follows:

```{r aicbic, echo= FALSE, warning=FALSE}
summlogit = summ(both_logit)
summprobit = summ(both_probit)
summclog = summ(both_clog)
aic1 <- round(summlogit$model$aic, 3)
aic2 <- round(summprobit$model$aic, 3)
aic3 <- round(summclog$model$aic, 3)
```
 
 Link              | Link Function           | AIC  | BIC   
 :-----------------|:----------------------- |:----:|:----:
 Logit link                | $g\left(p_{i}\right)=\log \left(\frac{p_{i}}{1-p_{i}}\right)$             | 533.47   | 562.23    
 Probit link               | $g\left(p_{i}\right)=\Phi^{-1}\left(p_{i}\right)=\beta_{0}+\beta_{1} x_{i}$             | 554.03   | 582.79   
 Complementary log-log link| $g\left(p_{i}\right)=\log \left[-\log \left(1-p_{i}\right)\right]=\beta_{0}+\beta_{1} x_{i}$             | 636.96    | 660.93  

Based on the AIC and BIC values in the table above, the model using logit link fits best in three. So we finally choose the logit link function in GLM.

The GLM regression model of logit link is as follows:

$$
Y \sim B(m_i, p{(\text {Qualityclass = Good})}_i),
$$


$$
g\left(p{(\text {Qualityclass = Good})}_{i}\right)=\log \left(\frac{p{(\text {Qualityclass = Good})}_{i}}{1-p{(\text {Qualityclass = Good})}_{i}}\right),
$$


```{r logitformula, echo= FALSE, warning=FALSE}
logitsele = summary(both_logit)
Coefs <- round(coef(logitsele), 4)
```

$$
\log \left(\frac{p{(\text {Qualityclass = Good})}}{1-p{(\text {Qualityclass = Good})}}\right) = `r Coefs[1]` + `r Coefs[2]` \cdot aroma + `r Coefs[3]` \cdot flavor + `r Coefs[4]` \cdot acidity + `r Coefs[5]` \cdot altitude + `r Coefs[6]` \cdot harvested.
$$


Considering that the correlation coefficient of aroma, flavor and acidity in EDA is relatively large, we calculated the VIF value of the variables in the regression. 

```{r VIF, echo= FALSE, warning=FALSE}
vif(both_logit) %>%
  kable(caption = 'VIF of Variables', digits = 4)%>%
  kable_styling(latex_options = "hold_position")
```

The results show that the VIF values are all small, excluding multicollinearity.

# Conclusion & Further work

As we can see from the predict model, where p=Prob(Good) and 1-p=Prob(Poor). The correlation coefficients of explanatory variables to quality score for the batch are all positive, which means that the log-odds of the qualityclass being good increases as the grade of aroma, flavor and acidity, the altitude of the grower farm and the year increase.

This provides us with a point estimate of how the log-odds changes with aroma grade, flavor grade, acidity grade, altitude meters and year. Therefore, if we want to improve the quality of coffee, we should focus on raising the level of the aroma, flavor and acidity grade, plant seeds at high elevations and extend the harvest year.


There is clearly future work to be done on exploring the influencing factors of coffee quality and the degree of how much they will influence coffee quality. 

First, I want to approach this problem with tree based methods such as bagged tree. They are widely popular due to their ease of use and easy interpretation of the outcome. Tree models are trained on the training dataset and produce an inferred model that can be used to map to new datapoints to predict an outcome. The outcome can be a numeric value in the case of regression (e.g. quality score of a coffee), but we do not classify them to be 'Good' and 'Poor'. Except that, we may try to look at other factors that might affect the quality of coffee. For example, sunshine time, the average temperature, air wetness, farm income and so on. Further development of improving coffee quality should be emphasized in taking the holistic view to include both the natural environment and economic environment.