understanding_models.qmd

# Understanding the Model {#sec-knowing}

![](img/chapter_gp_plots/gp_plot_2.svg){width=75%}


<!-- TODO: add conformal prediction discussion? This could be added here, as a general 'predictive' uncertainty, esp around pp check or in estimation as a comparison to bayes/boot -->

<!-- krantz will still number any panel tabset, even when expressly denoted unnumbered. Furthermore, it can't seem to decide whether to base the number on the previous higher level e.g. 4.3.1.1 or do a zero 4.3.0.1; it seems that if there is a 4.3.1, then it will do 4.3.1.1 even though it's not 4th level, otherwise, it will do 4.3.0.1 if there is no 4.3.1 header -->


```{r}
#| label: setup-knowing-r-load-objects
#| include: false
# load("data/models/knowing_models/model_lr_train.RData")
# load("data/models/knowing_models/model_class_train.RData")
# load("data/models/knowing_models/model_lr_3feat.RData")

load("data/models/knowing_models/knowing_model_data.RData")
df_reviews = read_csv("data/movie_reviews.csv")
```

```{python}
#| label: setup-knowing-py
#| include: false

import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.smpickle import load_pickle

np.set_printoptions(formatter={'float': lambda x: "{0:0.4f}".format(x)})
pd.set_option('display.float_format', lambda x: '%.3f' % x)

df_train = pd.read_csv("data/models/knowing_models/knowing_model_data_train.csv")
df_test = pd.read_csv("data/models/knowing_models/knowing_model_data_test.csv")

# model_lr_train = load_pickle("data/models/knowing_models/model_lr_train.pkl")
# model_class_train = load_pickle("data/models/knowing_models/model_class_train.pkl")
# model_lr_3feat = load_pickle("data/models/knowing_models/model_lr_3feat.pkl")

df_reviews = pd.read_csv("data/movie_reviews.csv")


# misc helpful objects
features = [
    "review_year_0", 
    "release_year_0",
    "age_sc", 
    "length_minutes_sc", 
    "total_reviews_sc", 
    "word_count_sc", 
    "genre", 
]
```


```{python}
#| echo: false
#| label: py-display-setup
import numpy as np
import pandas as pd
np.set_printoptions(precision = 4, suppress=True, floatmode = 'maxprec') # suppress is for whether to use scientific notation for otherwise rounded to zero numbers; apparently is either ignored or quarto doesn't pass
pd.set_option('display.precision', 4) # number of digits of precision for floating point output

np.set_printoptions(formatter={'float': lambda x: "{0:0.4f}".format(x)})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
```

In addition to giving the world one of the greatest television show theme songs -- Quincy Jones' *The Streetbeater* --  *Sanford & Son* gave us an insightful quote for offering criticism: "You big dummy." While we don't advocate for swearing at or denigrating your model, how do you know if your model is performing up to your expectations? It is easy to look at your coefficients, *t*-values, and an $R^2$, and say, "Wow! Look at this great model!" Your friends will be envious of such terrific *p*-values, and all of the strangers that you see at social functions will be impressed. What happens if that model falls apart on new data, though? What if a stakeholder wants to know exactly how a prediction was made for a specific business decision? Sadly, all of the stars that you gleefully pointed towards in your console will not offer you any real answers. 

Instead of falling in immediate love with your model, you should ask some hard questions of it. How does it perform on different slices of data? Do predictions make sense? Is your classification cut-point appropriate? In other words, you should criticize your model before you decide it can be used for its intended purposes. Remember that it is *data modeling*, not *data truthing*. In other words, you should always be prepared to call your model a "big dummy". 


## Key Ideas {#sec-knowing-key}

- Metrics can help you assess how well your model is performing, and they can also help you compare different models.
- Different metrics can be used depending on the goals of your model.
- Visualizations can further help us understand model performance in a variety of ways.


### Why this matters {#sec-knowing-why}

It's never good enough to simply get model results. You need to know how well your model is performing and how it is making predictions. You also should be comparing your model to other alternatives. Doing so provides more confidence in your model and helps you to understand how it is working, and just as importantly, where it fails. This is actionable knowledge.

### Helpful context {#sec-knowing-good}

This takes some of the things we see in other chapters on linear models and machine learning, so we'd suggest having the linear model basics down pretty well.


## Model Metrics {#sec-knowing-model-metrics}

A first step in understanding our model can be done with summary statistics, typically called **metrics**. Regression and classification have different metrics for assessing model performance. We want to give you a sample of some of the more common ones, but we also want to acknowledge that there are many more that you can use, and any might be useful. We would always recommend looking at a few different metrics for any given model to get a better sense of how your model is performing.

@tbl-performance-metrics illustrates some of the most commonly used performance metrics. Just because these are popular or applicable for your situation, doesn't mean they are the only ones you can or even should use. Nothing keeps you from using more than one metric for assessment, and in fact, it is often a good idea to do so. In general though, you should have a working knowledge of these, as you will likely come across them even if you don't use them all the time.


{{< pagebreak >}}

<!-- 
this option appears to make the whole book do left-right margining.
\KOMAoptions{paper=landscape,pagesize}
\recalctypearea

-->

\scriptsize

\blandscape

```{r}
#| echo: false
#| label: tbl-performance-metrics
#| tbl-cap: Commonly Used Performance Metrics in Machine Learning.


performance_metrics = tribble(
    ~Problem_Type, ~Metric, ~Description, ~`Other Names/Notes`,
    "Regression", "RMSE", "Root mean squared error", 'MSE (before square root)',
    "Regression", "MAE", "Mean absolute error", '',
    "Regression", "MAPE", "Mean absolute percentage error",  '',
    "Regression", "RMSLE", "Root mean squared log error", '',
    "Regression", "R-squared", "Amount of variance shared by predictions and target", 'Coefficient of determination',
    "Regression", "Deviance/AIC", "Generalization of sum of squared error ", 'Also "deviance explained" for similar R-sq interpretation',
    "Classification", "Accuracy", "Proportion correct", 'Error rate is 1 - Accuracy',
    "Classification", "Precision", "Proportion of positive predictions that are correct", 'Positive Predictive Value',
    "Classification", "Recall", "Proportion of positive samples that are predicted correctly", 'Sensitivity, True Positive Rate',
    "Classification", "Specificity", "Proportion of negative samples that are predicted correctly", 'True Negative Rate',
    'Classification', 'Negative Predictive Value', 'Proportion of negative predictions that are correct', '',
    "Classification", "F1", "Harmonic mean of precision and recall", 'F-Beta',
    "Classification", "AUC", "Area under the ROC curve", '',
    "Classification", 'False Positive Rate', 'Proportion of negative samples that are predicted incorrectly', 'Type I Error, alpha',
    "Classification", 'False Negative Rate', 'Proportion of positive samples that are predicted incorrectly', 'Type II Error, beta, Power is 1 - beta',
    "Classification", 'Phi', 'Correlation between predicted and actual', "Matthews Correlation",
    "Classification", "Log loss", "Negative log likelihood of the predicted probabilities", ''
)

performance_metrics |>  
    group_by(Problem_Type) |> 
    gt() |>  
    tab_header(
        title = "",
        subtitle = ""
    ) |> 
    tab_footnote(
        footnote = "Beta = 1 for F1",
        locations = cells_body(columns = vars(`Other Names/Notes`), rows = Metric == 'F1'),
        # placement = 'left'
    )  |> 
    # tab_footnote(
    #   footnote = "Not sure which fields refer to Matthews Correlation outside of CS and bioinformatics, since 'phi' had already been widely used for over 60 years before Matthews forgot to cite it in his paper, and phi was literally the first correlation coefficient devised by Pearson.[phi == mcc](https://en.wikipedia.org/wiki/Phi_coefficient#Machine_learning)",locations = cells_body(columns = vars(`Other Names/Notes`), rows = Metric == 'Phi') 
    # )  |> 
    tab_options(
        footnotes.font.size = 10,
    )

```

\elandscape

{{< pagebreak >}}

<!-- \KOMAoptions{paper=portrait,pagesize}
\recalctypearea -->


\normalsize


### Regression metrics {#sec-knowing-reg-metrics}

The primary goal of our endeavor is to come up with a predictive model. The closer our model predictions are to the observed target values, the better our model is performing. As we saw in @tbl-performance-metrics, when we have a numeric target there are quite a few metrics that help us understand prediction-target correspondence, so let's look at some of those.

But before we create a model to get us started, we are going to read in our data and then create two different splits within our data: a **training** set and a **test** set. In other words, we are going to **partition** our data so that we can train a model and then see how well that model performs with data it hasn't seen[^pyvr-splits].  For more on this process and the reasons why we do it see @sec-ml-generalization and @sec-ml-cv. For now, we just need to know that assessing prediction error on the test set will give us a better estimate of our metric of choice.


[^pyvr-splits]: For anyone comparing Python to R results, the data splits are not the same so outputs likewise will not be identical, though they should be very similar. We could have forced them to use the same data, but we feel you should get used to the randomness of the process.

:::{.callout-note title='Splitting Data' collapse='true'}
This basic split is the foundation of **cross-validation**. Cross-validation is a method for partitioning data into training and non-training sets in a way that allows you to better understand the model's performance. You'll find more explicit demonstration of how to do this in the machine learning chapter @sec-ml-common-models.
:::


:::{.panel-tabset}

##### R

```{r}
#| label: r-regression-import-split

# all data found on github repo
df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')

set.seed(123) # ensure reproducibility

initial_split = sample(
    x = 1:nrow(df_reviews), 
    size = nrow(df_reviews) * .75, 
    replace = FALSE
)

df_train = df_reviews[initial_split, ]

df_test = df_reviews[-initial_split, ]
```

##### Python

```{python}
#| label: python-regression-import-split
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

# all data found on github repo
df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')

df_train, df_test = train_test_split(
    df_reviews, 
    test_size = 0.25, 
    random_state = 123
)
```

:::

```{r}
#| label: r-knowing-save-model-data
#| echo: false
#| eval: false

# only run as needed

save(
    df_train, 
    df_test, 
    file = "data/models/knowing_models/knowing_model_data.RData"
)
```


```{python}
#| label: python-knowing-save-model-data
#| echo: false
#| eval: false

# only run as needed

df_train.to_csv("data/models/knowing_models/knowing_model_data_train.csv", index = False)
df_test.to_csv("data/models/knowing_models/knowing_model_data_test.csv", index = False)
```

You'll notice that we created training data with 75% of our data and we will use the other 25% to test our model. This is an arbitrary but common split. With training data in hand, let's produce a model to predict review rating. We'll use the standardized (scaled `_sc`) versions of several features, and use the 'year' features starting at year 0, which represents the earlies year observed in our data[^whyzero]. Finally, we also include the genre of the movie as a categorical feature.

[^whyzero]: See @sec-data-time-features for more on why we like to start year features at 0.

:::{.panel-tabset}

##### R

```{r}
#| label: r-regression-model-train
model_lr_train = lm(
    rating ~ 
        review_year_0 
        + release_year_0 
        + age_sc 
        + length_minutes_sc 
        + total_reviews_sc 
        + word_count_sc 
        + genre 
        ,
        df_train
)
```

##### Python

```{python}
#| label: python-regression-model-train
import statsmodels.api as sm
import statsmodels.formula.api as smf

# we'll use 'features' later also
features = [
    "review_year_0", 
    "release_year_0",
    "age_sc", 
    "length_minutes_sc", 
    "total_reviews_sc", 
    "word_count_sc", 
    "genre", 
]

model =  'rating ~ ' + " + ".join(features)

model_lr_train = smf.ols(formula = model, data = df_train).fit()
```

:::

```{r}
#| label: r-regression-model-save
#| echo: false
#| eval: false

# only run as needed
save(
    model_lr_train, 
    file = "data/models/knowing_models/model_lr_train.RData"
)
```
    

```{python}
#| label: python-regression-model-save
#| echo: false
#| eval: false

# only run as needed
model_lr_train.save("data/models/knowing_models/model_lr_train.pkl")
``` 

Now that we have a model from our training data, we can use it to make predictions on our test data:

:::{.panel-tabset}

##### R

```{r}
#| label: r-regression-model-predict

predictions = predict(model_lr_train, newdata = df_test)
```

##### Python

```{python}
#| label: python-regression-model-predict

predictions = model_lr_train.predict(df_test)
```

:::

The goal now is to find out how close our predictions match reality. Let's look at them first:

:::{.content-visible when-format='html'}
```{r}
#| label: fig-pred-vs-obs
#| fig-cap: Observed vs. Predicted Ratings
#| echo: false
#| out-width: 100%

p_dat = tibble(
    observed = df_test$rating, 
    predicted = predictions
) 

p1 = p_dat |> 
    ggplot(aes(predicted, observed)) +
    geom_point(alpha = 0.36, size = 2, position = position_jitter(width = .1)) +
    geom_abline(
        intercept = 0,
        slope = 1,
        color = okabe_ito[['darkblue']],
        linewidth = 1
    ) +
    lims(x = c(1, 5), y = c(1, 5)) +
    labs(
        y = 'Observed',
        x = 'Predicted',
        subtitle = 'Simple scatter plot of observed and predicted ratings on test set.\nThe regression line is shown in blue.'
        # title = 'Observed vs. Predicted Ratings'
    )
# https://github.com/tidyverse/ggplot2/issues/5669 when did they introduce this feature?
p2 = p_dat |>
    select(observed, predicted) |>
    pivot_longer(everything()) |>
    ggplot(aes(x = value)) +
    # geom_density(aes(color = name, fill = name), alpha = 0.15) +
    stat_ecdf(aes(color = name, linetype = name), geom = 'borderline', linewidth = 1) +
    scale_color_manual(
        values = c(okabe_ito[['orange']], okabe_ito[['darkblue']]),
        aesthetics = c('fill', 'color')
    ) +
    labs(
        y = 'Cumulative Density',
        x = 'Rating',
        subtitle = 'Cumulative density plot of observed and predicted ratings.'
    ) +
    theme(
        legend.key.size = unit(1.5, 'cm'),
    )


p1 + p2

ggsave('img/knowing-pred_vs_obs.svg', width = 8, height = 6)
```
:::

:::{.content-visible when-format='pdf'}
![Observed vs. Predicted Ratings](img/knowing-pred_vs_obs.svg){#fig-pred-vs-obs}
:::

Obviously, our points do not make a perfect line on the left, which would indicate perfect prediction. Also, the distribution of our values suggests we're over-predicting on the lower end, as none of our predictions even go below two. We're also under-predicting on the higher end of the target's range. More generally, we're not capturing the range of the observed values very well. But we'd like to determine how far off we are in a general sense. There are a number of metrics that can be used to measure this. We'll go through a few of them here. In each case, we'll demystify the calculations to make sure we understand what's going on.


#### R-squared {.unnumbered #sec-knowing-metrics-r2}

Anyone that has done linear regression has come across the $R^2$ value. It is a measure of how well the model explains the variance in the target. One way to calculate it is as follows:

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
$$ {#eq-r2}


where $y_i$ is the observed value, $\hat{y}_i$ is the predicted value, and $\bar{y}$ is the mean of the observed values. The $R^2$ value is basically a proportion of how much variance in the target (the denominator) is attributable to the model features in the form of the predictions (numerator). When applied to the training or observed data, it is a value between 0 and 1, with 1 indicating that the model explains all of the variance in the target. 

Alternatively, $R^2$  can be calculated as the squared correlation between the target and predicted values, which may be more conceptually intuitive. In that sense it can almost always be useful as a *descriptive* measure, just like we use means and standard deviations in exploratory data analysis. However, it is not so great at telling us about predictive quality. Why? Take the predictions from our rating model, and add 10 to them, or make them all negative. In both cases, if you calculate it as the squared correlation of your predictions and target, even though your predictions would be ridiculous, your $R^2$ will be the same. If you use the formula shown, you could even get negative values! Another problem is that for training data, $R^2$ will always increase as you add more features to your model, whether they are useful or pure noise! This is why we use other metrics to assess predictive quality.

:::{.panel-tabset}

##### R

```{r}
#| label: r-regression-r2
#| results: hold

residual_ss = sum((df_test$rating - predictions)^2)
total_ss = sum((df_test$rating - mean(df_test$rating))^2)

1 - residual_ss / total_ss

yardstick::rsq_trad_vec(df_test$rating, predictions)

# conceptually identical, but slight difference due
# to how internal calculations are done (not shown)
# cor(df_test$rating, predictions)^2 
# yardstick::rsq_vec(df_test$rating, predictions)

# exercise
# cor(df_test$rating, predictions)^2  
# cor(df_test$rating, predictions + 1)^2  # same
# yardstick::rsq_trad_vec(df_test$rating, predictions + 1) # negative!
```

##### Python

```{python}
#| label: python-regression-r2
#| results: hold
from sklearn.metrics import r2_score

residual_ss = np.sum((df_test.rating - predictions)**2)
total_ss = np.sum((df_test.rating - np.mean(df_test.rating))**2)

1 - residual_ss / total_ss

r2_score(df_test.rating, predictions)

# conceptually identical, but slight difference due to
# how calculations are done (not shown)
# np.corrcoef(df_test.rating, predictions)[0, 1]**2

# exercise
# np.corrcoef(df_test.rating, predictions)[0, 1]**2
# np.corrcoef(df_test.rating, predictions + 1)[0, 1]**2  # same
# r2_score(df_test.rating, predictions + 1)  # negative!
```

:::

:::{.callout-note title='R-squared variants' collapse='true'}
There are different versions of R-squared. 'Adjusted' R-squared is a common one, and it penalizes the model for adding features that don't really explain the target variance. This is a nice sentiment, but its difference versus the standard R-squared would only be noticeable for very small datasets. Some have also attempted to come up with R-squared values that are more appropriate for GLMs for count, binary and other models. Unfortunately, these 'pseudo-R-squared' values are not as interpretable as the original R-squared, and generally suffer several issues.
:::


#### Mean squared error {.unnumbered #sec-knowing-metrics-mse}

One of the most common *performance* metrics for numeric targets is the mean squared error (MSE) and its square root, root mean squared error (RMSE). The MSE is the average of the squared differences between the predicted and actual values. It is calculated as follows:

$$
\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$ {#eq-mse}

MSE penalizes large errors more- since errors are squared, the larger the error, the larger the penalty. As mentioned, RMSE is just the square root of the MSE. We also saw that it as a similar metric reported in typical linear regression output (@sec-lm-interpretation-model). Like MSE, RMSE also penalizes large errors, but if you want a metric that is in the same units as the original target data, RMSE is the metric for you. It is calculated as follows:

$$
\text{RMSE} = \sqrt{\text{MSE}}
$$ {#eq-rmse}


:::{.panel-tabset}

##### R

```{r}
#| label: r-regression-mse
#| results: hold

mse = mean((df_test$rating - predictions)^2)

mse

yardstick::rmse_vec(df_test$rating, predictions)^2
```

```{r}
#| label: r-regression-mse-2
#| results: hold
sqrt(mse)

yardstick::rmse_vec(df_test$rating, predictions)
```

##### Python

```{python}
#| label: python-regression-mse
#| results: hold
from sklearn.metrics import mean_squared_error, root_mean_squared_error

mse = np.mean((df_test.rating - predictions)**2)

mse

mean_squared_error(df_test.rating, predictions)
```

```{python}
#| label: python-regression-mse-2
#| results: hold
np.sqrt(mse)

root_mean_squared_error(df_test.rating, predictions)
```

:::

#### Mean absolute error {.unnumbered #sec-knowing-metrics-mae}

The mean absolute error (MAE) is the average of the absolute differences between the predicted and observed values. It is calculated as follows:

$$
\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|
$$ {#eq-mae}

MAE is a great metric when all you really want to know is how far off your predictions typically are from the observed values. It is not as sensitive to large errors as the MSE.

:::{.panel-tabset}

##### R

```{r}
#| label: r-regression-mae
#| results: hold
mean(abs(df_test$rating - predictions))

yardstick::mae_vec(df_test$rating, predictions)
```

##### Python

```{python}
#| label: python-regression-mae
#| results: hold
from sklearn.metrics import mean_absolute_error

np.mean(abs(df_test.rating - predictions))

mean_absolute_error(df_test.rating, predictions)
```

:::


#### Mean absolute percentage error {.unnumbered #sec-knowing-metrics-mape}

The mean absolute percentage error (MAPE) is the average of the absolute differences between the predicted and observed values, expressed as a percentage of the observed values. It is calculated as follows:

$$
MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i - \hat{y}_i|}{y_i} 
$$ {#eq-mape}


:::{.panel-tabset}

MAPE is a great metric when you want to know how far off your predictions are from the observed values, but you want to express that difference as a percentage of the actual value. It is not as sensitive to large errors as the MSE.

##### R

```{r}
#| label: r-regression-mape
#| results: hold
mean(
    abs(df_test$rating - predictions) / 
    df_test$rating
) * 100

yardstick::mape_vec(df_test$rating, predictions)
```

##### Python

```{python}
#| label: python-regression-mape
#| results: hold
from sklearn.metrics import mean_absolute_percentage_error

np.mean(
    abs(df_test.rating - predictions) / 
    df_test.rating
) * 100

mean_absolute_percentage_error(df_test.rating, predictions) * 100
```

:::


#### Which regression metric should I use? {.unnumbered #sec-knowing-which-reg-metrics}

In the end, it won't hurt to look at a few of these metrics to get a better idea of how well your model is performing. You will *always* be using at least one of these metrics to compare different models, so use a few of them to get a better sense of how well your models are performing relative to one another. Does adding a feature help drive down RMSE, indicating that the feature helps to reduce large errors? In other words, does adding complexity to your model provide a big reduction in error? If adding features doesn't help reduce error, do you really need to include them in your model?


### Classification metrics {#sec-knowing-class-metrics}

Whenever we are classifying outcomes in a binary case, we don't have the same ability to compare a predicted score to an observed score. Instead, we typically use the predicted probability of an outcome, establish a cut-point for that probability, convert everything below that cut-point to 0, and then convert everything at or above that cut-point to 1. We can then compare a table predicted versus target **classes**, typically called a **confusion matrix**[^whysoconfused].

[^whysoconfused]: The origin of the term "confusion matrix" is a bit muddled, and it's not clear why it's not just called a classification table/matrix (as it actually is from time to time). If you call it a classification table, probably everyone will know exactly what you mean, but if you call it a confusion matrix, probably few outside of data science (or domains that use it) will know what you're talking about.

Let's start with a model to predict whether a review is "good" or "bad". We will use the same training and testing data that we created above. Explore the summary output if desired (not shown), but we will focus on the predictions and metrics.

:::{.panel-tabset}

##### R

```{r}
#| label: r-classification-model-train
#| output: false
model_class_train = glm(
    rating_good ~ 
        review_year_0 
        + release_year_0 
        + age_sc 
        + length_minutes_sc 
        + total_reviews_sc 
        + word_count_sc 
        + genre     
        , 
        df_train, 
        family = binomial
)

summary(model_class_train)

# a numeric version to use later
y_target_testing_bin = ifelse(df_test$rating_good == "good", 1, 0)
```


##### Python

```{python}
#| label: python-classification-model-train
#| output: false
import statsmodels.api as sm
import statsmodels.formula.api as smf

model =  'rating_good ~ ' + " + ".join(features)

model_class_train = smf.glm(
    formula = model,
    data = df_train,
    family = sm.families.Binomial()
).fit()


# model_class_train.summary() 
```

\small

```{python}
#| label: python-classification-model-train-show
#| echo: false
# later changed to not show the output
# model_class_train.summary()
```

\normalsize

:::

```{r}
#| label: r-classification-model-save
#| echo: false
#| eval: false

# only run as needed
save(
    model_class_train, 
    file = "data/models/knowing_models/model_class_train.RData"
)
```

```{python}
#| label: python-classification-model-save
#| echo: false
#| eval: false

# only run as needed
model_class_train.save("data/models/knowing_models/model_class_train.pkl")
```

Now that we have our model trained, we can use it to get the predicted probabilities for each observation[^predict_proba].

[^predict_proba]: Machine learning libraries in Python based on the [scikit-learn]{.pack} API will have a `predict_proba` method that will give you the probability of each class, while the `predict` method will give you the predicted class. The latter typically makes the assumption that the cut-point for classification is .5.

:::{.panel-tabset}

##### R

```{r}
#| label: r-classification-model-predict
predicted_prob = predict(
    model_class_train,
    newdata = df_test,
    type = "response"
)
```

##### Python

```{python}
#| label: python-classification-model-predict
predicted_prob = model_class_train.predict(df_test)
```

:::

We are going to take those probability values and make a decision to convert everything above .5 to the positive class (a "good" review), which we can do simply by rounding, or with an if-else type of approach. It is a bold assumption, but one that we will make at first!

:::{.panel-tabset}

##### R

```{r}
#| label: r-classification-predictions
predicted_class = round(predicted_prob)

predicted_class = ifelse(predicted_prob > .5, 1, 0)
```


##### Python

```{python}
#| label: python-classification-predictions
predicted_class = predicted_prob.round().astype(int)
```

:::

#### Confusion matrix {.unnumbered #sec-knowing-metrics-confusion}

The confusion matrix is a table that shows the number of correct and incorrect predictions made by the model. It's easy enough to get one from scratch, but we recommend using a function that will give you a nice table, and possibly all of the metrics you need along with it. To get us started, we can use a package function that will take our predictions and observed target as input to create the basic table.


:::{.panel-tabset}

##### R

We use [mlr3verse]{.pack} in the machine learning chapters, so we'll use it here for our confusion matrix. Though our predictions are 0/1, we need to convert it to a factor for this function.

```{r}
#| echo: true
#| label: r-classification-confusion-matrix
# we'll use the mlr3verse in the machine learning demos also
rating_cm = mlr3measures::confusion_matrix(
    factor(df_test$rating_good),  # requires factor
    factor(predicted_class),
    positive = "1"                # class 1 is 'good'
)
```

##### Python

We can get an extremely rudimentary confusion matrix by using the `confusion_matrix` function from `sklearn.metrics`. We'll use it here to get the basic table, but we'll use a more advanced function later.

```{python}
#| echo: true
#| label: python-classification-confusion-matrix
from sklearn.metrics import confusion_matrix

rating_cm = confusion_matrix(df_test.rating_good, predicted_class)
```

:::

```{r}
#| echo: false
#| label: tbl-pretty-confmat
#| tbl-cap: Example of a Confusion Matrix
new_cm = rating_cm$matrix

new_cm['0','0'] = paste0("TN: ", new_cm['0','0'])
new_cm['1','1'] = paste0("TP: ", new_cm['1','1'])
new_cm['0', '1'] = paste0("FN: ", new_cm['0', '1'])
new_cm['1', '0'] = paste0("FP: ", new_cm['1', '0'])

# new_cm = new_cm[2:1,2:1]

class(new_cm) = "matrix" # strip table class

new_cm  |> 
    as.data.frame() |> 
    rownames_to_column(var = " ") |> 
    rename(`True 0` = `0`, `True 1` = `1`)  |> 
    mutate(` ` = glue('Predicted {` `}')) |>
    gt()
```

-   **TP**: A True Positive is an outcome where the model correctly predicts the positive class -- the model correctly predicted that the review was good.

-   **TN**: A True Negative is an outcome where the model correctly predicts the negative class -- the model correctly predicted that the review was not good.

-   **FP**: A False Positive is an outcome where the model incorrectly predicts the positive class -- the model incorrectly predicted that the review was good when it was bad

-   **FN**: A False Negative is an outcome where the model incorrectly predicts the negative class -- the model  predicted that the review was bad when it was good.


In an ideal world, we would have all of our observations fitting nicely in the diagonal of that table.Unfortunately, we don't live in that world, and the greater amount we have in the off diagonal (i.e., in the FN and FP spots), the worse our model is at classifying outcomes. 

Let's look at some metrics that will help to see if we've got a suitable model or not. We'll describe each, then show them all after.


#### Accuracy {.unnumbered #sec-knowing-metrics-accuracy}

Accuracy's allure is in its simplicity, and because we use it for so many things in our everyday affairs. It is simply the proportion of correct predictions made by the model. But accuracy can also easily mislead you into believing a model is doing better than it is. If you have any **class imbalance**, where one class has far more observations than the other, you can get a high accuracy by simply predicting the majority class all of the time! To get around the false sense of confidence that accuracy alone can promote, we can look at a few other metrics.

:::{.callout-note title='Accuracy is not enough' collapse='true'}
Accuracy is the first thing you see and the last thing that you trust! Seriously, accuracy alone should not be your sole performance metric unless you have a perfectly even split in the target! If you find yourself in a meeting where people are presenting their classification models and they only talk about accuracy, you should be wary. This is especially true when those accuracy values seem too good to be true. At the very least, always be ready to compare it to the baseline rate, or prevalence of the majority class.
:::

#### Sensitivity/Recall/True positive rate {.unnumbered #sec-knowing-metrics-sensitivity}

**Sensitivity**, also known as **recall** or the **true positive rate**, is the proportion of observed positives that are correctly predicted by the model. If your focus is on the positive class above all else, sensitivity is the metric for you. 


#### Specificity/True negative rate {.unnumbered #sec-knowing-metrics-specificity}

**Specificity**, also known as the **true negative rate**, is the proportion of observed negatives that are correctly predicted as such. If you want to know how well your model will work with the negative class, specificity is a great metric. 

#### Precision/Positive predictive value {.unnumbered #sec-knowing-metrics-precision}

The **precision** is the proportion of positive predictions that are correct, and is often a key metric in many business use cases. While similar to sensitivity, precision focuses on positive predictions, while sensitivity focuses on observed positive cases[^precrec].

#### Negative predictive value {.unnumbered #sec-knowing-metrics-npv}

The **negative predictive value** is the proportion of negative predictions that are correct, and is the complement to precision.

[^precrec]: It's not clear why people started using things like recall/sensitivity and specificity, neither of them as clear as true positive rate and true negative rate. Why anyone thought 'precision' was a good name for any metric is beyond us, given how many metrics could generically be called as such.

Now let's demystify this a bit and see how we'd do this ourselves. Starting with a basic confusion matrix of counts, we'll then extract the values to create the metrics we need.

:::{.panel-tabset}

##### R

```{r}
#| label: r-cm-basic
our_cm = rating_cm$matrix

TN = our_cm[2, 2]
TP = our_cm[1, 1]
FN = our_cm[2, 1]
FP = our_cm[1, 2]

acc = (TP + TN) / sum(our_cm)  # accuracy
tpr = TP / (TP + FN)           # true positive rate, sensitivity, recall
tnr = TN / (TN + FP)           # true negative rate, specificity 
ppv = TP / (TP + FP)           # positive predictive value, precision
npv = TN / (TN + FN)           # negative predictive value
```

##### Python

```{python}
#| label: python-cm-basic
our_cm = rating_cm

TN = our_cm[0, 0]
TP = our_cm[1, 1]
FN = our_cm[1, 0]
FP = our_cm[0, 1]

acc = (TP + TN) / np.sum(our_cm)  # accuracy
tpr = TP / (TP + FN)              # true positive rate, sensitivity, recall
tnr = TN / (TN + FP)              # true negative rate, specificity 
ppv = TP / (TP + FP)              # positive predictive value, precision
npv = TN / (TN + FN)              # negative predictive value
```

:::


Now that we have a sense of some metrics, let's get a confusion matrix and stats using packages that will give us a lot of these metrics at once. In both cases we have an 0/1 integer where 0 is a rating of "bad" and 1 is "good".

:::{.panel-tabset}

##### R

```{r}
#| label: r-cm-summary
#| output: false
tibble(
    metric = c('ACC', 'TPR', 'TNR', 'PPV', 'NPV'),
    ours = c(acc, tpr, tnr, ppv, npv),
    package = rating_cm$measures[c('acc', 'tpr', 'tnr', 'ppv', 'npv')]
)
```

##### Python

We find [pycm]{.pack} to be a great package for this purpose, as practically every metric based on a confusion matrix you can think of is available. You can also use [sklearn.metrics]{.pack} and its corresponding `classification_report` function.

```{python}
#| label: python-cm-summary
#| output: false

from pycm import ConfusionMatrix

rating_cm = ConfusionMatrix(
    df_test.rating_good.to_numpy(), 
    predicted_class.to_numpy(), 
    digit = 3
)

# print(rating_cm) # lots of stats!

package_result = [
    rating_cm.class_stat[stat][1] # get results specific to class 1
    for stat in  ['ACC', 'TPR', 'TNR', 'PPV', 'NPV']
]

pd.DataFrame({
    'metric':['ACC', 'TPR', 'TNR', 'PPV', 'NPV'],
    'ours': [acc, tpr, tnr, ppv, npv],
    'package': package_result
})
```

:::

So now we have demystified some classification metrics as well! Your results may be slightly different due to the random nature of the data splits, but they should be very similar to these, and should match the package results regardless.

```{r}
#| echo: false
#| label: tbl-class-metrics
#| tbl-cap: Classification Results (based on R data/model)

tibble(
    metric = c('ACC', 'TPR', 'TNR', 'PPV', 'NPV'),
    ours = c(acc, tpr, tnr, ppv, npv),
    package = rating_cm$measures[c('acc', 'tpr', 'tnr', 'ppv', 'npv')]
) |> 
    gt(decimals = 3)
```


#### Ideal decision points for classification {.unnumbered #sec-knowing-ideal-cut-point}

<!-- putting a ton of ### works in creating a correct cross-reference to this point for an unnumbered section in pdf
without actually showing any text, or in html also, although the reference in pdf will be cut to 4th level (fine, and actually correct here and/or for the most part), while the html will be 7th level max
######### {#sec-knowing-ideal-cut-point}
-->

Earlier when we obtained the predicted class, and subsequently all the metrics based on it, we used a predicted probability value of 0.5 as a cutoff for a 'good' vs. a 'bad' rating, and this is usually the default if we don't specify it explicitly. Assuming that this is the best for a given situation is actually a bold assumption on our part, and we should probably make sure that the cut-off value we choose is going to offer us the best result given the modeling context.

But what is the *best* result? That's going to depend on the situation. If we are predicting whether a patient has a disease, we might want to minimize false negatives, since if we miss the diagnosis, the patient could be in serious trouble. Meanwhile if we are predicting whether a transaction is fraudulent, we might want to minimize false positives, since if we flag a transaction as fraudulent when it isn't, we could be causing a lot of trouble for the customer, and add cost the company to deal with it. In other words, we might want to maximize the true positive or true negative rates, respectively.

Whatever we decide, we ultimately are just shifting the metrics around relative to one another. As an easy example, if we were to classify all of our observations as 'good', we would have a sensitivity of 1 because all good ratings would be classified correctly. However, our positive predictive value would not be 1, and we'd have a specificity of 0. No matter which cutpoint we choose, we are going to have to make a tradeoff.

Where this comes into play is with **model selection**, where we choose a model based on a particular metric, and something we will talk about very soon. If we are comparing models based on accuracy, we might choose a different model than if we are comparing based on sensitivity. And given a particular threshold, we might choose a different model based on the same metric, than we would have with a different threshold.

To help us with the task of choosing a threshold, we will start by creating what's called a **Receiver Operating Characteristic** (ROC) curve. This curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The **area under the curve** (AUC) is a measure of how well the model is able to distinguish between the two classes. The closer the AUC is to 1, the better the model is at distinguishing between the two classes. The AUC is a very popular metric because it is not sensitive to our threshold, and actually concerns two metrics we are often interested in[^prerecallcurve].


[^prerecallcurve]: The **precision-recall curve** is a very similar approach which visualizes the tradeoff between precision and recall. The area under the precision-recall curve is its corresponding metric.


:::{.panel-tabset}

##### R

```{r}
#| results: hide
#| label: r-roc-auc
#| fig-show: 'hide'
roc = performance::performance_roc(model_class_train, new_data = df_test)
roc

# requires the 'see' package
plot(roc) 
```

##### Python

```{python}
#| results: hide
#| label: python-roc-auc
#| fig-show: 'hide'
from sklearn.metrics import roc_curve, auc, RocCurveDisplay

fpr, tpr, thresholds = roc_curve(
    df_test.rating_good, 
    predicted_prob
)

auc(fpr, tpr)

RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
```
:::

:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-pretty-roc-auc
#| fig-cap: ROC curve and AUC value

auc_value = bayestestR::area_under_curve(
    roc$Specificity, 
    roc$Sensitivity
)

roc |>
    as_tibble() |> 
    ggplot(aes(1-Specificity, Sensitivity)) +
    geom_path(linewidth = 1.25) +
    geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = 'gray75') +
    annotate(
        "text", 
        x = 0.5, 
        y = 0.75, 
        label = paste("AUC:", round(auc_value, 2)),
        size = 5,
        fontface = "bold",
        color = 'gray33'
    ) +
    scale_x_reverse() +
    labs(
        x = "1 - True Negative Rate (False Positive Rate)",
        y = "True Positive Rate",
        title = "ROC curve"
    ) 

ggsave("img/knowing-roc_auc.svg", width = 8, height = 6)
```

:::

:::{.content-visible when-format='pdf'}
![ROC curve and AUC value](img/knowing-roc_auc.svg){#fig-pretty-roc-auc}
:::

With ROC curves and AUC values in hand, now we can find the ideal cut-point for balancing the TPR and FPR. There are different ways to do this, but one common way is to use the Youden's J statistic, which we do here.

:::{.panel-tabset}

##### R


```{r}
#| label: r-ideal-cut-point

# produces the same value as before
roc_ = pROC::roc(df_test$rating_good, predicted_prob)
threshold = pROC::coords(roc_, "best", ret = "threshold")

predictions = ifelse(
    predict(model_class_train, df_test, type='response') >= threshold$threshold, 
    1, 
    0
)

cm_new = mlr3measures::confusion_matrix(
    factor(df_test$rating_good), 
    factor(predictions), 
    positive = "1"
)

tibble(
    threshold = threshold,
    TPR = cm_new$measures['tpr'],
    TNR = cm_new$measures['tnr']
)
```

##### Python

```{python}
#| label: python-ideal-cut-point
cut = thresholds[np.argmax(tpr - fpr)]

pd.DataFrame({
    'threshold': [cut],
    'TPR': [tpr[np.argmax(tpr - fpr)]],
    'TNR': [1 - fpr[np.argmax(tpr - fpr)]],
})
```   


:::

The result is a "best" decision cut-point for converting our predicted probabilities to classes, though again, there are different and equally valid ways of going about this. The take home point is that instead of being naive about setting our probability to .5, this will provide a cut-point that will lead to a more balanced result that recognizes other metrics that are important beyond accuracy. We will leave it to you to take that ideal cut-point value and update your metrics to see how much of a difference it will make[^stillpt5]. 

[^stillpt5]: It is not lost on us that our R model actually chose ~ .5! 😆 But even then you can see the sight difference in TPR/TNR.

Note again that this only changes the metric values relative to one another, not the overall performance of the model - the actual predicted probabilities don't change after all. For example, accuracy may go down while recall increases. You'll need to match these metrics to your use case to see if the change is worth it.  Whether it is a meager, modest, or meaningful improvement is going to vary from situation to situation, as will how you determine if your model is "good" or "bad". Is this a good model? Are you more interested in correctly identifying the positive class, or the negative class? Are you more interested in avoiding false positives/negatives? These are all questions that you will need to answer depending on the modeling context.


:::{.callout-note title='Multiclass Classification' collapse='true'}
The metrics we've discussed here are for binary classification, but there are also metrics for multiclass classification. The initial conversion from raw output to probability is done by a softmax function, which extends the logistic/sigmoid function to multiple classes. So we get probabilities for each class, and can classify each observation to the class with the highest probability.

The confusion matrix is extended to show the number of correct and incorrect predictions for each class, and the metrics are extended to show how well the model is able to distinguish between the classes. All metrics discussed have a multiclass variant, typically by averaging the metric across all classes, or by focusing on a particular class versus all other classes.
:::

## Model Selection & Comparison {#sec-knowing-model-compare}


Another important way to understand our model is by looking at how it compares to other models in terms of performance, however we choose to define that. One common way we can do this is by comparing models based on the metric(s) of our choice, for example, with RMSE or AUC. Let's see this in action for our regression model. Here we will compare three models: our original model, one with a subset of three features, and the three feature model that includes interactions with genre.  Our goal will be to see how these perform on the test set based on RMSE.

:::{.panel-tabset}

##### R

```{r}
#| label: r-regression-compare
#| eval: true

# create the models
model_lr_3feat = lm(
    rating ~ review_year_0 + release_year_0 + age_sc,
    df_train
)

model_lr_interact = lm(
    rating ~ review_year_0 * genre + release_year_0 * genre + age_sc * genre, 
    df_train
)

model_lr_train = lm(
    rating ~ 
        review_year_0 
        + release_year_0 
        + age_sc 
        + length_minutes_sc 
        + total_reviews_sc 
        + word_count_sc 
        + genre 
        , 
        df_train
)

# get the predictions, calculate RMSE
result = map(
    list(model_lr_3feat, model_lr_train, model_lr_interact), 
    ~ predict(.x, newdata = df_test)
    ) |> 
    map_dbl(
        ~ yardstick::rmse_vec(df_test$rating, .)
    )
```

##### Python

```{python}
#| label: python-regression-compare
#| eval: false

import statsmodels.formula.api as smf
from sklearn.metrics import root_mean_squared_error

# create the models
model_lr_3feat = smf.ols(
    formula='rating ~ review_year_0 + release_year_0 + age_sc',
    data=df_train
).fit()

model_lr_interact = smf.ols(
    formula='rating ~ review_year_0 * genre + release_year_0 * genre + age_sc * genre',
    data=df_train
).fit()

model_lr_train = smf.ols(
    formula=
    '''
    rating ~ 
    review_year_0
    + release_year_0
    + age_sc
    + length_minutes_sc
    + total_reviews_sc
    + word_count_sc
    + genre
    ''' 
    ,
    data=df_train
).fit()

models = [model_lr_3feat, model_lr_train, model_lr_interact]

# get the predictions, calculate RMSE
result = pd.DataFrame({
    'model': ['3 features', 'original', '3 feat+interact'],
    'rmse': [
        root_mean_squared_error(
            df_test.rating, 
            model.predict(df_test[features])
        )
        for model in models
    ]
})
```

:::

```{r}
#| echo: false
#| label: tbl-regression-compare
#| tbl-cap: RMSE for Different Models

tibble(
    model = c('3 features', 'original', '3 feat+interact'),
    rmse = result,
    `% Δ RMSE` = scales::label_percent()(-(result - max(result)) / max(result))
) |> 
    arrange(rmse) |>
    gt() |> 
    tab_footnote(
        '% Δ is the percentage drop relative to the largest value.',
        locations = cells_column_labels(columns = vars(`% Δ RMSE`))
    )
```

In this case, the three feature model does worst, but adding interactions of those features with genre improves the model. However, we see that our original model with 7 features has the lowest RMSE, indicating that it is the best model *under these circumstances*. This suggests that the additional features have more to add to the model. This is a simple example, but it is a very typical way to compare models that you would use frequently.  The same approach would work for classification models, just using an appropriate metric like AUC or F1.

Another thing to consider is that even with a single model, the model fitting procedure is always comparing a model with the current parameter estimates, or more generally the current objective function value, to a previous one with different parameter estimates. In this case, our goal is **model selection**, or how we choose the best result from a single model. While this is an automatic process, the details of how this actually happens is the focus of @sec-estimation. In other cases, we are selecting models through the process of cross-validation (@sec-ml-cv), but the idea is largely the same in that we are comparing our current parameter estimates to other possibilities. We are always doing model selection and comparison, and as such we'll be demonstrating these often.


## Model Visualization {#sec-knowing-model-vis}

A key method for understanding how our model is performing is through visualization. You'll recall that we started out way back by look at the predicted values against the observed values to see if there was any correspondence (@fig-my-first-model-predictions-plot), but another key way to understand our model is to look at the residuals, or errors in prediction, which again is the difference in our prediction versus the observed value. Here are a couple plots that can help us understand our model:

```{r}
#| echo: false
#| label: fig-regression-visualize
#| fig-cap: Residuals vs. Fitted and PP Check plots for our regression model

# ?see:::plot.see_performance_pp_check
p_pp = plot(
    performance::check_predictions(model_lr_train), 
    # colors = okabe_ito,
    use_theme = FALSE,
    size_line = c(1),
    line_alpha = 0.1,
    style = theme_clean()
) +
    scale_color_manual(values = c(okabe_ito[['darkblue']], okabe_ito[['orange']])) +
    labs(
        x = 'Rating',
        y = 'Density',
        caption = 'Model-predicted lines (thinner) should resemble observed data line (thicker)'
    ) +
    theme(
        axis.title.x = element_text(size = 10, face='bold'),
        axis.title.y = element_text(size = 10, face='bold'),
        plot.title = element_text(size = 10, hjust =0),
        plot.subtitle = element_text(size = 8, hjust =0),
    )

p_res = plot(
        performance::check_heteroscedasticity(model_lr_train), 
        color=c(okabe_ito[['darkblue']], scales::alpha(okabe_ito[['orange']], .15)), # ignores alpha somehow 
        alpha = 0.1
    ) + 
    labs(
        x = "Predicted"
    ) +
    theme_minimal() + # remove previous theme to get rid of title issues
    theme_clean() +
    theme(
        axis.ticks.y = element_line(),
        axis.title.x = element_text(size = 10, face='bold'),
        axis.title.y = element_text(size = 10, face='bold'),
        plot.title = element_text(size = 10, hjust = 0),
        plot.subtitle = element_text(size = 8, hjust =0),
    )


# p_pp +
#     p_res 

# see:::.plot_diag_homogeneity
```


-   **Residuals vs. Fitted**: This type of plot shows predicted values vs. the residuals (or some variant of the residuals, like their square root). If you see a pattern, that potentially means your model is not capturing something in the data. For example, if you see a funnel shape, that would suggest that you are systematically having worse predictions for some part of the data. For some plots, patterns may suggest an underlying nonlinear relationship in the data is yet to be uncovered. For our main regression model, we don't see any patterns that would indicate that the model has a notable issue.

:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-model-select-residuals
#| fig-cap: Residuals vs. Fitted Plot
p_res

ggsave("img/knowing-model-select-residuals.svg", width = 8, height = 6)
```

:::

:::{.content-visible when-format='pdf'}
![Residuals vs. Fitted and PP Check plots for our regression model](img/knowing-model-select-residuals.svg){#fig-model-select-residuals}
:::

- **Training/Test Performance**: For iterative approaches like deep learning, we may want to see how our model is performing across iterations, typically called epochs. We can look at the training and testing performance to see if our model is overfitting or underfitting. We can actually do this with standard models as well if the estimation approach is iterative, but it's not as common. We can also visualize performance across samples of the data, such as in cross-validation. The following shows performance for a model similar to the multilayer perceptron (MLP) demonstrated later (@sec-ml-common-dl-nn), and using the same features as our other models. Here we see it get to a relatively low objective function value after just a few epochs.


```{python}
#| echo: false
#| label: py-mlp
#| results: hide
#| fig-show: 'hide'
#| include: false

from sklearn.neural_network import MLPRegressor

X_train = df_train[
    [
        'review_year_0',
        'release_year_0',
        'age_sc',
        'length_minutes_sc',
        'total_reviews_sc',
        'word_count_sc',
    ]
]
X_test = df_test[
    [
        'review_year_0',
        'release_year_0',
        'age_sc',
        'length_minutes_sc',
        'total_reviews_sc',
        'word_count_sc',
    ]
]
y_train = df_train['rating'].to_numpy()
y_test = df_test['rating'].to_numpy()

mlp = MLPRegressor(
    hidden_layer_sizes=(100,),
    max_iter=500,
    random_state=42,
    early_stopping=True, # only way to save validation
    validation_fraction=0.2,
    verbose=False,
)

mlp.fit(X_train, y_train)

mlp_loss_train = np.sqrt(mlp.loss_curve_)         # not sure any docs say what this is; supposedly mse 
mlp_loss_val   = mlp.validation_scores_  # this is supposedly R2 but doesn't make any sense if so

# import matplotlib.pyplot as plt
# # Plot the error across epochs for training and test data
# plt.plot(mlp.loss_curve_)
# plt.xlabel('Epoch')
# plt.ylabel('Error')
# plt.title('Error across 25 epochs')
# plt.show()

```

:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-model-select-training-loss
#| fig-cap: RMSE Loss over 25 epochs in an MLP model

tibble(
    epoch = 1:length(py$mlp_loss_train),
    loss = unlist(py$mlp_loss_train),
    val_loss = -unlist(py$mlp_loss_val) # completely uninterpretable/useless
) |> 
    ggplot(aes(epoch, loss)) +
    geom_line() +
    # geom_line(aes(y = val_loss), color = okabe_ito[['orange']]) + 
    coord_cartesian(xlim = c(1, 25)) +
    labs(
        title = 'RMSE Loss over 25 epochs in an MLP model',
        x = 'Epoch',
        y = 'Loss',
        caption = 'Multilayer Perceptron Model for review rating'
    )

ggsave("img/knowing-model-select-training-loss.svg", width = 8, height = 6)
```
:::

:::{.content-visible when-format='pdf'}
![MSE Loss over 25 epochs in an MLP model](img/knowing-model-select-training-loss.svg){#fig-model-select-training-loss}
:::


- **Predictive Check**: This is a basic comparison of predicted vs. observed target values. In the simplest case you can just examine your predictions vs. the observed values, and that's plenty for a quick assessment. 

```{r}
#| echo: false
#| eval: false
#| label: r-regression-pp-check-simple

p_pp_simple = tibble(
    Predicted = fitted(model_lr_train),
    Observed = df_train$rating
) |>
    pivot_longer(cols = c(Predicted, Observed), names_to = "Type", values_to = "Value") |>
    ggplot(aes(x = Value, fill = Type)) +
    geom_density(alpha = 0.5, color = NA) +
    scale_fill_manual(values = c(okabe_ito[['darkblue']], okabe_ito[['orange']])) +
    labs(
        x = "Rating",
        y = "Density",
        title = "Simple Predictive Check for a regression model"
    )

ggsave("img/knowing-model-pp-check-simple.svg", p_pp_simple, width = 8, height = 6)
```

![Simple Predictive Check for a regression model](img/knowing-model-pp-check-simple.svg){#fig-model-pp-check-simple}

We may be getting ahead of ourselves to understand this completely yet, but it's worth knowing about ***posterior* predictive checks**, which are typically used with Bayesian models, but are not restricted to that case.  A proper posterior predictive check is a bit more involved, but there are packages that make it straightforward[^ppcpack]. The basic idea is that we simulate the target based on the model parameter estimates and their uncertainty.  And with that distribution of estimates (e.g. regression coefficients), we can then simulate random draws of predicted values. A step-by-step approach is as follows:

1. Simulate the parameters following the assumed distribution, e.g., a normal distribution for the regression coefficients.
2. For each simulated parameter set, make a model prediction.
3. Repeat this process many times to get a distribution of predictions.
4. Compare this distribution to the observed target distribution.


In our final step, we compare that distribution of predictions to the observed target distribution. If the two distributions are similar, then the model is doing a good job of capturing the target distribution. 

This plot is ubiquitous in Bayesian modeling, but can be used for any model that has uncertainty estimates or is otherwise able to generate random draws of the target distribution. For our regression model, our predictions match the target distribution well.

[^ppcpack]: The [performance]{.pack} package in R has a `check_predictions` function that can do this for you, and we used it here. Base R has a `simulate` function that can be used to generate random draws of the predictive distribution for `lm/glm`  models. For Python you'll need to use a package like [arviz](https://arviz-devs.github.io/arviz/) for Bayesian models, or write your own function for other models.

```{r}
#| echo: false
#| eval: false
#| label: fig-model-select-pp-check
#| fig-cap: Posterior Predictive Check for a regression model
#| out-width: 100%

p_pp

ggsave("img/knowing-model-select-pp-check.svg", width = 8, height = 6)
```

![Posterior Predictive Check for a regression model](img/knowing-model-select-pp-check.svg){#fig-model-select-pp-check}


- **Other Plots**: Other plots may look at the distribution of residuals, check for extreme values, see if there is an overabundance of zero values, and other issues, some of which may be specific to the type of model you are using.

The following shows how to get a residuals vs. fitted plot and a posterior predictive check.  


:::{.panel-tabset}

##### R {.unnumbered}

```{r}
#| eval: false
#| label: r-regression-visualize

performance::check_model(model_lr_train, check = c('linearity', 'pp_check'))
```

##### Python {.unnumbered}

```{python}
#| label: fig-python-regression-visualize
#| eval: false

import seaborn as sns
import matplotlib.pyplot as plt

## Residual Plot
sns.residplot(
    x = model_lr_train.fittedvalues, 
    y = df_train.rating, 
    lowess = True, 
    line_kws={'color': 'red', 'lw': 1}
)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted')
plt.show()


## Posterior Predictive Check
# get the model parameters
pp = model_lr_train.model.get_distribution(
    params = model_lr_train.params, 
    scale  = model_lr_train.scale, 
    exog   = model_lr_train.model.exog
)

# Generate 10 simulated predictive distributions
pp_samples = [pp.rvs() for _ in range(10)]

# Plot the distribution of pp_samples
for sample in pp_samples:
    sns.kdeplot(sample, label='pp.rvs()', alpha=0.25)

# Overlay the density plot of df_train.rating
sns.kdeplot(
    df_train.rating.to_numpy(), 
    label='df_train.rating', 
    linewidth=2
)

plt.xlabel('Rating')
plt.ylabel('Density')
plt.title('Distribution of predictions vs. observed rating')
plt.show()
```

:::


:::{.callout-note title='Tests of Assumptions' collapse='true'}
For standard GLM models there are an abundance of statistical tests available for some of these checks, for example heterogeneity of variance, or whether your residuals are normally distributed. On a practical level, these are not usually helpful, and often misguided. For example, if you have a large sample size, you will almost always reject the null hypothesis that your residuals are normally distributed. It also starts the [turtles all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down) problem of whether you need to check the assumptions of your test of assumptions! We prefer the 'seeing is believing' approach. It is often pretty clear when there are model and data issues, and if it isn't, it's probably not a big deal.
:::


## Wrapping Up {#sec-knowing-wrap}

It is easy to get caught up in the excitement of creating a model and then using it to make predictions. It is also easy to get caught up in the excitement of seeing a model perform well on a test set. It is much harder to take a step back and ask yourself, "Is this model really doing what I want it to do?"  It takes a lot of work to trust what a model is telling you.

### The common thread {#sec-knowing-thread}

Much of what you've seen in this chapter can be applied to any model. From linear regression to deep learning, we often use similar metrics to help select and compare models.


### Choose your own adventure {#sec-knowing-choose}

Now that you've gotten a grasp of how to understand the model from a general perspective, you can now focus on understanding more about how features relate to the target and make predictions. Head to @sec-knowing-feature to find out more!

### Additional resources {#sec-knowing-resources}

If this chapter has piqued your curiosity, we would encourage you to check out the following resources. 

Even though we did not use the `mlr3` package in this chapter, the **Evaluation and Benchmarking** chapter of the associated book, [Applied Machine Learning Using mlr3 in R](https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html), offers a great conceptual take on model metrics and evaluation. 

For a more Pythonic look at model evaluation, we would highly recommend going through the sci-kit learn documentation on [Model Evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html). It has you absolutely covered with code examples and concepts.