chapter_3_ml.qmd

---
editor: 
  markdown: 
    wrap: 80
---

# A data driven approach to predict GPP from VIs through machine learning methods

## Introduction

The field of Earth Science has witnessed a transformative shift with the
integration of Machine Learning (ML) methods, which has led to a deeper
understanding of our planet's complex ecosystems and processes
[@reichstein_deep_2019]. ML methods are now well established in environmental
sciences [@lary_machine_2016], including their use within studies aimed at
mapping and quantifying vegetation characteristics and ecological processes such
as vegetation cover, structure, and disturbances, among others
[@lehnert_retrieval_2015; @verrelst_machine_2012].

The increasing number of EC sites [@tramontana_predicting_2016] coupled with the
continuously growing amount of Earth system data surpassing dozens of petabytes
[@reichstein_deep_2019], has led to an emergence of purely data-driven
methodologies for quantifying ecosystem status and fluxes. These approaches have
shown promise for the quantification of global terrestrial photosynthesis
[@jung_global_2011; @tramontana_predicting_2016] and have resulted in good
progress in the estimation of biogeo-physical parameters using remotely sensed
reflectance data, both at local and global scales [@coops_prediction_2003;
@verrelst_retrieval_2012].

Furthermore, these data-driven approaches have contributed significantly to the
scientific community by providing spatial, seasonal, and interannual variations
in predicted fluxes. These predictions, generated through machine learning
methodologies, are now serving as important benchmarks for evaluating the
performance of physical land-surface and climate models [@jung_recent_2010;
@bonan_improving_2011; @anav_spatiotemporal_2015].

Some of the differences between data driven models and process-based methods are
the inherent observational character of data driven models and that functional
relationships emerge from the patterns found in the data, rather than being
stated before [@tramontana_predicting_2016]. So functional relationships between
in-situ measured fluxes with the explanatory variables can emerge
[@tramontana_predicting_2016]. This paradigm shift toward data-driven modeling
to extract patterns represents an opportunity to come up with new ideas and
question established theories in earth system models.

In contrast to process based models, data driven models inherently possess an
observational nature where functional relationships emerge from the patterns
found in the data, rather than being predefined, such as the relationships
between in-situ measured fluxes and the explanatory variables.
[@tramontana_predicting_2016]. For example, the application of spatially
explicit global data driven methods, has unveiled discrepancies in the
estimation of photosynthesis within tropical rainforests when compared to
climate models [@beer_terrestrial_2010]. This overestimation, has led to the
creation of hypothesis for a better understanding of radiative transfer in
vegetation canopies [@bonan_improving_2011] which can result in better
photosynthesis estimates. This paradigm shift toward data-driven modeling to
extract patterns, represents an opportunity to explore novel ideas and question
established theories in earth system models [@reichstein_deep_2019].

Predicting dynamics in the biosphere is challenging due to biologically mediated
processes [@reichstein_deep_2019]. The term "prediction" should not be confused
with forecasting, as most models are not aiming at predicting into the future.
Instead, the focus of these data-driven models is to improve historical
estimates or enable the use of reflectance values to predict GPP when no in-situ
data is available in the present times [@meyer_improving_2018].

These predictions can include many forms of uncertainty [@reichstein_deep_2019].
One form is that individual ML methods can have different responses, especially
when these models are applied beyond the conditions presented in the training
dataset. [@jung_towards_2009; @papale_effect_2015]. Another form is related to
the explanatory variables used in ML methods derived from satellite remote
sensing, which are partial in providing information about the vegetation state
[@tramontana_predicting_2016]. Consequently, they lack the information required
to explain the complete variability in fluxes [@tramontana_uncertainty_2015].
For instance, if a model is created using only reflectance data to estimate GPP
without meteorological data, phenomena such as drought could lead to predicted
values with large errors, given that stomata closure during water deficit have
an immediate effect on the fluxes that can not be detected by the reflectance
values only until later when the stress conditions persist
[@tramontana_uncertainty_2015].

This chapter evaluates data drive approaches for estimating GPP given MODIS
surface reflectance data for temperate broadleaf forests in North America. Two
state-of-the-art approaches previously used for GPP estimation are tested:
regression random forests [@sohil_introduction_2022] and AutoML [@ledell2020h2o]
to understand how the ML algorithm impacts prediction uncertainty. Both
approaches were tested locally and using pooled data from three sites to
quantify their ability to be applied over larger spatial extents.

\newpage

## Methods

```{r libraries and sources}
#| echo: false
#| message: false
#| warning: false

# Libraries
library(ggplot2)
library(cowplot)
library(lubridate)
library(purrr)
library(broom)
library(gt)
library(tidymodels)
library(broom)
library(usemodels)
library(vip)
library(h2o)
library(stringr)
library(DALEX)
library(DALEXtra)
library(forcats)
library(ranger)
# Source files
# Source the objects created for the complete GPP trends.
# This file will source the code and load objects to memory.
source("scripts/trend_plots.R")

# Source the objects created for the complete GPP trends
source("scripts/models_data_preparation.R")

# Source file with functions to plot rf predictions
source("R/plot_exploratory.R")
```

<!--  - Sites -->

<!--  - ONEFlux for GPP -->

<!--  - Satellite imagery -->

<!--  - indices calculation -->

<!--  - Here I'm using all the bands available -->

<!--  - Cleaning satellite data -->

<!--  - Filter -->

<!--  - scaling -->

<!--  - join -->

### Eddy Covariance sites

For this study, we selected three deciduous broadleaf forest forests sites:
University of Michigan Biological Station located in northern Michigan, USA
(45°350 N 84°430 W), Bartlett experimental forest in New Hampshire, USA (44°06
N, 71°3 W), and the Borden Forest Research Station (44°19 N, 79°56 W) in
Ontario, Canada. These sites were selected to ensure they represented a single
ecosystem type, characterized by shared environmental features. This approach
allowed us to treat the dataset as a representation of a specific vegetation
type terrestrial ecosystem.

In-situ data such as GPP was obtained utilizing the ONEFlux estimation
processing by Ameriflux. Here, we selected GPP estimation done by the daytime
method [@pastorello2020fluxnet2015] on a daily, weekly, and monthly basis.

To capture seasonal variations and long-term trends, GPP data was collected over
a minimum of 2 years. Specifically, University of Michigan Biological Station
collected data spanned from January 2015 to January 2018, Bartlett experimental
forest data ranges from January 2015 to December 2018, and Borden Forest
Research Station from January 2015 to January 2022.

<!-- From ameriflux: -->

<!--  https://ameriflux.lbl.gov/sites/siteinfo/US-Bar -->

<!--  https://ameriflux.lbl.gov/sites/siteinfo/CA-Cbo -->

<!--  https://ameriflux.lbl.gov/sites/siteinfo/US-UMB -->

<!-- All sites are Dfb -->

<!-- The code "Dfb" in the Köppen system refers to a specific climate type, which can be described as follows: -->

<!-- "D" stands for the warm-summer continental or hemiboreal climate. -->

<!-- "f" indicates that this climate has significant precipitation in all seasons. -->

<!-- "b" indicates that the warmest month has an average temperature between 22°C and 28°C. -->

### Satellite imagery

We used Google Earth Engine (GEE) to retrieve the Terra Moderate Resolution
Imaging Spectroradiometer (MODIS), specifically the collection MOD09GA Version
6.1 daily 500m resolution surface reflectance products (MODIS/Terra Surface
Reflectance Daily L2G Global 1 km and 500 m SIN Grid). A square polygon with an
area of 3km surrounding the EC tower was defined for each of the study sites,
and all daily data pixel values within this polygon were extracted for analysis.

MOD09GA contains the surface spectral reflectance from bands 1 through 7
(@tbl-MODIS_500_complete_bands) with a spatial resolution of 500m, with
corrections for atmospheric conditions such as aerosols, gasses, and Rayleigh
scattering [@vermote2021modis].

We selected the highest quality pixels according to the 1km Reflectance Data
State QA (`state_1km`) (@tbl-state_1km_bitstrings) and Surface Reflectance 500m
Quality Assurance (`qc_500m`) (@tbl-qc_scan_bit_strings) variables. Once we had
just the highest quality pixels, all the band values were scaled by a factor of
0.0001. If any value fell outside the range of 0 to 1 after the scaling, it was
discarded.

Once all the band values were scaled, we calculated 4 Vegetation Indices: `NDVI`
(@eq-ndvi), `NIRv` (@eq-nirv), `EVI` (@eq-evi), and `CCI` (@eq-cci). Then all
the MODIS bands values and VIs were summarized on a daily, weekly, and monthly
basis to be merged with the GPP values from ONEFlux.

| **Name**    | **Description** | **Resolution** | **Wavelength** |
|-------------|-----------------|----------------|----------------|
| sur_refl_01 | Red             | 500 meters     | 620-670nm      |
| sur_refl_02 | NIR             | 500 meters     | 841-876nm      |
| sur_refl_03 | Blue            | 500 meters     | 459-479nm      |
| sur_refl_04 | Green           | 500 meters     | 545-565nm      |
| sur_refl_05 | Red Edge        | 500 meters     | 1230-1250nm    |
| sur_refl_06 | SWIR 1          | 500 meters     | 1628-1652nm    |
| sur_refl_07 | SWIR 2          | 500 meters     | 2105-2155nm    |

: MODIS (MOD09GA.061 product) bands used for ML methods
{#tbl-MODIS_500_complete_bands}

### Random Forests

A random forest is an ensemble learning technique that leverages the power of
multiple decision trees to improve predictive accuracy and robustness. It
combines regression and classification trees, constructing each tree from random
subsets of both trained data and features [@sohil_introduction_2022]. The final
prediction is obtained by aggregating individual tree predictions, which are
then averaged to produce the ultimate estimate. Additionally, at each split, the
best predictor from the random subset is selected to effectively partition the
data and in the case of the regression random forest, the process is adapted to
predict a continuous numeric outcome[@sohil_introduction_2022;
@meyer_improving_2018].

Regression random forests were used as an approach to predict GPP as a function
of all available MOD09GA bands values (from B01 to B07) and the calculated VIs.
Three distinct models were developed, each one tailored for a specific time
scale (daily, weekly, and monthly). This approach allowed us to assess the
prediction performance of GPP at different time scales.

Each model was calibrated using a random data splitting procedure, dividing the
data into a training set with 70% of the observations and a test set with the
remaining 30%. Due to the varying number of observations for each site, to avoid
an imbalanced training dataset, we employed a stratified data split to ensure a
proportional representation of each site category in both sets. To ensure
reproducibility, we used a consistent random number generator state throughout
the process.

To implement the RF models, we use the `ranger` package in R
[@wright_ranger_2017], utilizing 1000 trees within the forest ensemble. The
models were trained using the bootstrap resampling technique with 100 folds,
which helps to improve the robustness and accuracy of the predictions.

We calculated the variable of importance (VIP) to understand which MODIS bands
or VIs are driving the predictions in each of the regression random forest
models. To measure the influence of each feature on the overall model's
predictive performance, we quantify how this performance deteriorates when a
particular variable is permuted while keeping others constant.

To understand which features contributed the most on average to a particular GPP
prediction in different coalitions [@molnar2020interpretable], we calculated the
Shapley values [@lundberg_unified_nodate]. These values were computed with the
DALEX package in R [@biecek2018dalex] for both low GPP and high GPP scenarios
within each of the temporally aggregated models.

### AutoML

The AutoML approach is designed to identify the most optimal ML pipeline for a
specific problem and available training data by evaluating different
combinations of data processing steps, ML models, and hyperparameters settings
[@gaber2023using]. To create a stacked ensemble of models and evaluate the
performance, we utilized the H2O AutoML framework [@ledell2020h2o].

For the data training and evaluation, we randomly split the dataset into a
training and a test set, allocating 80% and 20% of the observations,
respectively. We excluded the `total observations` and `site` variables from
both datasets, while the remaining features (bands and VIs), were considered
predictors. The response variable for prediction was GPP. We transformed the
data into an H2O frame to ensure compatibility with the AutoML process.

After generating machine learning models and hyperparameter configurations, we
imposed a time constraint of a maximum of 2 minutes for model training. The best
performing models were then selected in terms of lower predictions erros and
used to form an ensemble. This trained ensemble model was used to generate
predictions on the test set. Model performance was evaluated using metrics such
as the coefficient of determination and the RMSE.

To asses the variable importance, we determined the influence of each variable
on the model's predictions. The results of this analysis were visually
represented as heatmaps for each of the models within the ensemble.

\newpage

## Results

### Data-Driven GPP Prediction: A Regression Random Forest Approach

The best performing model, based on R² and RMSE was the monthly aggregated
model, achieving a R² of 0.81 and an RMSE of
$2.03 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ (see @fig-monthly_500_rf). The daily
model exhibited the second best performance model with a RMSE of
$3.20 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ and a R² of 0.69 (see
@fig-daily_500_rf). However, the weekly model displayed comparatively lower
performance, with an R² of 0.56 and an RMSE of
$3.23 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ (see @fig-weekly_500_rf)

Moreover, in assesing the importance of predictor variables within each model,
VIs variables held top positions across all models, surpassing the importance of
any other spectral bands alone. Specifically, the CCI contributed most
significantly to the predictive performance in two of the models, the weekly and
monthly models (@fig-vip_weekly_500_rf and @fig-vip_monthly_500_rf). In the case
of the daily model, CCI emerged as the second most influential predictor
variable (as shown in @fig-vip_daily_500_rf).

Despite NDVI being ranked as the second variable in importance for the
monthly model, it was the fifth variable in importance for the daily model and
the fourth one for the weekly model. Meanwhile, the NIRv consistently held the
third most influential position across all models. EVI was found to be most
important for the daily model, the second most important variable in the weekly
model, but fourth in the monthly model.

To evaluate how the predictor variables had an impact on the model's
predictions, we employed the Shapley values. For each of the models created, we
calculated the contributions to each variable in predicting GPP for two cases:
for a known high GPP value and a known low GPP value from the test data set.

In the case of the daily model, we chose a low GPP value from the test dataset,
specifically $0.01 \, \mathrm{gC \, m^{-2} \, d^{-1}}$and obtained a model
prediction of $0.60 \, \mathrm{gC \, m^{-2} \, d^{-1}}$. Our Shapley value
analysis showed that EVI, CCI and NIRv were the most influential attributes
affecting the model prediction (See @fig-shap_daily_500_rf **A**). These VIs
contribute negatively to the prediction, reducing the predicted GPP.

Conversely, the high GPP value selected
($16.21 \, \mathrm{gC \, m^{-2} \, d^{-1}}$) from the test dataset, had a model
prediction of $12.70 \, \mathrm{gC \, m^{-2} \, d^{-1}}$. In this case the
Shapley values analysis indicated that the most influential variables were CCI
and NIRv (See @fig-shap_daily_500_rf **B**). These variables contributed
positively to the prediction.

For the weekly model, we selected high and low GPP values of
$1.21 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ and
$13.17 \, \mathrm{gC \, m^{-2} \, d^{-1}}$, respectively, resulting in
predictions of $1.60 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ and
$11.0 \, \mathrm{gC \, m^{-2} \, d^{-1}}$. For the low GPP value, the most
influential variables were CCI, EVI, NIRv, and NDVI, all contributing negatively
to the prediction (see @fig-shap_weekly_500_rf **A**). Conversely, for the high
GPP value, CCI, EVI, NIRv, B02 and NDVI had the most influence on the prediction
positively (refer to @fig-shap_weekly_500_rf **B**).

In the case of the monthly model, the low GPP value selected was
$1.86 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ and the high GPP value was
$11.87 \, \mathrm{gC \, m^{-2} \, d^{-1}}$, leading to predictions of
$3.31 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ and\
$10.60 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ respectively. In both scenarios, CCI,
NDVI, and NIRv emerged as the most influential variables, albeit contributing
negatively to the low GPP value prediction (See @fig-shap_monthly_500_rf **A**)
and positively to the high GPP value prediction (See @fig-shap_monthly_500_rf
**B**).

```{r data_preparation_rf}
#| echo: false
#| message: false
#| warning: false

# 500
# Dataset to use: daily_500 for all sites
bor <- borden_daily_500 %>% 
  select(ends_with(c("_mean")),
         gpp_dt_vut_ref, total_obs) %>% 
  mutate(site = "borden")

bar <- bartlett_daily_500 %>% 
  select(ends_with(c("_mean")),
         gpp_dt_vut_ref, total_obs) %>% 
  mutate(site = "bartlett")

mich <- michigan_daily_500 %>% 
  select(ends_with(c("_mean")),
         gpp_dt_vut_ref, total_obs) %>% 
  mutate(site = "michigan")

daily_500_rf <- bind_rows(bor, bar, mich) %>% 
  select(-kndvi_mean)

# Dataset to use: weekly_500 for all sites
bor <- borden_weekly_500 %>% 
  select(ends_with(c("_mean")),
         gpp_dt_vut_ref, total_obs) %>% 
  mutate(site = "borden")

variables <- names(bor)

bar <- bartlett_weekly_500 %>% 
  mutate(site = "bartlett") %>% 
  select(all_of(variables))

mich <- michigan_weekly_500 %>% 
  mutate(site = "michigan") %>% 
  select(all_of(variables))

weekly_500_rf <- bind_rows(bor, bar, mich) %>% 
  select(-kndvi_mean)

## Dataset to use: monthly_500 for all sites
bor <- borden_monthly_500 %>% 
  select(ends_with(c("_mean")),
         gpp_dt_vut_ref, total_obs) %>% 
  mutate(site = "borden")

variables <- names(bor)

bar <- bartlett_monthly_500 %>% 
  mutate(site = "bartlett") %>% 
  select(all_of(variables))

mich <- michigan_monthly_500 %>% 
  mutate(site = "michigan") %>% 
  select(all_of(variables))

monthly_500_rf <- bind_rows(bor, bar, mich) %>% 
  select(-kndvi_mean)
```

<!-- #### Daily 500 -->

```{r xgboost_daily}
#| echo: false
#| message: false
#| warning: false
# set.seed(123)
# daily_500_split <- initial_split(daily_500_rf, strata = site)
# daily_500_train <- training(daily_500_split)
# daily_500_test <- testing(daily_500_split)
# 
# set.seed(234)
# # daily_500_folds 
# daily_500_folds <- bootstraps(daily_500_train,
#                               times = 100,
#                               strata = gpp_dt_vut_ref)
# 
# bb_recipe <- 
#   recipe(formula = gpp_dt_vut_ref ~ ., data = daily_500_train) %>% 
#   step_select(-site, -total_obs)
# 
# xgb_spec <-
#   boost_tree(
#     trees = tune(),
#     min_n = tune(),
#     mtry = tune(),
#     learn_rate = 0.01
#   ) %>%
#   set_engine("xgboost") %>%
#   set_mode("regression")
# 
# xgb_wf <- workflow(bb_recipe, xgb_spec)
# 
# set.seed(3156)
# 
# library(finetune)
# doParallel::registerDoParallel()
# 
# set.seed(345)
# xgb_rs <- tune_race_anova(
#   xgb_wf,
#   resamples = daily_500_folds,
#   grid = 15,
#   # metrics = metric_set(mn_log_loss),
#   control = control_race(verbose_elim = TRUE)
# )
# 
# plot_race(xgb_rs)
# 
# show_best(xgb_rs)
# 
# xgb_last <- xgb_wf %>%
#   finalize_workflow(select_best(xgb_rs)) %>%
#   last_fit(daily_500_split)
# 
# xgb_last
# metrics <- collect_metrics(xgb_last)
# 
# plot_predictions_rf(xgb_last, metrics, 4, 4, 23.5, 22) 
```

```{r daily_500_rf}
#| echo: false
#| message: false
#| warning: false

## Make sure that the source of the file "R/plot_exploratory.R" was succesful

set.seed(752)
daily_500_split <- initial_split(daily_500_rf, strata = site)
daily_500_train <- training(daily_500_split)
daily_500_test <- testing(daily_500_split)

set.seed(234)
# daily_500_folds 
daily_500_folds <- bootstraps(daily_500_train,
                              times = 100,
                              strata = gpp_dt_vut_ref)

ranger_recipe <- 
  recipe(formula = gpp_dt_vut_ref ~ ., data = daily_500_train) %>% 
  step_select(-site, -total_obs, skip = TRUE)

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("regression") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

# Conditional to re-run model if no artifac was saved before.
if (fs::file_exists("models/daily_500_fit_site.rds") & 
    fs::file_exists("models/daily_500_site_ranger_tune.rds")) {
  
  daily_500_fit <- readRDS("models/daily_500_fit_site.rds")
  ranger_tune <- readRDS("models/daily_500_site_ranger_tune.rds")
  
} else {  
  doParallel::registerDoParallel()
  set.seed(6578)
  ranger_tune <-
    tune_grid(ranger_workflow, 
              resamples = daily_500_folds, 
              grid = 12)
  
  # Final fit
  final_rf <- ranger_workflow %>% 
    finalize_workflow(select_best(ranger_tune))
  
  daily_500_fit <- last_fit(final_rf, daily_500_split)
  
  ## last_fit is saved if no model has been trained and saved.
  saveRDS(ranger_tune, "models/daily_500_site_ranger_tune.rds")
  saveRDS(daily_500_fit, "models/daily_500_fit_site.rds")
}
```

```{r predictions_plot_daily_500_rf}
#| label: fig-daily_500_rf
#| fig-cap: GPP observed and predicted values from the Random Forest model for all the sites at a daily basis. The red line represents a 1:1 relation. Metrics units are gC m⁻² d⁻¹
#| fig-width: 6
#| fig-height: 4
#| echo: false
#| message: false
#| warning: false

# Explore RF results
## Check the metrics
metrics <- collect_metrics(daily_500_fit) 

## Collect predictions
plot_predictions_rf(daily_500_fit, metrics, 4, 4, 23.5, 22) 
```

```{r vip_plot_daily_500_rf}
#| label: fig-vip_daily_500_rf
#| fig-cap: "Variable of importance derived from the Random forest model for the daily values at 500 m spatial resolution model."
#| echo: false
#| message: false
#| warning: false

## Feature importance
imp_spec <- ranger_spec %>%
  finalize_model(select_best(ranger_tune)) %>%
  set_engine("ranger", importance = "permutation")

workflow() %>%
  add_recipe(ranger_recipe) %>%
  add_model(imp_spec) %>%
  fit(daily_500_train) %>%
  extract_fit_parsnip() %>% 
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) +
  theme_light(base_size = 12) +
  scale_x_discrete(labels = c("ndvi_mean" = "NDVI",
                              "nirv_mean" = "NIRv",
                              "evi_mean" = "EVI",
                              "cci_mean" = "CCI",
                              "sur_refl_b01_mean" = "B01",
                              "sur_refl_b02_mean" = "B02",
                              "sur_refl_b03_mean" = "B03",
                              "sur_refl_b04_mean" = "B04",
                              "sur_refl_b05_mean" = "B05",
                              "sur_refl_b06_mean" = "B06",
                              "sur_refl_b07_mean" = "B07"))
```

```{r shapley_values_daily}
#| label: fig-shap_daily_500_rf
#| fig-cap: "Shapley values derived from the Random forest model for the daily values at 500 m spatial resolution model. Predicted value for the low GPP value is 0.59 (A) and 12.7 for the selected high GPP value (B)."
#| echo: false
#| message: false
#| warning: false

# Extract workflow to obtain shapley values (or run predict)
daily_gpp_model <- extract_workflow(daily_500_fit)

# Create an explainer for a regression model
explainer_rf <- explain_tidymodels(
  daily_gpp_model,
  data = daily_500_train,
  y = daily_500_train$gpp_dt_vut_ref,
  label = "rf",
  verbose = FALSE
)

# Take a low gpp value
low_gpp <- daily_500_train[71, ] 
low_gpp_pred <- predict(daily_gpp_model, low_gpp)

rf_breakdown <- predict_parts(explainer = explainer_rf, 
                              new_observation = low_gpp,
                              type = "shap")
# rf_breakdown
low_value <- rf_breakdown %>%
  group_by(variable) %>%
  mutate(mean_val = mean(contribution)) %>%
  ungroup() %>% 
  mutate(variable = case_when(
    str_detect(variable, "ndvi_mean") ~ str_replace(variable, "ndvi_mean", "NDVI"),
    str_detect(variable, "nirv_mean") ~ str_replace(variable, "nirv_mean", "NIRv"),
    str_detect(variable, "evi_mean" ) ~ str_replace(variable, "evi_mean", "EVI" ),
    str_detect(variable, "cci_mean" ) ~ str_replace(variable, "cci_mean", "CCI" ),
    str_detect(variable, "sur_refl_b01_mean") ~ str_replace(variable, "sur_refl_b01_mean", "B01"),
    str_detect(variable, "sur_refl_b02_mean") ~ str_replace(variable, "sur_refl_b02_mean", "B02"),
    str_detect(variable, "sur_refl_b03_mean") ~ str_replace(variable, "sur_refl_b03_mean", "B03"),
    str_detect(variable, "sur_refl_b04_mean") ~ str_replace(variable, "sur_refl_b04_mean", "B04"),
    str_detect(variable, "sur_refl_b05_mean") ~ str_replace(variable, "sur_refl_b05_mean", "B05"),
    str_detect(variable, "sur_refl_b06_mean") ~ str_replace(variable, "sur_refl_b06_mean", "B06"),
    str_detect(variable, "sur_refl_b07_mean") ~ str_replace(variable, "sur_refl_b07_mean", "B07"),
    str_detect(variable, "total_obs") ~ str_replace(variable, "total_obs", "Total obs."),
    str_detect(variable, "site") ~ str_replace(variable, "site", "Site"),
    str_detect(variable, "gpp_dt_vut_ref") ~ str_replace(variable, "gpp_dt_vut_ref", "GPP"),
    .default = variable
  )) %>%
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>%
  ggplot(aes(contribution, variable, fill = mean_val > 0)) +
  geom_col(data = ~distinct(., variable, mean_val), 
           aes(mean_val, variable), 
           alpha = 0.5) +
  geom_boxplot(width = 0.5) +
  theme_light() +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

# Take a high gpp value
high_gpp <- daily_500_train[578, ] 
high_gpp_pred <- predict(daily_gpp_model, high_gpp)

rf_breakdown <- predict_parts(explainer = explainer_rf, 
                              new_observation = high_gpp,
                              type = "shap")
# rf_breakdown
high_value <- rf_breakdown %>%
  group_by(variable) %>%
  mutate(mean_val = mean(contribution)) %>%
  ungroup() %>%
  mutate(variable = case_when(
    str_detect(variable, "ndvi_mean") ~ str_replace(variable, "ndvi_mean", "NDVI"),
    str_detect(variable, "nirv_mean") ~ str_replace(variable, "nirv_mean", "NIRv"),
    str_detect(variable, "evi_mean" ) ~ str_replace(variable, "evi_mean", "EVI" ),
    str_detect(variable, "cci_mean" ) ~ str_replace(variable, "cci_mean", "CCI" ),
    str_detect(variable, "sur_refl_b01_mean") ~ str_replace(variable, "sur_refl_b01_mean", "B01"),
    str_detect(variable, "sur_refl_b02_mean") ~ str_replace(variable, "sur_refl_b02_mean", "B02"),
    str_detect(variable, "sur_refl_b03_mean") ~ str_replace(variable, "sur_refl_b03_mean", "B03"),
    str_detect(variable, "sur_refl_b04_mean") ~ str_replace(variable, "sur_refl_b04_mean", "B04"),
    str_detect(variable, "sur_refl_b05_mean") ~ str_replace(variable, "sur_refl_b05_mean", "B05"),
    str_detect(variable, "sur_refl_b06_mean") ~ str_replace(variable, "sur_refl_b06_mean", "B06"),
    str_detect(variable, "sur_refl_b07_mean") ~ str_replace(variable, "sur_refl_b07_mean", "B07"),
    str_detect(variable, "total_obs") ~ str_replace(variable, "total_obs", "Total obs."),
    str_detect(variable, "site") ~ str_replace(variable, "site", "Site"),
    str_detect(variable, "gpp_dt_vut_ref") ~ str_replace(variable, "gpp_dt_vut_ref", "GPP"),
    .default = variable
  )) %>%
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>%
  ggplot(aes(contribution, variable, fill = mean_val > 0)) +
  geom_col(data = ~distinct(., variable, mean_val), 
           aes(mean_val, variable), 
           alpha = 0.5) +
  geom_boxplot(width = 0.5) +
  theme_light() +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

plot_grid(low_value,
          high_value,
          nrow = 1,
          labels = c('A', 'B'),
          vjust = 1)
```

<!-- #### Weekly 500 -->

```{r weekly_500_rf}
#| echo: false
#| message: false
#| warning: false
set.seed(125)

weekly_500_split <- initial_split(weekly_500_rf, strata = site)
weekly_500_train <- training(weekly_500_split)
weekly_500_test <- testing(weekly_500_split)

set.seed(2389)
weekly_500_folds <- bootstraps(weekly_500_train,
                              times = 100,
                              strata = gpp_dt_vut_ref)
ranger_recipe <- 
  recipe(formula = gpp_dt_vut_ref ~ ., data = weekly_500_train) %>% 
  step_select(-site, -total_obs, skip = TRUE)

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("regression") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

# Conditional to re-run model if no artifac was saved before.
if (fs::file_exists("models/weekly_500_fit_site.rds")) {
  weekly_500_fit <- readRDS("models/weekly_500_fit_site.rds")
  ranger_tune <- readRDS("models/weekly_500_site_ranger_tune.rds")
} else {
  doParallel::registerDoParallel()
  set.seed(1297)
  ranger_tune <-
    tune_grid(ranger_workflow, 
              resamples = weekly_500_folds, 
              grid = 12)
  
  # Final fit
  final_rf <- ranger_workflow %>% 
    finalize_workflow(select_best(ranger_tune))
  
  weekly_500_fit <- last_fit(final_rf, weekly_500_split)
  
  ## last_fit is saved if no model has been trained and saved.
  saveRDS(ranger_tune, "models/weekly_500_site_ranger_tune.rds")
  saveRDS(weekly_500_fit, "models/weekly_500_fit_site.rds")
}
```

```{r predictions_plot_weekly_500_rf}
#| label: fig-weekly_500_rf
#| fig-cap: "GPP observed and predicted values from the Random Forest for all the sites at a weekly basis. The red line represents a 1:1 relation. Metrics units are gC m⁻² d⁻¹"
#| fig-width: 6
#| fig-height: 4
#| echo: false
#| message: false
#| warning: false

# Explore RF results
## Check the metrics
metrics <- collect_metrics(weekly_500_fit) 

## Collect predictions
plot_predictions_rf(weekly_500_fit, metrics, 5, 5, 18, 19)
```

```{r vip_plot_weekly_500_rf}
#| label: fig-vip_weekly_500_rf
#| fig-cap: "Variable of importance derived from the Random forest model for the weekly values at 500 m spatial resolution model."
#| echo: false
#| message: false
#| warning: false

## Feature importance
imp_spec <- ranger_spec %>%
  finalize_model(select_best(ranger_tune)) %>%
  set_engine("ranger", importance = "permutation")

workflow() %>%
  add_recipe(ranger_recipe) %>%
  add_model(imp_spec) %>%
  fit(weekly_500_train) %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) +
  theme_classic(base_size = 12) +
  scale_x_discrete(labels = c("ndvi_mean" = "NDVI",
                              "nirv_mean" = "NIRv",
                              "evi_mean" = "EVI",
                              "cci_mean" = "CCI",
                              "sur_refl_b01_mean" = "B01",
                              "sur_refl_b02_mean" = "B02",
                              "sur_refl_b03_mean" = "B03",
                              "sur_refl_b04_mean" = "B04",
                              "sur_refl_b05_mean" = "B05",
                              "sur_refl_b06_mean" = "B06",
                              "sur_refl_b07_mean" = "B07"))
```

```{r shapley_values_weekly}
#| label: fig-shap_weekly_500_rf
#| fig-cap: "Shapley values derived from the Random forest model for the weekly values at 500 m spatial resolution model. Predicted value for the low GPP value is 1.59 (A) and 11.0 for the selected high GPP value (B)."
#| echo: false
#| message: false
#| warning: false

# Extract workflow to obtain shapley values (or run predict)
weekly_gpp_model <- extract_workflow(weekly_500_fit)

# Create an explainer for a regression model
explainer_rf <- explain_tidymodels(
  weekly_gpp_model,
  data = weekly_500_train,
  y = weekly_500_train$gpp_dt_vut_ref,
  label = "rf",
  verbose = FALSE
)

# Take a low gpp value
low_gpp <- weekly_500_train[16, ] 
low_gpp_pred <- predict(weekly_gpp_model, low_gpp)

rf_breakdown <- predict_parts(explainer = explainer_rf, 
                              new_observation = low_gpp,
                              type = "shap")
# rf_breakdown
low_value <- rf_breakdown %>%
  group_by(variable) %>%
  mutate(mean_val = mean(contribution)) %>%
  ungroup() %>%
  mutate(variable = case_when(
    str_detect(variable, "ndvi_mean") ~ str_replace(variable, "ndvi_mean", "NDVI"),
    str_detect(variable, "nirv_mean") ~ str_replace(variable, "nirv_mean", "NIRv"),
    str_detect(variable, "evi_mean" ) ~ str_replace(variable, "evi_mean", "EVI" ),
    str_detect(variable, "cci_mean" ) ~ str_replace(variable, "cci_mean", "CCI" ),
    str_detect(variable, "sur_refl_b01_mean") ~ str_replace(variable, "sur_refl_b01_mean", "B01"),
    str_detect(variable, "sur_refl_b02_mean") ~ str_replace(variable, "sur_refl_b02_mean", "B02"),
    str_detect(variable, "sur_refl_b03_mean") ~ str_replace(variable, "sur_refl_b03_mean", "B03"),
    str_detect(variable, "sur_refl_b04_mean") ~ str_replace(variable, "sur_refl_b04_mean", "B04"),
    str_detect(variable, "sur_refl_b05_mean") ~ str_replace(variable, "sur_refl_b05_mean", "B05"),
    str_detect(variable, "sur_refl_b06_mean") ~ str_replace(variable, "sur_refl_b06_mean", "B06"),
    str_detect(variable, "sur_refl_b07_mean") ~ str_replace(variable, "sur_refl_b07_mean", "B07"),
    str_detect(variable, "total_obs") ~ str_replace(variable, "total_obs", "Total obs."),
    str_detect(variable, "site") ~ str_replace(variable, "site", "Site"),
    str_detect(variable, "gpp_dt_vut_ref") ~ str_replace(variable, "gpp_dt_vut_ref", "GPP"),
    .default = variable
  )) %>%
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>%
  ggplot(aes(contribution, variable, fill = mean_val > 0)) +
  geom_col(data = ~distinct(., variable, mean_val), 
           aes(mean_val, variable), 
           alpha = 0.5) +
  geom_boxplot(width = 0.5) +
  theme_light() +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

# Take a high gpp value
high_gpp <- weekly_500_train[167, ] 
high_gpp_pred <- predict(weekly_gpp_model, high_gpp)

rf_breakdown <- predict_parts(explainer = explainer_rf, 
                              new_observation = high_gpp,
                              type = "shap")
# rf_breakdown
high_value <- rf_breakdown %>%
  group_by(variable) %>%
  mutate(mean_val = mean(contribution)) %>%
  ungroup() %>%
  mutate(variable = case_when(
    str_detect(variable, "ndvi_mean") ~ str_replace(variable, "ndvi_mean", "NDVI"),
    str_detect(variable, "nirv_mean") ~ str_replace(variable, "nirv_mean", "NIRv"),
    str_detect(variable, "evi_mean" ) ~ str_replace(variable, "evi_mean", "EVI" ),
    str_detect(variable, "cci_mean" ) ~ str_replace(variable, "cci_mean", "CCI" ),
    str_detect(variable, "sur_refl_b01_mean") ~ str_replace(variable, "sur_refl_b01_mean", "B01"),
    str_detect(variable, "sur_refl_b02_mean") ~ str_replace(variable, "sur_refl_b02_mean", "B02"),
    str_detect(variable, "sur_refl_b03_mean") ~ str_replace(variable, "sur_refl_b03_mean", "B03"),
    str_detect(variable, "sur_refl_b04_mean") ~ str_replace(variable, "sur_refl_b04_mean", "B04"),
    str_detect(variable, "sur_refl_b05_mean") ~ str_replace(variable, "sur_refl_b05_mean", "B05"),
    str_detect(variable, "sur_refl_b06_mean") ~ str_replace(variable, "sur_refl_b06_mean", "B06"),
    str_detect(variable, "sur_refl_b07_mean") ~ str_replace(variable, "sur_refl_b07_mean", "B07"),
    str_detect(variable, "total_obs") ~ str_replace(variable, "total_obs", "Total obs."),
    str_detect(variable, "site") ~ str_replace(variable, "site", "Site"),
    str_detect(variable, "gpp_dt_vut_ref") ~ str_replace(variable, "gpp_dt_vut_ref", "GPP"),
    .default = variable
  )) %>%
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>%
  ggplot(aes(contribution, variable, fill = mean_val > 0)) +
  geom_col(data = ~distinct(., variable, mean_val), 
           aes(mean_val, variable), 
           alpha = 0.5) +
  geom_boxplot(width = 0.5) +
  theme_light() +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

plot_grid(low_value,
          high_value,
          nrow = 1,
          labels = c('A', 'B'),
          vjust = 1)
```

<!-- #### Monthly -->

```{r monthly_500_rf}
#| echo: false
#| message: false
#| warning: false
set.seed(973)

monthly_500_split <- initial_split(monthly_500_rf, strata = site)
monthly_500_train <- training(monthly_500_split)
monthly_500_test <- testing(monthly_500_split)

# monthly_500_folds 
set.seed(365)
monthly_500_folds <- bootstraps(monthly_500_train,
                                times = 100,
                                strata = gpp_dt_vut_ref)

ranger_recipe <- 
  recipe(formula = gpp_dt_vut_ref ~ ., data = monthly_500_train) %>% 
  step_select(-site, -total_obs, skip = TRUE)

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("regression") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

# Conditional to re-run model if no artifac was saved before.
if (fs::file_exists("models/monthly_500_fit_site.rds")) {
  monthly_500_fit <- readRDS("models/monthly_500_fit_site.rds")
  ranger_tune <- readRDS("models/monthly_500_site_ranger_tune.rds")
} else {
  set.seed(3159)
  
  doParallel::registerDoParallel()
  ranger_tune <-
    tune_grid(ranger_workflow, 
              resamples = monthly_500_folds, 
              grid = 12)
  
  # Final fit
  final_rf <- ranger_workflow %>% 
    finalize_workflow(select_best(ranger_tune))
  
  monthly_500_fit <- last_fit(final_rf, monthly_500_split)
  
  ## last_fit is saved if no model has been trained and saved.
  saveRDS(ranger_tune, "models/monthly_500_site_ranger_tune.rds")
  saveRDS(monthly_500_fit, "models/monthly_500_fit_site.rds")
}
```

```{r predictions_plot_monthly_500_rf}
#| label: fig-monthly_500_rf
#| fig-cap: "GPP observed and predicted values from the Random Forest for all the sites at a monthly basis. The red line represents a 1:1 relation. Metrics units are gC m⁻² d⁻¹"
#| fig-width: 6
#| fig-height: 4
#| echo: false
#| message: false
#| warning: false

# Explore RF results
## Check the metrics
metrics <- collect_metrics(monthly_500_fit) 

## Collect predictions
plot_predictions_rf(monthly_500_fit, metrics, 5, 5, 13, 14)
```

```{r vip_plot_monthly_500_rf}
#| label: fig-vip_monthly_500_rf
#| fig-cap: "Variable of importance derived from the Random forest model for the monthly values at 500 m spatial resolution model."
#| echo: false
#| message: false
#| warning: false

## Feature importance
imp_spec <- ranger_spec %>%
  finalize_model(select_best(ranger_tune)) %>%
  set_engine("ranger", importance = "permutation")

workflow() %>%
  add_recipe(ranger_recipe) %>%
  add_model(imp_spec) %>%
  fit(monthly_500_train) %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) +
  theme_light(base_size = 12) +
  scale_x_discrete(labels = c("ndvi_mean" = "NDVI",
                              "nirv_mean" = "NIRv",
                              "evi_mean" = "EVI",
                              "cci_mean" = "CCI",
                              "sur_refl_b01_mean" = "B01",
                              "sur_refl_b02_mean" = "B02",
                              "sur_refl_b03_mean" = "B03",
                              "sur_refl_b04_mean" = "B04",
                              "sur_refl_b05_mean" = "B05",
                              "sur_refl_b06_mean" = "B06",
                              "sur_refl_b07_mean" = "B07"))
```

```{r shapley_values_monthly}
#| label: fig-shap_monthly_500_rf
#| fig-cap: "Shapley values derived from the Random forest model for the monthly values at 500 m spatial resolution model. Predicted value for the low GPP value is 3.31 (A) and 10.6 for the selected high GPP value (B)."
#| echo: false
#| message: false
#| warning: false

# Extract workflow to obtain shapley values (or run predict)
monthly_gpp_model <- extract_workflow(monthly_500_fit)

# Create an explainer for a regression model
explainer_rf <- explain_tidymodels(
  monthly_gpp_model,
  data = monthly_500_train,
  y = monthly_500_train$gpp_dt_vut_ref,
  label = "rf",
  verbose = FALSE
)

# Take a low gpp value
low_gpp <- monthly_500_train[45, ] 
low_gpp_pred <- predict(monthly_gpp_model, low_gpp)

rf_breakdown <- predict_parts(explainer = explainer_rf, 
                              new_observation = low_gpp,
                              type = "shap")
# rf_breakdown
low_value <- rf_breakdown %>%
  group_by(variable) %>%
  mutate(mean_val = mean(contribution)) %>%
  ungroup() %>%
  mutate(variable = case_when(
    str_detect(variable, "ndvi_mean") ~ str_replace(variable, "ndvi_mean", "NDVI"),
    str_detect(variable, "nirv_mean") ~ str_replace(variable, "nirv_mean", "NIRv"),
    str_detect(variable, "evi_mean" ) ~ str_replace(variable, "evi_mean", "EVI" ),
    str_detect(variable, "cci_mean" ) ~ str_replace(variable, "cci_mean", "CCI" ),
    str_detect(variable, "sur_refl_b01_mean") ~ str_replace(variable, "sur_refl_b01_mean", "B01"),
    str_detect(variable, "sur_refl_b02_mean") ~ str_replace(variable, "sur_refl_b02_mean", "B02"),
    str_detect(variable, "sur_refl_b03_mean") ~ str_replace(variable, "sur_refl_b03_mean", "B03"),
    str_detect(variable, "sur_refl_b04_mean") ~ str_replace(variable, "sur_refl_b04_mean", "B04"),
    str_detect(variable, "sur_refl_b05_mean") ~ str_replace(variable, "sur_refl_b05_mean", "B05"),
    str_detect(variable, "sur_refl_b06_mean") ~ str_replace(variable, "sur_refl_b06_mean", "B06"),
    str_detect(variable, "sur_refl_b07_mean") ~ str_replace(variable, "sur_refl_b07_mean", "B07"),
    str_detect(variable, "total_obs") ~ str_replace(variable, "total_obs", "Total obs."),
    str_detect(variable, "site") ~ str_replace(variable, "site", "Site"),
    str_detect(variable, "gpp_dt_vut_ref") ~ str_replace(variable, "gpp_dt_vut_ref", "GPP"),
    .default = variable
  )) %>%
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>%
  ggplot(aes(contribution, variable, fill = mean_val > 0)) +
  geom_col(data = ~distinct(., variable, mean_val), 
           aes(mean_val, variable), 
           alpha = 0.5) +
  geom_boxplot(width = 0.5) +
  theme_light() +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

# Take a high gpp value
high_gpp <- monthly_500_train[6, ] 
high_gpp_pred <- predict(monthly_gpp_model, high_gpp)

rf_breakdown <- predict_parts(explainer = explainer_rf, 
                              new_observation = high_gpp,
                              type = "shap")
# rf_breakdown
high_value <- rf_breakdown %>%
  group_by(variable) %>%
  mutate(mean_val = mean(contribution)) %>%
  ungroup() %>%
  mutate(variable = case_when(
    str_detect(variable, "ndvi_mean") ~ str_replace(variable, "ndvi_mean", "NDVI"),
    str_detect(variable, "nirv_mean") ~ str_replace(variable, "nirv_mean", "NIRv"),
    str_detect(variable, "evi_mean" ) ~ str_replace(variable, "evi_mean", "EVI" ),
    str_detect(variable, "cci_mean" ) ~ str_replace(variable, "cci_mean", "CCI" ),
    str_detect(variable, "sur_refl_b01_mean") ~ str_replace(variable, "sur_refl_b01_mean", "B01"),
    str_detect(variable, "sur_refl_b02_mean") ~ str_replace(variable, "sur_refl_b02_mean", "B02"),
    str_detect(variable, "sur_refl_b03_mean") ~ str_replace(variable, "sur_refl_b03_mean", "B03"),
    str_detect(variable, "sur_refl_b04_mean") ~ str_replace(variable, "sur_refl_b04_mean", "B04"),
    str_detect(variable, "sur_refl_b05_mean") ~ str_replace(variable, "sur_refl_b05_mean", "B05"),
    str_detect(variable, "sur_refl_b06_mean") ~ str_replace(variable, "sur_refl_b06_mean", "B06"),
    str_detect(variable, "sur_refl_b07_mean") ~ str_replace(variable, "sur_refl_b07_mean", "B07"),
    str_detect(variable, "total_obs") ~ str_replace(variable, "total_obs", "Total obs."),
    str_detect(variable, "site") ~ str_replace(variable, "site", "Site"),
    str_detect(variable, "gpp_dt_vut_ref") ~ str_replace(variable, "gpp_dt_vut_ref", "GPP"),
    .default = variable
  )) %>%
  mutate(variable = fct_reorder(variable, abs(mean_val))) %>%
  ggplot(aes(contribution, variable, fill = mean_val > 0)) +
  geom_col(data = ~distinct(., variable, mean_val), 
           aes(mean_val, variable), 
           alpha = 0.5) +
  geom_boxplot(width = 0.5) +
  theme_light() +
  theme(legend.position = "none") +
  scale_fill_viridis_d() +
  labs(y = NULL)

plot_grid(low_value,
          high_value,
          nrow = 1,
          labels = c('A', 'B'),
          vjust = 1)
```

\newpage

### The potential of AutoML approaches for GPP predictions

The AutoML approach yielded varying performance outcomes across different
temporal scales as measured by R² and RMSE. The monthly prediction model
(@fig-predictions_automl_monthly_500) emerged as the top performer, with an R²
of 0.76 and a low RMSE of $1.84 \, \mathrm{gC \, m^{-2} \, d^{-1}}$. Following
this, the weekly model (@fig-predictions_automl_weekly_500) demonstrated the
second-highest explanatory power, capturing R² = 0.72 of the GPP variability, albeit
with a slightly elevated RMSE of $3.08 \, \mathrm{gC \, m^{-2} \, d^{-1}}$.
Conversely, the daily model (@fig-predictions_automl_500) exhibited
comparatively diminished performance, explaining 0.67 of the variability with the
highest RMSE of $3.11 \, \mathrm{gC \, m^{-2} \, d^{-1}}$

An examination of variable importance in the AutoML model revealed distinctive
patterns in significant contributors to GPP prediction across temporal scales.
In the daily prediction model, EVI and CCI emerged as the most important
(@fig-vip_daily_500_automl). Moving to a weekly time frame, the importance of
EVI and CCI persisted, with the addition of Band 02 (NIR band)
(@fig-vip_daily_500_automl_weekly). In the monthly prediction model, EVI once
again took precedence, accompanied by Band 02 and NIRv
(@fig-vip_monthly_500_automl_monthly), highlighting the enduring importance of
these variables.

Additionally, a comparative assessment between the regression random forest
model and the AutoML model, both applied to the same datasets, revealed nuanced
differences in their predictive performance @tbl-summary_ml_metrics. While the
regression random forest model exhibited superior R² values in daily and monthly
predictions, indicating a better overall fit to the data, the AutoML model
demonstrated lower RMSE. Conversely, for weekly predictions, the AutoML model
outperformed in both metrics. These findings underscore the importance of
considering multiple metrics and temporal scales when evaluating and selecting
models for GPP predictions.

| Variable | RF R2 | RF RMSE | automl R2 | automl RMSE |
|----------|-------|---------|-----------|-------------|
| Daily    | 0.70  | 3.17    | 0.67      | 3.11        |
| Weekly   | 0.55  | 3.23    | 0.72      | 3.08        |
| Monthly  | 0.81  | 2.03    | 0.76      | 1.84        |

: Summary ML metrics in gC m^-2^ d^-1^{#tbl-summary_ml_metrics}

<!-- #### Daily autoML -->

```{r}
#| label: fig-predictions_automl_500
#| fig-cap: "GPP observed and predicted values from the autoML for all the sites at a daily basis. The red line represents a 1:1 relation. Metrics units are gC m⁻² d⁻¹"
#| echo: false
#| message: false
#| warning: false

predictions_automl <- readRDS("models/predictions_automl.rds")
perf <- readRDS("models/performance_automl.rds")

rsq <- h2o.r2(perf)
rmse <- h2o.rmse(perf)

plot_predictions_automl(predictions_automl, rmse, rsq, 3, 3, 18, 20)
```

```{r automl_importance_variable}
#| label: fig-vip_daily_500_automl
#| fig-cap: "Variable of importance derived from the autoML model for the daily values at 500 m spatial resolution model."
#| fig-width: 7
#| fig-height: 5
#| echo: false
#| message: false
#| warning: false

readRDS("models/automl_va_plot.rds") +
  theme_classic(base_size = 12) +
  scale_fill_viridis_c(direction = -1) +
  labs(title = NULL, x = "Model ID") +
  theme(axis.text.x = element_text(angle = 55, h = 1)) +
  scale_y_discrete(labels = c("ndvi_mean" = "NDVI",
                              "nirv_mean" = "NIRv",
                              "evi_mean" = "EVI",
                              "cci_mean" = "CCI",
                              "sur_refl_b01_mean" = "B01",
                              "sur_refl_b02_mean" = "B02",
                              "sur_refl_b03_mean" = "B03",
                              "sur_refl_b04_mean" = "B04",
                              "sur_refl_b05_mean" = "B05",
                              "sur_refl_b06_mean" = "B06",
                              "sur_refl_b07_mean" = "B07"))
```

<!-- #### Weekly autoML -->

```{r}
#| label: fig-predictions_automl_weekly_500
#| fig-cap: "GPP observed and predicted values from the autoML for all the sites at a weekly basis. The red line represents a 1:1 relation. Metrics units are gC m⁻² d⁻¹"
#| echo: false
#| message: false
#| warning: false

predictions_automl <- readRDS("models/predictions_automl_weekly.rds")
perf <- readRDS("models/performance_automl_weekly.rds")

rsq <- h2o.r2(perf)
rmse <- h2o.rmse(perf)

plot_predictions_automl(predictions_automl, rmse, rsq, 4, 4, 18, 20)
```

```{r automl_importance_variable_weekly}
#| label: fig-vip_daily_500_automl_weekly
#| fig-cap: "Variable of importance derived from the autoML model for the weekly values at 500 m spatial resolution model."
#| fig-width: 7
#| fig-height: 5
#| echo: false
#| message: false
#| warning: false

readRDS("models/automl_va_plot_weekly.rds") +
  theme_classic(base_size = 12) +
  scale_fill_viridis_c(direction = -1) +
  labs(title = NULL, x = "Model ID") +
  theme(axis.text.x = element_text(angle = 55, h = 1)) +
  scale_y_discrete(labels = c("ndvi_mean" = "NDVI",
                              "nirv_mean" = "NIRv",
                              "evi_mean" = "EVI",
                              "cci_mean" = "CCI",
                              "sur_refl_b01_mean" = "B01",
                              "sur_refl_b02_mean" = "B02",
                              "sur_refl_b03_mean" = "B03",
                              "sur_refl_b04_mean" = "B04",
                              "sur_refl_b05_mean" = "B05",
                              "sur_refl_b06_mean" = "B06",
                              "sur_refl_b07_mean" = "B07"))
```

<!-- #### Monthly autoML -->

```{r}
#| label: fig-predictions_automl_monthly_500
#| fig-cap: "GPP observed and predicted values from the autoML for all the sites at a monthly basis. The red line represents a 1:1 relation. Metrics units are gC m⁻² d⁻¹"
#| echo: false
#| message: false
#| warning: false
predictions_automl <- readRDS("models/predictions_automl_monthly.rds")
perf <- readRDS("models/performance_automl_monthly.rds")

rsq <- h2o.r2(perf)
rmse <- h2o.rmse(perf)

plot_predictions_automl(predictions_automl, rmse, rsq, 5, 5, 12, 13)
```

```{r automl_importance_variable_monthly}
#| label: fig-vip_monthly_500_automl_monthly
#| fig-cap: "Variable of importance derived from the autoML model for the monthly values at 500 m spatial resolution model."
#| fig-width: 7
#| fig-height: 5
#| echo: false
#| message: false
#| warning: false
readRDS("models/automl_va_plot_monthly.rds") +
  theme_classic(base_size = 12) +
  scale_fill_viridis_c(direction = -1) +
  labs(title = NULL, x = "Model ID") +
  theme(axis.text.x = element_text(angle = 55, h = 1)) +
  scale_y_discrete(labels = c("ndvi_mean" = "NDVI",
                              "nirv_mean" = "NIRv",
                              "evi_mean" = "EVI",
                              "cci_mean" = "CCI",
                              "sur_refl_b01_mean" = "B01",
                              "sur_refl_b02_mean" = "B02",
                              "sur_refl_b03_mean" = "B03",
                              "sur_refl_b04_mean" = "B04",
                              "sur_refl_b05_mean" = "B05",
                              "sur_refl_b06_mean" = "B06",
                              "sur_refl_b07_mean" = "B07"))
```

\newpage

## Discussion

These results underscore the temporal variations in model performance, with the
monthly models having the best performance metrics for both, the regression
random forest and the autoML. The lower RMSE of the AutoML model indicates it
can potentially result in lower uncertainty for monthly predictions than RF.
Given the higher R², regression RF captures a larger proportion of the variance
in the dependent variable than the autoML. This discrepancy underscores the
importance of considering multiple evaluation metrics when assessing model
performance. The choice between these models may depend on the specific goals of
the analysis, weighing the trade-off between explaining variability and
achieving precision in predictions.

Upon examining the six specific predictions analyzed through Shapley values for
the Random Forest models across all timeframes, it became evident that
high-predicted values of Gross Primary Productivity (GPP) were consistently
underestimated, while low GPP values were overestimated. It is crucial to
clarify that these particular predictions and Shapley values are specific to
each instance and do not represent the entirety of potential predictions.
However, when scrutinizing the graphs of predicted GPP values against observed
GPP values, especially in the case of the daily model, it is noteworthy that the
maximum predicted values hover around $16 \, \mathrm{gC \, m^{-2} \, d^{-1}}$,
whereas observed values can reach up to $24 \, \mathrm{gC \, m^{-2} \, d^{-1}}$
(See @fig-daily_500_rf). Importantly, these values all belong to the Borden
site, where the range of observed GPP is notably higher compared to the other
two sites.

The discrepancy between the observed values and the predicted values may arise
from a potential lag between the change in photosynthetic rate and the
concentration of photosynthetic pigments, particularly the change in
chlorophyll. Since the predicted GPP values for the constructed models are
solely based on reflectance values, the primary changes related to GPP that they
can capture are APAR and, to some extent, chlorophyll concentration
[@pabon-moreno_potential_2022]. Given that photosynthesis can change rapidly
without significant alterations in pigment concentrations, there might be an
overestimation of predicted values compared to observed and estimated GPP values
at the site mediating EC. In situations of stress, the dynamics of GPP could
shift unnoticed solely based on satellite reflectance values.

Conversely, underestimation may be attributed to the well-known phenomenon of
saturation. When using indices created with the NIR band for GPP estimation,
challenges arise in scenarios where there is a substantial increase in
vegetation biomass, which happens during summers in the study sites. In such
situations, the dense biomass leads to increased scattering and reflection of
radiation. While the NIR band is sensitive to changes in vegetation structure
and density, it encounters limitations as the amount of biomass intensifies.
This results in a phenomenon known as saturation, where the sensor reaches its
maximum capacity to detect changes in reflectance values.

Saturation occurs because the dense vegetation causes a greater proportion of
the incoming radiation to be scattered or reflected, particularly in the NIR
spectrum [@camps-valls_unified_2021-2]. As a consequence, the sensor becomes
saturated, meaning that further increases in biomass or productivity do not
translate proportionally into higher measured reflectance values. This limits
the sensor's ability to capture and differentiate changes in the productivity of
vegetation beyond a certain threshold.

While, for the monthly model, AutoML demonstrates an acceptable percentage of
explained variability, with Random Forest surpassing it by 0.05, for the weekly
and daily models, it appears that capturing the inherent variability in the data
is challenging solely with the utilized indices and the entirety of MODIS bands.
There seems to be a necessity to incorporate additional variables that possibly
could further enhance predictive capabilities.

Regarding the question of whether it is preferable to employ a Random Forest
(RF) model or an AutoML model, the advantages revealed by the results of this
study do not exhibit marked distinctions. The performances are comparable, with
the only instance of a notable increase, a R² = 0.17 improvement in explaining
variance, observed in the weekly model by the AutoML model. In the monthly
model, Random Forest demonstrates a R² = 0.05 superiority in terms of explained
variance compared to AutoML, although AutoML achieves a marginal improvement in
error reduction by 0.198. These nuanced differences suggest that the choice
between RF and AutoML may depend on specific considerations, emphasizing the
importance of assessing both explanatory power and error metrics for
comprehensive model evaluation.

\newpage

## Conclusions

In summary, the monthly models generated using both methods (RF and autoML)
exhibited superior performance based on various metrics, surpassing the outcomes
from weekly or daily temporal aggregations. Specifically, in the context of
monthly modelling, although RF demonstrated a slightly higher variance
explanation compared to autoML (by 0.05), autoML showcased a lower RMSE,
signifying more accurate predictions with minimal error. When incorporating VIs
alongside all bands into the GPP prediction models (both RF and autoML), the VIs
consistently demonstrated their significance as the most influential variables
in the predictions, with CCI and EVI consistently having pivotal roles. Notably,
among the bands, B02 (NIR) emerged as the most crucial for predictions,
surpassing the importance of the remaining bands.

Analyzing the RF model revealed interesting insights through Shapley values,
indicating a tendency to underestimate high GPP values and overestimate low GPP
values. This observation was consistent across scatterplots depicting predicted
versus actual values for diverse temporal aggregations. When comparing both
methods (RF and autoML), no definitive superiority emerges, as both exhibit
nuanced distinctions, except for the weekly models where autoML outperforms with
a 0.17 better variance explanation and a lower RMSE.

\newpage

## References

::: {#refs}
:::