Merge branch 'develop' into siteID-refactor

PecanProject · Aug 16, 2024 · 14950aa · 14950aa
2 parents 903efc9 + bb2cda9
commit 14950aa
Show file tree

Hide file tree

Showing 3 changed files with 200 additions and 67 deletions.
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,45 @@
+# Contributor Covenant Code of Conduct
+
+**Our Pledge**
+
+In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
+
+**Our Standards**
+
+Examples of behavior that contributes to creating a positive environment include:
+
+   * Using welcoming and inclusive language
+   * Being respectful of differing viewpoints and experiences
+   * Gracefully accepting constructive criticism
+   * Focusing on what is best for the community
+   * Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a professional setting
+
+
+
+**Our Responsibilities**
+
+Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+
+**Scope**
+
+This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
+
+**Enforcement**
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at pecanproj[at]gmail.com. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
+
+**Attribution**
+
+This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org/)  version 1.4, available at [http://contributor-covenant.org/version/1/4](http://contributor-covenant.org/version/1/4/).
diff --git a/DEV-INTRO.md b/DEV-INTRO.md
@@ -78,7 +78,6 @@ You can copy the [`docker/env.example`](docker/env.example) file as .env in your
 cp docker/env.example .env
 ```
 
-
 The variables we want to modify are:
 
 - `COMPOSE_PROJECT_NAME`, the prefix for all containers. Set this to "pecan".
@@ -181,13 +180,13 @@ Next copy the R packages from a container to volume `pecan_lib`. This is not rea
 
 You can copy all the data using the following command. This will copy all compiled packages to your local machine.
 
-```
+```bash
 docker run -ti --rm -v pecan_R_library:/rlib pecan/base:develop cp -a /usr/local/lib/R/site-library/. /rlib/
 ```
 
 If you have set a custom UID or GID in your `.env`, change ownership of these files as described above for the data volume. E.g. if you use the same UID in the containers as on your host machine, run:
 
-```
+```bash
 docker run -ti --rm -v pecan_R_library:/rlib pecan/base:develop chown -R "$(id -u):$(id -g)" /rlib/
 ```
 
@@ -210,7 +209,7 @@ For Windows
 copy docker\web\config.docker.php web\config.php
 ```
 
-## PEcAn Development
+## PEcAn Development Setup
 
 To begin development we first have to bring up the full PEcAn stack. This assumes you have done once the steps above. You don\'t need to stop any running containers, you can use the following command to start all containers. At this point you have PEcAn running in docker.
 
@@ -239,13 +238,13 @@ R CMD ../web/workflow.R --settings docker.sipnet.xml
 
 A better way of doing this is developed as part of GSOC, in which case you can leverage of the restful interface defined, or using the new R PEcAn API package.
 
-# PEcAn URLs
+## PEcAn URLs
 
 You can check the RabbitMQ server used by pecan using <https://rabbitmq.pecan.localhost> on the same server that the docker stack is running on. You can use rstudio either with <http://server/rstudio> or at <http://rstudio.pecan.localhost>. To check the traefik dashboard you can use <http://traefik.pecan.localhost>.
 
 If the stack is running on a remote machine, you can use ssh and port forwarding to connect to the server. For example `ssh -L 8000:localhost:80` will allow you to use <http://rabbitmq.pecan.localhost:8000/> in your browser to connect to the remote PEcAn server RabbitMQ.
 
-# Directory Structure
+## Directory Structure
 
 Following are the main folders inside the pecan repository.
 
@@ -281,9 +280,9 @@ Some of the docker build files. The Dockerfiles for each model are placed in the
 
 Small scripts that are used as part of the development and installation of PEcAn.
 
-# Advanced Development Options
+## Advanced Development Options
 
-## Reset all containers/database
+### Reset all containers/database
 
 If you want to start from scratch and remove all old data, but keep your pecan checked out folder, you can remove the folders where you have written the data (see `folders` below). You will also need to remove any of the docker managed volumes. To see all volumes you can do `docker volume ls -q -f name=pecan`. If you are sure, you can either remove them one by one, or remove them all at once using the command below. **THIS DESTROYS ALL DATA IN DOCKER MANAGED VOLUMES.**.
 

diff --git a/modules/assim.sequential/R/downscale_function.R b/modules/assim.sequential/R/downscale_function.R
@@ -62,6 +62,38 @@ SDA_downscale_preprocess <- function(data_path, coords_path, date, carbon_pool)
   return(list(input_data = input_data, site_coordinates = site_coordinates, carbon_data = carbon_data))
 }
 
+##' @title Create folds function
+##' @name create_folds
+##' @author Sambhav Dixit
+##'
+##' @param y Vector. A vector of outcome data or indices.
+##' @param k Numeric. The number of folds to create.
+##' @param list Logical. If TRUE, returns a list of fold indices. If FALSE, returns a vector.
+##' @param returnTrain Logical. If TRUE, returns indices for training sets. If FALSE, returns indices for test sets.
+##' @details This function creates k-fold indices for cross-validation. It can return either training or test set indices, and the output can be in list or vector format.
+##'
+##' @description This function generates k-fold indices for cross-validation, allowing for flexible output formats.
+##'
+##' @return A list of k elements (if list = TRUE), each containing indices for a fold, or a vector of indices (if list = FALSE).
+
+create_folds <- function(y, k, list = TRUE, returnTrain = FALSE) {
+  n <- length(y)
+  indices <- seq_len(n)
+  folds <- split(indices, cut(seq_len(n), breaks = k, labels = FALSE))
+
+  if (!returnTrain) {
+    folds <- folds  # Test indices are already what we want
+  } else {
+    folds <- lapply(folds, function(x) indices[-x])  # Return training indices
+  }
+
+  if (!list) {
+    folds <- unlist(folds)
+  }
+
+  return(folds)
+}
+
 ##' @title SDA Downscale Function
 ##' @name SDA_downscale
 ##' @author Joshua Ploshay, Sambhav Dixit
@@ -140,84 +172,141 @@ SDA_downscale <- function(preprocessed, date, carbon_pool, covariates, model_typ
       predictions[[i]] <- stats::predict(models[[i]], test_data)
     }
   } else if (model_type == "cnn") {
+    # Define k_folds and num_bags
+    k_folds <- 5
+    num_bags <- 5
+
     # Reshape input data for CNN
     x_train <- keras3::array_reshape(x_train, c(nrow(x_train), 1, ncol(x_train)))
     x_test <- keras3::array_reshape(x_test, c(nrow(x_test), 1, ncol(x_test)))
-    
+
     for (i in seq_along(carbon_data)) {
-      # Define the CNN model architecture
-      # Used dual batch normalization and dropout as the first set of batch normalization and dropout operates on the lower-level features extracted by the convolutional layer, the second set works on the higher-level features learned by the dense layer.
-      model <- keras3::keras_model_sequential() |>
-        # 1D Convolutional layer: Extracts local features from input data
-        keras3::layer_conv_1d(filters = 64, kernel_size = 1, activation = 'relu', input_shape = c(1, length(covariate_names))) |>
-        # Batch normalization: Normalizes layer inputs, stabilizes learning, reduces internal covariate shift
-        keras3::layer_batch_normalization() |>
-        # Dropout: Randomly sets some of inputs to 0, reducing overfitting and improving generalization
-        keras3::layer_dropout(rate = 0.3) |>
-        # Flatten: Converts 3D output to 1D for dense layer input
-        keras3::layer_flatten() |>
-        # Dense layer: Learns complex combinations of features
-        keras3::layer_dense(units = 64, activation = 'relu') |>
-        # Second batch normalization: Further stabilizes learning in deeper layers
-        keras3::layer_batch_normalization() |>
-        # Second dropout: Additional regularization to prevent overfitting in final layers
-        keras3::layer_dropout(rate = 0.3) |>
-        # Output layer: Single neuron for regression prediction
-        keras3::layer_dense(units = 1)
+      all_models <- list()
 
-      # Learning rate scheduler
-      lr_schedule <- keras3::learning_rate_schedule_exponential_decay(
-        initial_learning_rate = 0.001,
-        decay_steps = 1000,
-        decay_rate = 0.9
-      )
+      # Create k-fold indices
+      fold_indices <- create_folds(y = seq_len(nrow(x_train)), k = k_folds, list = TRUE, returnTrain = FALSE)
 
-      # Compile the model
-      model |> keras3::compile(
-        loss = 'mean_squared_error',
-        optimizer = keras3::optimizer_adam(learning_rate = lr_schedule),
-        metrics = c('mean_absolute_error')
-      )
-
-      # Early stopping callback
-      early_stopping <- keras3::callback_early_stopping(
-        monitor = 'val_loss',
-        patience = 10,
-        restore_best_weights = TRUE
-      )
+      #initialise operations for each fold
+      for (fold in 1:k_folds) {
+        cat(sprintf("Processing ensemble %d, fold %d of %d\n", i, fold, k_folds))
+
+        # Split data into training and validation sets for this fold
+        train_indices <- setdiff(seq_len(nrow(x_train)), fold_indices[[fold]])
+        val_indices <- fold_indices[[fold]]
+
+        x_train_fold <- x_train[train_indices, , drop = FALSE]
+        y_train_fold <- y_train[train_indices, i]
+        x_val_fold <- x_train[val_indices, , drop = FALSE]
+        y_val_fold <- y_train[val_indices, i]
+
+        # Create bagged models for this fold
+        fold_models <- list()
+        for (bag in 1:num_bags) {
+          # Create bootstrap sample
+          bootstrap_indices <- sample(1:nrow(x_train_fold), size = nrow(x_train_fold), replace = TRUE)
+          x_train_bag <- x_train_fold[bootstrap_indices, ]
+          y_train_bag <- y_train_fold[bootstrap_indices]
+
+          # Define the CNN model architecture
+          # Used dual batch normalization and dropout as the first set of batch normalization and 
+          model <- keras3::keras_model_sequential() |>
+            # Layer Reshape : Reshape to fit target shape for the convolutional layer
+            keras3::layer_reshape(target_shape = c(ncol(x_train), 1, 1), input_shape = ncol(x_train)) |>
+            # 1D Convolutional layer: Extracts local features from input data
+            keras3::layer_conv_2d(
+              filters = 32,
+              kernel_size = c(3, 1),
+              activation = 'relu',
+              padding = 'same'
+            ) |>
+            # Flatten: Converts 3D output to 1D for dense layer input
+            keras3::layer_flatten() |>
+            # Dense layer: Learns complex combinations of features
+            keras3::layer_dense(
+              units = 64, 
+              activation = 'relu',
+              kernel_regularizer = keras3::regularizer_l2(0.01)
+            ) |>
+            # Batch normalization: Normalizes layer inputs, stabilizes learning, reduces internal covariate shift
+            keras3::layer_batch_normalization() |>
+            # Dropout: Randomly sets some of inputs to 0, reducing overfitting and improving generalization
+            keras3::layer_dropout(rate = 0.3) |>
+            # Dense layer: Learns complex combinations of features
+            keras3::layer_dense(
+              units = 32, 
+              activation = 'relu',
+              kernel_regularizer = keras3::regularizer_l2(0.01)
+            ) |>
+            # Batch normalization: Further stabilizes learning in deeper layers
+            keras3::layer_batch_normalization() |>
+            # Dropout: Additional regularization to prevent overfitting in final layer
+            keras3::layer_dropout(rate = 0.3) |>
+            # Output layer: Single neuron for regression prediction
+            keras3::layer_dense(
+              units = 1,
+              kernel_regularizer = keras3::regularizer_l2(0.01)
+            )
+
+          # Learning rate scheduler
+          lr_schedule <- keras3::learning_rate_schedule_exponential_decay(
+            initial_learning_rate = 0.001,
+            decay_steps = 1000,
+            decay_rate = 0.9
+          )
+
+          # Early stopping callback
+          early_stopping <- keras3::callback_early_stopping(
+            monitor = 'loss',
+            patience = 10,
+            restore_best_weights = TRUE
+          )
 
-      # Train the model
-      model |> keras3::fit(
-        x = x_train,
-        y = y_train[, i],
-        epochs = 500,  # Increased max epochs
-        batch_size = 32,
-        validation_split = 0.2,
-        callbacks = list(early_stopping),
-        verbose = 0
-      )
+          # Compile the model
+          model |> keras3::compile(
+            loss = 'mean_squared_error',
+            optimizer = keras3::optimizer_adam(learning_rate = lr_schedule),
+            metrics = c('mean_absolute_error')
+          )
 
-      # Store the trained model
-      models[[i]] <- model
+          # Train the model
+          model |> keras3::fit(
+            x = x_train_bag,
+            y = y_train_bag,
+            epochs = 500,
+            batch_size = 32,
+            callbacks = list(early_stopping),
+            verbose = 0
+          )
 
-      #CNN predictions
-      cnn_predict <- function(model, newdata, scaling_params) {
+          # Store the trained model for this bag in the fold_models list
+          fold_models[[bag]] <- model
+        }
+
+        # Add fold models to all_models list
+        all_models <- c(all_models, fold_models)
+      }
+
+      # Store all models for this ensemble
+      models[[i]] <- all_models
+
+      # Use all models for predictions
+      cnn_ensemble_predict <- function(models, newdata, scaling_params) {
         newdata <- scale(newdata, center = scaling_params$mean, scale = scaling_params$sd)
-        newdata <- keras3::array_reshape(newdata, c(nrow(newdata), 1, ncol(newdata)))
-        predictions <- stats::predict(model, newdata)
-        return(as.vector(predictions))
+        predictions <- sapply(models, function(m) stats::predict(m, newdata))
+        return(rowMeans(predictions))
       }
 
       # Create a prediction raster from covariates
       prediction_rast <- terra::rast(covariates)
-      
+
       # Generate spatial predictions using the trained model
       maps[[i]] <- terra::predict(prediction_rast, model = models[[i]],
-                                  fun = cnn_predict,
+                                  fun = cnn_ensemble_predict,
                                   scaling_params = scaling_params)
 
       # Make predictions on held-out test data
-      predictions[[i]] <- cnn_predict(models[[i]], x_data[-sample, ], scaling_params)
+      predictions[[i]] <- cnn_ensemble_predict(models[[i]], x_data[-sample, ], scaling_params)
+
     }
   } else {
     stop("Invalid model_type. Please choose either 'rf' for Random Forest or 'cnn' for Convolutional Neural Network.")