Code and infrastructure for building temporally-specific (pre- and post-2000) species distribution models for a variety of butterfly species. The project is linked to work and data in a long-term transect of butterfly observations across CA and NV, with the ultimate goal of providing infrastructure to make comparisons of distributions of all butterfly species in North America between the two time periods.
The project can be split into several components:
- Automated extraction of occurrence records for species from iNat and GBIF
- Prepping data pre-modeling (combining environmental rasters, cleaning, background point generation, splitting data into training and testing data).
- Model building
- Model evaluation
- Predictions, maps and visualizations
The steps outlined below are automated in the build_sdm()
function and
outputs a master list of objects that are the complete step-wise record of the
following functions.
-
The
prep_data()
function prepares individual occurence data for further analysis by splitting it into the two time periods (T1, T2), generating 10k background points, cropping the environmental rasters to the occurence data and generating a SpatialPointsDataFrame object that is used in further analysis. -
The
run_block_cv()
function makes use of the blockCV package developed by R Valavi. Here, we use build a spatial grid across the area of interest and split the data into training (blocks 1-4) and test data (block 5) that we use to evaluate models. -
The
prep_data_2()
function does anothe round of data manipulation, this time extracting the environmental data for the occurence and background points, filtering out some unwanted columns and dropping missing data and filtering out some bad data (-Inf). -
The
train_test_split()
function takes occurence data and parses it into training and test sets, based on the methodology set up byrun_block_cv()
. -
The
model_func()
function is the main model-building and tuning function. It relies heavily on the ENMevaluate package developed by Bob Muscarella. The function attempts to use the maxnet package framework to build a set of MaxEnt models using 5-fold cross validation. If this fails (which can happen with the complex hinge models that are often generated from these occurence data), the functionn attempts to use the older maxent.jar framework which seems to work better with these models. Both approaches utilize parallel processing to help speed up this part of the workflow. -
The
best_mod()
function examines the models object generated bymodel_func()
and extracts the 'best' model, which is the model with the set of hyperparameters that generated the highest average AUC score on test data within the 5-fold cross validation tuning procedure inmodel_func()
. -
The
evaluate_models()
function uses the 'best' model from the tuning procedure on training data to make predictions on the out-of-sample test data generated byrun_block_cv()
, which are evaluated using AUC scores, this gives a secondary mesaure of model performance on spatially-blocked data that was not used in the generation of the model, and provides an unbiased estimate on model performance. -
Finally, the a 'full' model is constructed using the
full_model()
function to generate predictions for use in maps and other down-stream analyses. This full model takes on the hyperparameter-values tuned inmodel_func()
but uses the full set data (training + test) to optimize. -
There are assorted scripts for producing continuous probability and threshold maps in the analysis_mapping directory, pbs scripts and frameworks related to running this material on the University of Arizona HPC in the pbs_scripts directory, and an exploratory set of tools that work through a Monte Carlo procedure for comparing time periods in the monte_carlo_methods directory.
For questions regarding this project and its content, please feel free to get in touch with Keaton Wilson, Katy Prudic, or Jeff Oliver.
This work is licensed under a BSD Clause-2 License.