Prepare Data for Model Calibration • kuenm2

Summary

Description
Getting ready
Prepare data
Prepare user pre-processed data
Using Custom Data Partitions
- Spatial Partitioning with ENMeval
- Spatial Partitioning with flexsdm
Analysis of extrapolation risks in partitions
Saving a prepared_data object

Description

Before starting the ENM process, data must be formatted in a specific structure required by functions in kuenm2. This vignette guides users through the steps necessary to prepare occurrence data and environmental predictors using built-in tools. It covers the use of prepare_data() and prepare_user_data() to generate standardized objects, which are essential for model calibration. The vignette also demonstrates options for applying PCA, incorporating sampling bias, and saving prepared data for later use.

Getting ready

If kuenm2 has not been installed yet, please do so. See the Main guide for installation instructions. See the basic data cleaning guide for some steps on cleaning data.

Use the following lines of code to load kuenm2 and any other required packages, and define a working directory (if needed). In general, setting a working directory in R is considered good practice, as it provides better control over where files are read from or saved to. If users are not working within an R project, we recommend setting a working directory, since at least one file will be saved at later stages of this guide.

# Load packages
library(kuenm2)
library(terra)

# Current directory
getwd()

# Define new directory
#setwd("YOUR/DIRECTORY")  # uncomment this line if setting a new directory

Prepare data

Import data

We will use occurrence records provided within the kuenm2 package. Most example data in the package is derived from Trindade & Marques (2024). The occ_data object contains 51 occurrences of Myrcia hatschbachii, a tree endemic to Southern Brazil. Although this example data set has three columns (species, x, and y), users’ input data only requires two numeric columns with longitude and latitude coordinates.

# Import occurrences
data(occ_data, package = "kuenm2")

# Check data structure
str(occ_data)
#> 'data.frame':    51 obs. of  3 variables:
#>  $ species: chr  "Myrcia hatschbachii" "Myrcia hatschbachii" "Myrcia hatschbachii" "Myrcia hatschbachii" ...
#>  $ x      : num  -51.3 -50.6 -49.3 -49.8 -50.2 ...
#>  $ y      : num  -29 -27.6 -27.8 -26.9 -28.2 ...

As predictor variables, we will use other data included in the package. This data set comprises four bioclimatic variables from WorldClim 2.1 at 10 arc-minute resolution, and a categorical variable (SoilType) from SoilGrids resampled to 10 arc-minutes. All variables have been masked using a polygon that delimits the area for model calibration, which was generated by drawing a minimum convex polygon around the records with a 300 km buffer.

# Import raster layers
var <- rast(system.file("extdata", "Current_variables.tif", package = "kuenm2"))

# Check variables
plot(var)

Visualize occurrences records in geography:

# Visualize occurrences on one variable
plot(var[["bio_1"]], main = "Bio 1")

points(occ_data[, c("x", "y")], col = "black")

First steps in preparing data

The function prepare_data() is central to getting data ready for model calibration. It handles several key steps:

Defining the algorithm: Users can choose between maxnet or glm.
Generating background points: Background points area sampled from raster layers, unless provided by the user. These points serve as a reference to contrast presence records.
Principal component analysis (PCA): An optional step that can be applied to the set of predictors in PCA.
Preparing calibration data: Presence records and background points are associate with predictor values and put together in a data.frame to be used in the ENM.
Data partitioning: The function divides your data to prepare training and testing sets via a cross-validation process. The partitioning methods available includes kfolds, subsample, and bootstrap.
Defining grid of model parameters: This helps setting up combinations of feature classes (FCs), regularization multiplier (RM) values (for Maxnet), and sets of predictor variables. An explanation of the roles of RMs and FCs in Maxent models see Merow et al. 2013.

As with any function, we recommend consulting the documentation for more detailed explanations (e.g., help(prepare_data)). Now, let’s prepare our data for model calibration using prepare_data(), using 4 k-folds as the partitioning method:

# Prepare data for maxnet model
d <- prepare_data(algorithm = "maxnet",
                  occ = occ_data,
                  x = "x", y = "y",
                  raster_variables = var,
                  species = "Myrcia hatschbachii",
                  categorical_variables = "SoilType", 
                  partition_method = "kfolds", 
                  n_partitions = 4,
                  n_background = 1000,
                  features = c("l", "q", "lq", "lqp"),
                  r_multiplier = c(0.1, 1, 2))
#> Warning in handle_missing_data(occ_bg, weights): 43 rows were excluded from
#> database because NAs were found.

The prepare_data() function returns a prepared_data object, which is a list containing various essential pieces of information fro model calibration. Below is an example of how the object is printed to summarize its components.

print(d)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Partition Method: kfolds 
#>   - Number of kfolds: 4 
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 300 
#>   - Features classes (responses): l, q, lq, lqp 
#>   - Regularization multipliers: 0.1, 2, 1

The parts of the prepared_data object can be explored in further detail by indexing them as in the following example.

# See first rows of calibration data
head(d$calibration_data)
#>   pr_bg    bio_1    bio_7 bio_12   bio_15 SoilType
#> 1     1 16.49046 18.66075   1778 12.96107       19
#> 2     1 15.46644 19.65775   1560 14.14697       19
#> 3     1 15.70560 17.99450   1652 23.27548        6
#> 4     1 17.78899 19.55600   1597 18.91694        1
#> 5     1 15.50116 18.30750   1497 15.39440       19
#> 6     1 17.42421 17.25875   1760 34.17664        6

# See first rows of formula grid
head(d$formula_grid)
#>   ID           Formulas R_multiplier Features
#> 1  1  ~bio_1 + bio_7 -1          0.1        l
#> 2  2  ~bio_1 + bio_7 -1          2.0        l
#> 3  3  ~bio_1 + bio_7 -1          1.0        l
#> 4  4 ~bio_1 + bio_12 -1          2.0        l
#> 5  5 ~bio_1 + bio_12 -1          1.0        l
#> 6  6 ~bio_1 + bio_12 -1          0.1        l

By default, prepare_data() prepares an object for fitting models using maxnet. However, this can be changed to use GLM instead. When using GLM, it is not necessary to set regularization multipliers, as this algorithm does not utilize them.

Let’s create an object for fitting models using glm, this time using the subsample partitioning method, with 10 replicates and 70% of the dataset used for training.

d_glm <- prepare_data(algorithm = "glm",
                      occ = occ_data,
                      x = "x", y = "y",
                      raster_variables = var,
                      species = "Myrcia hatschbachii",
                      categorical_variables = "SoilType", 
                      partition_method = "subsample", 
                      n_partitions = 10, 
                      train_proportion = 0.7,
                      n_background = 300,
                      features = c("l", "q", "p", "lq", "lqp"),
                      r_multiplier = NULL) #Not necessary with glms
#> Warning in handle_missing_data(occ_bg, weights): 8 rows were excluded from
#> database because NAs were found.
#Print object
d_glm
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 343 
#>   - Presence: 51 
#>   - Background: 292 
#> Partition Method: subsample 
#>   - Number of replicates: 
#>   - Train proportion: 0.7 
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: glm 
#>   - Number of candidate models: 122 
#>   - Features classes (responses): l, q, p, lq, lqp

By default, prepare_data() extracts background points from the entire extent of the provided raster variables. If you have a species-specific polygon defining the calibration area, you can use it to mask the raster variables and delimit the calibration area. For example, let’s create a 100km buffer around the occurrence records:

#Convert dataframe with occurrences to SpatVector
pts <- vect(occ_data, geom = c(x = "x", y = "y"), crs = "EPSG:4326")
#Create buffer
b <- buffer(x = pts, width = 100000) #Width in meters
#Aggregate buffers
b <- terra::aggregate(b)
#Plot buffer
plot(b)
points(occ_data[, c("x", "y")], col = "black")

Now, let’s use this new polygon to mask the variables within prepare_data(). Let’s also increase the number of background points to 1,000:

d_buffer <- prepare_data(algorithm = "maxnet",
                         occ = occ_data,
                         x = "x", y = "y",
                         raster_variables = var,
                         mask = b, #Polygon to mask variables
                         species = "Myrcia hatschbachii",
                         categorical_variables = "SoilType",
                         n_background = 1000, partition_method = "kfolds",
                         features = c("l", "q", "p", "lq", "lqp"),
                         r_multiplier = c(0.1, 1, 2, 3, 5))
#> 'n_background' >= initial number of points, using all points.
#> Warning in handle_missing_data(occ_bg, weights): 25 rows were excluded from
#> database because NAs were found.

Note the warning message: “n_background’ >= initial number of points, using all points”. This occurs because, after masking the variables, the total number of pixels within the buffer is less than the specified n_background (1,000). In such cases, all available points within the calibration area will be used as background points.

In the following examples, we’ll use the object d_maxnet, which was prepared for the maxnet algorithm without a mask. However, all functions are compatible with objects prepared for GLM as well.

Exploring calibration data

Users can visualize the distribution of predictor values for occurrence records, background points, and the entire calibration area using histograms. An example is presented below. See full documentation with help(explore_calibration_hist) and help(plot_explore_calibration).

# Prepare histogram data
calib_hist <- explore_calibration_hist(data = d, raster_variables = var,
                                       include_m = TRUE)

# Plot histograms
plot_explore_calibration(explore_calibration = calib_hist)

The gray bars represent values across the entire calibration area. Blue bars show values for the background, while green bars display values at presence points (magnified by a factor of 2 for improved visualization). You can customize both the colors and the magnification factor.

Additionally, users can explore the geographic distribution of occurrence and background points. See full documentation with help(explore_calibration_geo).

pbg <- explore_calibration_geo(data = d, raster_variables = var[[1]],
                               plot = TRUE)

Note that, by default, background points are selected randomly within the calibration area. However, users can influence the spatial distribution of background, increasing or decreasing the probability of selection in certain regions, by providing a bias file (as demonstrated in the next section).

Using a bias file

A bias file is a SpatRaster object that contains values that will influence the selection of background points within the calibration area. This can be particularly useful for mitigating sampling bias, for instance, by incorporating the density of records from a target group (as discussed in Ponder et al. 2001, Anderson et al. 2003, and Barber et al. 2020).

The bias file must have the same extent, resolution, and number of cells as your raster variables, unless a mask is supplied. If a mask is used, the extent of the bias file should encompass or be larger than the mask extent.

Let’s illustrate this with an example bias file included in the package. This SpatRaster has lower values in the center and higher values towards the borders:

# Import a bias file
bias <- rast(system.file("extdata", "bias_file.tif", package = "kuenm2"))

plot(bias)

We will now use this bias file to prepare two new datasets: one where the bias effect is “direct” (higher probability in regions with higher bias values) and another where the effect is “inverse” (higher probability in regions with lower bias values):

# Using bias as a direct effect in sampling
d_bias_direct <- prepare_data(algorithm = "maxnet",
                              occ = occ_data,
                              x = "x", y = "y",
                              raster_variables = var,
                              species = "Myrcia hatschbachii",
                              categorical_variables = "SoilType",
                              n_background = 1000, 
                              partition_method = "kfolds",
                              bias_file = bias, bias_effect = "direct",  # bias parameters
                              features = c("l", "q", "p", "lq", "lqp"),
                              r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 57 rows were excluded from
#> database because NAs were found.

# Using bias as an indirect effect in sampling
d_bias_inverse <- prepare_data(algorithm = "maxnet",
                               occ = occ_data,
                               x = "x", y = "y",
                               raster_variables = var,
                               species = "Myrcia hatschbachii",
                               categorical_variables = "SoilType",
                               n_background = 1000,
                               partition_method = "kfolds",
                               bias_file = bias, bias_effect = "inverse",   # bias parameters
                               features = c("l", "q", "p", "lq", "lqp"),
                               r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 45 rows were excluded from
#> database because NAs were found.

# Compare background points generated randomly versus with bias effects
## Saving original plotting parameters
original_par <- par(no.readonly = TRUE)

## Adjusting plotting grid
par(mfrow = c(2,2))  

## The plots to show sampling bias effects
plot(bias, main = "Bias file")
explore_calibration_geo(d, raster_variables = var[[1]],
                        main = "Random Background", 
                        plg=list(cex=0.75)) #Decrease size of legend text)
explore_calibration_geo(d_bias_direct, raster_variables = var[[1]],
                        main = "Direct Bias Effect", 
                        plg=list(cex=0.75)) #Decrease size of legend text)
explore_calibration_geo(d_bias_inverse, raster_variables = var[[1]],
                        main = "Inverse Bias Effect", 
                        plg=list(cex=0.75)) #Decrease size of legend text)


par(original_par)  # Reset grid

Note that when the bias effect is “direct”, the majority of the background points are sampled from the borders of the calibration area, corresponding to higher bias values. Conversely, with an “inverse” bias effect, most background points are selected from the center, where bias values are lower.

PCA for variables

A common approach in ENM involves summarizing the information from a set of predictor variables into a smaller set of uncorrelated variables using Principal Component Analysis (PCA) (see Cruz-Cardenaz et al. 2014 for an example). In kuenm2 users can perform a PCA internally or use variables that have been externally prepared as PCs.

Internal PCA

kuenm2 can perform all PCA transformations internally, eliminating the need to prepare the new PC variables for each scenario of projection. This is particularly advantageous when projecting model results across multiple time scenarios (e.g., various Global Climate Models for different future periods). By performing PCA internally, you only need to store the raw environmental variables (e.g., bio_1, bio_2, etc.) on your directory, and the functions will handle the PCA transformation as needed.

Let’s explore how to implement this:

# Prepare data for maxnet models using PCA parameters
d_pca <- prepare_data(algorithm = "maxnet",
                      occ = occ_data,
                      x = "x", y = "y",
                      raster_variables = var, 
                      do_pca = TRUE, center = TRUE, scale = TRUE,  # PCA parameters
                      species = "Myrcia hatschbachii",
                      categorical_variables = "SoilType",
                      n_background = 1000,
                      partition_method = "kfolds",
                      features = c("l", "q", "p", "lq", "lqp"),
                      r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 43 rows were excluded from
#> database because NAs were found.

print(d_pca)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Partition Method: kfolds 
#>   - Number of kfolds: 4 
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information:
#>   - Variables included: bio_1, bio_7, bio_12, bio_15 
#>   - Number of PCA components: 4 
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 610 
#>   - Features classes (responses): l, q, p, lq, lqp 
#>   - Regularization multipliers: 0.1, 1, 2, 3, 5

The elements calibration data and formula grid have now been generated considering the principal components (PCs). By default, all continuous variables were included in the PCA, while categorical variables (e.g., “SoilType”) were excluded. The default settings for the number of PCs selected retain the axes that collectively explain 95% of the total variance, and then further filter these, keeping only those axes that individually explain at least 5% of the variance. These parameters can be changed using other arguments in the function prepare_data

# Check calibration data
head(d_pca$calibration_data)
#>   pr_bg         PC1        PC2        PC3         PC4 SoilType
#> 1     1  1.48690341 1.01252697  0.1180156 -0.09119257       19
#> 2     1  1.46028074 0.17701144  1.1573461 -0.12326796       19
#> 3     1  0.82676494 1.21965795  0.8145129 -0.67588891        6
#> 4     1  0.62680441 0.03967459  0.1525997  0.18784282        1
#> 5     1  0.94584897 0.93302089  1.4382424 -0.03192094       19
#> 6     1 -0.07597437 1.55268331 -0.2007953 -0.98153204        6

# Check formula grid
head(d_pca$formula_grid)
#>   ID      Formulas R_multiplier Features
#> 1  1 ~PC1 + PC2 -1          0.1        l
#> 2  2 ~PC1 + PC2 -1          1.0        l
#> 3  3 ~PC1 + PC2 -1          2.0        l
#> 4  4 ~PC1 + PC2 -1          3.0        l
#> 5  5 ~PC1 + PC2 -1          5.0        l
#> 6  6 ~PC1 + PC3 -1          5.0        l

# Explore variables distribution
calib_hist_pca <- explore_calibration_hist(data = d_pca, raster_variables = var,
                                           include_m = TRUE, breaks = 7)

plot_explore_calibration(explore_calibration = calib_hist_pca)

As the PCA was performed internally, the prepared_data object contains all the necessary information to transform the raw environmental variables into the required PCs This means that when predicting or projecting models, users should provide raw raster variables, and the PCs will be obtained internally in the function.

External PCA

Alternatively, users can perform a PCA with their data by using the perform_pca() function, or one of their preference. See full documentation with help(perform_pca). Se an example with perform_pca() below:

pca_var <- perform_pca(raster_variables = var, exclude_from_pca = "SoilType",
                       center = TRUE, scale = TRUE)

# Plot
plot(pca_var$env)

Now, let’s use the PCs generated by perform_pca() to prepare the data:

# Prepare data for maxnet model using PCA variables
d_pca_extern <- prepare_data(algorithm = "maxnet",
                             occ = occ_data,
                             x = "x", y = "y",
                             raster_variables = pca_var$env,  # Output of perform_pca()
                             do_pca = FALSE,  # Set to FALSE because variables are PCs
                             species = "Myrcia hatschbachii",
                             categorical_variables = "SoilType", 
                             n_background = 1000, 
                             partition_method = "kfolds",
                             features = c("l", "q", "p", "lq", "lqp"),
                             r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 43 rows were excluded from
#> database because NAs were found.

print(d_pca_extern)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Partition Method: kfolds 
#>   - Number of kfolds: 4 
#> Continuous Variables:
#>   - PC1, PC2, PC3, PC4 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 610 
#>   - Features classes (responses): l, q, p, lq, lqp 
#>   - Regularization multipliers: 0.1, 1, 2, 3, 5

Note that since PCA was performed externally, do_pca = FALSE is set within the prepare_data function. This is crucial because setting it to TRUE would incorrectly apply PCA to variables that are already PCs. Consequently, the prepared_data object in this scenario does not store any PCA-related information. This means that when users predict or project models, they must must provide the PCs instead of the raw raster variables.

# Check calibration data
head(d_pca_extern$calibration_data)
#>   pr_bg         PC1        PC2        PC3         PC4 SoilType
#> 1     1  1.48690341 1.01252697  0.1180156 -0.09119257       19
#> 2     1  1.46028074 0.17701144  1.1573461 -0.12326796       19
#> 3     1  0.82676494 1.21965795  0.8145129 -0.67588891        6
#> 4     1  0.62680441 0.03967459  0.1525997  0.18784282        1
#> 5     1  0.94584897 0.93302089  1.4382424 -0.03192094       19
#> 6     1 -0.07597437 1.55268331 -0.2007953 -0.98153204        6

# Check formula grid
head(d_pca_extern$formula_grid)
#>   ID      Formulas R_multiplier Features
#> 1  1 ~PC1 + PC2 -1          0.1        l
#> 2  2 ~PC1 + PC2 -1          1.0        l
#> 3  3 ~PC1 + PC2 -1          2.0        l
#> 4  4 ~PC1 + PC2 -1          3.0        l
#> 5  5 ~PC1 + PC2 -1          5.0        l
#> 6  6 ~PC1 + PC3 -1          5.0        l

Prepare user pre-processed data

If users already have data that has been prepared for calibration, they can use the prepare_user_data() function to create the object required for model calibration. User-prepared calibration data must be a data.frame that includes a column indicating presence (1) and background (0) records, along with columns with values for each of your variables. The package includes an example of such a data.frame for reference. See an example of its use below:

data("user_data", package = "kuenm2")

head(user_data)
#>   pr_bg    bio_1    bio_7 bio_12   bio_15 SoilType
#> 1     1 16.49046 18.66075   1778 12.96107       19
#> 2     1 15.46644 19.65775   1560 14.14697       19
#> 3     1 15.70560 17.99450   1652 23.27548        6
#> 4     1 17.78899 19.55600   1597 18.91694        1
#> 5     1 15.50116 18.30750   1497 15.39440       19
#> 7     1 17.42421 17.25875   1760 34.17664        6

The prepare_user_data() function operates similarly to prepare_data(), but with a key difference: instead of requiring a data.frame of occurrence coordinates and a SpatRaster of predictor variables, it takes your already prepared user data.frame (see below). See full documentation with help(prepare_user_data).

# Prepare data for maxnet model
data_user <- prepare_user_data(algorithm = "maxnet",
                               user_data = user_data,  # user-prepared data.frame
                               pr_bg = "pr_bg",
                               species = "Myrcia hatschbachii",
                               categorical_variables = "SoilType",
                               partition_method = "bootstrap",
                               features = c("l", "q", "p", "lq", "lqp"),
                               r_multiplier = c(0.1, 1, 2, 3, 5))

data_user 
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 527 
#>   - Presence: 51 
#>   - Background: 476 
#> Partition Method: bootstrap 
#>   - Number of replicates: 
#>   - Train proportion: 0.7 
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 610 
#>   - Features classes (responses): l, q, p, lq, lqp 
#>   - Regularization multipliers: 0.1, 1, 2, 3, 5

This function also allows you to provide a list of index identifying test points for cross-validation to be used during model calibration. If user_folds is NULL, the function will automatically partition the data according to the specified partition_method, n_partitions, and train_proportion.

Using Custom Data Partitions

The functions prepare_data() and prepare_user_data() in the kuenm2 package include four built-in methods for randomly partitioning data:

“kfolds”: Splits the dataset into K subsets (folds) of approximately equal size. In each partition, one fold is used as the test set, while the remaining folds are combined to form the training set.
“bootstrap”: Creates the training dataset by sampling observations from the original dataset with replacement (i.e., the same observation can be selected multiple times). The test set consists of the observations that were not selected in that specific replicate.
“subsample”: Similar to bootstrap, but the training set is created by sampling without replacement (i.e., each observation is selected at most once). The test set includes the observations not selected for training.

Other methods for data partitioning are also available, including those based on spatial rules (Radosavljevic and Anderson, 2014). Although kuenm2 does not currently implement spatial partitioning techniques, it is possible to apply them using other R packages and then incorporate the resulting partitions into the prepared_data object for model calibration within the kuenm2 framework.

The ENMeval and flexsdm R packages offers some options. Below, we provide examples on how to split the data using this packages and include this partition into the prepared_data object to run with kuenm2.

Spatial Partitioning with ENMeval

The ENMeval package (Kass et al. 2021) provides three spatial partitioning methods:

Spatial block: Divides occurrence data into four groups based on latitude and longitude lines, aiming for groups of roughly equal size.
Checkerboard 1 (basic): Generates a checkerboard grid over the study area and assigns points to groups based on their location in the grid.
Checkerboard 2 (hierarchical): Aggregates the input raster at two hierarchical spatial scales using specified aggregation factors. Points are then grouped based on their position in the resulting hierarchical checkerboard.

As an example, let’s use the spatial block method to partition our dataset. For this example, we will use the object d, which corresponds to the prepared_data created in previous steps.

# Install ENMeval if not already installed
if(!require("ENMeval")){
  install.packages("ENMeval")
}

# Extract calibration data from the prepared_data object and separate presence and background records
calib_occ <- d$data_xy[d$calibration_data$pr_bg == 1,] #Presences
calib_bg <- d$data_xy[d$calibration_data$pr_bg == 0,] #Background

# Apply spatial block partitioning using ENMeval
enmeval_block <- get.block(occs = calib_occ, bg = calib_bg)

# Inspect the structure of the output
str(enmeval_block)
#> List of 2
#>  $ occs.grp: num [1:51] 1 1 2 2 1 4 2 2 4 2 ...
#>  $ bg.grp  : num [1:957] 2 4 4 1 3 3 3 1 1 3 ...

Note that the output of get.block() is a list with two elements: occs.grp and bg.grp. The occs.grp vector has the same length as the number of occurrence records, and bg.grp has the same length as the background points. Both vectors contain numeric values from 1 to 4, indicating the spatial group to which each point belongs. For example, an occurrence labeled with 1 belongs to the first spatial block, 2 to the second block, and so on.

In contrast, kuenm2 stores partitioned data in a different format: it expects a list of vectors, where each element corresponds to a partition and contains the indices of the test points for that partition. The indices include both presence and background points. During model evaluation for each partition, only the presence test points are used. The background points, although present in the dataset, are simply excluded from the calibration process.

str(d$part_data)
#> List of 4
#>  $ Partition_1: num [1:253] 1 3 12 13 14 20 24 25 34 43 ...
#>  $ Partition_2: num [1:252] 5 7 8 10 15 17 23 27 31 33 ...
#>  $ Partition_3: num [1:251] 4 9 19 21 28 29 30 32 35 36 ...
#>  $ Partition_4: num [1:252] 2 6 11 16 18 22 26 37 38 40 ...

We can convert the spatial group information stored in enmeval_block into a format compatible with kuenm2:

# Identify unique spatial blocks
id_blocks <- sort(unique(unlist(enmeval_block)))

# Create a list of test indices for each spatial block
new_part_data <- lapply(id_blocks, function(i) {
  # Indices of occurrence records in group i
  rep_i_presence <- which(enmeval_block$occs.grp == i)
  
  # Indices of background records in group i
  rep_i_bg <- which(enmeval_block$bg.grp == i)
  # To get the right indices, we need to sum the number of records
  rep_i_bg <- rep_i_bg + nrow(occ_data)
  
  # Combine presence and background indices for the test set
  c(rep_i_presence, rep_i_bg)
})

# Assign names to each partition
names(new_part_data) <- paste0("Partition_", id_blocks)

# Inspect the structure of the new partitioned data
str(new_part_data)
#> List of 4
#>  $ Partition_1: int [1:406] 1 2 5 12 13 14 15 29 31 42 ...
#>  $ Partition_2: int [1:108] 3 4 7 8 10 16 18 22 27 36 ...
#>  $ Partition_3: int [1:367] 11 19 20 21 24 26 30 33 34 38 ...
#>  $ Partition_4: int [1:127] 6 9 17 23 25 28 32 35 37 39 ...

We now have a list containing four vectors, each storing the indices of test data defined using the get.block() function from the ENMeval package. The final step is to replace the original part_data in the prepared_data object with new_part_data. We should also update the partitioning method to reflect this change.

# Replace the original partition data with the new spatial blocks
d_spatial_block <- d
d_spatial_block$part_data <- new_part_data

# Update the partitioning method to reflect the new approach
d_spatial_block$partition_method <- "Spatial block (ENMeval)"  # You can use any descriptive name

# Print the updated prepared_data object
print(d_spatial_block)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Partition Method: Spatial block (ENMeval) 
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 300 
#>   - Features classes (responses): l, q, lq, lqp 
#>   - Regularization multipliers: 0.1, 2, 1

Spatial Partitioning with flexsdm

The flexsdm (Velazco et al. 2022) offers similar spatial partitioning methods. However, it determines the number or size of bands or blocks based on spatial autocorrelation, environmental similarity, and the number of presence and background records within each partition. For more details, please visit the package website.

As an example, we will use the Spatial Block method, which evaluates different raster cell sizes to identify the optimal block configuration for the dataset.

# Install flexsdm if not already installed
if (!require("flexsdm")) {
  install.packages("flexsdm")
}

# Combine calibration data with spatial coordinates
calib_data <- cbind(d$data_xy, d$calibration_data)

# Split data in test and train using flexsdm
flexsdm_block <- part_sblock(env_layer = var, data = calib_data, 
                             x = "x", y = "y",
                             pr_ab = "pr_bg", min_res_mult = 1,
                             max_res_mult = 500, num_grids = 30, n_part = 4, 
                             prop = 0.5)
#> The following grid cell sizes will be tested:
#> 0.17 | 3.03 | 5.9 | 8.77 | 11.64 | 14.51 | 17.37 | 20.24 | 23.11 | 25.98 | 
#> 28.84 | 31.71 | 34.58 | 37.45 | 40.32 | 43.18 | 46.05 | 48.92 | 51.79 | 
#> 54.66 | 57.52 | 60.39 | 63.26 | 66.13 | 68.99 | 71.86 | 74.73 | 77.6 | 
#> 80.47 | 83.33
#> 
#> Creating basic raster mask...
#> 
#> Searching for the optimal grid size...

# See the structure of the object
str(flexsdm_block$part)
#> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1008 obs. of  4 variables:
#>  $ x    : num  -51.3 -50.6 -49.3 -49.8 -50.2 ...
#>  $ y    : num  -29 -27.6 -27.8 -26.9 -28.2 ...
#>  $ pr_ab: num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ .part: int  3 1 3 2 3 3 1 4 4 3 ...

The output from flexsdm differs from that of ENMeval. Instead of returning a list with vectors of spatial group IDs for occurrences and background points separately, flexsdm appends the group assignments as a new column within the part element of its output.

As with the ENMeval example, we can transform the spatial group information stored in flexsdm_block into a format compatible with kuenm2:

# Identify unique spatial blocks from flexsdm output
id_blocks <- sort(unique(flexsdm_block$part$.part))

# Create a list of test indices for each spatial block
new_part_data_flexsdm <- lapply(id_blocks, function(i) {
  # Indices of occurrences and background points in group i
  which(flexsdm_block$part$.part == i)
})

# Assign names to each partition partition
names(new_part_data_flexsdm) <- paste0("Partition_", id_blocks)

# Inspect the structure of the new partitioned data
str(new_part_data_flexsdm)
#> List of 4
#>  $ Partition_1: int [1:248] 2 7 13 21 24 25 26 27 36 40 ...
#>  $ Partition_2: int [1:266] 4 12 15 17 18 19 29 31 32 33 ...
#>  $ Partition_3: int [1:247] 1 3 5 6 10 16 20 28 34 35 ...
#>  $ Partition_4: int [1:247] 8 9 11 14 22 23 30 37 38 41 ...

# Replace the partition data in the prepared_data object
d_block_flexsdm <- d
d_block_flexsdm$part_data <- new_part_data_flexsdm

# Update the partition method description
d_block_flexsdm$partition_method <- "Spatial block (flexsdm)"  # You can choose any descriptive name

# Display the updated prepared_data object
print(d_block_flexsdm)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Partition Method: Spatial block (flexsdm) 
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 300 
#>   - Features classes (responses): l, q, lq, lqp 
#>   - Regularization multipliers: 0.1, 2, 1

Analysis of extrapolation risks in partitions

When we partition a dataset, some test records may fall in locations with environmental conditions not represented in the training data. As a result, model evaluation can occur under extrapolation, meaning the model is being tested on conditions outside the range it was calibrated on. The risk of extrapolation increases when spatial blocks are used for data partitioning. This is because environmental conditions in one spatial block may differ significantly from those in others. As an example, let’s use the prepared_data object that was partitioned using spatial blocks from the ENMeval package.

The explore_partition_extrapolation() function calculates environmental dissimilarity and detects non-analogous conditions between training and test data for each partition using the MOP (Mobility-Oriented Parity) metric (Owens et al. 2013). It requires a prepared_data object. To visualize the spatial distribution of test data falling within or outside the training data range, you must also provide the raster_variables used during data preparation.

By default, the function uses all training data (presences and backgrounds) to define the environmental range but computes MOP only for test presence records. To also compute MOP for test background points, set include_test_background = TRUE:

mop_partition <- explore_partition_extrapolation(data = d_spatial_block, 
                                                 include_test_background = T, 
                                                 raster_variables = var)

The function returned a data.frame containing:

mop_distance: MOP distances;
inside_range an indicator of whether environmental conditions at each test record fall within the training range;
n_var_out: the number of variables outside the training range;
towards_low and towards_high : the names of variables with values lower or higher than the training range;
SoilType: because the prepared_data object includes a categorical variable, it also contains a column indicating which values in the testing data were not present in the training data.

# Highlight some partitions with extrapolation risks
head(mop_partition$Mop_results[mop_partition$Mop_results$n_var_out > 1, ])
#>       Partition pr_bg         x         y mop_distance inside_range n_var_out
#> 87  Partition_1     0 -50.75000 -30.75000     12.57765        FALSE         2
#> 129 Partition_1     0 -51.75000 -29.91667     10.91112        FALSE         2
#> 193 Partition_1     0 -53.41667 -27.08333     43.59743        FALSE         2
#> 199 Partition_1     0 -52.41667 -26.75000    171.15194        FALSE         2
#> 252 Partition_1     0 -51.58333 -30.75000     15.76803        FALSE         2
#> 261 Partition_1     0 -53.25000 -27.08333     43.59052        FALSE         2
#>     towards_low towards_high SoilType
#> 87       bio_15         <NA>       21
#> 129      bio_15         <NA>       21
#> 193      bio_15        bio_7     <NA>
#> 199      bio_15       bio_12     <NA>
#> 252      bio_15         <NA>       21
#> 261      bio_15        bio_7     <NA>

Because we set return_raster_result = TRUE, the function also returned a SpatRaster showing the spatial distribution of training presences and testing data:

plot(mop_partition$Spatial_results, 
     mar = c(3, 3, 2, 8), #Adjust margin
     plg=list(cex=0.75)) #Decrease size of legend text)

Note that in Partition 1, some test records fall in locations with environmental conditions not represented in the training data. This partition also includes many background points representing conditions that were not used to calibrate its models.

However, it’s important to mention that this function only identifies environmental conditions that fall outside the range of the training data. To fully evaluate the risk of extrapolation, you should look at the MOP results together with the response curves. This combined analysis helps to see how the model behaves under those conditions. For more details, check the Exploring Model Uncertainty and Variability vignette.

Saving a prepared_data object

The prepared_data object is crucial for the next step in the ENM workflow in kuenm2, model calibration. As this object is essentially a list, users can save it to a local directory using saveRDS(). Saving the object facilitates loading it back into your R session later using readRDS(). See an example below:

# Set directory to save (here, in a temporary directory)
dir_to_save <- file.path(tempdir())

# Save the data
saveRDS(d, file.path(dir_to_save, "Data.rds"))

# Import data
d <- readRDS(file.path(dir_to_save, "Data.rds"))