Skip to contents

Description

Before starting the ENM process, data must be formatted in a specific structure required by functions in kuenm2. This vignette guides users through the steps necessary to prepare occurrence data and environmental predictors using built-in tools. It covers the use of prepare_data() and prepare_user_data() to generate standardized objects, which are essential for model calibration. The vignette also demonstrates options for applying PCA, incorporating sampling bias, and saving prepared data for later use.


Getting ready

If kuenm2 has not been installed yet, please do so. See the Main guide for installation instructions. See the basic data cleaning guide for some steps on cleaning data.

Use the following lines of code to load kuenm2 and any other required packages, and define a working directory (if needed). In general, setting a working directory in R is considered good practice, as it provides better control over where files are read from or saved to. If users are not working within an R project, we recommend setting a working directory, since at least one file will be saved at later stages of this guide.

# Load packages
library(kuenm2)
library(terra)

# Current directory
getwd()

# Define new directory
#setwd("YOUR/DIRECTORY")  # uncomment this line if setting a new directory


Prepare data

Import data

We will use occurrence records provided within the kuenm2 package. Most example data in the package is derived from Trindade & Marques (2024). The occ_data object contains 51 occurrences of Myrcia hatschbachii, a tree endemic to Southern Brazil. Although this example data set has three columns (species, x, and y), users’ input data only requires two numeric columns with longitude and latitude coordinates.

# Import occurrences
data(occ_data, package = "kuenm2")

# Check data structure
str(occ_data)
#> 'data.frame':    51 obs. of  3 variables:
#>  $ species: chr  "Myrcia hatschbachii" "Myrcia hatschbachii" "Myrcia hatschbachii" "Myrcia hatschbachii" ...
#>  $ x      : num  -51.3 -50.6 -49.3 -49.8 -50.2 ...
#>  $ y      : num  -29 -27.6 -27.8 -26.9 -28.2 ...


As predictor variables, we will use other data included in the package. This data set comprises four bioclimatic variables from WorldClim 2.1 at 10 arc-minute resolution, and a categorical variable (SoilType) from SoilGrids resampled to 10 arc-minutes. All variables have been masked using a polygon that delimits the area for model calibration, which was generated by drawing a minimum convex polygon around the records with a 300 km buffer.

# Import raster layers
var <- rast(system.file("extdata", "Current_variables.tif", package = "kuenm2"))

# Check variables
plot(var)


Visualize occurrences records in geography:

# Visualize occurrences on one variable
plot(var[["bio_1"]], main = "Bio 1")

points(occ_data[, c("x", "y")], col = "black")


First steps in preparing data

The function prepare_data() is central to getting data ready for model calibration. It handles several key steps:

  • Defining the algorithm: Users can choose between maxnet or glm.
  • Generating background points: Background points area sampled from raster layers, unless provided by the user. These points serve as a reference to contrast presence records.
  • Principal component analysis (PCA): An optional step that can be applied to the set of predictors in PCA.
  • Preparing calibration data: Presence records and background points are associate with predictor values and put together in a data.frame to be used in the ENM.
  • Data partitioning: The function divides your data using k-folds to prepare training and testing sets via a cross-validation process.
  • Defining grid of model parameters: This helps setting up combinations of feature classes (FCs), regularization multiplier (RM) values (for Maxnet), and sets of predictor variables. An explanation of the roles of RMs and FCs in Maxent models see Merow et al. 2013.

As with any function, we recommend consulting the documentation for more detailed explanations (e.g., help(prepare_data)). Now, let’s prepare our data for model calibration using prepare_data():

# Prepare data for maxnet model
d <- prepare_data(algorithm = "maxnet",
                  occ = occ_data,
                  x = "x", y = "y",
                  raster_variables = var,
                  species = "Myrcia hatschbachii",
                  categorical_variables = "SoilType",
                  n_background = 1000,
                  features = c("l", "q", "p", "lq", "lqp"),
                  r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 43 rows were excluded from
#> database because NAs were found.


The prepare_data() function returns a prepared_data object, which is a list containing various essential pieces of information fro model calibration. Below is an example of how the object is printed to summarize its components.

print(d)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Training-Testing Method:
#>   - k-fold Cross-validation: 4 folds
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 610 
#>   - Features classes (responses): l, q, p, lq, lqp 
#>   - Regularization multipliers: 0.1, 1, 2, 3, 5


The parts of the prepared_data object can be explored in further detail by indexing them as in the following example.

# See first rows of calibration data
head(d$calibration_data)
#>   pr_bg    bio_1    bio_7 bio_12   bio_15 SoilType
#> 1     1 16.49046 18.66075   1778 12.96107       19
#> 2     1 15.46644 19.65775   1560 14.14697       19
#> 3     1 15.70560 17.99450   1652 23.27548        6
#> 4     1 17.78899 19.55600   1597 18.91694        1
#> 5     1 15.50116 18.30750   1497 15.39440       19
#> 7     1 17.42421 17.25875   1760 34.17664        6

# See first rows of formula grid
head(d$formula_grid)
#>   ID           Formulas R_multiplier Features
#> 1  1  ~bio_1 + bio_7 -1          0.1        l
#> 2  2  ~bio_1 + bio_7 -1          1.0        l
#> 3  3  ~bio_1 + bio_7 -1          2.0        l
#> 4  4  ~bio_1 + bio_7 -1          3.0        l
#> 5  5  ~bio_1 + bio_7 -1          5.0        l
#> 6  6 ~bio_1 + bio_12 -1          5.0        l


By default, prepare_data() prepares an object for fitting models using maxnet. However, this can be changed to use GLM instead. When using GLM, it is not necessary to set regularization multipliers, as this algorithm does not utilize them.

d_glm <- prepare_data(algorithm = "glm",
                      occ = occ_data,
                      x = "x", y = "y",
                      raster_variables = var,
                      species = "Myrcia hatschbachii",
                      categorical_variables = "SoilType",
                      n_background = 300,
                      features = c("l", "q", "p", "lq", "lqp"),
                      r_multiplier = NULL) #Not necessary with glms
#> Warning in handle_missing_data(occ_bg, weights): 8 rows were excluded from
#> database because NAs were found.
#Print object
d_glm
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 343 
#>   - Presence: 51 
#>   - Background: 292 
#> Training-Testing Method:
#>   - k-fold Cross-validation: 4 folds
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: glm 
#>   - Number of candidate models: 122 
#>   - Features classes (responses): l, q, p, lq, lqp


By default, prepare_data() extracts background points from the entire extent of the provided raster variables. If you have a species-specific polygon defining the calibration area, you can use it to mask the raster variables and delimit the calibration area. For example, let’s create a 100km buffer around the occurrence records:

#Convert dataframe with occurrences to SpatVector
pts <- vect(occ_data, geom = c(x = "x", y = "y"), crs = "EPSG:4326")
#Create buffer
b <- buffer(x = pts, width = 100000) #Width in meters
#Aggregate buffers
b <- terra::aggregate(b)
#Plot buffer
plot(b)
points(occ_data[, c("x", "y")], col = "black")

Now, let’s use this new polygon to mask the variables within prepare_data(). Let’s also increase the number of background points to 1,000:

d_buffer <- prepare_data(algorithm = "maxnet",
                         occ = occ_data,
                         x = "x", y = "y",
                         raster_variables = var,
                         mask = b, #Polygon to mask variables
                         species = "Myrcia hatschbachii",
                         categorical_variables = "SoilType",
                         n_background = 1000,
                         features = c("l", "q", "p", "lq", "lqp"),
                         r_multiplier = c(0.1, 1, 2, 3, 5))
#> 'n_background' >= initial number of points, using all points.
#> Warning in handle_missing_data(occ_bg, weights): 25 rows were excluded from
#> database because NAs were found.

Note the warning message: “n_background’ >= initial number of points, using all points”. This occurs because, after masking the variables, the total number of pixels within the buffer is less than the specified n_background (1,000). In such cases, all available points within the calibration area will be used as background points.

In the following examples, we’ll use the object d_maxnet, which was prepared for the maxnet algorithm without a mask. However, all functions are compatible with objects prepared for GLM as well.


Exploring calibration data

Users can visualize the distribution of predictor values for occurrence records, background points, and the entire calibration area using histograms. An example is presented below. See full documentation with help(explore_calibration_hist) and help(plot_explore_calibration).

# Prepare histogram data
calib_hist <- explore_calibration_hist(data = d, raster_variables = var,
                                       include_m = TRUE)

# Plot histograms
plot_explore_calibration(explore_calibration = calib_hist)

The gray bars represent values across the entire calibration area. Blue bars show values for the background, while green bars display values at presence points (magnified by a factor of 2 for improved visualization). You can customize both the colors and the magnification factor.


Additionally, users can explore the geographic distribution of occurrence and background points. See full documentation with help(explore_calibration_geo).

pbg <- explore_calibration_geo(data = d, raster_variables = var[[1]],
                               plot = TRUE)


Note that, by default, background points are selected randomly within the calibration area. However, users can influence the spatial distribution of background, increasing or decreasing the probability of selection in certain regions, by providing a bias file (as demonstrated in the next section).


Using a bias file

A bias file is a SpatRaster object that contains values that will influence the selection of background points within the calibration area. This can be particularly useful for mitigating sampling bias, for instance, by incorporating the density of records from a target group (as discussed in Ponder et al. 2001, Anderson et al. 2003, and Barber et al. 2020).

The bias file must have the same extent, resolution, and number of cells as your raster variables, unless a mask is supplied. If a mask is used, the extent of the bias file should encompass or be larger than the mask extent.

Let’s illustrate this with an example bias file included in the package. This SpatRaster has lower values in the center and higher values towards the borders:

# Import a bias file
bias <- rast(system.file("extdata", "bias_file.tif", package = "kuenm2"))

plot(bias)


We will now use this bias file to prepare two new datasets: one where the bias effect is “direct” (higher probability in regions with higher bias values) and another where the effect is “inverse” (higher probability in regions with lower bias values):

# Using bias as a direct effect in sampling
d_bias_direct <- prepare_data(algorithm = "maxnet",
                              occ = occ_data,
                              x = "x", y = "y",
                              raster_variables = var,
                              species = "Myrcia hatschbachii",
                              categorical_variables = "SoilType",
                              n_background = 1000,
                              bias_file = bias, bias_effect = "direct",  # bias parameters
                              features = c("l", "q", "p", "lq", "lqp"),
                              r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 57 rows were excluded from
#> database because NAs were found.

# Using bias as an indirect effect in sampling
d_bias_inverse <- prepare_data(algorithm = "maxnet",
                               occ = occ_data,
                               x = "x", y = "y",
                               raster_variables = var,
                               species = "Myrcia hatschbachii",
                               categorical_variables = "SoilType",
                               n_background = 1000,
                               bias_file = bias, bias_effect = "inverse",   # bias parameters
                               features = c("l", "q", "p", "lq", "lqp"),
                               r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 45 rows were excluded from
#> database because NAs were found.

# Compare background points generated randomly versus with bias effects
## Saving original plotting parameters
original_par <- par(no.readonly = TRUE)

## Adjusting plotting grid
par(mfrow = c(2,2))  

## The plots to show sampling bias effects
plot(bias, main = "Bias file")
explore_calibration_geo(d, raster_variables = var[[1]],
                        main = "Random Background")
explore_calibration_geo(d_bias_direct, raster_variables = var[[1]],
                        main = "Direct Bias Effect")
explore_calibration_geo(d_bias_inverse, raster_variables = var[[1]],
                        main = "Inverse Bias Effect")


par(original_par)  # Reset grid

Note that when the bias effect is “direct”, the majority of the background points are sampled from the borders of the calibration area, corresponding to higher bias values. Conversely, with an “inverse” bias effect, most background points are selected from the center, where bias values are lower.


PCA for variables

A common approach in ENM involves summarizing the information from a set of predictor variables into a smaller set of uncorrelated variables using Principal Component Analysis (PCA) (see Cruz-Cardenaz et al. 2014 for an example). In kuenm2 users can perform a PCA internally or use variables that have been externally prepared as PCs.

Internal PCA

kuenm2 can perform all PCA transformations internally, eliminating the need to prepare the new PC variables for each scenario of projection. This is particularly advantageous when projecting model results across multiple time scenarios (e.g., various Global Climate Models for different future periods). By performing PCA internally, you only need to store the raw environmental variables (e.g., bio_1, bio_2, etc.) on your directory, and the functions will handle the PCA transformation as needed.

Let’s explore how to implement this:

# Prepare data for maxnet models using PCA parameters
d_pca <- prepare_data(algorithm = "maxnet",
                      occ = occ_data,
                      x = "x", y = "y",
                      raster_variables = var, 
                      do_pca = TRUE, center = TRUE, scale = TRUE,  # PCA parameters
                      species = "Myrcia hatschbachii",
                      categorical_variables = "SoilType",
                      n_background = 1000,
                      features = c("l", "q", "p", "lq", "lqp"),
                      r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 43 rows were excluded from
#> database because NAs were found.

print(d_pca)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Training-Testing Method:
#>   - k-fold Cross-validation: 4 folds
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information:
#>   - Variables included: bio_1, bio_7, bio_12, bio_15 
#>   - Number of PCA components: 4 
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 610 
#>   - Features classes (responses): l, q, p, lq, lqp 
#>   - Regularization multipliers: 0.1, 1, 2, 3, 5


The elements calibration data and formula grid have now been generated considering the principal components (PCs). By default, all continuous variables were included in the PCA, while categorical variables (e.g., “SoilType”) were excluded. The default settings for the number of PCs selected retain the axes that collectively explain 95% of the total variance, and then further filter these, keeping only those axes that individually explain at least 5% of the variance. These parameters can be changed using other arguments in the function prepare_data

# Check calibration data
head(d_pca$calibration_data)
#>   pr_bg         PC1        PC2        PC3         PC4 SoilType
#> 1     1  1.48690341 1.01252697  0.1180156 -0.09119257       19
#> 2     1  1.46028074 0.17701144  1.1573461 -0.12326796       19
#> 3     1  0.82676494 1.21965795  0.8145129 -0.67588891        6
#> 4     1  0.62680441 0.03967459  0.1525997  0.18784282        1
#> 5     1  0.94584897 0.93302089  1.4382424 -0.03192094       19
#> 7     1 -0.07597437 1.55268331 -0.2007953 -0.98153204        6

# Check formula grid
head(d_pca$formula_grid)
#>   ID      Formulas R_multiplier Features
#> 1  1 ~PC1 + PC2 -1          0.1        l
#> 2  2 ~PC1 + PC2 -1          1.0        l
#> 3  3 ~PC1 + PC2 -1          2.0        l
#> 4  4 ~PC1 + PC2 -1          3.0        l
#> 5  5 ~PC1 + PC2 -1          5.0        l
#> 6  6 ~PC1 + PC3 -1          5.0        l

# Explore variables distribution
calib_hist_pca <- explore_calibration_hist(data = d_pca, raster_variables = var,
                                           include_m = TRUE, breaks = 7)

plot_explore_calibration(explore_calibration = calib_hist_pca)

As the PCA was performed internally, the prepared_data object contains all the necessary information to transform the raw environmental variables into the required PCs This means that when predicting or projecting models, users should provide raw raster variables, and the PCs will be obtained internally in the function.


External PCA

Alternatively, users can perform a PCA with their data by using the perform_pca() function, or one of their preference. See full documentation with help(perform_pca). Se an example with perform_pca() below:

pca_var <- perform_pca(raster_variables = var, exclude_from_pca = "SoilType",
                       center = TRUE, scale = TRUE)

# Plot
plot(pca_var$env)


Now, let’s use the PCs generated by perform_pca() to prepare the data:

# Prepare data for maxnet model using PCA variables
d_pca_extern <- prepare_data(algorithm = "maxnet",
                             occ = occ_data,
                             x = "x", y = "y",
                             raster_variables = pca_var$env,  # Output of perform_pca()
                             do_pca = FALSE,  # Set to FALSE because variables are PCs
                             species = "Myrcia hatschbachii",
                             categorical_variables = "SoilType",
                             n_background = 1000,
                             features = c("l", "q", "p", "lq", "lqp"),
                             r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning in handle_missing_data(occ_bg, weights): 43 rows were excluded from
#> database because NAs were found.

print(d_pca_extern)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 1008 
#>   - Presence: 51 
#>   - Background: 957 
#> Training-Testing Method:
#>   - k-fold Cross-validation: 4 folds
#> Continuous Variables:
#>   - PC1, PC2, PC3, PC4 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 610 
#>   - Features classes (responses): l, q, p, lq, lqp 
#>   - Regularization multipliers: 0.1, 1, 2, 3, 5


Note that since PCA was performed externally, do_pca = FALSE is set within the prepare_data function. This is crucial because setting it to TRUE would incorrectly apply PCA to variables that are already PCs. Consequently, the prepared_data object in this scenario does not store any PCA-related information. This means that when users predict or project models, they must must provide the PCs instead of the raw raster variables.

# Check calibration data
head(d_pca_extern$calibration_data)
#>   pr_bg         PC1        PC2        PC3         PC4 SoilType
#> 1     1  1.48690341 1.01252697  0.1180156 -0.09119257       19
#> 2     1  1.46028074 0.17701144  1.1573461 -0.12326796       19
#> 3     1  0.82676494 1.21965795  0.8145129 -0.67588891        6
#> 4     1  0.62680441 0.03967459  0.1525997  0.18784282        1
#> 5     1  0.94584897 0.93302089  1.4382424 -0.03192094       19
#> 7     1 -0.07597437 1.55268331 -0.2007953 -0.98153204        6

# Check formula grid
head(d_pca_extern$formula_grid)
#>   ID      Formulas R_multiplier Features
#> 1  1 ~PC1 + PC2 -1          0.1        l
#> 2  2 ~PC1 + PC2 -1          1.0        l
#> 3  3 ~PC1 + PC2 -1          2.0        l
#> 4  4 ~PC1 + PC2 -1          3.0        l
#> 5  5 ~PC1 + PC2 -1          5.0        l
#> 6  6 ~PC1 + PC3 -1          5.0        l


Prepare user pre-processed data

If users already have data that has been prepared for calibration, they can use the prepare_user_data() function to create the object required for model calibration. User-prepared calibration data must be a data.frame that includes a column indicating presence (1) and background (0) records, along with columns with values for each of your variables. The package includes an example of such a data.frame for reference. See an example of its use below:

data("user_data", package = "kuenm2")

head(user_data)
#>   pr_bg    bio_1    bio_7 bio_12   bio_15 SoilType
#> 1     1 16.49046 18.66075   1778 12.96107       19
#> 2     1 15.46644 19.65775   1560 14.14697       19
#> 3     1 15.70560 17.99450   1652 23.27548        6
#> 4     1 17.78899 19.55600   1597 18.91694        1
#> 5     1 15.50116 18.30750   1497 15.39440       19
#> 7     1 17.42421 17.25875   1760 34.17664        6


The prepare_user_data() function operates similarly to prepare_data(), but with a key difference; instead of requiring a data.frame of occurrence coordinates and a SpatRaster of predictor variables, it takes your already prepared user data.frame (see below). See full documentation with help(prepare_user_data).

# Prepare data for maxnet model
data_user <- prepare_user_data(algorithm = "maxnet",
                               user_data = user_data,  # user-prepared data.frame
                               pr_bg = "pr_bg",
                               species = "Myrcia hatschbachii",
                               categorical_variables = "SoilType",
                               features = c("l", "q", "p", "lq", "lqp"),
                               r_multiplier = c(0.1, 1, 2, 3, 5))

data_user 
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii 
#> Number of Records: 527 
#>   - Presence: 51 
#>   - Background: 476 
#> Training-Testing Method:
#>   - k-fold Cross-validation: 4 folds
#> Continuous Variables:
#>   - bio_1, bio_7, bio_12, bio_15 
#> Categorical Variables:
#>   - SoilType 
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#>   - Algorithm: maxnet 
#>   - Number of candidate models: 610 
#>   - Features classes (responses): l, q, p, lq, lqp 
#>   - Regularization multipliers: 0.1, 1, 2, 3, 5

This function also allows you to provide a list of folds for cross-validation to be used during model calibration. If user_folds is NULL, the function will automatically split your data based on the number of folds specified by the kfolds argument. Internal PCA for variables is also available with this function.


Saving a prepared_data object

The prepared_data object is crucial for the next step in the ENM workflow in kuenm2, model calibration. As this object is essentially a list, users can save it to a local directory using saveRDS(). Saving the object facilitates loading it back into your R session later using readRDS(). See an example below:

# Set directory to save (here, in a temporary directory)
dir_to_save <- file.path(tempdir())

# Save the data
saveRDS(d, file.path(dir_to_save, "Data.rds"))

# Import data
d <- readRDS(file.path(dir_to_save, "Data.rds"))