Prepare data for model calibration
prepare_data.Rd
This function prepares data for model calibration, including optional PCA, background point generation, k-fold partitioning, and the creation of a grid of parameter combinations, including regularization multiplier values, feature classes, and sets of environmental variables.
Usage
prepare_data(algorithm, occ, x, y, raster_variables, species = NULL,
mask = NULL, n_background = 1000, features = c("lq", "lqp"),
r_multiplier = c(0.1, 0.5, 1, 2, 3), kfolds = 4,
categorical_variables = NULL, do_pca = FALSE, center = TRUE,
scale = TRUE, exclude_from_pca = NULL, variance_explained = 95,
min_explained = 5, min_number = 2, min_continuous = NULL,
bias_file = NULL, bias_effect = NULL, weights = NULL,
include_xy = TRUE, write_pca = FALSE, pca_directory = NULL,
write_file = FALSE, file_name = NULL, seed = 1)
Arguments
- algorithm
(character) modeling algorithm, either "glm" or "maxnet".
- occ
(data frame) a data.frame containing the coordinates (longitude and latitude) of the occurrence records.
- x
(character) a string specifying the name of the column in
occ
that contains the longitude values.- y
(character) a string specifying the name of the column in
occ
that contains the latitude values.- raster_variables
(SpatRaster) predictor variables used to calibrate the models.
- species
(character) string specifying the species name (optional). Default is NULL.
- mask
spatial object used to mask the variables to the area where the model will be calibrated. Mask must be of one of the following classes:
SpatRaster
,SpatVector
, orSpatExtent
. Default is NULL.- n_background
(numeric) number of points to represent the background for the model. Default is 1000.
- features
(character) a vector of feature classes. Default is c("q", "lq", "lp", "qp", "lqp").
- r_multiplier
(numeric) a vector of regularization parameters for maxnet. Default is c(0.1, 1, 2, 3, 5).
- kfolds
(numeric) the number of groups (folds) the occurrence data will be split into for cross-validation. Default is 4.
- categorical_variables
(character) names of the variables that are categorical. Default is NULL.
- do_pca
(logical) whether to perform a principal component analysis (PCA) with the set of variables. Default is FALSE.
- center
(logical) whether the variables should be zero-centered. Default is TRUE.
- scale
(logical) whether the variables should be scaled to have unit variance before the analysis takes place. Default is FALSE.
- exclude_from_pca
(character) variable names within raster_variables that should not be included in the PCA transformation. Instead, these variables will be added directly to the final set of output variables without being modified. The default is NULL, meaning all variables will be used unless specified otherwise.
- variance_explained
(numeric) the cumulative percentage of total variance that must be explained by the selected principal components. Default is 95.
- min_explained
(numeric) the minimum percentage of total variance that a principal component must explain to be retained. Default is 5.
- min_number
(numeric) the minimum number of variables to be included in the model formulas to be generated.
- min_continuous
(numeric) the minimum number of continuous variables required in a combination. Default is NULL.
- bias_file
(SpatRaster) a raster containing bias values (probability weights) that influence the selection of background points. It must have the same extent, resolution, and number of cells as the raster variables, unless a mask is provided. Default is NULL.
- bias_effect
(character) a string specifying how the values in the
bias_file
should be interpreted. Options are "direct" or "inverse". If "direct", higher values in bias file increase the likelihood of selecting background points. If "inverse", higher values decrease the likelihood. Default = NULL. Must be defined ifbias_file
is provided.- weights
(numeric) a numeric vector specifying weights for the occurrence records. Default is NULL.
- include_xy
(logical) whether to include the coordinates (longitude and latitude) in the results from preparing data. Columns containing coordinates will be renamed as "x" and "y". Default is TRUE.
- write_pca
(logical) whether to save the PCA-derived raster layers (principal components) to disk. Default is FALSE.
- pca_directory
(character) the path or name of the folder where the PC raster layers will be saved. This is only applicable if
write_pca = TRUE
. Default is NULL.- write_file
(logical) whether to write the resulting prepared_data list in a local directory. Default is FALSE.
- file_name
(character) the path or name of the folder where the resulting list will be saved. This is only applicable if
write_file = TRUE
. Default is NULL.- seed
(numeric) integer value to specify an initial seed to split the data and extract background. Default is 1.
Value
An object of class prepared_data
containing all elements to run a model
calibration routine. The elements include: species, calibration data,
a grid of model parameters, indices of k-folds for cross validation,
xy coordinates, names of continuous and categorical variables, weights,
results from PCA, and modeling algorithm.
Examples
# Import occurrences
data(occ_data, package = "kuenm2")
# Import raster layers
var <- terra::rast(system.file("extdata", "Current_variables.tif",
package = "kuenm2"))
# Import a bias file
bias <- terra::rast(system.file("extdata", "bias_file.tif",
package = "kuenm2"))
# Prepare data for maxnet model
sp_swd <- prepare_data(algorithm = "maxnet", occ = occ_data,
x = "x", y = "y",
raster_variables = var,
species = occ_data[1, 1],
categorical_variables = "SoilType",
n_background = 500, bias_file = bias,
bias_effect = "direct",
features = c("l", "q", "p", "lq", "lqp"),
r_multiplier = c(0.1, 1, 2, 3, 5))
#> Warning: 27 rows were excluded from database because NAs were found.
print(sp_swd)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii
#> Number of Records: 524
#> - Presence: 51
#> - Background: 473
#> Training-Testing Method:
#> - k-fold Cross-validation: 4 folds
#> Continuous Variables:
#> - bio_1, bio_7, bio_12, bio_15
#> Categorical Variables:
#> - SoilType
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#> - Algorithm: maxnet
#> - Number of candidate models: 610
#> - Features classes (responses): l, q, p, lq, lqp
#> - Regularization multipliers: 0.1, 1, 2, 3, 5
# Prepare data for glm model
sp_swd_glm <- prepare_data(algorithm = "glm", occ = occ_data,
x = "x", y = "y",
raster_variables = var,
species = occ_data[1, 1],
categorical_variables = "SoilType",
n_background = 500, bias_file = bias,
bias_effect = "direct",
features = c("l", "q", "p", "lq", "lqp"))
#> Warning: 27 rows were excluded from database because NAs were found.
print(sp_swd_glm)
#> prepared_data object summary
#> ============================
#> Species: Myrcia hatschbachii
#> Number of Records: 524
#> - Presence: 51
#> - Background: 473
#> Training-Testing Method:
#> - k-fold Cross-validation: 4 folds
#> Continuous Variables:
#> - bio_1, bio_7, bio_12, bio_15
#> Categorical Variables:
#> - SoilType
#> PCA Information: PCA not performed
#> Weights: No weights provided
#> Calibration Parameters:
#> - Algorithm: glm
#> - Number of candidate models: 122
#> - Features classes (responses): l, q, p, lq, lqp