Fit and Explore Selected Models • kuenm2

Summary

Introduction
Response curve
Variable importance
Evaluate models with independent data
Saving a fitted_models object

Introduction

After the best performing models have been selected, users need to fit this models (using fit_selected()) in order to explore their characteristics and continue with the next steps. Fitted models can then be used to assess variable importance in models, as well as to explore variable response curves. We can also evaluate the selected models using independent presence records that were not used during model calibration.

To fit the selected models, we need a calibration_results object. For more details in model calibration, please refer to the vignette Model Calibration. The calibration_results object generated in this vignette is available as a data example in the package. Let’s load it.

#Load packages
library(kuenm2)
library(terra)
#> terra 1.8.60

#Import calib_results_maxnet
data("calib_results_maxnet", package = "kuenm2")
#Print calibration result
calib_results_maxnet
#> calibration_results object summary (maxnet)
#> =============================================================
#> Species: Myrcia hatschbachii 
#> Number of candidate models: 300 
#>   - Models removed because they failed to fit: 0 
#>   - Models identified with concave curves: 39 
#>   - Model with concave curves not removed 
#>   - Models removed with non-significant values of pROC: 0 
#>   - Models removed with omission error > 10%: 165 
#>   - Models removed with delta AIC > 2: 133 
#> Selected models: 2 
#>   - Up to 5 printed here:
#>      ID
#> 192 192
#> 219 219
#>                                                                                      Formulas
#> 192                        ~bio_1 + bio_7 + bio_15 + I(bio_1^2) + I(bio_7^2) + I(bio_15^2) -1
#> 219 ~bio_1 + bio_7 + bio_12 + bio_15 + I(bio_1^2) + I(bio_7^2) + I(bio_12^2) + I(bio_15^2) -1
#>     Features R_multiplier pval_pROC_at_10.mean Omission_rate_at_10.mean
#> 192       lq          0.1                    0                   0.0769
#> 219       lq          0.1                    0                   0.0962
#>         dAIC Parameters
#> 192 0.000000          6
#> 219 1.179293          7

This object contains the results of candidate models calibrated using the maxnet algorithm. The package also provides a similar example using the glm algorithm, which works in exactly the same way.

#Import calib_results_glm
data("calib_results_glm", package = "kuenm2")
#Print calibration result
calib_results_glm
#> calibration_results object summary (glm)
#> =============================================================
#> Species: Myrcia hatschbachii 
#> Number of candidate models: 122 
#>   - Models removed because they failed to fit: 0 
#>   - Models identified with concave curves: 18 
#>   - Model with concave curves not removed 
#>   - Models removed with non-significant values of pROC: 0 
#>   - Models removed with omission error > 10%: 101 
#>   - Models removed with delta AIC > 2: 20 
#> Selected models: 1 
#>   - Up to 5 printed here:
#>    ID                                                        Formulas Features
#> 85 85 ~bio_1 + bio_7 + bio_12 + I(bio_1^2) + I(bio_7^2) + I(bio_12^2)       lq
#>    pval_pROC_at_10.mean Omission_rate_at_10.mean dAIC Parameters
#> 85                    0                   0.0904    0          6

Note that the calibration_results object stores only the information related to the calibration process and model evaluation—it does not include the fitted maxnet (or glm) models themselves.

To obtain the final fitted models, we need to use the fit_selected() function. By default, this function fits a full model (i.e., without partitions and without splitting the data into training and testing sets). However, you can configure it to fit final models with partitions if desired.

In this example, we’ll fit the final models using the same partition settings (4-fold cross-validation) as used in the Model Calibration vignette.

# Fit selected models using calibration results
fm <- fit_selected(calibration_results = calib_results_maxnet, 
                   partition_method = "kfolds", n_partitions = 4)
# Fitting partitions...
#   |========================================================================| 100%
# Fitting full models...
#   |========================================================================| 100%

The fit_selected() function returns a fitted_models object, a list that contains essential information from the fitted models, which is required for the subsequent steps.

You can explore the contents of the fitted_models object by indexing its elements. For example, the fitted maxnet (or glm) model objects are stored within the Models element. Note that Models is a nested list: for each selected model (in this case, models 192 and 219), it includes both the partitions (if fitted with partitions) and the full model.

#See names of selected models
names(fm$Models)
#> [1] "Model_192" "Model_219"

#See models inside Model 192
names(fm$Models$Model_192)
#> [1] "Partition_1" "Partition_2" "Partition_3" "Partition_4" "Full_model"

The fitted_models object also stores the thresholds that can be used to binarize the models into suitable and unsuitable areas. These thresholds correspond to the omission error used during model selection (e.g., 5% or 10%).

You can access the omission error used to calculate the thresholds directly from the object:

#Get omission error used to select models and calculate the thesholds
fm$omission_rate
#> [1] 10

The omission error used to calculate the thresholds was 10%, meaning that when the predictions are binarized, approximately 10% of the presence records used to calibrate the models will fall into cells with predicted values below the threshold.

The thresholds are summarized in two ways: the mean and median across partitions for each model, and the consensus mean and median across all selected models (when more than one model is selected):

fm$thresholds
#> $Model_192
#> $Model_192$mean
#> [1] 0.241469
#> 
#> $Model_192$median
#> [1] 0.2535061
#> 
#> 
#> $Model_219
#> $Model_219$mean
#> [1] 0.2573623
#> 
#> $Model_219$median
#> [1] 0.2597515
#> 
#> 
#> $consensus
#> $consensus$mean
#> [1] 0.2494156
#> 
#> $consensus$median
#> [1] 0.2566288
#> 
#> 
#> $type
#> [1] "cloglog"

Now, we can use the fitted_models object to generate response curves and compute variable importance.

Response curve

The response curves illustrate how each environmental variable influences the predicted suitability, while keeping all other variables constant.

By default, the curves are generated with all other variables set to their mean values (or the mode for categorical variables), calculated from the combined set of presence and background localities (averages_from = "pr_bg"). You can change this behavior to use only the presence localities by setting averages_from = "pr".

Let’s check which variables are available to plot by examining the coefficients of the full models:

#Get variables with non-zero coefficients in the models
fm$Models[[1]]$Full_model$betas #From the first model selected
#>        bio_1        bio_7       bio_15   I(bio_1^2)   I(bio_7^2)  I(bio_15^2) 
#> 11.572321659  0.215970079  0.369077970 -0.356605446 -0.020306099 -0.006200151
fm$Models[[2]]$Full_model$betas #From the second model selected
#>         bio_1        bio_12        bio_15    I(bio_1^2)    I(bio_7^2) 
#>  1.178814e+01  1.638570e-02  3.406100e-01 -3.625405e-01 -1.440450e-02 
#>   I(bio_12^2)   I(bio_15^2) 
#> -5.261860e-06 -5.623987e-03

The variables bio_1, bio_7, bio_12 and bio_15have non-zero coefficient values, which means they contribute to the model and are available for generating response curves.

By default, response curves are computed using all selected models. The resulting plots include a line for the mean response, along with a shaded area representing the 95% confidence interval.

par(mfrow = c(2, 2)) #Set grid of plot
response_curve(models = fm, variable = "bio_1")
response_curve(models = fm, variable = "bio_7")
response_curve(models = fm, variable = "bio_12")
response_curve(models = fm, variable = "bio_15")

on.exit() #Reinitiate grid

We can also specify which of the selected models should be used to generate the response curves:

par(mfrow = c(2, 2)) #Set grid of plot
response_curve(models = fm, variable = "bio_1",
               modelID = "Model_192", main = "Model_192")
response_curve(models = fm, variable = "bio_1", 
               modelID = "Model_219", main = "Model_219")
response_curve(models = fm, variable = "bio_7", 
               modelID = "Model_192", main = "Model_192")
response_curve(models = fm, variable = "bio_7", 
               modelID = "Model_219", main = "Model_219")

on.exit() #Reinitiate grid

The dashed lines indicate the range of the variable within the calibration data. By default, the plot extends beyond these limits based on the variable’s minimum and maximum values and the extrapolation_factor (when extrapolation = TRUE). The default extrapolation is set to 10% of the variable’s range (i.e., range × 0.1).

If extrapolation = FALSE, no extrapolation occurs, and the plot limits match the calibration data range exactly.

We can increase the extrapolation factor to allow a broader range beyond the observed data. Below is the response curve plotted with an extrapolation factor of 2:

par(mfrow = c(2, 2)) #Set grid of plot
response_curve(models = fm, variable = "bio_1", extrapolation_factor = 2)
response_curve(models = fm, variable = "bio_7", extrapolation_factor = 2)
response_curve(models = fm, variable = "bio_12", extrapolation_factor = 2)
response_curve(models = fm, variable = "bio_15", extrapolation_factor = 2)

on.exit() #Reinitiate grid

Note that the response curve now extends further beyond the observed data range (indicated by the dashed lines).

Optionally, we can manually set the lower and upper limits of the variables. For example, since bio_12 represents annual precipitation and negative values are not meaningful, we can set its lower limit to 0:

response_curve(models = fm, variable = "bio_12", 
               extrapolation_factor = 0.1, 
               l_limit = 0)

Now, the lower limit of the plot for bio_12 is set to 0. Since we did not specify an upper limit, the plot uses the extrapolation factor (here, 0.1) to define the upper limit.

Optionally, we can add the original presence and background points to the plot by setting add_point = TRUE:

response_curve(models = fm, variable = "bio_1", 
               add_points = TRUE)

Variable importance

The relative importance of predictor variables can be calculated using explained deviance through the var_importance() function. This process starts by fitting the full model (maxnet or glm), which includes all predictor variables. Then, the function fits separate models excluding one variable at a time, assessing how the removal affects model performance.

By systematically evaluating the impact of each predictor’s exclusion, the function provides insights into the individual contribution of each variable to the model’s overall performance and explanatory power.

By default, the function runs on a single core. You can enable parallel processing by setting parallel = TRUE and specifying the number of cores with ncores. Note that parallelization only speeds up the computation when there are many variables (more than 7) and a large calibration dataset (over 15,000 presence and background points).

By default, variable importance is computed for all selected models:

# Calculate variable importance
imp <- variable_importance(models = fm)

# Calculating variable contribution for model 1 of 2
#   |======================================================================| 100%
# Calculating variable contribution for model 2 of 2
#   |======================================================================| 100%

The function returns a data.frame with the relative contribution of each variable. If multiple models are included in the fitted object, an additional column identifies each distinct model.

imp
#>      predictor contribution    Models
#> 1   I(bio_1^2)  0.330546900 Model_192
#> 2        bio_1  0.295001412 Model_192
#> 3       bio_15  0.209668355 Model_192
#> 4  I(bio_15^2)  0.156926301 Model_192
#> 5   I(bio_7^2)  0.006279520 Model_192
#> 6        bio_7  0.001577512 Model_192
#> 7   I(bio_1^2)  0.368476226 Model_219
#> 8        bio_1  0.332782286 Model_219
#> 9       bio_15  0.155708606 Model_219
#> 10 I(bio_15^2)  0.095186390 Model_219
#> 11 I(bio_12^2)  0.022722557 Model_219
#> 12      bio_12  0.021793617 Model_219
#> 13  I(bio_7^2)  0.003330318 Model_219

We can visualize variable importance using the plot_importance() function. When the fitted_models object contains more than one selected model, the plot displays a boxplot of contributions, along with the mean contribution and the number (N) of fitted models.

plot_importance(imp)

If variable importance is computed for a single model, the plot displays a barplot instead of a boxplot:

# Calculate variable importance for a specific selected Model
imp_192<- variable_importance(models = fm, modelID = "Model_192", 
                               progress_bar = FALSE)
#Plot variable contribution for model 192
plot_importance(imp_192, main = "Variable importance - Model 192")

Evaluate models with independent data

We can evaluate the selected models using an independent set of presence records that were not used during model calibration. This approach is especially useful when new records become available or when working with invasive species. In the latter case, models are first calibrated using presence data from the native area and subsequently evaluated with data from the invaded area.

The independent_evaluation() function computes the omission rate - i.e., the proportion of independent records that fall in unsuitable areas, where “unsuitable” is defined according to the omission threshold used during model calibration and selection (e.g., 10%) - as well as the partial ROC. It also assesses whether the environmental conditions in the independent data are analogous (i.e., within the range) to those in the calibration data.

As example of independent data, let’s use the new_occ data example provided in the package. It contains coordinates of the Myrcia hatschbachii sourced from NeotropicTree (Oliveira-Filho, 2017), and they were not part of the occ_dataused to train the models.

# Import independent records
data("new_occ", package = "kuenm2")
# See data structure
str(new_occ)
#> Classes 'data.table' and 'data.frame':   82 obs. of  3 variables:
#>  $ species: chr  "Myrcia hatschbachii" "Myrcia hatschbachii" "Myrcia hatschbachii" "Myrcia hatschbachii" ...
#>  $ x      : num  -48.3 -49.1 -49.9 -49.4 -49.9 ...
#>  $ y      : num  -25.2 -25 -24.5 -24.5 -24.8 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

For predicting the model to this new set of occurrence records, we need to extract the environmental conditions on this locals. Let’s import the variables we used to fit the models and extract to the new_occ conditions:

# Import raster layers
var <- terra::rast(system.file("extdata", "Current_variables.tif",
                               package = "kuenm2"))

# Extract variables to occurrences
new_data <- extract_occurrence_variables(occ = new_occ, x = "x", y = "y",
                                         raster_variables = var)
# See data structure
str(new_data)
#> 'data.frame':    82 obs. of  8 variables:
#>  $ pr_bg   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ x       : num  -48.3 -49.1 -49.9 -49.4 -49.9 ...
#>  $ y       : num  -25.2 -25 -24.5 -24.5 -24.8 ...
#>  $ bio_1   : num  20.2 18 16.6 17.8 16.7 ...
#>  $ bio_7   : num  16.7 18.2 19.9 19.4 20.1 ...
#>  $ bio_12  : num  2015 1456 1526 1414 1578 ...
#>  $ bio_15  : num  43.8 33.7 29.1 32.2 26.5 ...
#>  $ SoilType: num  6 6 6 6 6 10 10 10 10 10 ...

Especially when using independent records that fall outside the calibration area (e.g., in an invaded region on another continent), it is common for the environmental conditions in these new records to be non-analogous (i.e., outside the range) to those in the calibration data.

To better illustrate this case, let’s add three fake records, in which some variables have non-analogous values, either higher than the upper limit or lower than the lower limit observed in the calibration data.

#Add some fake data beyond the limits of calibration ranges
fake_data <- data.frame("pr_bg" = c(1, 1, 1),
                        "x" = c(NA, NA, NA),
                        "y" = c(NA, NA, NA),
                        "bio_1" = c(10, 15, 23),
                        "bio_7" = c(12, 16, 20),
                        "bio_12" = c(2300, 2000, 1000),
                        "bio_15" = c(30, 40, 50),
                        "SoilType" = c(1, 1, 1))
# Bind data
new_data <- rbind(new_data, fake_data)

Now, let’s evaluate this independent dataset (keep in mind that the last three records are fake):

# Evaluate models with independent data
res_ind <- independent_evaluation(fitted_models = fitted_model_maxnet,
                                  new_data = new_data)

The output is a list with three elements. The first one, evaluation, presents the evaluation metrics (omission rate and pROC) for each selected model, as well as for the overall consensus. We can see that all selected models have significant pROC values, but show higher omission rates (around 40% of the independent records fall in areas predicted as unsuitable) compared to the threshold specified during model calibration (10%).

res_ind$evaluation
#>               Model consensus Omission_rate_at_10 Mean_AUC_ratio pval_pROC
#> 1         Model_192      mean           0.4352941       1.166789         0
#> 2         Model_192    median           0.4000000       1.139045         0
#> 3         Model_219      mean           0.4235294       1.190714         0
#> 4         Model_219    median           0.4000000       1.138885         0
#> 5 General_consensus    median           0.4000000       1.136853         0
#> 6 General_consensus      mean           0.4352941       1.180920         0

When perform_mop is set to TRUE, the function also returns the mop_results, which is a list containing the output of mop::mop() function. The main results of the MOP analysis are appended to the predictions element of the list.

For each records, the following information is provided:

mop_distance: the distance (i.e., dissimilarity) between the environmental conditions in calibration data and those at the location of the independent record.
inside_range: wheter the environmental conditions at the location of the independent record fall within the calibration range.
n_var_out: the number of variables at the location of the independent record that are non-analogous (i.e., fall outside the calibration range).
towards_low: the names of variables with values lower than the minimum observed in the calibration data.
towards_high: the names of variables with values higher than the maximum observed in the calibration data.

# Show the mop results for the last 5 independent records
res_ind$predictions$continuous[81:85 ,c("mop_distance", "inside_range", "n_var_out", 
                                  "towards_low", "towards_high")]
#>    mop_distance inside_range n_var_out  towards_low towards_high
#> 81     7.753558         TRUE         0         <NA>         <NA>
#> 82     6.622770         TRUE         0         <NA>         <NA>
#> 83   157.782905        FALSE         3 bio_7, bio_1       bio_12
#> 84    21.045482         TRUE         0         <NA>         <NA>
#> 85   183.905420        FALSE         2       bio_12        bio_1

Note that two of the three fake records we added to new_data have non-analogous environmental conditions. One of them falls in a location where bio_7 and bio_1 have values lower than the minimum observed in the calibration data, while bio_12 has a value higher than the maximum. Another record is in a location where bio_12 is below the calibration range, and bio_1 exceeds the upper limit.

When we setreturn_predictions = TRUE, the function also returns the continuous predictions for each selected model and for the general consensus:

# Show the continuous predictions for the last 5 independent records
# Round to two decimal places
round(res_ind$predictions$continuous[81:85, 1:6], 2)
#>    Model_192.mean Model_192.median Model_219.mean Model_219.median
#> 81           0.57             0.63           0.58             0.65
#> 82           0.54             0.60           0.55             0.62
#> 83           1.00             1.00           0.96             1.00
#> 84           1.00             1.00           1.00             1.00
#> 85           0.00             0.00           0.00             0.00
#>    General_consensus.median General_consensus.mean
#> 81                     0.63                   0.58
#> 82                     0.61                   0.54
#> 83                     1.00                   0.98
#> 84                     1.00                   1.00
#> 85                     0.00                   0.00

If we set return_binary = TRUE, the function also returns the binary predictions, with values classified as suitable (1) or unsuitable (0), based on the threshold calculated using the omission rate applied during model evaluation and selection:

# Show the continuous predictions for the last 5 independent records
res_ind$predictions$binary[81:85, 1:6]
#>    Model_192.mean Model_192.median Model_219.mean Model_219.median
#> 81              1                1              1                1
#> 82              1                1              1                1
#> 83              1                1              1                1
#> 84              1                1              1                1
#> 85              0                0              0                0
#>    General_consensus.mean General_consensus.median
#> 81                      1                        1
#> 82                      1                        1
#> 83                      1                        1
#> 84                      1                        1
#> 85                      0                        0

These results help determine whether the independent data should be incorporated into the calibration dataset to re-run the models. If omission rates for the independent records are too high, pROC values are non-significant, or most of the records fall in locations with non-analogous environmental conditions, this suggests it may be a good idea to re-run the models, including the independent data this time.

Saving a fitted_models object

After fitting the best-performing models with fit_selected(), we can proceed to predict the models for single or multiple scenarios. As this object is essentially a list, users can save it to a local directory using saveRDS(). Saving the object facilitates loading it back into your R session later using readRDS(). See an example below:

# Set directory to save (here, in a temporary directory)
dir_to_save <- file.path(tempdir())

# Save the data:
saveRDS(fm, file.path(dir_to_save, "fitted_models.rds"))

# Import data
fm <- readRDS(file.path(dir_to_save, "fitted_models.rds"))