Fit and Explore Selected Models
model_exploration.Rmd
Introduction
After the best performing models have been selected, users need to
fit this models (using fit_selected()
) in order to explore
their characteristics and continue with the next steps. Fitted models
can then be used to assess variable importance in models, as well as to
explore variable response curves.
To fit the selected models, we need a
calibration_results
object. For more details in model
calibration, please refer to the vignette Model Calibration. The
calibration_results
object generated in this vignette is
available as a data example in the package. Let’s load it.
#Load packages
library(kuenm2)
library(terra)
#> terra 1.8.54
#Import calib_results_maxnet
data("calib_results_maxnet", package = "kuenm2")
#Print calibration result
calib_results_maxnet
#> calibration_results object summary (maxnet)
#> =============================================================
#> Species: Myrcia hatschbachii
#> Number of candidate models: 300
#> - Models removed because they failed to fit: 0
#> - Models identified with concave curves: 35
#> - Model with concave curves not removed
#> - Models removed with non-significant values of pROC: 0
#> - Models removed with omission error > 5%: 147
#> - Models removed with delta AIC > 2: 151
#> Selected models: 2
#> - Up to 5 printed here:
#> ID Formulas
#> 159 159 ~bio_1 + bio_7 + I(bio_1^2) + I(bio_7^2) -1
#> 189 189 ~bio_1 + bio_7 + bio_12 + I(bio_1^2) + I(bio_7^2) + I(bio_12^2) -1
#> Features R_multiplier pval_pROC_at_5.mean Omission_rate_at_5.mean dAIC
#> 159 lq 0.1 0 0.0192 0.4418045
#> 189 lq 0.1 0 0.0192 0.0000000
#> Parameters
#> 159 3
#> 189 5
This object contains the results of candidate models calibrated using
the maxnet
algorithm. The package also provides a similar
example using the glm
algorithm, which works in exactly the
same way.
#Import calib_results_glm
data("calib_results_glm", package = "kuenm2")
#Print calibration result
calib_results_glm
#> calibration_results object summary (glm)
#> =============================================================
#> Species: Myrcia hatschbachii
#> Number of candidate models: 100
#> - Models removed because they failed to fit: 0
#> - Models identified with concave curves: 18
#> - Model with concave curves not removed
#> - Models removed with non-significant values of pROC: 0
#> - Models removed with omission error > 5%: 50
#> - Models removed with delta AIC > 2: 49
#> Selected models: 1
#> - Up to 5 printed here:
#> ID
#> 86 86
#> Formulas
#> 86 ~bio_1 + bio_7 + bio_15 + I(bio_1^2) + I(bio_7^2) + I(bio_15^2) + bio_1:bio_7 + bio_1:bio_15 + bio_7:bio_15
#> Features pval_pROC_at_5.mean Omission_rate_at_5.mean dAIC Parameters
#> 86 lqp 0 0.0192 0 9
Note that the calibration_results
object stores only the
information related to the calibration process and model evaluation—it
does not include the fitted maxnet
(or glm
)
models themselves.
To obtain the final fitted models, we need to use the
fit_selected()
function. By default, this function fits a
full model (i.e., without replicates and without splitting the data into
training and testing sets). However, you can configure it to fit final
models with replicates if desired.
In this example, we’ll fit the final models using the same replication settings (4-fold cross-validation) as used in the Model Calibration vignette.
# Fit selected models using calibration results
fm <- fit_selected(calibration_results = calib_results_maxnet, #or calib_results_glm
n_replicates = 4)
# Fitting replicates...
# |========================================================================| 100%
# Fitting full models...
# |========================================================================| 100%
The fit_selected()
function returns a
fitted_models
object, a list that contains essential
information from the fitted models, which is required for the subsequent
steps.
You can explore the contents of the fitted_models
object
by indexing its elements. For example, the fitted maxnet
(or glm
) model objects are stored within the
Models
element. Note that Models
is a nested
list: for each selected model (in this case, models 159 and 189), it
includes both the replicates (if fitted with replicates) and the full
model.
#See names of selected models
names(fm$Models)
#> [1] "Model_159" "Model_189"
#See models inside Model 189
names(fm$Models$Model_189)
#> [1] "Rep_1" "Rep_2" "Rep_3" "Rep_4" "Full_model"
The fitted_models
object also stores the thresholds that
can be used to binarize the models into suitable and unsuitable areas.
These thresholds correspond to the omission error used during model
selection (e.g., 5% or 10%).
You can access the omission error used to calculate the thresholds directly from the object:
#Get omission error used to select models and calculate the thesholds
fm$omission_rate
#> [1] 5
The omission error used to calculate the thresholds was 5%, meaning that when the predictions are binarized, approximately 5% of the presence records used to calibrate the models will fall into cells with predicted values below the threshold.
The thresholds are summarized in two ways: the mean and median across replicates for each model, and the consensus mean and median across all selected models (when more than one model is selected):
fm$thresholds
#> $Model_159
#> $Model_159$mean
#> [1] 0.2976193
#>
#> $Model_159$median
#> [1] 0.314841
#>
#>
#> $Model_189
#> $Model_189$mean
#> [1] 0.3298963
#>
#> $Model_189$median
#> [1] 0.3491305
#>
#>
#> $consensus
#> $consensus$mean
#> [1] 0.315051
#>
#> $consensus$median
#> [1] 0.336047
Now, we can use the fitted_models
object to generate
response curves and compute variable importance.
Response curve
The response curves illustrate how each environmental variable influences the predicted suitability, while keeping all other variables constant.
By default, the curves are generated with all other variables set to
their mean values (or the mode for categorical variables), calculated
from the combined set of presence and background localities
(averages_from = "pr_bg"
). You can change this behavior to
use only the presence localities by setting
averages_from = "pr"
.
Let’s check which variables are available to plot by examining the coefficients of the full models:
#Get variables with non-zero coefficients in the models
fm$Models[[1]]$Full_model$betas #From the first model selected
#> bio_1 I(bio_1^2) I(bio_7^2)
#> 15.52377637 -0.46828240 -0.01261841
fm$Models[[2]]$Full_model$betas #From the second model selected
#> bio_1 bio_12 I(bio_1^2) I(bio_7^2) I(bio_12^2)
#> 1.462835e+01 3.592615e-02 -4.407921e-01 -1.208573e-02 -1.152889e-05
The variables bio_1
, bio_7
, and
bio_12
have non-zero coefficient values, which means they
contribute to the model and are available for generating response
curves.
By default, response curves are computed using all selected models. The resulting plots include a line for the mean response, along with a shaded area representing the 95% confidence interval.
par(mfrow = c(1, 3)) #Set grid of plot
response_curve(models = fm, variable = "bio_1")
response_curve(models = fm, variable = "bio_7")
response_curve(models = fm, variable = "bio_12")
on.exit() #Reinitiate grid
We can also specify which of the selected models should be used to generate the response curves:
par(mfrow = c(2, 2)) #Set grid of plot
response_curve(models = fm, variable = "bio_1",
modelID = "Model_159", main = "Model_159")
response_curve(models = fm, variable = "bio_1",
modelID = "Model_189", main = "Model_189")
response_curve(models = fm, variable = "bio_7",
modelID = "Model_159", main = "Model_159")
response_curve(models = fm, variable = "bio_7",
modelID = "Model_189", main = "Model_189")
on.exit() #Reinitiate grid
The dashed lines indicate the range of the variable within the
calibration data. By default, the plot extends beyond these limits based
on the variable’s minimum and maximum values and the
extrapolation_factor
(when
extrapolation = TRUE
). The default extrapolation is set to
10% of the variable’s range (i.e., range × 0.1).
If extrapolation = FALSE
, no extrapolation occurs, and
the plot limits match the calibration data range exactly.
We can increase the extrapolation factor to allow a broader range beyond the observed data. Below is the response curve plotted with an extrapolation factor of 2:
par(mfrow = c(1, 3)) #Set grid of plot
response_curve(models = fm, variable = "bio_1", extrapolation_factor = 2)
response_curve(models = fm, variable = "bio_7", extrapolation_factor = 2)
response_curve(models = fm, variable = "bio_12", extrapolation_factor = 2)
on.exit() #Reinitiate grid
Note that the response curve now extends further beyond the observed data range (indicated by the dashed lines).
Optionally, we can manually set the lower and upper limits of the
variables. For example, since bio_12
represents annual
precipitation and negative values are not meaningful, we can set its
lower limit to 0:
response_curve(models = fm, variable = "bio_12",
extrapolation_factor = 0.1,
l_limit = 0)
Now, the lower limit of the plot for bio_12
is set to 0.
Since we did not specify an upper limit, the plot uses the extrapolation
factor (here, 0.1) to define the upper limit.
Optionally, we can add the original presence and background points to
the plot by setting add_point = TRUE
:
response_curve(models = fm, variable = "bio_1",
add_points = TRUE)
Variable importance
The relative importance of predictor variables can be calculated
using explained deviance through the var_importance()
function. This process starts by fitting the full model
(maxnet
or glm
), which includes all predictor
variables. Then, the function fits separate models excluding one
variable at a time, assessing how the removal affects model
performance.
By systematically evaluating the impact of each predictor’s exclusion, the function provides insights into the individual contribution of each variable to the model’s overall performance and explanatory power.
By default, the function runs on a single core. You can enable
parallel processing by setting parallel = TRUE
and
specifying the number of cores with ncores
. Note that
parallelization only speeds up the computation when there are many
variables (more than 7) and a large calibration dataset (over 15,000
presence and background points).
By default, variable importance is computed for all selected models:
# Calculate variable importance
imp <- variable_importance(models = fm)
# Calculating variable contribution for model 1 of 2
# |======================================================================| 100%
# Calculating variable contribution for model 2 of 2
# |======================================================================| 100%
The function returns a data.frame
with the relative
contribution of each variable. If multiple models are included in the
fitted
object, an additional column identifies each
distinct model.
imp
#> predictor contribution Models
#> 1 I(bio_1^2) 0.515387743 Model_159
#> 2 bio_1 0.482802883 Model_159
#> 3 I(bio_7^2) 0.001809375 Model_159
#> 4 I(bio_1^2) 0.444939057 Model_189
#> 5 bio_1 0.425438027 Model_189
#> 6 I(bio_12^2) 0.065575776 Model_189
#> 7 bio_12 0.062618167 Model_189
#> 8 I(bio_7^2) 0.001428973 Model_189
We can visualize variable importance using the
plot_importance()
function. When the
fitted_models
object contains more than one selected model,
the plot displays a boxplot of contributions, along with the mean
contribution and the number (N) of fitted models.
plot_importance(imp)
If variable importance is computed for a single model, the plot displays a barplot instead of a boxplot:
# Calculate variable importance for a specific selected Model
imp_189 <- variable_importance(models = fm, modelID = "Model_189",
progress_bar = FALSE)
#Plot variable contribution for model 189
plot_importance(imp_189, main = "Variable importance - Model 189")
Saving a fitted_models object
After fitting the best-performing models with
fit_selected()
, we can proceed to predict the models for
single or multiple scenarios. As this object is essentially a list,
users can save it to a local directory using saveRDS()
.
Saving the object facilitates loading it back into your R session later
using readRDS()
. See an example below: