Lab 1d. Species Distribution Modeling - Evaluate Models

Learning Objectives

Now you’ll complete the modeling workflow with the steps to evaluate model performance and calibrate model parameters.

Full model workflow with calibrate and evaluate steps emphasized.

Figure 1: Full model workflow with calibrate and evaluate steps emphasized.

1 Setup

# global knitr chunk options
knitr::opts_chunk$set(
  warning = FALSE, 
  message = FALSE)

# load packages
librarian::shelf(
  dismo, # species distribution modeling: maxent(), predict(), evaluate(), 
  dplyr, ggplot2, GGally, here, maptools, readr, 
  raster, readr, rsample, sf,
  usdm)  # uncertainty analysis for species distribution models: vifcor()
select = dplyr::select

# options
set.seed(42)
options(
  scipen = 999,
  readr.show_col_types = F)
ggplot2::theme_set(ggplot2::theme_light())

# paths
dir_data      <- here("data/sdm")
pts_geo       <- file.path(dir_data, "pts.geojson")
env_stack_grd <- file.path(dir_data, "env_stack.grd")
mdl_maxv_rds  <- file.path(dir_data, "mdl_maxent_vif.rds")

# read points of observation: presence (1) and absence (0)
pts <- read_sf(pts_geo)

# read raster stack of environment
env_stack <- raster::stack(env_stack_grd)

1.1 Split observations into training and testing

# create training set with 80% of full data
pts_split  <- rsample::initial_split(
  pts, prop = 0.8, strata = "present")
pts_train  <- rsample::training(pts_split)
pts_test   <- rsample::testing(pts_split)

pts_train_p <- pts_train %>% 
  filter(present == 1) %>% 
  as_Spatial()
pts_train_a <- pts_train %>% 
  filter(present == 0) %>% 
  as_Spatial()

2 Calibrate: Model Selection

# show pairs plot before multicollinearity reduction with vifcor()
pairs(env_stack)

# calculate variance inflation factor per predictor, a metric of multicollinearity between variables
vif(env_stack)
   Variables      VIF
1     WC_alt 8.254995
2    WC_bio1 6.948011
3    WC_bio2 1.140249
4     ER_tri 5.066341
5 ER_topoWet 4.059576
# stepwise reduce predictors, based on a max correlation of 0.7 (max 1)
v <- vifcor(env_stack, th=0.7) 
v
2 variables from the 5 input variables have collinearity problem: 
 
WC_alt ER_tri 

After excluding the collinear variables, the linear correlation coefficients ranges between: 
min correlation ( ER_topoWet ~ WC_bio2 ):  -0.06974524 
max correlation ( ER_topoWet ~ WC_bio1 ):  0.6474975 

---------- VIFs of the remained variables -------- 
   Variables      VIF
1    WC_bio1 1.785410
2    WC_bio2 1.041940
3 ER_topoWet 1.730377
# reduce enviromental raster stack by 
env_stack_v <- usdm::exclude(env_stack, v)

# show pairs plot after multicollinearity reduction with vifcor()
pairs(env_stack_v)

# fit a maximum entropy model
if (!file.exists(mdl_maxv_rds)){
  mdl_maxv <- maxent(env_stack_v, sf::as_Spatial(pts_train))
  readr::write_rds(mdl_maxv, mdl_maxv_rds)
}
mdl_maxv <- read_rds(mdl_maxv_rds)

# plot variable contributions per predictor
plot(mdl_maxv)
# plot term plots
response(mdl_maxv)
# predict
y_maxv <- predict(env_stack, mdl_maxv) #, ext=ext, progress='')

plot(y_maxv, main='Maxent, raw prediction')
data(wrld_simpl, package="maptools")
plot(wrld_simpl, add=TRUE, border='dark grey')

3 Evaluate: Model Performance

3.1 Area Under the Curve (AUC), Reciever Operater Characteristic (ROC) Curve and Confusion Matrix

pts_test_p <- pts_test %>% 
  filter(present == 1) %>% 
  as_Spatial()
pts_test_a <- pts_test %>% 
  filter(present == 0) %>% 
  as_Spatial()

y_maxv <- predict(mdl_maxv, env_stack)
#plot(y_maxv)

e <- dismo::evaluate(
  p     = pts_test_p,
  a     = pts_test_a, 
  model = mdl_maxv,
  x     = env_stack)
e
class          : ModelEvaluation 
n presences    : 98 
n absences     : 100 
AUC            : 0.8598469 
cor            : 0.6240488 
max TPR+TNR at : 0.6251154 
plot(e, 'ROC')

thr <- threshold(e)[['spec_sens']]
thr
[1] 0.6251154
p_true <- na.omit(raster::extract(y_maxv, pts_test_p) >= thr)
a_true <- na.omit(raster::extract(y_maxv, pts_test_a) < thr)

# (t)rue/(f)alse (p)ositive/(n)egative rates
tpr <- sum(p_true)/length(p_true)
fnr <- sum(!p_true)/length(p_true)
fpr <- sum(!a_true)/length(a_true)
tnr <- sum(a_true)/length(a_true)

matrix(
  c(tpr, fnr,
    fpr, tnr), 
  nrow=2, dimnames = list(
    c("present_obs", "absent_obs"),
    c("present_pred", "absent_pred")))
            present_pred absent_pred
present_obs    0.8265306        0.15
absent_obs     0.1734694        0.85
# add point to ROC plot
points(fpr, tpr, pch=23, bg="blue")
plot(y_maxv > thr)

4 Lab1 Submission

To submit Lab 1, please submit the path on taylor.bren.ucsb.edu to your single consolidated Rmarkdown document (i.e. combined from separate parts) that you successfully knitted here:

The path should start /Users/ and end in .Rmd. Please be sure to have succesfully knitted to html, as in I should see the .html file there by the same name.

In your lab, please be sure to include:

# Lab Item Points
1 1a.Explore Name _Genus species_ used to search GBIF for observations. (0.5 pts) 0.5
2 1a.Explore Image of your species. I like to see wildlife. 0.5
3 1a.Explore Map of distribution of points 0.5
4 1a.Explore Question: How many observations total are in GBIF for your species? 0.5
5 1a.Explore Question: Did you have to perform any data cleaning steps? 0.5
6 1a.Explore Question: What environmental layers did you choose as predictors? Can you find any support for these in the literature? 1
7 1a.Explore Raster plot of environmental rasters clipped to species range 0.5
8 1a.Explore Map with pseudo-absence points 0.5
9 1a.Explore Term plots 0.5
10 1b.Regress Plot of ggpairs 0.4
11 1b.Regress Linear model output with range of values 0.4
12 1b.Regress Generalized Linear Model (GLM) output 0.4
13 1b.Regress GLM term plots 0.3
14 1b.Regress Generalized Additive Model (GAM) output 0.3
15 1b.Regress GAM term plots 0.3
16 1b.Regress Question: Which GAM environmental variables, and even range of values, seem to contribute most towards presence (above 0 response) versus absence (below 0 response)? 1
17 1b.Regress Maxent model output 0.3
18 1b.Regress Maxent variable contribution plot 0.3
19 1b.Regress Maxent term plots 0.3
20 1b.Regress Question: Which Maxent environmental variables, and even range of values, seem to contribute most towards presence (closer to 1 response) and how might this differ from the GAM results? 1
21 1c.Trees Tabular counts of 1 vs 0 before and after split 0.4
22 1c.Trees Rpart model output and plot, depth=1 0.4
23 1c.Trees Rpart model output and plot, depth=default 0.3
24 1c.Trees Rpart complexity parameter plot 0.3
25 1c.Trees Question: Based on the complexity plot threshold, what size of tree is recommended? 1
26 1c.Trees Rpart variable importance plot 0.3
27 1c.Trees Question: what are the top 3 most important variables of your model? 1
28 1c.Trees RandomForest variable importance 0.3
29 1c.Trees Question: How might variable importance differ between rpart and RandomForest in your model outputs? 1
30 1d.Evaluate Plot of pairs from environmental stack 0.3
31 1d.Evaluate VIF per variable 0.3
32 1d.Evaluate Variables after VIF collinearity removal 0.3
33 1d.Evaluate Plof of pairs after VIF collinearity removal 0.3
34 1d.Evaluate Plot of variable contribution 0.3
35 1d.Evaluate Question: Which variables were removed due to multicollinearity and what is the rank of most to least important remaining variables in your model? 1
36 1d.Evaluate Maxent term plots 0.4
37 1d.Evaluate Map of Maxent prediction 0.4
38 1d.Evaluate ROC threshold value maximizing specificity and sensitivity 0.4
39 1d.Evaluate Confusion matrix with percentages 0.5
40 1d.Evaluate AUC plot 0.4
41 1d.Evaluate Map of binary habitat 0.4