Chapter 6 Introduction to Species Distribution Modeling
Hammed A. Akande
This chapter briefly introduce you to Species Distribution Modeling (hereinafter, SDM) in R. In SDM, we relate the species occurrence data (e.g. in presence-absence format) with their environmental data (e.g. climatic data) to predict the probability of species occurring in an area or their habitat suitability. In this tutorial, I am assuming you have basic knowledge of Ecology/Wildlife/Conservation biology and statistics. I encourage you to watch my presentation video and read more online, especially if you don’t have knowledge of ecology or ecological factors that can affect species distribution.
Now, let’s start the modeling exercise.
To run this exercise, you need to install and load the required Packages. Again, I assume you know how to install and load packages, if not, refer to my Day 1 slide and video (or check the introductory section of this book). N.B- If you have installed the packages before, no need to install again, just load the library.
#install.packages("dismo")
#install.packages("maptools")
#install.packages("maps")
#install.packages("mapdata")
#install.packages("dplyr")
#install.packages("CoordinateCleaner")
#install.packages("raster")
#install.packages("ggplot2")
#install.packages("scales")
#install.packages("corrplot")
library(dismo)
library(maptools)
library(maps)
library(mapdata)
library(dplyr)
library(CoordinateCleaner)
library(rgbif)
library(corrplot)
library(raster)
Today, we are using the GBIF website to download species data (obviously, you can load in and use your own data if you have). We shall be using the Mona Monkey as study species. Again, please read about ecology of Mona Monkey and if you don’t know of GBIF (I explained in class though) read more online about the organization.
6.0.1 Downloading the Species Data
<- occ_search(scientificName = "Cercopithecus mona", hasCoordinate=T) Mona
This function “occ_search” search for the species at the GBIF website and see if there are data available. If yes, the data will be downloaded and that is only the ones with coordinate. Remember I set “hasCoordinate” to be equal to TRUE (apparently, you want to “spatially” model data with coordinates)
The output will be stored as “Mona” (in form of list), but we only need the data part of it, so retain the data alone.
= Mona$data
Mona
View(Mona) #you can view the species to see how its structured
head(Mona) # to see the first 10 observations
## # A tibble: 6 x 142
## key scientificName decimalLatitude decimalLongitude issues datasetKey publishingOrgKey installationKey
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 25739~ Cercopithecus ~ 12.1 -61.7 gass84 50c9509d-~ 28eb1a3f-1c15-4~ 997448a8-f762-~
## 2 28234~ Cercopithecus ~ 12.1 -61.7 cdroun~ 50c9509d-~ 28eb1a3f-1c15-4~ 997448a8-f762-~
## 3 33113~ Cercopithecus ~ 7.04 2.00 cdroun~ 70d19075-~ 071d63b5-ae0b-4~ 842b6a28-624a-~
## 4 24451~ Cercopithecus ~ 0.0459 6.53 cdroun~ 50c9509d-~ 28eb1a3f-1c15-4~ 997448a8-f762-~
## 5 26112~ Cercopithecus ~ 6.96 2.10 cdroun~ 50c9509d-~ 28eb1a3f-1c15-4~ 997448a8-f762-~
## 6 29885~ Cercopithecus ~ 12.1 -61.7 cdroun~ 50c9509d-~ 28eb1a3f-1c15-4~ 997448a8-f762-~
## # ... with 134 more variables: publishingCountry <chr>, protocol <chr>, lastCrawled <chr>, lastParsed <chr>,
## # crawlId <int>, hostingOrganizationKey <chr>, basisOfRecord <chr>, occurrenceStatus <chr>,
## # taxonKey <int>, kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,
## # genusKey <int>, speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, kingdom <chr>,
## # phylum <chr>, order <chr>, family <chr>, genus <chr>, species <chr>, genericName <chr>,
## # specificEpithet <chr>, taxonRank <chr>, taxonomicStatus <chr>, iucnRedListCategory <chr>,
## # dateIdentified <chr>, stateProvince <chr>, year <int>, month <int>, day <int>, eventDate <chr>,
## # modified <chr>, lastInterpreted <chr>, references <chr>, license <chr>, identifiers <chr>, facts <chr>,
## # relations <chr>, isInCluster <lgl>, geodeticDatum <chr>, class <chr>, countryCode <chr>,
## # recordedByIDs <chr>, identifiedByIDs <chr>, country <chr>, rightsHolder <chr>, identifier <chr>,
## # http...unknown.org.nick <chr>, verbatimEventDate <chr>, datasetName <chr>, verbatimLocality <chr>,
## # gbifID <chr>, collectionCode <chr>, occurrenceID <chr>, taxonID <chr>, catalogNumber <chr>,
## # recordedBy <chr>, http...unknown.org.occurrenceDetails <chr>, institutionCode <chr>, rights <chr>,
## # eventTime <chr>, identifiedBy <chr>, identificationID <chr>, name <chr>,
## # coordinateUncertaintyInMeters <dbl>, projectId <chr>, continent <chr>, habitat <chr>, locality <chr>,
## # samplingProtocol <chr>, occurrenceRemarks <chr>, individualCount <int>, lifeStage <chr>,
## # vernacularName <chr>, higherClassification <chr>, collectionID <chr>, programmeAcronym <chr>,
## # organismQuantityType <chr>, institutionID <chr>, county <chr>,
## # http...rs.tdwg.org.dwc.terms.organismQuantityType <chr>, collectionKey <chr>, institutionKey <chr>,
## # identifiedByIDs.type <chr>, identifiedByIDs.value <chr>, sex <chr>,
## # X.99d66b6c.9087.452f.a9d4.f15f2c2d0e7e. <chr>, establishmentMeans <chr>, higherGeography <chr>,
## # nomenclaturalCode <chr>, endDayOfYear <chr>, georeferenceVerificationStatus <chr>, language <chr>,
## # type <chr>, startDayOfYear <chr>, accessRights <chr>, ...
How about we define the extent of our species to know the min and max longitude and latitude of the species? That should make sense I guess. So, set the geographic extent.
<- ceiling(max(Mona$decimalLatitude))
max.lat <- floor(min(Mona$decimalLatitude))
min.lat <- ceiling(max(Mona$decimalLongitude))
max.lon <- floor(min(Mona$decimalLongitude))
min.lon <- extent(x = c(min.lon, max.lon, min.lat, max.lat))
geographic.extent
geographic.extent
## class : Extent
## xmin : -62
## xmax : 30
## ymin : -5
## ymax : 49
Now, let’s just check it on map to even know where our species are located in space.
data(wrld_simpl)
# Base map
plot(wrld_simpl,
xlim = c(min.lon, max.lon),
ylim = c(min.lat, max.lat),
axes = TRUE,
col = "grey95")
# Individual obs points
points(x = Mona$decimalLongitude,
y = Mona$decimalLatitude,
col = "red",
pch = 20,
cex = 0.75)
box()
Voila! That’s better. Looking at something in picture/plot/map make more sense I guess.
6.0.2 Cleaning the coordinates and checking for outliers
At least now you know where they are located, but are you really sure all points are accurate? Do they all look correct? Do you think there might be errors in species collection or even when recording them? Or might be biased in any way? So, the best way to be sure is to do some “Data quality” checking.
Let’s use a package called CoordinateCleaner for this. I am testing for duplicates, centroids outliers and to check how far away from biodiversity institutions. use the code ?CoordinateCleaner to know more about what you can test for.
<- clean_coordinates(Mona, lon="decimalLongitude",lat="decimalLatitude",
clean_Mona tests=c("centroids", "outliers", "duplicates", "institutions"),inst_rad = 10000)
## Testing coordinate validity
## Flagged 0 records.
## Testing country centroids
## Flagged 1 records.
## Testing geographic outliers
## Flagged 23 records.
## Testing biodiversity institutions
## Flagged 14 records.
## Testing duplicates
## Flagged 185 records.
## Flagged 206 of 373 records, EQ = 0.55.
Wow, 206 of 373 flagged. See why you need to clean data now? Else, your model(s) will be biased.
Let’s subset the data to only cleaned version now.
= clean_Mona[clean_Mona$.summary,] clean_Mona
If you check the “clean_Mona”, you will see there are many variables. We really don’t need all of them for analysis, so why not just retain the variables we only need
= clean_Mona[, c("species", "decimalLatitude", "decimalLongitude")]
clean_Mona
# or through the dplyr package. Of course, you will get the same result
<-clean_Mona %>%
clean_Mona ::select(species, decimalLatitude, decimalLongitude) dplyr
Remember the data contains just the name of the species. We need it in presence/absence format (I explained in class). So, let’s turn the species name to 1 (for presence)
<- data.frame(clean_Mona, occ=1)
Mona_P
head(Mona_P)
## species decimalLatitude decimalLongitude occ
## 3 Cercopithecus mona 7.038812 1.999688 1
## 4 Cercopithecus mona 0.045902 6.528708 1
## 5 Cercopithecus mona 6.956431 2.097323 1
## 7 Cercopithecus mona 7.721062 -1.701965 1
## 8 Cercopithecus mona 6.438185 3.534619 1
## 9 Cercopithecus mona 6.591946 3.948594 1
# You see a new column is now added, called "occ" (as in short form of occurence)
# If you wish to export this clean data (e.g. to csv), for further analysis, you can do that now.
#write.csv(clean_Mona, "Mona_cleaned.csv", row.names = FALSE)
Don’t forget that SDM (as in the case of correlative), we want to relate the species with their environment to understand factors affecting them.
So, let’s the Get the climate data.
6.0.3 Download Climate data
You can get climate data from worldclim, chelsa and paleoclim, among others.
# You may want to set directory to store it
if(!dir.exists("bioclim_data")){
dir.create("bioclim_data", recursive = TRUE)
}
<- getData(name = "worldclim",
clim_data var = "bio",
res = 5,
path = "bioclim_data",
download = T)
In SDM, to every presence, there should be absence. As you may know, absence data are often not available and so we can generate background (or pseudo-absence) data.
6.0.4 Generate Background data
Let’s generate Background data using the climate data we just downloaded as the sampling resolution
<- list.files(path = "bioclim_data/wc5",
bil.files pattern = "*.bil$",
full.names = TRUE)
# Let's just use one of the .bil files to mask the background data, we don't really need all
<- raster(bil.files[1])
mask
# Use the randomPoints function to randomly sample points. Now, we shall sample the same number of points as our observed points (and extend it by 1.25). By sampling same number of occurence point and giving a bit room for extension, we are conservative enough and reduce bias.
<- randomPoints(mask = mask, n = nrow(Mona_P), ext = geographic.extent, extf = 1.25) background
How about we Plot them on map (presence and pseudo-absence)
plot(wrld_simpl,
xlim = c(min.lon, max.lon),
ylim = c(min.lat, max.lat),
axes = TRUE,
col = "grey35",
main = "Presence and pseudo-absence points")
# Add the background points
points(background, col = "green", pch = 1, cex = 0.75)
# Add the observations
points(x = Mona_P$decimalLongitude,
y = Mona_P$decimalLatitude,
col = "red",
pch = 20,
cex = 0.75)
box()
Now, what we can do is to join them together.
= Mona_P[, c("decimalLongitude", "decimalLatitude", "occ")] # since we don't need the column "species" again, we can remove it.
Mona_P
<- data.frame(background) # put it in dataframe
background_dat summary(background_dat)
## x y
## Min. :-72.792 Min. :-11.542
## 1st Qu.: -2.708 1st Qu.: 1.458
## Median : 12.958 Median : 14.792
## Mean : 4.572 Mean : 17.108
## 3rd Qu.: 29.167 3rd Qu.: 29.792
## Max. : 40.625 Max. : 55.042
names(background_dat) <- c('decimalLongitude','decimalLatitude') # set the name of background_dat instead form "x" and "y" to Longitude and Latitude
$occ <- 0 # set absence data to 0 (remember we set presence to 1)
background_datsummary(background_dat)
## decimalLongitude decimalLatitude occ
## Min. :-72.792 Min. :-11.542 Min. :0
## 1st Qu.: -2.708 1st Qu.: 1.458 1st Qu.:0
## Median : 12.958 Median : 14.792 Median :0
## Mean : 4.572 Mean : 17.108 Mean :0
## 3rd Qu.: 29.167 3rd Qu.: 29.792 3rd Qu.:0
## Max. : 40.625 Max. : 55.042 Max. :0
<- rbind(Mona_P, background_dat) # use the "rbind" function to row bind them.
Mona_PA summary(Mona_PA)
## decimalLongitude decimalLatitude occ
## Min. :-72.792 Min. :-11.542 Min. :0.0
## 1st Qu.: 1.738 1st Qu.: 5.896 1st Qu.:0.0
## Median : 2.686 Median : 6.909 Median :0.5
## Mean : 3.857 Mean : 11.830 Mean :0.5
## 3rd Qu.: 14.417 3rd Qu.: 14.792 3rd Qu.:1.0
## Max. : 40.625 Max. : 55.042 Max. :1.0
= data.frame(Mona_PA)
Mona_PA
::sample_n(Mona_PA, 10) # randomly check 10 observations dplyr
## decimalLongitude decimalLatitude occ
## 92 2.512721 6.904417 1
## 240 9.400000 5.833300 1
## 321 -65.875000 -7.541667 0
## 110 2.427500 8.490000 1
## 541 -68.291667 8.291667 0
## 179 -13.125000 11.625000 0
## 259 18.000000 9.000000 1
## 125 2.166667 6.916667 1
## 751 14.041667 28.625000 0
## 1410 38.375000 14.625000 0
6.0.5 Extract the environmental data for the Mona coordinate
= cbind(Mona_PA, raster::extract(x = clim_data, y = data.frame(Mona_PA[,c('decimalLongitude','decimalLatitude')]), cellnumbers=T ))
Mona_PA
# Check if there are duplicated cells
duplicated(Mona_PA$cells)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [18] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [35] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [52] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [69] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [86] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [103] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [120] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [154] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [171] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [222] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [239] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [256] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [273] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [290] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [307] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [324] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
You can see some duplicated cells, right? So, let’s retain non-duplicated cells (obviously, you don’t want to have duplicated cells in analysis)
Retain non-duplicated cells
<- Mona_PA[!duplicated(Mona_PA$cells),]
Mona_PA
# Now check again if there are duplicated cells (I am certain it will all be FALSE now)
duplicated(Mona_PA$cells)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [154] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [171] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [222] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [239] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [256] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [273] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [290] FALSE
Check for missing values (NA)
any(is.na(Mona_PA)) # Check for NA
## [1] TRUE
Clear enough right? We have missing values, so let’s remove them
Remove NA
= na.omit(Mona_PA) # remove NA
Mona_PA
# check again. This time, it should be FALSE
any(is.na(Mona_PA))
## [1] FALSE
That’s it. We can start the process of model fitting
Before we even start, its a good idea to test for multicollinearity (to be sure we don’t have multicollinear variables). I explained why this is not good in class- watch the video or read more online.
6.0.6 Test for Multicollinearity
Build a correlation matrix
<- cor(Mona_PA[,-c(1:6)], method='spearman')
cor_mat
corrplot.mixed(cor_mat, tl.pos='d', tl.cex=0.6, number.cex=0.5, addCoefasPercent=T)
We can use a function called “select07” to remove highly correlated variables (variables greater than 70% = 0.7). (See Dorman et al 2013)
library(devtools)
#devtools::install_git("https://gitup.uni-potsdam.de/macroecology/mecofun.git")
library(mecofun)
# Run select07()
<- select07(X=Mona_PA[,-c(1:4)],
var_sel y=Mona_PA$occ,
threshold=0.7)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Check out the structure of the resulting object:
str(var_sel)
## List of 3
## $ AIC : Named num [1:19] 197 197 213 241 257 ...
## ..- attr(*, "names")= chr [1:19] "bio4" "bio6" "bio3" "bio7" ...
## $ cor_mat : num [1:19, 1:19] 1 -0.108 0.365 -0.277 0.494 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:19] "bio1" "bio2" "bio3" "bio4" ...
## .. ..$ : chr [1:19] "bio1" "bio2" "bio3" "bio4" ...
## $ pred_sel: chr [1:9] "bio4" "bio6" "bio15" "bio19" ...
# Extract the names of the weakly correlated predictors in order of their AIC:
= var_sel$pred_sel
pred_sel pred_sel
## [1] "bio4" "bio6" "bio15" "bio19" "bio9" "bio10" "bio2" "bio8" "bio18"
See important variables in that order
6.0.7 Model selection
We can fit different regression model to predict our species. This model can take linear function, quadratic or polynomial. We can then use vif or AIC to determine which one work best for this model. For the sake of this exercise, I will only fit Linear relationship.
# Take any bioclim variable and fit a GLM assuming a linear relationship:
<- glm(occ ~ bio19, family=binomial(link=logit), data= Mona_PA)
model_linear
summary(model_linear)
##
## Call:
## glm(formula = occ ~ bio19, family = binomial(link = logit), data = Mona_PA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0272 -0.7494 -0.6247 1.0343 1.6464
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5350688 0.2064067 -7.437 1.03e-13 ***
## bio19 0.0038553 0.0005399 7.141 9.26e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 393.60 on 288 degrees of freedom
## Residual deviance: 310.23 on 287 degrees of freedom
## AIC: 314.23
##
## Number of Fisher Scoring iterations: 4
Okay, let’s fit a quadratic relationship with the same bioclim var used above:
<- glm(occ ~ bio19 + I(bio19^2), family=binomial(link=logit), data= Mona_PA)
model_quad
summary(model_quad)
##
## Call:
## glm(formula = occ ~ bio19 + I(bio19^2), family = binomial(link = logit),
## data = Mona_PA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1023 -0.6376 -0.3866 0.7762 1.7656
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.557e+00 3.115e-01 -8.207 2.27e-16 ***
## bio19 1.071e-02 1.363e-03 7.860 3.83e-15 ***
## I(bio19^2) -6.134e-06 1.003e-06 -6.112 9.81e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 393.60 on 288 degrees of freedom
## Residual deviance: 271.78 on 286 degrees of freedom
## AIC: 277.78
##
## Number of Fisher Scoring iterations: 4
We can now use a Maximum likelihood estimator to select which model is best and fit the SDM. N.B- the lower your AIC, the better. So any model with lower AIC value is the best model to be selected.
AIC(model_linear)
## [1] 314.2324
AIC(model_quad)
## [1] 277.7795
Voila! Ideally, including the interaction term (quadratic function) seems to make more sense for this model. However, as I said earlier, for the sake of this exercise, I will just continue with linear model to demonstrate what we really want to know. If you want to do more (include quadratic or anything), you can go ahead using the same model formula above or reach out to me if you have issues or concerns.
6.0.8 Fitting the model
Now that we know which model to fit, we can select the model and then evaluate the prediction.
# regression model
= step(glm(occ ~ bio4 + bio6 + bio15 + bio19, family=binomial(link=logit), data= Mona_PA)) model
## Start: AIC=183.44
## occ ~ bio4 + bio6 + bio15 + bio19
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - bio15 1 173.53 181.53
## <none> 173.44 183.44
## - bio4 1 179.20 187.20
## - bio19 1 187.58 195.58
## - bio6 1 263.93 271.93
##
## Step: AIC=181.53
## occ ~ bio4 + bio6 + bio19
##
## Df Deviance AIC
## <none> 173.53 181.53
## - bio4 1 181.99 187.99
## - bio19 1 187.62 193.62
## - bio6 1 265.54 271.54
summary(model)
##
## Call:
## glm(formula = occ ~ bio4 + bio6 + bio19, family = binomial(link = logit),
## data = Mona_PA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.10852 -0.30152 -0.00401 0.49153 1.94663
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.584e+01 2.342e+00 -6.762 1.36e-11 ***
## bio4 8.983e-04 2.862e-04 3.139 0.001697 **
## bio6 7.347e-02 1.034e-02 7.105 1.20e-12 ***
## bio19 2.205e-03 6.460e-04 3.413 0.000642 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 393.60 on 288 degrees of freedom
## Residual deviance: 173.53 on 285 degrees of freedom
## AIC: 181.53
##
## Number of Fisher Scoring iterations: 7
#Let's see the plot of Occurrence
<- c('bio4', 'bio6', "bio15", "bio19")
my_preds
<- data.frame(rasterToPoints(clim_data[[my_preds]]))
bio_clim_df1
any(is.na(bio_clim_df1))
## [1] TRUE
<- na.omit(bio_clim_df1)
bio_clim_df1
<- rasterFromXYZ(cbind(bio_clim_df1[,1:2],predict(model, bio_clim_df1, type='response')))
Model_glm_pred plot((Model_glm_pred),
xlim = c(min(Mona_PA$decimalLongitude),max (Mona_PA$decimalLongitude)),
ylim = c(min(Mona_PA$decimalLatitude), max(Mona_PA$decimalLatitude)),
main='Probability of Occurence', axes=F)
Good. You can see the habitat suitability right? or the probability of occurrence for Mona Monkey. How about we zoom in to Africa and check it well?
Run the code below to zoom into Africa
plot((Model_glm_pred),
xlim = c(min(-25),max (50)),
ylim = c(min(-40), max(40)),
main='Probability of Occurence- Mona Monkey', axes=F)
You may want to assess the goodness of fit
# Explained deviance:
expl_deviance(obs = Mona_PA$occ,
pred = model$fitted)
## [1] 0.5591346
55.9% of the predictors explained the deviance in the model
Okay, that’s not what we want to do with SDM here. Let’s transfer the probability of occurence to binary prediction
6.0.9 Model evaluation and validation
Because we need to evaluate the prediction (of course if you write exam, you want to know how well you perform), so we need to set up evaluation dataset. The approach to do this (as in remote sensing) is to divide (randomly) the data into testing and training. So, let’s set out 70% of our Mona monkey as training data and the remaining 30% for testing later. Lastly, we have selected linear function up there, so we are good to go and can fit different algorithms now.
Split and train the model
# Use 70% for training data (of course you can change it and use 60 or 80% depending on you)
<- sample(seq_len(nrow(Mona_PA)), size=round(0.7*nrow(Mona_PA)))
train_data
# Okay, let's subset the training & testing data
<- Mona_PA[train_data,]
Mona_train <- Mona_PA[-train_data,]
Mona_test
# If you want to store the split information for later use, use this code:
#write(train_data, file = "Mona_traindata.txt")
#remember I said we can store other file than csv alone right?)
Using our GLM regression (but now on the training data) to evaluate how well it perform
= step(glm(occ ~ bio4 + bio6 + bio15 + bio19, family=binomial(link=logit), data= Mona_train)) model_glm
## Start: AIC=133.16
## occ ~ bio4 + bio6 + bio15 + bio19
##
## Df Deviance AIC
## - bio15 1 123.17 131.17
## <none> 123.16 133.16
## - bio4 1 128.90 136.90
## - bio19 1 130.47 138.47
## - bio6 1 184.25 192.25
##
## Step: AIC=131.17
## occ ~ bio4 + bio6 + bio19
##
## Df Deviance AIC
## <none> 123.17 131.17
## - bio4 1 129.74 135.74
## - bio19 1 130.60 136.60
## - bio6 1 185.95 191.95
summary(model_glm)
##
## Call:
## glm(formula = occ ~ bio4 + bio6 + bio19, family = binomial(link = logit),
## data = Mona_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.95547 -0.29929 -0.00348 0.48366 2.02285
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.574e+01 2.877e+00 -5.471 4.49e-08 ***
## bio4 9.193e-04 3.352e-04 2.742 0.0061 **
## bio6 7.370e-02 1.284e-02 5.740 9.48e-09 ***
## bio19 1.799e-03 7.148e-04 2.517 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 277.17 on 201 degrees of freedom
## Residual deviance: 123.17 on 198 degrees of freedom
## AIC: 131.17
##
## Number of Fisher Scoring iterations: 7
You may want to check the response curve
= c("bio4", "bio6", "bio15", "bio19")
my_preds
<- crossvalSDM(model_glm, traindat = Mona_train, colname_species = 'occ', colname_pred = my_preds)
preds_cv
plot(model_glm$fitted.values, preds_cv, xlab='Fitted values', ylab='Predicted values from CV')
abline(0,1,col='red',lwd=2)
Before we map the prediction, let’s threshold the data (and try check the threshold independent metrics- AUC)
Thresholding
library(PresenceAbsence)
# Cross-validated predictions:
<- data.frame(ID = seq_len(nrow(Mona_train)), obs = Mona_train$occ, pred = preds_cv)
threshold_data
# Get the optimal thresholds:
<- PresenceAbsence::optimal.thresholds(DATA= threshold_data)) (threshold_optimal
## Warning in PresenceAbsence::optimal.thresholds(DATA = threshold_data): req.sens defaults to 0.85
## Warning in PresenceAbsence::optimal.thresholds(DATA = threshold_data): req.spec defaults to 0.85
## Warning in PresenceAbsence::optimal.thresholds(DATA = threshold_data): costs assumed to be equal
## Method pred
## 1 Default 0.5000000
## 2 Sens=Spec 0.4850000
## 3 MaxSens+Spec 0.3700000
## 4 MaxKappa 0.3700000
## 5 MaxPCC 0.3700000
## 6 PredPrev=Obs 0.5350000
## 7 ObsPrev 0.4405941
## 8 MeanProb 0.4402461
## 9 MinROCdist 0.3700000
## 10 ReqSens 0.5000000
## 11 ReqSpec 0.4800000
## 12 Cost 0.3700000
Good. You can now use any values above to threshold your species data to “presence” and “absence”
# Threshold using the max sen+spec
# Print the confusion Matrix
<- PresenceAbsence::cmx(DATA= threshold_data, threshold=threshold_optimal[3,2])) (cmx_maxSSS
## observed
## predicted 1 0
## 1 85 18
## 0 4 95
Let’s compute AUC
library(AUC)
# Let's have a look a the ROC curve:
<- roc(preds_cv, as.factor(Mona_train$occ))
roc_cv plot(roc_cv, col = "grey70", lwd = 2)
Compute the AUC and other evaluation metrics:
evaluation_metrics = evalSDM(Mona_train$occ, preds_cv, thresh.method = "MaxSens+Spec")) (
## AUC TSS Kappa Sens Spec PCC D2 thresh
## 1 0.9281098 0.7957641 0.7826895 0.9550562 0.840708 0.8910891 0.5246774 0.37
We can now validate the model performance on the test data
<- evalSDM(Mona_test$occ, predict(model_glm, Mona_test[,my_preds], type='response'), thresh.method = "MaxSens+Spec")) (performance_glm
## AUC TSS Kappa Sens Spec PCC D2 thresh
## 1 0.9393939 0.7441077 0.7529813 0.8181818 0.9259259 0.8850575 0.5591624 0.49
Please note-
Sensitivity = true positive rate Specificity = true negative rate PCC = Proportion of correctly classified observations,
We can evaluate if the model is good or not with TSS (true skill statistics or Kappa). You can also chekc AUC (Area under the curve). You may ask which curve, the ROC curve- Receiver operating characteristics.
6.0.10 Map prediction
Now, let’s check the Map prediction by plotting the main binary map with the data-
<- data.frame(rasterToPoints(clim_data[[my_preds]]))
bio_clim_df_2
any(is.na(bio_clim_df_2))
## [1] TRUE
<- na.omit(bio_clim_df_2)
bio_clim_df_2
<- predicted_glm <- rasterFromXYZ(cbind(bio_clim_df_2[,1:2],predict(model_glm, bio_clim_df_2, type='response')))
binary_glm values(binary_glm) <- ifelse(values(predicted_glm)>= performance_glm$thresh, 1, 0)
plot(stack(predicted_glm, binary_glm),
xlim = c(min(-25),max (50)),
ylim = c(min(-40), max(40)),
main=c('Probability of Occurrence-Mona','Binary Prediction-Mona'), axes=F)
Now, you can see the binary prediction of Mona Monkey throughout Africa.
Great! We stopped here in class. I will update the book later as time permits (check back soon):
- Transfer this prediction to future (2050 or 2070)
- Use different model algorithms (random forest, boosted regression trees, etc)
- Ensemble the models (to account for model uncertainty) and lots more.
If you have questions, feel free to ask email me or slack me.