Due: 5pm Mon, Feb 23 2015 via GauchoSpace
In this lab, you’ll predict the distribution of a species using the software package Maxent, short for “maximum entropy” – the statistical technique used to fit a model differentiating species observations from the background environmental data. You’ll use “presence” points from the Global Biodiversity Information Facility (GBIF), and bioclimatic environmental predictors from WorldClim.org.
Here’s a figure of the overall process:
Choose a species from this Species List and enter your name in the student column so we all do different species. This list was derived from the species listed in tables 24 and 27 from Santa Barbara County’s Burton Mesa Ecological Reserve Final Land Management Plan and Environmental Impact Report where at least 100 occurence records are found for Santa Barbara County’s extent (longitude: -121 to -119; latitude: 34 to 36) in the Global Biodiversity Information Facility GBIF.org.
Set your working directory (wd
) and scientific species name (sp_scientific
) of your chosen species (Genus species only) in the set_vars
R chunk below.
wd = 'H:/esm215/lab6_species'
sp_scientific = 'Amphispiza belli'
The next R chunk fetches the first 1000 occurrences from GBIF using the rgbif R package.
You’ll want to restrict observations to those in the Americas, so we can later predict to Santa Barbara County. To do this, you’re provided with an interactive map to draw a bounding box extent around the points of interest.
For the interactive drawing to work, you’ll want to go to upper right RStudio menu Chunks -> Run All so the R code is run from the Console. (Note: Knitting the R Markdown document is not interactive.)
The bounding box extent is saved to spp/*_extent.csv
which is read in next time the code runs. (The interactive drawing of bounding box is presented if the file’s not found, or defaults to global extent if not run interactively.).
If only a few points are within the Americas (and most in other continents), please update Species List and the set_vars
R chunk with a new species selection, delete the spp
folder, go to menu Chunks -> Run All again.
The next R chunk:
Filters the GBIF observation points based on the drawn extent from the previous step, and
Partitions these points randomly into:
train for model fitting (80% of filtered points), and
test for model evaluation (20%).
Question: Were there any GBIF observations for your species that you spatially filtered to achieve a study area restricted to the Americas? If observations were filtered, do you speculate that the populations elsewhere are distinct (ie have no demographic interactions) with the one(s) in the Americas?
For this lab, you’ll be using environmental predictors from the WorldClim.org database. This database provides current, past and future climatic variables such as monthly mean/min/max temperature and precipitation (which are multiplied by 10 in order to store as integers). These monthly variables have been combined into 19 biologically relevant climatic variables (bioclim) plus altitude:
We’ll use the variables calculated for the following time periods:
Current: climatic averages representative of the period 1950-2000
Future: climatic forecast for 2070, average for 2061-2080, from IPPC Fifth Assessment’s GFDL-CM3 based on the representative concentration pathway 8.5 with 2 deg C projected temperature increase by 2050 (3.7 deg C by 2090).
You’ll proceed with the following scenarios which distinguish between the time (current vs future) and space (global, cropped or Santa Barbara) of chosen environmental predictors (all or top) in which the model is fitted versus predicted (sequentially novel aspect of scenario bolded):
scenario | time, fitted | space, fitted | predictors | time, predicted | space, predicted |
---|---|---|---|---|---|
scenario_01 | current | global | all | current | global |
scenario_02 | current | cropped | all | current | cropped |
scenario_03 | current | cropped | top | current | cropped |
scenario_04 | current | cropped | top | current | Santa Barbara |
scenario_05 | current | cropped | top | future | Santa Barbara |
The “global” predictors you’ll be using have a cell size of 10 minutes (18.55 km at equator), from which “cropped” is limited to the geographic extent of the point observations. Then you’ll pick the top 3 environmental predictors plus altitude (“top 3 bio + alt”) based on percent contribution. Next, you’ll use this fitted model for predicting to a “Santa Barbara” extent using WorldClim predictors having a cell size of 0.5 minutes (0.93 km at equator).
Although we could theoretically use the R function getData from the raster library to fetch all these WorldClim data on the fly, the server is slow and inconsistent with responding. So these data have been provided for you.
The following R chunk creates the Maxent output scenario folders.
The next chunk of R code crops the global data to the extent of train and test data points, leaving you with the following predictor combinations of time and space in the env
folder that you’ll use in the Maxent scenarios in the following order:
current_10min_global
current_10min_cropped
current_0.5min_sb
future_0.5min_sb
In order to capture the maximum possible extent of the species, you’ll start with the global predictors.
Note that these global environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by commenting out (ie precede the line with a #
character) the plot command in the R chunk above.
Now that you have your biological response data (in spp
folder) and environmental predictor data (in env
folder), you’re ready to fit, predict and evaluate a species distribution model with Maxent.
Double click on software\maxent.bat
to launch Maxent’s graphical user interface. This batch file simply opens the Maxent Java archive (*.jar) with the Java engine using 6000 MB of memory: java -mx6000m -jar maxent.jar
. There’s a non-fatal warning based on a slight mismatch between Java versions that you can safely ignore (“WARNING: Could not open/create prefs root node Softwareat root 0 x80000002. Windows RegCreateKeyEx(…) returned error code 5.”)
Configure scenario_01 with the following basic settings (where *
represents your species):
Samples: browse to spp\*_train.csv
. This sets the input species presence points for which Maxent extracts the environmental layer data and differentiates from the “background” (ie all other environmental data in the environmental layer extent.)
Environmental layers: browse to env\current_10min_global
. This sets the folder of environmental data and automatically selects all valid rasters in the folder, which must have the same resolution and extent as each other.
Output directory: browse to scenario_01
. This is where Maxent places all the output results from fitting, predicting and evaluating the model.
Output file type: bil. The band interleaved by line (.bil) format is simply a more compact in file size than the default ASCII text raster format (.asc).
Create response curves tick on. These response curves describe the statistical relationship between environmental predictor and species response.
Settings button to launch interface to more parameters
in the Basic tab, Test sample file: browse to spp\*_test.csv
. This adds an extra set of evaluation diagnostics on species presence points which were partitioned from the original dataset.
When you click on the Run button, here’s the software/tutorial.doc
description of what happens next…
A progress monitor describes the steps being taken. After the environmental layers are loaded and some initialization is done, progress towards training of the maxent model is shown like this:
The gain is closely related to deviance, a measure of goodness of fit used in generalized additive and generalized linear models. It starts at 0 and increases towards an asymptote during the run. During this process, Maxent is generating a probability distribution over pixels in the grid, starting from the uniform distribution and repeatedly improving the fit to the data. The gain is defined as the average log probability of the presence samples, minus a constant that makes the uniform distribution have zero gain. At the end of the run, the gain indicates how closely the model is concentrated around the presence samples; for example, if the gain is 2, it means that the average likelihood of the presence samples is exp(2) â 7.4 times higher than that of a random background pixel. Note that Maxent isnât directly calculating âprobability of occurrenceâ. The probability it assigns to each pixel is typically very small, as the values must sum to 1 over all the pixels in the grid.
Once the progress meter disappears, Maxent is finished producing the results. At the global scale, this may take several minutes.
Open the summary of results scenario_01/*.html
. Note the sections:
Analysis of omission/commission
Figure: Omission and Predicted Area. The Cumulative threshold on the x-axis refers to the application of a threshold, above which continuously ranging values (0 to 100) are converted to binary, predicting presence and absence below. We can then look at teh fraction of total area predicted (red). So at a 0 threshold all cells are predicted as “present” hence fraction of 1, whereas at 100 no cells are predicted as “present” hence fraction of 0. Along this continuum of thresholds, we can evaluate how many points are missed from the input training (blue) and test (turquoise) points are missed, ie “omitted” versus the predicted random rate (black).
Figure: Sensitivity vs. 1 - Specificity. This plot shows the Receiver Operating Curve (ROC) for both training and test data. The area under the ROC curve (AUC) shown in the legend indicates how well the model performs along a range of thresholds (1 is perfect, 0.5 is random; the further to the upper left the blue and red lines to the upper left corner, the better the model).
Table: Logistic thresholds. Because we are using the default Output format parameter “Logistic”, ie ranging from 0 to 1, the Logistic threshold would be applied for converting the output from continuous (0 to 1) to binary (present vs absent). Of the many Descriptions available, “Maximum training sensitivity plus specificity” optimizes the tradeoff between sensitivity ( mimizing false presences, ie errors of comission) and specificity (mimizing false absences, ie errors of omission) using the training data. We’ll use this later to convert the continuous probability of encounter to a binary surface (present vs absent) for delineating species habitat “patches”.
Pictures of the model. The map here shows the continuous probability of occurrence with square dots showing train (white) and test (purple) locations. (The Explain tool, although nifty to explore, only seems to rarely work. You can try it out by using a new output directory, eg scenario_01-explain, untick Auto running the model again Although you could untick Auto features and Product features which disables predictor interactions. Turning off Product features is required to use the Explain tool, which might then work by double clicking the batch file *_explain.bat.)
Response curves Two sets of response curves are generated:
Marginal species response (0 to 1 in y axis) to predictor (range of values in x axis) when all other predictors are averaged. These term plots can be hard to interpret when variables are correlated with each other.
Dedicated species response to predictor for model using only that predictor. You can much more clearly see the “niche” of this species by the given environmental variable.
Analysis of variable contributions. The variables in this table are sorted by Percent contribution of the given variable to the overall gain of the fitted model. The Permutation importance describes the fractional difference in AUC score if the given variable is excluded. So Percent contribution describes overall model fit, and Permution importance relates to how much more accurate (ie sensitivity vs specificity) the model is by the variable’s inclusion.
Raw data outputs and control parameters Links to raw data file outputs and various statistics are included here.
For more details on Maxent input and output, check out the help and tutorial docs in the software folder.
Question: What are the top 3 predictors for Percent contribution versus top 3 for Permutation importance?
Question: Are there any other “hotspots” outside the extent of occurrence (test and train points), particularly outside the Americas, which represent similar habitats?
Next, we’ll limit the environmental background to the extent of occurrences to create a more refined fitted model.
Use the same settings as scenario_01, except:
Environmental layers: browse to env\current_10min_cropped
Output directory: browse to scenario_02
Here are the environmental predictors cropped to the extent of occurrence.
Note that these cropped environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by commenting out (ie precede the line with a #
character) the plot command in the R chunk above.
After you Run the model, inspect the results scenario_02/*.html
.
Question: Now, what are the top 3 predictors for Percent contribution versus top 3 for Permutation importance?
Question: Compare the following dedicated altitude response curves between the global [scenario_01] and cropped [scenario_02] models? Briefly explain why these might be different.
To simplify the prediction and interpretation of the model, choose the combination of “top” predictors that comprise:
altitude (alt)
top 3 variables by Percent contribution
top 3 variables by Permutation importance
Question: What are these “top” predictors?
Use the same Marxan settings as scenario_02, except:
Environmental layers: continuing with env\current_10min_cropped
, Deselect all and individually tick the “top”" predictors identified above.
Output directory: browse to scenario_03
Question: By dropping the other predictor variables, how much predictive performance did we lose? In technical terms, report the difference in Area Under the Curve of the Test data going from scenario_02 to scenario_03.
Here are the environmental predictors for Santa Barbara County’s current climate.
Note that these global environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by comment commenting out (ie precede the line with a #
character) the plot command in the R chunk above.
We’ll continue to use the model fitted to the cropped extent of occurrence, but predict to Santa Barbara County at a much higher resolution (0.5 minutes). You can fit a model and predict to a different set of environmental rasters by using the parameter “Projection layers directory/file.” Given that we already fitted the model, we can directly predict to a different set of environmental data with the following command (per the software/tutorial.doc
:
java -cp maxent.jar density.Project lambdaFile gridDir outFile
This command gets written to the batch file scenario_04/maxent_predict.bat
, similar to software/maxent.bat
, by the R chunk below. You’ll need to Knit or Run All chunks in order for this code to generate the scenario_04 file. Then simply double-click on scenario_04/maxent_predict.bat
to predict the scenario_03 fitted model to the current Santa Barbara half minute environment env/current_0.5min_sb
and output to the scenario_04
folder. You should see a black command window flicker up and the folder populate with 4 raster files (*_current_0.5min_sb*
).
Here’s the continuous probability of encounter for Amphispiza belli in Santa Barbara County’s current climate:
Here are the environmental predictors for Santa Barbara County’s future climate.
Note that these global environmental predictor plots above are to be excluded from final writeup, so once you’ve seen them you can speed up the knitting of your document and turn them off by comment commenting out (ie precede the line with a #
character) the plot command in the R chunk above.
Similar to the previous scenario, you’ll predict to the future Santa Barbara climate with the automatically generated batch file scenario_05/maxent_predict.bat
which will populate the folder with 4 raster files (*_future_0.5min_sb*
).
Here’s the continuous probability of encounter for Amphispiza belli in Santa Barbara County’s future climate:
Next, we’ll find the threshold that converts 20% of the landscape into species habitat, so you have a reasonable quantity of patches for conducting the Connectivity lab. Then we’ll apply that same threshold to the future scenario to return the: lost (red; -1), core (blue; 0), and novel (green; 1) habitats.
Threshold above which converts 20% of the landscape to habitat for SB current climate: 0.6755.
zone | area_km2 | pct |
---|---|---|
-1 | 1838.0 | 22.696 |
0 | 243.2 | 3.004 |
1 | 6017.2 | 74.300 |
Question. Compare the threshold above that obtains 20% of the landscape as habitat with that of the “Maximum training sensitivity plus specificity” found in the “Logistic threshold” table scenario_03 model results. Would using the threshold “Maximum training sensitivity plus specificity” result in more or less habitat for the current climate?
For this assignment, you’re are expected to generate all 5 scenario outputs, knit this lab6.Rmd document to have all outputs included, and respond to all the questions above as a seperate dedicated Word document lab6_answers.docx
directly in the lab6_species folder (be sure to include the original question). Since the environmental predictor plots will be the same regardless of species chosen, please exclude them from your knitted document by commenting out (ie precede the line with a #
character) the plot command from within the following R chunks:
To evaluate your responses, I’d like to see your writeup and all scenario outputs in a zip file that you submit. The easiest way to do this is from Windows Explorer, right-click on the lab6_species folder -> 7-Zip -> Add to lab6_species.zip. Double click on this file to open it in 7-Zip, navigate into the lab6_species folder, select and delete the following subfolders: env, software, figures. After you close 7-Zip, submit this lab6_species.zip file into the GauchoSpace lab6 assignment.
You might be interested in perusing resources on species distribution modeling from week 4 of last quarter’s Advanced GIS course which used a different modeling technique, the generalized linear model (vs Maxent) and pseudo-absence points (vs background).