Due: 5pm Fri, Mar 6 2015 via GauchoSpace
In this lab you’ll explore differences between sites to characterize “communities” based on dissimilarity in species composition and environment. Whittaker (1960) was first to describe species diversity between sites (beta; \(\beta\) diversity), which is in relation to diversity of the individual (alpha; \(\alpha\) diversity) and overall region (gamma; \(\gamma\) diversity).
You will quantify how “different” sites are from each other based on differences in species composition. This ecological distance between sites then forms the input matrix for an ordination technique that oders the sites along one or more axis so as to maximize the spread of ecological distances. You’ll then cluster these sites into similar groups.
Next, you’ll look at environmental variation between sites in a few ways. First, you’ll overlay environmental vectors in the “species” ordination space to show general trends. Then you’ll add environmental contour plots for a couple dominant variables to see how the range of environmental values trends across the sites in species space. Next you’ll conduct recursive partitioning to determine how environmental variables predict the groups.
Set your working directory in the R chunk below.
You’ll use the tree survey data from Sequoia National Park described by Urban et al (2002) from your assigned readings. The survey files in the data
folder are:
You can find a more detailed description in data\sequoia_data.pdf
.
Let’s start by mapping the study plots (see also Figure 2 from Urban et al, 2002).
Per the approach of Urban et al (2002), you’ll limit your analysis to the top 11 dominant species by average density (stems per hectare). Inclusion of very rare species can make it difficult for the ordination technique to converge on a solution. The following R chunk calculates the average species density (stems per hectare) and spits out the original species data now limited to the top 11 species into data/spp11.csv
.
rank | species | avg_density |
---|---|---|
1 | SEgi | 22.1326263 |
2 | ABco | 18.1081818 |
3 | ABma | 13.0607071 |
4 | PImo | 9.7373737 |
5 | PIla | 7.1006061 |
6 | PIje | 5.3846465 |
7 | CAde | 4.2610101 |
8 | PIco | 3.0318182 |
9 | PIpo | 2.5250505 |
10 | QUke | 1.4663636 |
11 | COnu | 0.0505051 |
12 | ARvi | 0.0169697 |
13 | CEin | 0.0156566 |
14 | QUch | 0.0125253 |
15 | UMca | 0.0111111 |
16 | ACma | 0.0076768 |
17 | TOca | 0.0021212 |
Question: What are the most and least abundant species by common name for the subset of 11 most dominant species to be used in this analysis? (You can find the lookup table to go from Code to Common Name in the data\sequoia_data.pdf
.)
The first step in this community analysis is to define the ecological distance between sites \(\beta\) (beta diversity) based on species composition.
The best known index of beta diversity is based on the ratio of total number of species in a collection of sites \(S\) (ie \(\gamma\) diversity) and the average richness per one site \(\bar \alpha\):
\[ \beta = S/\bar \alpha - 1 \ \]
Subtraction of one means that \(\beta = 0\) when there are no excess species or no heterogeneity between sites. For this study area we calculate it with the following:
spp17 = read.csv('data/spp17.csv', row.names=1)
alpha17 = mean(specnumber(spp17))
beta17 = ncol(spp17)/alpha17 - 1
spp11 = read.csv('data/spp11.csv', row.names=1)
alpha11 = mean(specnumber(spp11))
beta11 = ncol(spp11)/alpha11 - 1
We find the values are different for all 17 species (\(\alpha_{17}\) = 2.8484848; \(\beta_{17}\) = 4.9680851) than for the top 11 dominant species (\(\alpha_{11}\) = 2.7171717; \(\beta_{11}\) = 3.0483271).
Question: In simple terms, why the difference in \(\beta\) and \(\alpha\) terms between the dataset with all 17 species versus the most dominant 11?
This simple calculation is problematic though since \(S\) increases with the number of sites even when sites are all subsets of the same community. Whittaker (1960) noticed this, and suggested the index to be found from pairwise comparison of sites. The most commonly used metric for this is the Bray-Curtis dissimilarity (also known as the Sorenson index of dissimilarity):
\[ \beta = \frac{a+b+c}{(2a+b+c)/2} - 1 = \frac{b+c}{2a+b+c} \]
where the number of shared species in two sites is \(a\), and the numbers of species unique to each site are \(b\) and \(c\), then \(\bar \alpha = (2a + b + c)/2\) and \(S = a + b + c\). The Bray-Curtis dissimilarity is bound between 0 and 1, where 0 means the two sites have the same composition (ie they share all the same species), and 1 means the two sites do not share any species. This metric can be calculated between all sites using the vegan function vegdist
:
# Bray-Curtis as binary (presence/absence) of species per equation above
spp11_bray_binary = vegdist(spp11, method='bray', binary=T)
# Bray-Curtis as amount (density) of species
spp11_bray = vegdist(spp11, method='bray')
# transformed Bray-Curtis as amount (density) of species
spp11_bray_transformed = vegdist(wisconsin(sqrt(spp11)), method='bray')
# write to data folder
write.csv(as.matrix(spp11_bray_binary), 'data/spp11_bray_binary.csv')
write.csv(as.matrix(spp11_bray), 'data/spp11_bray.csv')
write.csv(as.matrix(spp11_bray_transformed), 'data/spp11_bray_transformed.csv')
Note that in the R chunk above you calculated Bray Curtis using the binary (presence/absence) of species as well as using the density. You generally want to transform the input species composition data so that the inputs are normalized between the species and sites. All the ecological distance measures are output to the data folder.
Let’s look at how these values are distributed.
# show just the lower triangle of the matrix
i = lower.tri(as.matrix(spp11_bray_binary))
hist(as.matrix(spp11_bray_binary)[i] ,
xlab='ecological distance', main='spp11_bray_binary')
hist(as.matrix(spp11_bray)[i] ,
xlab='ecological distance', main='spp11_bray')
hist(as.matrix(spp11_bray_transformed)[i],
xlab='ecological distance', main='spp11_bray_transformed')
In order to solidify subsequent processes in your mind, let’s look values of a few individual sites that are closest, furthest and some random pair in between. When you crack open data/spp11_bray_transformed.csv
you’ll see 0’s along the diagonals corresponding to zero distance from a site to itself. Let’s not include those as potential closest sites. Also, you’ll notice not just a single max value (1), but many.
m = as.matrix(spp11_bray_transformed)
diag(m) = NA
i_max = which(m == max(m, na.rm=T), arr.ind=T)
# ... TO BE CONTINUED
Question: What are the dimensions [# rows X # columns] of the output ecological distance matrices in terms of sites/species/or…?
Question: Lookup the help for the vegdist
function and name at least three other ecological distance measures.
Question: In the R chunk above, how were the species composition data transformed to create spp11_bray_transformed
before feeding into the vegdist
function? Lookup the functions used to describe this transformation.
In multivariate analysis, ordination (eg gradient analysis) orders objects that are characterized by values on multiple variables (i.e., multivariate objects) so that similar objects are near each other and dissimilar objects are farther from each other. These relationships between the objects, on one or more axes, are then characterized numerically and/or graphically. Here’s a dichotomus key for choosing the appropriate ordination technique from Dean Urban’s Multivariate class:
Question | Choice |
---|---|
1a. Ancillary data (ENV) not available, or not used to constrain the ordination | 2 |
1b. Ancillary data (ENV) available and used in ordination | 4 |
2a. Species response assumed linear | PCA (FA) |
2b. Species response not linear, or unknown | 3 |
3a. Species response assumed nonlinear and unimodal | DCA (RA) |
3b. Species response not linear, nor nonlinear and unimodal, or unknown | NMS (PCoA, BC) |
4a. Species response assumed linear | RDA |
4b. Species response not linear, or unknown | 5 |
5a. Species response nonlinear and unimodal | 6 |
5b. Species response not linear, nor nonlinear and unimodal, or unknown | dbRDA |
6a. A single (or few) environmental factors known (or required) as constraints | DO |
6b. Several factors suspected | CCA |
In the key, the lead technique is the best or default choice; alternatives in parentheses are options appropriate to particular data sets (see Applications Checklist, below). The techniques:
Since nonmetric multidimensional scaling has the fewest assumptions of the data, we’ll proceed with that technique.
spp11_mds = metaMDS(spp11, k=3, trace=F, trymax=100)
spp11_mds
##
## Call:
## metaMDS(comm = spp11, k = 3, trymax = 100, trace = F)
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(spp11))
## Distance: bray
##
## Dimensions: 3
## Stress: 0.07271247
## Stress type 1, weak ties
## Two convergent solutions found after 97 tries
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(spp11))'
Question Note in the R chunk above that the data input for the metaMDS
function is the original species composition data sp11
, but how did the argument k=3
and default values for arguments distance
and autotransform
affect the results?
Let’s look at the first 2 axis of ordination with site numbers in black and species in red.
plot(spp11_mds, type='t')
There’s a third axis which you can see in this plot.
ordiplot3d(spp11_mds, type='h')
You can also interact with a 3D plot using the following commands in the Console which opens up a new window (that you need to find in the Windows bar). Click hold and rotate the axes. First, by sites:
if (interactive()){
ordirgl(spp11_mds, size=2) # default sites
}
or by species:
if (interactive()){
ordirgl(spp11_mds, display='species', type='t') # species
}
Now hierarchically cluster the sites using the same transformed Bray-Curtis species dissimilarity input as the ordination.
# hierarchically cluster sites
clu = hclust(spp11_bray_transformed, 'average')
plot(clu)
Now cutoff the hierarchical cluster into three groups.
# cutoff clusters to 3 groups
plot(clu)
rect.hclust(clu, k=3) # or by height of similarity h=0.5
grp = cutree(clu, k=3)
Plot the groups in ordination space.
# plot convex hull of groups: overlap
plot(spp11_mds, display='sites')
ordihull(spp11_mds, grp, lty=2, col='red', label=T)
Which you can also do interactively in 3D.
if (interactive()){
# plot 3D to see 3rd NMS axis
ordirgl(spp11_mds, display='sites') #, type='t')
orglspider(spp11_mds, grp)
}
Next, you’ll plot the environment vectors fitted into the “species” space.
env21 = read.csv('data/env21.csv', row.names=1)
ef = envfit(spp11_mds, env21, permu = 999)
ef
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## Elev 0.99995 -0.01038 0.8823 0.001 ***
## Slope -0.62043 0.78426 0.0530 0.090 .
## Tasp 0.71012 -0.70408 0.0053 0.753
## TSI 0.84736 -0.53101 0.0138 0.515
## xLitter -0.49325 0.86989 0.0503 0.068 .
## xDepth -0.60080 0.79940 0.4012 0.001 ***
## sDepth 0.72877 0.68476 0.0089 0.668
## pH -0.55792 0.82989 0.5088 0.001 ***
## C 0.58942 -0.80783 0.1307 0.001 ***
## N 0.33071 -0.94373 0.0704 0.026 *
## CN 0.99981 -0.01968 0.2649 0.001 ***
## P -0.09770 0.99522 0.0556 0.066 .
## Ca -0.86190 0.50709 0.2417 0.001 ***
## Mg -0.99148 -0.13023 0.2255 0.001 ***
## K -0.93975 0.34186 0.2126 0.001 ***
## Acid 0.76256 -0.64691 0.5098 0.001 ***
## ECEC -0.93377 0.35787 0.1705 0.001 ***
## BS -0.85626 0.51654 0.5999 0.001 ***
## Clay -0.35055 -0.93654 0.0267 0.259
## Silt -0.25389 -0.96723 0.0030 0.863
## Sand 0.26003 0.96560 0.0048 0.785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
plot(spp11_mds, display = 'sites')
plot(ef, p.max = 0.1)
# TODO: highlight special sites and extract env values
Now choose two environment variables to create contours.
# pick two variables: pH and Elev
ef2 = envfit(spp11_mds ~ pH + Elev, env21)
plot(spp11_mds, display = 'sites')
plot(ef2)
tmp = with(env21, ordisurf(spp11_mds, pH, add=T))
with(env21, ordisurf(spp11_mds, Elev, add=T, col="green4"))
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## <environment: 0x7fd8533f6120>
##
## Estimated degrees of freedom:
## 8.1 total = 9.1
##
## REML score: 620.2007
Get the difference in environmental values between groups (from hierarchical clustering above).
boxplot(pH ~ grp, data=env21, notch = TRUE, xlab='group', ylab='pH')
## Warning in bxp(structure(list(stats = structure(c(4.92, 5.24, 5.45, 5.73,
## : some notches went outside hinges ('box'): maybe set notch=FALSE
boxplot(Elev ~ grp, data=env21, notch = TRUE, xlab='group', ylab='Elev')
## Warning in bxp(structure(list(stats = structure(c(1545, 1673, 1923, 2093,
## : some notches went outside hinges ('box'): maybe set notch=FALSE
Now you’ll use recursive partitioning to see how the environmental data can be used to predict the different groups based on site species dissimmilarity.
tree1 = rpart(factor(grp) ~ ., data=env21)
tree1_p = as.party(tree1)
tree1_p
##
## Model formula:
## factor(grp) ~ Elev + Slope + Tasp + TSI + xLitter + xDepth +
## sDepth + pH + C + N + CN + P + Ca + Mg + K + Acid + ECEC +
## BS + Clay + Silt + Sand
##
## Fitted party:
## [1] root
## | [2] Elev < 2395
## | | [3] pH >= 4.845
## | | | [4] Elev < 2162.5: 1 (n = 54, err = 0.0%)
## | | | [5] Elev >= 2162.5: 3 (n = 8, err = 50.0%)
## | | [6] pH < 4.845: 3 (n = 11, err = 0.0%)
## | [7] Elev >= 2395: 2 (n = 26, err = 0.0%)
##
## Number of inner nodes: 3
## Number of terminal nodes: 4
plot(tree1_p)
Let’s plot some other diversity indices for each site and see how they relate to each other.
# diversity indices for 17 species
Shannon17 = diversity(spp17, index='shannon')
Simpson17 = diversity(spp17, index='simpson')
InvSimpson17 = diversity(spp17, 'inv')
Rare17 = rarefy(round(spp17), 2) - 1
Alpha17 = fisher.alpha(round(spp17))
pairs(cbind(Shannon17, Simpson17, InvSimpson17, Rare17, Alpha17), pch='+', col='blue')
# diversity indices for 17 species
Shannon11 = diversity(spp11, index='shannon')
Simpson11 = diversity(spp11, index='simpson')
InvSimpson11 = diversity(spp11, 'inv')
Rare11 = rarefy(round(spp11), 2) - 1
Alpha11 = fisher.alpha(round(spp11))
pairs(cbind(Shannon11, Simpson11, InvSimpson11, Rare11, Alpha11), pch='+', col='blue')
Just for kicks and giggles, let’s look at species accumulation curve, or rarefaction curve, which shows us the number of species we expect to find as we add plots.
# species accumulation curves
sac <- specaccum(spp17)
plot(sac, ci.type='polygon', ci.col='gray')
Besides the questions above, your assignment is to describe the species and environment associations between sites. Please reference species by “common name (code)”, such as “giant sequoia (SEgi).” Be explicit about the relationship between species and at least 3 environmental gradients as well as the 3 groups identified by the cluster analysis. You can simply refer to which sites are found in relatively high versus mid and low range environmental values. Treat this like a Results and Discussion section of a paper describing the community analysis of this Sequoia National Park study area. You might also wish to recall Figure 1.4 from the textbook (slide 18 of first lecture) by Whittaker (1956) showing the different species communities and environmental gradients of the Great Smoky Mountains.
To help you identify these differences choose Chunks -> Run All which should prompt you to click on sites (open circles). Click on sites at the environmental extremes and near the group’s center. Click ESC to exit the interactive selection of sites.
Once you’ve identified the sites, knit this document to generate the following 3 figures and 2 tables which you should include and reference by caption number in your Word document writeup. Any of the other results above you can similarly include, such as the recursive partitioning plot under “Use Environment to Predict Cluster Groups” by adding the figure, a figure number and caption.
Figure 1. Sites (circles), including selected sites (closed circles with numeric id), enclosed by group membership convex hull (red dashed line; red numeric label), associated species ordination centers (red character text), and environmental gradients (blue vector lines).
Table 1. Species density (stems per hectare) original (non-normalized) values per selected site.
ABco | ABma | CAde | COnu | PIco | PIje | PIla | PImo | PIpo | QUke | SEgi | |
---|---|---|---|---|---|---|---|---|---|---|---|
35 | 43.97 | 0.00 | 9.03 | 0 | 0.00 | 0.00 | 6.35 | 0.00 | 8.65 | 0.00 | 0.00 |
37 | 0.55 | 0.00 | 2.87 | 0 | 0.00 | 0.00 | 4.27 | 0.00 | 1.66 | 27.74 | 0.00 |
46 | 0.00 | 34.42 | 0.00 | 0 | 0.00 | 0.00 | 0.00 | 21.27 | 0.00 | 0.00 | 0.00 |
64 | 5.66 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 129.56 | 0.00 | 0.00 | 0.00 | 179.55 |
77 | 0.00 | 0.00 | 0.00 | 0 | 29.76 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
93 | 6.76 | 0.00 | 0.00 | 0 | 0.00 | 61.25 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
96 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 24.99 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Table 2. Environmental original (non-normalized) values per selected site.
Elev | Slope | Tasp | TSI | xLitter | xDepth | sDepth | pH | C | N | CN | P | Ca | Mg | K | Acid | ECEC | BS | Clay | Silt | Sand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | 1802 | 24 | -0.15643 | 0.01871 | 6.3 | 96.52 | 11.88 | 5.56 | 3.450 | 0.130 | 26.09 | 123.23 | 7.02 | 0.817 | 0.457 | 0.077 | 8.37 | 99.07 | 0.6 | 19.1 | 80.3 |
37 | 1765 | 30 | -0.84805 | 0.07265 | 4.9 | 27.27 | 33.07 | 5.35 | 3.073 | 0.160 | 19.21 | 145.95 | 5.78 | 0.677 | 0.347 | 0.123 | 6.92 | 97.94 | 0.2 | 18.2 | 81.6 |
46 | 2687 | 17 | -0.22495 | 0.06968 | 2.7 | 54.60 | 36.30 | 4.15 | 6.373 | 0.200 | 34.20 | 76.35 | 2.01 | 0.267 | 0.397 | 2.327 | 5.00 | 54.78 | 1.6 | 29.0 | 69.3 |
64 | 2125 | 13 | -0.70711 | 0.00439 | 9.7 | 91.13 | 23.14 | 6.37 | 1.607 | 0.067 | 23.83 | 27.57 | 6.67 | 0.403 | 0.450 | -0.003 | 7.53 | 100.00 | 0.2 | 20.6 | 79.2 |
77 | 2822 | 5 | 0.60181 | 0.00436 | 2.7 | 17.30 | 10.26 | 4.27 | 2.177 | 0.090 | 24.40 | 238.43 | 0.32 | 0.067 | 0.130 | 1.060 | 1.57 | 33.27 | 0.0 | 20.5 | 79.6 |
93 | 2218 | 4 | -0.19082 | -0.01310 | 2.7 | 41.30 | 18.64 | 5.16 | 1.753 | 0.077 | 23.19 | 89.77 | 2.86 | 0.377 | 0.357 | 0.097 | 3.70 | 96.87 | 0.0 | 19.0 | 81.3 |
96 | 2017 | 10 | 0.97437 | 0.04057 | 1.3 | 4.07 | 4.87 | 3.90 | 6.330 | 0.283 | 22.27 | 81.60 | 1.56 | 0.237 | 0.200 | 2.337 | 4.33 | 46.60 | 2.1 | 22.3 | 75.6 |
Figure 2. Normalized species densities with with selected sites shown in rank order (blue text).
Figure 3. Normalized environmental densities with selected sites shown in rank order (blue text).
The data set and overall concepts were pulled from Dean Urban’s Multivariate Analysis class at Duke University. The specific R commands have been updated to use the Vegan R package as described by these instructive vignettes:
Other helpful lecture materials can be found here: