Beruflich Dokumente
Kultur Dokumente
Jari Oksanen
processed with vegan 2.3-0 in R Under development (unstable) (2015-05-25 r68412) on May 26, 2015
1 Parallel processing 1
1.1 User interface . . . . . . . . . . . . . 1 1.1 User interface
1.1.1 Using parallel processing as The functions that are capable of parallel process-
default . . . . . . . . . . . . . 1 ing have argument parallel. The normal default
1.1.2 Setting up socket clusters . . 2 is parallel = 1 which means that no parallel pro-
1.1.3 Random number generation . 2 cessing is performed. It is possible to set parallel
1.1.4 Does it pay off? . . . . . . . . 2 processing as the default in vegan (see 1.1.1).
1.2 Internals for developers . . . . . . . 3 For parallel processing, the parallel argument
can be either
2 Nestedness and Null models 3
2.1 Matrix temperature . . . . . . . . . 3 1. An integer in which case the given number
2.2 Backtracking . . . . . . . . . . . . . 4 of parallel processes will be launched (value
1 launches non-parallel processing). In unix-
3 Scaling in redundancy analysis 5 like systems (e.g., MacOS, Linux) these will
be forked multicore processes. In Windows
4 Weighted average and linear combina- socket clusters will be set up, initialized and
tion scores 6 closed.
4.1 LC Scores are Linear Combinations . 7
4.2 Factor constraints . . . . . . . . . . 9 2. A previously created socket cluster. This saves
4.3 Conclusion . . . . . . . . . . . . . . 10 time as the cluster is not set up and closed
in the function. If the argument is a socket
cluster, it will also be used in unix-like sys-
1 Parallel processing tems. Setting up a socket cluster is discussed
in 1.1.2.
Several vegan functions can perform parallel pro-
cessing using the standard R package parallel. 1.1.1 Using parallel processing as default
The parallel package in R implements the func-
tionality of earlier contributed packages multicore If the user sets option mc.cores, its value will be
and snow. The multicore functionality forks the used as the default value of the parallel argument
analysis to multiple cores, and snow functionality in vegan functions. The following command will
1
set up parallel processing to all subsequent vegan Socket clusters are used for parallel processing
commands: in Windows, but you do not need to pre-define
> options(mc.cores = 2) the socket cluster in oecosimu if you only need
vegan commands. However, if you need some
The mc.cores option is defined in the parallel other contributed packages, you must pre-define the
package, but it is usually unset in which case ve- socket cluster also in Windows with appropriate
gan will default to non-parallel computation. The clusterEvalQ calls.
mc.cores option can be set by the environmen-
tal variable MC_CORES when the parallel package If you pre-set the cluster, you can also use snow
is loaded. style socket clusters in unix-like systems.
R allows1 setting up a default socket cluster
(setDefaultCluster), but this will not be used in
vegan. 1.1.3 Random number generation
2
1.2 Internals for developers
The implementation of the parallel processing
should accord with the description of the user in-
terface above ( 1.1). Function oecosimu can be
used as a reference implementation, and similar
interpretation and order of interpretation of argu-
ments should be followed. All future implementa-
tions should be consistent and all must be changed
if the call heuristic changes.
The value of the parallel argument can be
3
ciples. Rodrguez-Girones and Santamaria (2006) The following function is used to define the fill
have seen the original code and reveal more details line:
of calculations, and their explanation is the basis of y = (1 (1 x)p )1/p (2)
the implementation in vegan. However, there are
This is similar to the equation suggested by
still some open issues, and probably vegan func-
Rodrguez-Girones and Santamaria (2006, eq.
tion nestedtemp will never exactly reproduce re-
4), but omits all terms dependent on the num-
sults from other programs, although it is based on
bers of species or sites, because I could not
the same general principles.2 I try to give main
understand why they were needed. The dif-
computation details in this document all details
ferences are visible only in small data sets.
can be seen in the source code of nestedtemp.
The y and x are the coordinates in the unit
Species and sites are put into unit square square, and the parameter p is selected so
(Rodrguez-Girones and Santamaria, 2006). that the curve covers the same area as is the
The row and column coordinates will be (k proportion of presences (Fig. 2.1). The pa-
0.5)/n for k = 1 . . . n, so that there are no rameter p is found numerically using R func-
points in the corners or the margins of the tions integrate and uniroot. The fill line
unit square, and a diagonal line can be drawn used in the original matrix temperature soft-
through any point. I do not know how the rows ware (Atmar and Patterson, 1993) is supposed
and columns are converted to the unit square to be similar (Rodrguez-Girones and Santa-
in other software, and this may be a consider- maria, 2006). Small details in the fill line com-
able source of differences among implementa- bined with differences in scores used in the unit
tions. square (especially in the corners) can cause
large differences in the results.
Species and sites are ordered alternately using
indices (Rodrguez-Girones and Santamaria, A line with slope = 1 is drawn through the
2006): point and the x coordinate of the intersection
X of this line and the fill line is found using func-
sj = i2 tion uniroot. The difference of this intersec-
i|xij =1 tion and the row coordinate gives the argument
X (1)
tj = (n i + 1)2 d of matrix temperature (Fig. 2.1).
i|xij =0
In other software, duplicated species occur-
Here x is the data matrix, where 1 is pres- ring on every site are removed, as well as empty
ence, and 0 is absence, i and j are row and sites and species after reordering (Rodrguez-
column indices, and n is the number of rows. Girones and Santamaria, 2006). This is not
The equations give the indices for columns, but done in vegan.
the indices can be reversed for corresponding
row indexing. Ordering by s packs presences 2.2 Backtracking
to the top left corner, and ordering by t pack
zeros away from the top left corner. The final Gotellis and Entsmingers seminal paper (Gotelli
sorting should be a compromise (Rodrguez- and Entsminger, 2001) on filling algorithms is
Girones and Santamaria, 2006) between these somewhat confusing: it explicitly deals with
scores, and vegan uses s+t. The result should knights tour which is quite a different problem
be cool, but the packing does not try to mini- than the one we face with null models. The chess
mize the temperature (Rodrguez-Girones and piece knight3 has a history: a piece in a certain
Santamaria, 2006). I do not know how the position could only have entered from some candi-
compromise is defined, and this can cause date squares. The filling of incidence matrix has no
some differences to other implementations. history: if we know that the item last added was
2 functionnestedness in the bipartite package is a direct 3Knight is Springer in German which is very appro-
port of the original BINMATNEST program of Rodrguez- priate as Springer was the publisher of the paper on knights
Giron
es and Santamaria (2006). tour
4
in certain row and column, we have no information sition (svd). Functions rda and prcomp even use
to guess which of the filled items was entered pre- svd internally in their algorithm.
viously. A consequence of dealing with a different In svd a centred data matrix X = {xij } is
problem is that Gotelli and Entsminger (2001) do decomposed
P into orthogonal components so that
not give many hints on implementing a fill algo- xij = k k uik vjk , where uik and vjk are orthonor-
rithm as a community null model. mal coefficient matrices and k are singular val-
The backtracking is implemented in two stages ues. Orthonormality means that sums of squared
in vegan: filling and backtracking. columns
P 2 isPone2 and their cross-product
P Pis zero, or
u
i ik = v
j jk = 1, and u u
i ik il = j vjk vjl =
1. The matrix is filled in the order given by the 0 for k 6= l. This is a decomposition, and the orig-
marginal probabilities. In this way the ma- inal matrix is found exactly from the singular vec-
trix will look similar to the final matrix at all tors and corresponding singular values, and first
stages of filling. Equal filling probabilities are two singular components give the rank = 2 least
not used since that is ineffective and produces squares estimate of the original matrix.
strange fill patterns: the rows and columns Principal component analysis is often presented
with one or a couple of presences are filled (and performed in legacy software) as an eigenanal-
first, and the process is cornered to columns ysis of covariance matrices. Instead of a data ma-
and rows with many presences. As a conse- trix, we analyse a matrix of covariances and vari-
quence, the the process tries harder to fill that ances S. The result are orthonormal coefficient ma-
corner, and the result is a more tightly packed trix U and eigenvalues . The coefficients uik ares
quadratic fill pattern than with other methods. identical to svd (except for possible sign changes),
and eigenvalues k are related to the corresponding
2. The filling stage stops when no new points singular values by k = k2 /(n 1). With classi-
can be added without exceeding row or col- cal definitions, the sum of all eigenvalues equals the
umn totals. Backtracking means removing sum of variances of species, or k k = j s2j , and
P P
random points and seeing if this allows adding it is often said that first axes explain a certain pro-
new points to the plot. No record of history is portion of total variance in the data. The orthonor-
kept (and there is no reason to keep a record of mal matrix V of svd can be found indirectly as
history), but random points are removed and well, so that we have the same components in both
filled again. The number of removed points in- methods.
creases from one to four points. New configu- The coefficients uik and vjk are scaled to unit
ration is kept if it is at least as good as the pre- length for all axes k. Singular values k or eigen-
vious one, and the number of removed points values k give the information of the importance
is reduced back to one if the new configuration of axes, or the axis lengths. Instead of the or-
is better than the old one. Because there is thonormal coefficients, or equal length axes, it is
no record of history, this does not sound like a customary to scale species (column) or site (row)
backtracking, but it still fits the general defi- scores or both by eigenvalues to display the impor-
nition of backtracking: try something, and if tance of axes and to describe the true configuration
it fails, try something else (Sedgewick, 1990). of points. Table 1 shows some alternative scalings.
These alternatives apply to principal components
analysis in all cases, and in redundancy analysis,
3 Scaling in redundancy anal- they apply to species scores and constraints or lin-
ysis ear combination scores; weighted averaging scores
have somewhat wider dispersion.
This chapter discusses the scaling of scores (results) In community ecology, it is common to plot both
in redundancy analysis and principal component species and sites in the same graph. If this graph
analysis performed by function rda in the vegan is a graphical display of svd, or a graphical, low-
library. dimensional approximation of the data, the graph
Principal component analysis, and hence redun- is called a biplot. The graph isP a biplot if the trans-
dancy analysis, is a case of singular value decompo- formed scores satisfy xij = c k uij vjk
where c is
5
Table 1: Alternative scalings for rda used in the functions prcomp and princomp, and the one used
in the vegan function rda and the proprietary software Canoco scores in terms of orthonormal species
(vik ) and site p
scores (ujk ), eigenvalues (k ), number of sites (n) and species standard deviations (sj ). In
P
rda, const = 4 (n 1) k . Corresponding negative scaling in vegan is derived dividing each species
by its standard deviation sj (possibly with some additional constant multiplier).
a scaling constant. In functions princomp, prcomp have two separate scaling constants: the first for
and rda, c = 1 and the plotted scores are a bi- the species, and the second for sites and friends,
plot so that the singular values (or eigenvalues) are and this allows getting scores of other software or
expressed for sites, and species are left unscaled. R functions (Table 2).
There is no natural way of scaling species and In this chapter, I used always centred data ma-
site scores to each other. The eigenvalues in redun- trices. In principle svd could be done with orig-
dancy and principal components analysis are scale- inal, non-centred data, but there is no option for
dependent and change when the data are multiplied this in rda, because I think that non-centred anal-
by a constant. If we have percent cover data, the ysis is dubious and I do not want to encourage its
eigenvalues are typically very high, and the scores use (if you think you need it, you are certainly so
scaled by eigenvalues will have much wider disper- good in programming that you can change that one
sion than the orthonormal set. If we express the line in rda.default). I do think that the argu-
percentages as proportions, and divide the matrix ments for non-centred analysis are often twisted,
by 100, the eigenvalues will be reduced by factor and the method is not very good for its intended
1002 , and the scores scaled by eigenvalues will have purpose, but there are better methods for finding
a narrower dispersion. For graphical biplots we fuzzy classes. Normal, centred analysis moves the
should be able to fix the relations of row and col- origin to the average of all species, and the dimen-
umn scores to be invariant against scaling of data. sions describe differences from this average. Non-
The solution in R standard function biplot is to centred analysis leaves the origin in the empty site
scale site and species scores independently, and typ- with no species, and the first axis usually runs from
ically very differently, but plot each independently the empty site to the average site. Second and third
to fill the graph area. The solution in Canoco non-centred components are often very similar to
P and
rda is to use proportional eigenvalues k / k in- first and second (etc.) centred components, and the
stead of original eigenvalues. These proportions best way to use non-centred analysis is to discard
are invariant with scale changes, and typically they the first component and use only the rest. This is
have a nice range for plotting two data sets in the better done with directly centred analysis.
same graph.
The veganP package uses a scaling constant c =
p4
(n 1) k in order to be able to use scaling by 4 Weighted average and linear
proportional eigenvalues (like in Canoco) and still combination scores
be able to have a biplot scaling. Because of this,
the scaling of rda scores is non-standard. However, Constrained ordination methods such as Con-
the scores function lets you to set the scaling con- strained Correspondence Analysis (CCA) and Re-
stant to any desired values. It is also possible to dundancy Analysis (RDA) produce two kind of site
6
Table 2: Values of the const argument in vegan to get the scores that are equal to those from other
functions and software.
P Number of sites (rows) is n, the number of species (columns) is m, and the sum
of all eigenvalues is k k (this is saved as the item tot.chi in the rda result)
prcomp, princomp 1 1 (n 1) k k
Canoco v3 -1, -2, -3 n1
n
Canoco v4 -1, -2, -3 m n
2
5
27 20
24
Many computer programs for constrained ordina- 28 10
2
1
Al
recommendation of Palmer (1993). However, func-
tions cca and rda in the vegan package use primar-
4
2
2 1 0 1 2
LC scores are linear combinations, so they
CCA1
give us only the (scaled) environmental vari-
ables. This means that they are independent
of vegetation and cannot be found from the Figure 2: LC scores in CCA of the original data.
species composition. Moreover, identical com-
binations of environmental variables give iden-
4.1 LC Scores are Linear Combina-
tical LC scores irrespective of vegetation.
tions
McCune (1997) has demonstrated that noisy Let us perform a simple CCA analysis using only
environmental variables result in deteriorated two environmental variables so that we can see the
LC scores whereas WA scores tolerate some constrained solution completely in two dimensions:
errors in environmental variables. All environ- > library(vegan)
mental measurements contain some errors, and > data(varespec)
therefore it is safer to use WA scores. > data(varechem)
> orig <- cca(varespec ~ Al + K, varechem)
This article studies mainly the first point. The Function cca in vegan uses WA scores as default.
users of vegan have a choice of either LC or WA So we must specifically ask for LC scores (Fig. 2).
(default) scores, but after reading this article, I be-
> plot(orig, dis=c("lc","bp"))
lieve that most of them do not want to use LC
scores, because they are not what they were look- What would happen to linear combinations of LC
ing for in ordination. scores if we shuffle the ordering of sites in species
7
Procrustes errors
Al
2
6
2
18
16
3
1
1
9
24
23
15
CCA2
Dimension 2
0
0
7 4
20 13
10 25 27
21 28 K
22
11
1
1
12 5 19
2
14
2
2
2 1 0 1 2
CCA1
2 1 0 1 2
Dimension 1
Figure 3: LC scores of shuffled species data.
Figure 4: Procrustes rotation of LC scores from
data? Function sample() below shuffles the in- CCA of original and shuffled data.
dices.
> i <- sample(nrow(varespec))
> shuff <- cca(varespec[i,] ~ Al + K, varechem)
It seems that site scores are fairly similar, but ori- Procrustes errors
8
scores(shuff, dis="lc")))
Dimension 2
There is a small difference, but this will disappear
2
> tmp1 <- rda(varespec ~ Al + K, varechem)
2
8
The original data and shuffled data differ in their Procrustes errors
goodness of fit:
> orig
2
Call: cca(formula = varespec ~ Al + K, data =
varechem)
1
Inertia Proportion Rank
Dimension 2
Total 2.0832 1.0000
Constrained 0.4760 0.2285 2
0
Unconstrained 1.6072 0.7715 21
Inertia is mean squared contingency coefficient
1
CCA1 CCA2
0.3608 0.1152
2
Eigenvalues for unconstrained axes:
CA1 CA2 CA3 CA4 CA5 CA6 CA7
0.3748 0.2404 0.1970 0.1782 0.1521 0.1184 0.0836
CA8 3 2 1 0 1 2
0.0757
(Showed only 8 of all 21 unconstrained eigenvalues) Dimension 1
> shuff
Figure 6: Procrustes rotation of WA scores of CCA
Call: cca(formula = varespec[i, ] ~ Al + K, data
= varechem)
with the original and shuffled data.
9
2
0.5
16
15
13
20
19
8
14
5
18
6
7
11
2
1
Moisture.Q
Moisture.C
0.0
1
10
3
17
4
Moisture1 Moisture5
0.5
0
Moisture2
Moisture.L
1.0
CCA2
CCA2
1
1.5
2.0
2.5
Moisture4
3
9
12
CCA1 CCA1
Figure 7: LC scores of the dune meadow data using Figure 8: A spider plot connecting WA scores
only one factor as a constraint. to corresponding LC scores. The shorter the web
segments, the better the ordination.
10
perature of presenceabsence matrices. Journal
of Biogeography, 33, 921935.
11