Beruflich Dokumente
Kultur Dokumente
ABSTRACT
We present herea framework for the study ofmolecularvariationwithin a singlespecies.
Information on DNA haplotypedivergence is incorporated into an analysisofvariance format,
derived from a matrix of squared-distances among all pairs of haplotypes.This analysis of molecular
variance (AMOVA) produces estimates of variance components and F-statistic analogs, designated
here as @-statistics,reflecting the correlation of haplotypic diversityat different levels of hierarchical
subdivision. The method is flexible enough toaccommodateseveral alternative input matrices,
corresponding to different types of moleculardata, as well as different types of evolutionary assump-
tions, without modifyingthe basic structure of the analysis. The significance of the variance compo-
nents and @-statisticsis tested using a permutational approach, eliminating the normality assumption
that is conventional for analysis of variance but inappropriate for molecular data. Application of
AMOVA to human mitochondrial DNA haplotype data shows that population subdivisionsare better
resolved whensomemeasureofmolecular differences among haplotypes is introduced into the
analysis. Atthe intraspecific level, however, the additional information provided by knowing the exact
phylogenetic relations among haplotypes or by a nonlinear translation of restriction-site change into
nucleotide diversity does not significantly modify the inferred population genetic structure. Monte
Carlo studies show that site sampling doesnot fundamentally affect the significance of the molecular
variance components. The AMOVA treatment is easily extended in several different directions and
it constitutes a coherent and flexible frameworkfor the statistical analysis of moleculardata.
-
b c
P;=[
a
0
b
0
c
1 ]
-
h2
ways met nor generally verifiable. We need a more Pi=[ 0 1 1 ]
-
general methodology that does not depend so criti- h,
Pj-[ 0 1 0 ]
cally on the specific assumptions. h4
P,‘=[ 1 1 1 ]
Our purpose here is to design an alternative meth- h5
Pi-[ 1 1 0 ]
odology that makes use of the available molecular
informationgathered in population surveys, while
remaining flexible enough to accommodate different
types of assumptions about the evolution of the ge- n
netic system. We propose to extend thework of COCK-
ERHAM (1973), LONG(1986) and LONG,SMOUSE and
WOOD(1987) on allelic correlations among demes to
a comparable analysis of haplotypic diversity. Using
the fact that a conventional sum of squares (SS) may
be written as the sum of squared differences between *Pi= I0 0 0 0 ]
all pairs of observations(LI1976), we constructa ‘P;= [ 0 0 1 0 ]
tion-site differences. This Euclidean metric is com- where the elements ofthe block-diagonal submatrices
monly employed for population differences (NEI and D$ contain pairwise squared-distances )6;( between
TAJIMA 1981), but it may be used just as easily for individuals of the same (ith) population, and those of
differencesbetween single haplotypes. Inthe case the off-diagonal matrix blocks D$ contain pairwise
where W is diagonal, W = diag(w:], weighting sites squared-distances between individuals, one from the
differentially but treating them as independent,Equa- ith and the other from the population.
i’th Individuals
tion 3a can be rewritten as may also be grouped at higher levels, according to
S
such non-genetic criteria as geography,ecological en-
vironment, or language.
65 = s= 1
d(psj - psk)‘. (3b)
Aconventional sum of squares[SS(Tota~Jmay be
written, barring a constant(2N),as the sum of squared
T h e rest of the analysis does not depend on which differences between all pairs of N items (LI 1976). In
particular form of W has been chosen; we will assume the multidimensional case, using vectors instead of
that the weight matrix has been set in advance, re- scalars, the conventional sum of squares becomes a
turning to the definition of metrics and the choice of sum of squared deviations (SSD) from the centroidof
W for the humanillustration. a multidimensional space. Thus,
Evolutionary distances between haplotypes: The N
DNA haplotypes can sometimes be related mutation-
ally and arranged into a network (see Figure 1). We
SSD,,,I, = -
(xj - X)’W(xj X)
j= I
(54
may then use thenumber of mutationsalong the 1 N-l
network as a measure of evolutionary divergence be- =- (Xj - Xk)’W(Xj - Xk),
tween any two haplotypes. Network distance does not N j = 1 b]
I [Di] . . . ..
individuals into strata, we can write
SSD(Tota1) = SSD(Among Strata)
(7)
+ SSD(Within Strata),
482 L. Excoffier, P. E. Smouse and J. M. Quattro
placing us in traditional analysis of variance frame- 1969, 1973), but it allows for thehaploid transmission
work, designated here as Analysis of Molecular Vari- of mitochondrial genomes. It may also be useful to
ance, AMOVA (Table 1). Forillustration, we shall employ an analogous array of haplotypic correlation
partition the total sum of squared deviations, measures, which we shall term @-statistics to avoid
SSD(Total), into components for variation within pop- confusion. Following COCKERHAM'S lead, we have
ulations, SSD(WP), variation among populations
within regional groups, SSD(AP/WG), and variation a: = (1 - @ST)a2,
among regional groups, SSD(AG) ab' = (@ST - @CT)a2, ( 10 4
Nw Nw a: = 9c7-a 2
,
G I ccf$4
SSD(WP) = 2
i=l
jG1 '='
2Ne
(84 where u2 = a: + +a; a;: @sT is viewed as the corre-
lation of random haplotypes within populations, rela-
tive to thatof random pairsof haplotypes drawn from
SSD(AP/WG)
the whole species; @ c T as the correlation of random
haplotypes within a group of populations, relative to
-% i= 1
that of random pairs of haplotypes drawn from the
whole species, and as the correlation of the molec-
ular diversity of random haplotypes within popula-
tions, relative to that of random pairs of haplotypes
drawn from the region. Still following the analogy,
we rewritetheequations (loa) in terms of the 9-
statistics
x- 1 /
Total N- 1
cate eachindividual to a randomly chosen population, different statistics with the previously described pro-
while holding sample sizes constant at the realized cedures.
values. This amounts to random permutation of the
rows (and corresponding columns) of the squared-
distance matrix (MANTEL1967). T h e variance-com- ILLUSTRATIONWITH HUMAN mtDNA
ponents are estimated from each of a large number HAPLOTYPES
(say 1000) of permuted matrices. We use this proce-
dure toobtain the null distribution and to test for the Due to its high relative mutation rate (BROWN,
GEORGEand WILSON 1979; BROWNet al. 1982),
significance of @ST and up.
mtDNA presentsmany distinct haplotypes in different
T w o other permutation schemes are useful. The
demes. Prevailing maternal transmission in mammals
first assumes that the regions are real but that the
(GILES et al. 1980; GYLLENSTEN et al. 1991) favors
populations within them are not, permuting individ-
higher levels of population subdivision than is true for
uals within regional groups without regard to popu-
nuclear DNA markers (BIRKY, MARUYAMA, and
lation, a procedure used to obtain the null distribu-
FUERST1983; BIRKY,FUERSTand MARUYAMA 1989).
tions of aScand a!. The second assumes that while
Barring migration, these two effects should produce
the populations are real, the regional groupings are
increasingly non-overlapping sets of restriction hap-
artificial, permuting whole populations across groups.
lotypes as divergence time between populations in-
In this case, the sizes of the groups (but not those of
creases (WATTERSON 1985). Both of these featuresare
the populations) vary with each permutational run.
evident in human mtDNA, which is small (16,569 bp,
This randomization scheme is used to obtain the null ANDERSON et al. 198l), rapidly evolving, and appar-
distribution of +CT and ui. ently free of recombination.
Restriction site sampling: T h e sampling of nucle- Restriction haplotypes of human mtDNAhave been
otides has been shown to be a major source of varia- sampled from a substantial number of populations
bility for estimates of molecular diversity (LYNCH and (for a review of the two main data set, see EXCOFFIER
CREASE1990). One can legitimally ask whether the 1990; STONEKING et al. 1990). Our purpose is to
results of our study depend on theparticular array of illustrate the methodologydescribedabove, rather
restriction sites employed. We examine the influence than to reopen the question of human origins raised
of site samplingon the genetic structure of the popu- elsewhere (CANN,STONEKING and WILSON 1987;
lations, using a site resampling plan similar to the EXCOFFIER andLANGANEY1989; EXCOFFIER1990;
bootstrap used by EFRON(1 982).Under the assump- STONEKING et al. 1990). We consider here ten popu-
tion that the observed 62 sites are representative of lations for which ample data are available in the lit-
all potential mtDNA sites, we obtain the distribution erature (Table 2). These particular populations were
of thevariance components and associated +-statistics chosen torepresent five “regionalgroups” of two
by Monte Carlo simulation, using 500 random collec- populations each (Figure 2). The samples have also
tions of sites. For each collection, the procedure is as been analyzed for polymorphism with the five restric-
follows: (a) Draw a given number of sites from the tion enzymes most commonly used in human studies,
observed array of 62 sites, at random and with replace- BamHI: GGATCC, HpaI: GTTAAC, HaeII: (A/
ment. Given the choice of sites, the haplotype of each G)GCGC(T/C), AvaII: GG(T/A)CC, and MspI:
individual is thentakenas the combinationof the CCGG. Among the 672 mtDNAs assayed from these
originalstates of thoserandomly chosen sites; (b) ten populations, 34 of 62 recognizable sites were
compute interhaplotypic distances on the basis of the found to be polymorphic.
newly defined haplotypes and perform an AMOVA In a sample of 672, we cannot expect to see all P 4
analysis. T h e distances are simply computedfrom possible haplotypes, but sample size considerations
Equation 3b, with all w: equal to 1; and (c) permute aside, the absence of recombination practically guar-
the matrix 500 times, and test the significance of the antees large amountsof disequilibrium among the 34
484 L. Excoffier, P. E. Smouse and J. M. Quattro
TABLE 2
Haplotypic composition of the population samples by region
Sample
Sample No.Reference
name
Sample size Haplotype frequencies'
Asia
1 89 13 28 47 48 4950515253 54
1 Tharu BREGAet al. ( 1 986) 91 4 8 2 5 2 3 2 2 2 1 1 1 2 1 1
9
8
61 12
13 27 28 29
2 Oriental JOHNSON
et al. (1 983) 46 3 2 1 2 4 2 2 1 1 1
West Africa
1 27 10 273952 64 65 6667 68 71
3 Wolof et al. ( 1 988)
SCOZZARI 110 23 3 9 2 9 2 2 5 2 2 I I 1 2 I
1 2 6 8 34
39 69
4 Peul et al. ( 1 988)
SCOZZARI 47 11 1 9
2 2 1 1 1
America
1 639 46
5 Pima WALLACE,
GARRISON and 63 59 2 1 1
KNOWLER( 1 985)
1 47 95
6 Maya et al. (1990)
SCHURR 37 30 4 3
Europe
1611 18 21 38478283
7 Finnish VILKKI, SAVONTAUS and 110 8 7 2 4 3 8 2 2 1 1
NIKOSKELAINEN
( 1 988)
126 18 21 23 34 42 47 56577273 75 7677
8 Sicilian et al. ( 1 989)
SEMINO 90 5 0 3 9 1 1 1 1 1 1 5 1 2 1 1 1 1 1
Middle-East
1611 17 22 36 373839
9 Israeli Jews al. ( 1986)
B O N N ~ T A MetI R 39 1514 1 1 4 1 1 1 1
1
2
67
2231
4041 42 43 44 45
10 Israeli Arabs B O N N ~ T A MetI al.
R (1986) -
39 2 2 1 1 1 6 2 1 1 1 1 1 1
672
' For each population, haplotype numbers are reported on the first line and their absolute frequencies are shown' in italic on the second
line.
8 0 8 0 8 0
TABLE 3
Hierarchical analysis of variance on four different square matrices of distances between haplotypes
D,(haplotypic) Dn(multiallelic)
Among regions Z 0.134 21.12 0.002 ipc~=0.211 0.055 15.73 0.008 *cr= 0.157
Among populations/regions ui 0.022 3.49 CO.0001 9sc= 0.044 0.013 3.59 CO.0001 9sc = 0.043
Within populations US 0.478 75.39 CO.0001 0.246 0.281 80.68 <0.0001 = 0.193
network) D, (Prim D, (nonlinear)
Among regions Z 0.142 21.99 0.002 @cr=0.220 0.127 IO-' 21.30 0.002 *p,=0.213
Among populations/regions ui 0.021 3.29 CO.0001 aSc= 0.042 0.020 3.31 CO.0001 @sc= 0.042
Within populations U: 0.484 74.72 <0.0001 9n = 0.253 0.449 lo-' 75.39 CO.0001 * s r = 0.246
a Probability of having a moreextreme variance component and *-statistic than the observed values by chance alone. *CT and 0.' are tested
under random permutations of whole populations across regions. GsCand ui are tested under random permutation of individuals across
populations but within the same region. and 0,' are tested under random permutation of individuals across populations without regard to
either their original populations or regions.
between distinguishable haplotypes are assumed to be @ST and aSc,our observed variance components
unknown, a standardtreatment forallozymes or other showed extreme values in all cases. Seven permuta-
protein systems (see, however, RICHARDSONand tions of whole populations across regions were found
SMOUSE 1976; RICHARDSON, SMOUSE and RICHARDSON to yield greater ai (and @cr)than our observed value.
1977). This treatment is also applicable to antigenic Although the result is still significant, we clearly lose
systems, or even to molecular fingerprint analysis, geographic resolution with this metric.
where the banding pattern of two individuals either D3: Our third matrix is based on a distance metric
matches or does not.T h e @-statisticsbecome the usual computed along the evolutionarily parsimonious net-
multiallelic F-statistics (LONG 1986). T h e results of work shown in Figure 3. When several connections of
our hierarchical analysis are presented in Table 3 equal length are possible for a particular haplotype,
under DP.Most of the haplotype diversity (80.68%) is two additional rules are used to make a choice (Ex-
found within each population, butan appreciable COFFIER and LANGANEY 1989). The first is a proba-
amount still (15.73%) separates regions. The differ- bility criterion; a link between two rare (<5%) haplo-
ences among populations within regions are small types is less likely than a link between rare and fre-
(3.59%). For the two procedures that involve permu- quent (>5%) haplotypes. The second criterion is
tation of individuals across populations, testing a:, a:, geographic; links between haplotypes that are found
Analysis
Varianceof Molecular 487
ENS, SPIELMAN and HARRIS1981; NEI and TAJIMA
198 1, 1983; KAPLAN 1983; NEI and MILLER 1990).
-
Observed value 0.13 1
For simplicity, we have used Equation 4 from NEI and
MILLER (1990), whichyields results veryclose tothe
maximum-likelihood estimates ofNEI and TAJIMA
(1 983). For each adjacent pairof haplotypes x and y
on the network, we estimate the nucleotide diversity
d, by
40001 E
@ C
e= 1
Seredxy(e)
d, = E
, (1 1)
Observed value = 0.022 C
e= 1
Sere
2
I
adjacent haplotypes on the network are virtually iden-
tical, except for the two cases where more than one
restriction-site change is involved.
Geneticstructureand DNA sitesampling: We
evaluated the sensitivity to site sampling by examining
the Dl partition for a randomsample of sites, with the
number of sites ranging from 5 to 62. We report the
percentages of significant values (a < 0.05) for the /
variance components in Figure 6. These three power
curves are indistinguishable from those forthe @- /
statistics, which are suppressed. As anticipated,the
percentage of significant results increases with the 10 20 30 40 50 60
number of sampled sites for all statistics; a: and @’ST
Number of restriction sites sampled
approach 100% significant outcomes when as few as
40 sites are taken into account. When 62 sites are FIGURE6.-Percentage of significant variance components as a
randomly sampled,ab‘and 9sc are significant in 99.8% function of haplotype size (in number of restriction-sites). A given
number of sites is drawn at random with replacement from the
of all replicates, whereas u,‘ andare significant original 62 restriction sites and variance component significance is
94.8% of the time. T h e component of molecular tested by 500 permutations of the original matrix of squared
variance among regions exhibits least power and re- interindividual distances (see text). This process is repeated 500
quires the largest number of restriction sites, suggest- times to find the percentage of significant outcomes at the level (Y
= 0.05. @-statisticscurves are almost identical to corresponding
ing that differences among regionsare due tospecific variance components and are not reported on the graph.
sites and mutations. On the whole, however,these
high levels of significance show that the inferred ge- thanproof. Empirically, we see no alternative but
netic structure of our sampled populations is not a testing of the data we have.
sampling artifact and that reliable inference does not LYNCHand CREASE(1 990)studied nucleotide sam-
require an inordinatelylargenumber of sites. We pling analytically, showing that it constituted a major
have not carried the analysis to more than 62 sites, source of variance in estimating diversity at the nu-
because an increase in the number of sampled sites cleotide level. Our results are somewhat at odds with
would mean the occurrence of new haplotypes, the theirs. In our case however, the unit studied for its
distribution of which among populations is unknown diversity is notthenucleotide butthe haplotype,
from our data. which is itself a collection of sites. The variance of
That conclusion is subject, however, to the assump- haplotypic diversity due to site sampling appears to be
tion that the 62 sites observed are representative of lower than the variance of nucleotide diversity due to
all sites of the mtDNA molecule. Our sites, sampled the same sampling process. When the number ofsites
from an empiric set, are, however, not entirely ran- per haplotype is reduced, site sampling becomes in-
dom. As a practical matter, restriction enzymes that creasingly importantas shown in Figure 6. For a
do not generate restriction site variation are usually haplotype with only 5 sites a: is significant in 73% of
discarded from the assay battery. The enzymes used all replicates, uz in 44.4%, and ai inonly 30.8%,
here are used routinely in human work precisely be- showing the importance of site sampling in this case.
cause they do exhibit substantial polymorphism. They
DISCUSSION
almost surely do notprovide a random representation
of the human mtDNA genome, and our collection of Human population radiation: Hierarchical analysis
sites is certainly biased towards excess polymorphism. of human mtDNA variability shows substantial sub-
T h e fact that the variation encountered is also geo- division among human populations, but with a large
graphically structured was not used as a criterion of fraction of the variation found within populations
choice. Indeed, a recentwork (STONEKING et al. 1990) (>74%). A similar vaIue (69%) has been derived using
using additional enzymes revealing even greater a GSTapproach on another human mtDNA data set
polymorphism shows as much geographic structure as (STONEKING et al. 1990). Our rather contrivedre-
we have demonstrated here. It seems probable that a gional groups exhibit ahigh level of divergence. Pop-
truly random sample of sites (or nucleotides), a larger ulations within regions were shown to be significantly
fraction of which would be monomorphic, would be (but minimally) differentiated. Our results suggest
required to demonstrate the same level of infra-spe- that extensive studies within each of the regions are
cific structure we have described here. T h e question needed to determine whether the much greater di-
of whether our chosen genetic markersare represent- vergenceobserved“amongregions”than“among
ative set is one more often dealt with by assumption populations/within regions” is an artifact of our arbi-
Analysis
Varianceof Molecular 489
trary choice of populations, a sampling consequence present in the data, as it is here, the parsimony crite-
of isolation-by-distance, or whether there are steep rion does not lead to a unique network, as is also the
boundary zones of limited genetic exchange between case for most phylogeny reconstruction algorithms,
regions. Such zones have come under increasing scru- and a large numberof equally parsimonious networks
tiny of late (BARBUJANI, ODEN andSOKAL1989; BAR- could have been imposed. The question ofhow to
BUJANI and SOKAL1990, 1991), anda generic answer choose among equally parsimonious networks (or
to the “boundaryquestion” will only be available from trees) is a problem that cannot be settled here. Our
a study of more evenly spaced samples. contention is merely that given a minimum spanning
Regional differentiation is more apparentwhen the (parsimonious) network, buttressed by frequency and
degree of difference betweenhaplotypes is taken into geographic criteria, an eminently “sensible” network,
account, in keeping with the observation that molec- one can use the methods developed here for a useful
ular distances are larger for pairsof haplotypes drawn partition of the variation. For the example at hand,
fromdifferentregionsthanfromthe same region the additional wrinkle of measuring distance along
(Figure 4). This suggests that a substantial fraction of the network does not provide any additional resolu-
the mtDNA variability among regions is due todiver- tion. Whether we could do better with adifferent
gent arrays of haplotypes, ultimately attributable to network, and how to choose such a network, we will
the occurrence of new mutations along the path to leave for a later paper.
regional radiation. It is initially surprising that com- Our analysis of regional differences shows that the
puting distances alongthenetwork only slightly geographic criterion used to define regional groups is
enhances the regional differences in our data set. On quitereasonable as a first approximation. Slightly
further reflection, however, the results make sense. greater regional divergence was found with an alter-
Homoplasies due to recurrent mutations mainly affect native partition of the populations. The European
low frequency haplotypes that are located at the tips region contains the most internal diversity, whereas
of the network. Both the low frequency of such hap- the Amerindian region contains the least. The two
lotypes and their network placement will minimally +
“alternative” regions Sicily Maya and Pima Fin- +
affect the hierarchical partitionof variation. T h e com- land present intermediate “within region” diversities,
putation of evolutionary distances along a network which slightly lower the total “within region” variabil-
should yield greater additional resolution for taxo- ity and increase the “among region”variance compo-
nomic assemblages of greaterinternalradiation, nent. One might consider that could itself be useful
whereextinction of intermediates would lead to as a criterion for defining supra-population groups.
homoplasic mutations of higherfrequency and of This situation also shows that we need to examine
more central position. more closely the extent to which each region or each
Nonlinear transformation of restriction-site differ- population contributes to the total molecular diver-
ences into estimates of nucleotide diversity between sity, as the variance components or @-statisticsdo not
haplotypes also does not substantially affect the hap- bring us much detail of the patterning of the species
lotypic variance partition. We attribute this result to variability. As has already been done for the multial-
the low divergence between adjacent haplotypes on lelic case(LONG,SMOUSE and WOOD 1987), our analy-
the network. As most of the links between adjacent sis framework could be extended to a partitioning of
haplotypes involve unique restriction-site changes, the among-population variability into pairwise popu-
taking intoaccountthe fact that aparticular site lation distance components.
involves four-, five- or six-base recognition sequences Methodologicalconsiderations: We have intro-
doesnotmatter much here.Thus,the additional duced an analytical method for studying the genetic
assumptions involved in the nonlineartranslation, structure of populations that permits use of as much
such as uniform substitution rates at different sites (or as little) of the available information on the molec-
and identical substitution probabilities for the four ular nature ofDNA haplotypes as is desired. It extends
nucleotides, may not be necessary in delineating the procedures that explicitly use an analysis of variance
internal genetic structure of a single species. However, format (COCKERHAM 1969, 1973; WEIRand COCKER-
such nonlinear transformations could be useful if the HAM 1984; LONG 1986;LONG, SMOUSEand WOOD
analysis included individuals fromdifferent species 1987) to estimate the degree of intra-specific genetic
with larger interhaplotypic differences. subdivision. If we can legitimately assume that popu-
These conclusions may depend on thechoice of the lations become differentiated by drift alone, then we
network presentedin Figure 3,which was built before can expect a linear relation between divergence time
the AMOVA analyses were performed. Itsbasic struc- and allelic correlation for short periods (REYNOLDS,
ture had already been determined in previous publi- WEIRand COCKERHAM 1983). In our case, population
cations (JOHNSON et al. 1983; EXCOFFIER and LAN- differences in restriction pattern have clearly arisen
GANEY 1989).Whenahigh level of homoplasy is from genetic driftof existing variants, from the intro-
490 L. Excoffer, P. E. Smouse and J. M. Quattro
duction of new mutations, and from some degree of literature (SWOFFORD and OLSEN1990), oneis free to
gene flow, so we will not extrapolate our results as far choose. We will content ourselves here with the ob-
as a divergence-time interpretation. servation that the use of a Euclidean metric has some
The point of thecurrent exercise is neitherto natural advantages, not the least of which is that a
estimate unknown population parameters from our matrix of such distances can be used for other pur-
variance components nor to define exactly how or at poses than phylogenetic analysis. The considerable
what rate thesepopulationdifferences have devel- variety of data types made available by molecular
oped. Our purpose here is todemonstrate how to biology needs a statistical analysis framework that is
delineate the extent of genetic differentiation within coherent butalso sufficiently flexible to accommodate
andamong populations. The approach is general the different types of questions inherent in each par-
enough to deal with any organism and to study any ticular situation. The AMOVA treatment presented
type of structure (hierarchical or otherwise) that one here is intended to serve as the beginning ofjust such
might wish to consider. T h e underlying (distance ma- a framework.
trix) structure of the analysis permits flexible explo- The authors thank OSCARGAGGIOTTI and ANDRE LANGANEY
ration of a given data set. Several different distance for their comments on the manuscript, as well as MICHAELLYNCH
matrices, one for each particular set of assumptions, and another (anonymous) reviewer for their suggestions. L.E. was
may be taken as alternate inputs and their influence funded by FNRS Switzerland 32-28784.90and 32-27845.89,and
on the outcome evaluated.The relation to F-statistics INSERM France 900 814, P.E.S. by NJAES/USDA-32102, J M Q
by the Roosevelt Fund, American Museum of Natural History and
is straightforward, though subject to the usual limita- by the Leathem-Steinetz-Stauber Fund, Rutgers University. An
tions. More important is the realization that the whole analysis of molecular variance program (AMOVA), including the
array of least-squares methods (analysis of variance, permutational testing procedures, is available on request from L.E.
analysis of covariance, regression, correlation, princi-
pal coordinates analysis, factor analysis, etc.) is acces- LITERATURECITED
sible from this same distance matrix. We have tapped ANDERSON, S., A. T. BANKIER, B. G. BARREL, M. H. L. DE BRUIJN,
only a small portion of the available repertoire here. A. R. COULSON, J. DROUIN, I. C. EPERON,D. P. NIERLICH, B.
A. ROE,F. SANGER, P. H. SCHREIER, A. J. H. SMITH,R. STADEN
Significance testing with permutation procedures is
and I. G. YOUNG,1981 Sequence and organization of the
both easy and essentially assumption free; in particu- human mitochondrial genome. Nature 2 9 0 457-465.
lar, we are freed from the testing limitations of normal BARBUJANI, G., N.L. ODENand R. R. SOKAL,1989 Detecting
theory, so useful in analysis of variance but so inap- areas of abrupt change in maps of biological variables. Syst.
2001. 38: 376-389.
propriate here. We can address several questions with BARBUJANI, G.,and R. R. SOKAL,1990 The zones of sharp genetic
the same data set. We might even wish to test the change in Europe are also language boundaries. Proc. Natl.
difference between outcomes formally, based on dif. Acad. Sci. USA 87: 1816-1819.
BARBUJANI, G., and R. R. SOKAL,1991 Genetic population struc-
ferent squared-distance matrices. As the computation ture of Italy. 11.Physical and cultural barriers to gene flow.
of the variance components involves only manipula- Am J. Hum. Genet. 48: 398-41 1 .
tion of the original input distance metrics, the out- BIRKY,C. W., P. FUERSTand T . MARUYAMA,1989 Organelle
come will only be as different as the inputs. Squared- gene diversity under migration, and drift: equilibrium expec-
tations, approach to equilibrium, effects of heteroplasmic cells,
distance matrices may be compared using a normal- and comparison to nuclear genes. Genetics 121: 613-627.
ized Mantel test (SMOUSE, LONGand SOKAL1986). BIRKY,C. W., T . MARUYAMA and P. FUERST,1983 An approach
If one wishes to translate restriction site differences to population and evolutionary genetic theory for genes in
mitochondria and chloroplasts, and some results. Genetics 103:
into estimates of the fraction of nucleotide differences 513-527.
between pairs of haplotypes ( r j k )several
, procedures BONNE-TAMIR, B., M. J. JOHNSON, A. NATALI,D. C. WALLACE and
are available (ENGELS1981; EWENS,SPIELMAN and L. L. CAVALLI-SFORZA, 1986 Human mitochondrial DNA
HARRIS 1981;NEI and TAJIMA 1981, 1983; KAPLAN types in two Israeli populations-a comparative study at the
DNA level. Am. J. Hum. Genet. 38: 341-351.
1983; NEI and MILLER 1990), any one of which can BREGA, A,, R. GARDELLA, 0.SEMINO, G. MORPURGO, G. B. ASTALDI
be used to modify the interhaplotypicsquared dis- RICOTTI,D.C. WALLACE and A. S. SANTACHIARA-BERENE-
tances in our technique. Additional translation may CETTI, 1986 Genetic studies on theTharu population of
Nepal: restriction endonuclease polymorphisms of mitochon-
permit linearization of these estimates with divergence drial DNA. Am. J. Hum. Genet. 39: 502-512.
time. Such transformations have the additional advan- BROWN, W. M., M. GEORGE, JR. and A. C. WILSON,1979 Rapid
tage of being independent of the number of restric- evolution of animal mitochondrial DNA. Proc. Natl. Acad. Sci.
tion sites surveyed. We have seen, however, that this USA 7 6 1967-1971.
BROWN,W. M . , E. M. PRAGER,A. WANGand A. C.WILSON,
process does not fundamentally alter either our esti- 1982 Mitochondrial DNA sequences of primates: tempo and
mates of the variance components. Extension of this mode of evolution. J. Mol. Evol. 18: 225-239.
methodology to DNA sequencedata is straightfor- CANN, L.,
R. M. STONEKINGand A. C. WILSON, 1987
Mitochondrial DNA and human evolution. Nature 325: 31-
ward and can be achieved through a redefinition of 36.
the interchromosomal distance metric. As several COCKERHAM, C. C., 1969 Variance of gene frequencies. Evolution
methods are already available for this purpose in the 23: 72-84.
Analysis of Molecular Variance 49 1
COCKERHAM, C. C., 1973 Analyses of gene frequencies. Genetics of the coancestry coefficient: Basis for a short term genetic
7 4 679-700. distance. Genetics 105: 767-779.
EFRON,B., 1982 The Jacknife, the Bootstrap and Other Resam- RICHARDSON, R. R., and P.E. SMOUSE, 1976 Patterns of electro-
pling Plans. Regional Conference Series in Applied Mathemat- phoretic mobility. I. Interspecific comparisons in the Drosophila
ics, Vol 38. Society for Industrial and Applied Mathematics, mulleri complex. Biochem. Genet. 1 4 447-466.
Philadelphia. RICHARDSON, R.
R.,P.
E.SMOUSE and M. E. RICHARDSON,
ENGELS, W.R., 1981 Estimating genetic divergence and genetic 1977 Patterns of molecular variation. 11. Associations of elec-
variation with restriction endonucleases. Proc. Natl. Acad. Sci. trophoretic mobility and larval substrate within species of the
Drosophila mulleri complex. Genetics 85: 14 - 1 154.
USA 7 8 6329-6333.
ROHLF,F.J., 1990 NTSYS. Numerical Taxonomy and Multivar-
EWENS,W. J., R. S . SPIELMAN and H. HARRIS,1981 Estimation
iate Analysis System. Ver. 1.60. Exeter Publ. Ltd., Setauket,
of genetic variation at the DNA level from restriction endo- N.Y.
nuclease data. Proc. Natl. Acad. Sci. USA 78: 3748-3750. SCHURR, T. G., S. W. BALLINGER, Y.-Y. GAN,J. A. HODCE, D. A.
EXCOFFIER,L., 1990 Evolution of human mitochondrial DNA: MERRIWEATHER, D. N. LAWRENCE, W.C. KNOWLER, K. M.
evidence for departure from a pure neutral model of popula- WEISSand D.C. WALLACE, 1990 Amerindian mitochondrial
tions at equilibrium. J. Mol. Evol. 3 0 125-139. DNAs have rare Asian mutations at high frequencies, suggest-
EXCOFFIER, L., and A. LANGANEY, 1989 Origin and differentia- ing they derived from four primary maternal lineages. Am. J.
tion of human mitochondrial DNA. Am. J. Hum. Genet. 44: Hum. Genet. 4 6 613-623.
73-85. SCOZZARI, R., A. TORRONI, 0. SEMINO, G. SIRUGO,A. BREGAand
FARRIS, J. S., 1970 Methods for computing Wagner trees. Syst. A. S . SANTACHIARA-BERENECETTI, 1988 Genetic studies on
ZOO^. 1 9 83-92. the Senegal population. I. Mitochondrial DNA polymorphisms.
FELSENSTEIN, J., 1988 Phylogenies from molecular sequences: Am. J. Hum. Genet. 43: 534-544.
inference and reliability. Annu. Rev. Genet. 22: 521-565. SEMINO, O., A. TORRONI, R. SCOZZARI, A. BREGA,G. DE BENEDIC-
GILES,R. E., H. BLANC,H. M. CANNand D. C. WALLACE, TIS and A. S. SANTACHIARABENERECETTI,1989
1980 Maternal inheritance of human mitochondrial DNA. Mitochondrial DNA polymorphisms in Italy. 111. Population
Proc. Natl. Acad. Sci. USA 77: 671 5-6719. data from Sicily: a possible quantitation of African ancestry.
GYLLENSTEN, U., D. WHARTON, A. JOSEFSSON and A. C. WILSON, Ann. Hum. Biol. 53: 193-202.
1991 Paternal inheritance of mitochondrial DNAinmice. SLATKIN, M., 1987 The average number of sites separating DNA
Nature 3 5 2 255-257. sequences drawn from a subdivided population. Theor. Popul.
JOHNSON, M. J., D. C. WALLACE, S. D. FERRIS,M. C. RATTAZZI and Biol. 32: 42-49.
L. L. CAVALLI-SFORZA, 1983 Radiation of human mitochon- SMOUSE,P. E., J. C. LONGand R. R. SOKAL,1986 Multiple
drial DNA types analyzed by restriction endonuclease cleavage regression and correlation extensions of the Mantel test of
matrix correspondence. Syst. Zool. 3 5 627-632.
patterns. J. Mol. Evol. 1 9 255-271.
STONEKING, M.,L.B. JORDE,K. BHATIAand A.C. WILSON,
KAPLAN,N., 1983 Statistical analysis of restriction enzyme map 1990 Geographic variation in human mitochondrial DNA
data and nucleotide sequence data, pp. 75-106 in Statistical from Papua New Guinea. Genetics 124 717-733.
Analysis of DNA Sequence Data, edited byB. S. WEIR.Marcel SWOFFORD, D.L., and G. J. OLSEN, 1990 Phylogeny reconstruc-
Dekker, New York. tion, pp. 41 1-501 in MolecularSystematics, edited byD. M.
LI, C. C., 1976 Population Genetics. Boxwood, Pacific Grove, Calif. HILLISand C. MORITZ. Sinauer Associates, New York.
LONG,J. C., 1986 The allelic correlation structure of Gainj- and TAKAHATA, N., and S. R. PALUMBI, 1985 Extranuclear differen-
Kalam-speaking people. I. The estimation and interpretation tiation and gene flow in the finite island model. Genetics 1 0 9
of Wright’s F-statistics. Genetics 112: 629-647. 441-457.
LONG,J. C., P. E. SMOUSE and J. W. WOOD,1987 The allelic VILKKI,J., M.-L. SAVONTAUS and E.V. NIKOSKELAINEN, 1988
correlation structure of Gainj- and Kalam-speaking people. 11. Human mitochondrial types in Finland. Hum. Genet. 8 0 3 17-
The genetic distance between population subdivisions.Genetics 321.
117: 273-283. WALLACE,D. C., K. GARRISON and W. C. KNOWLER,1985
LYNCH,M., and T. J. CREASE,1990 The analysis of population Dramatic founder effect in Amerindian mitochondrial DNAs.
survey data on DNA sequence variation. Mol.Biol.Evol. 7: Am. J. Phys. Anthrop. 6 8 149-155.
377-394. WATTERSON, G. A., 1975 On the number of segregating sites in
MANTEL,N.,1967 The detection of disease clustering and a genetical models without recombination. Theor. Popul. Biol.
7: 256-276.
generalized regression approach. Cancer Res. 27: 209-220.
WATTERSON, G. A., 1985 The genetic divergence of two popula-
NEI, M., 1973 Analysis of gene diversity in subdivided popula- tions. Theor. Popul. Biol. 27: 298-317.
tions. Proc. Natl. Acad. Sci. USA 7 0 3321-3323. WEIR,B. S., and C.C. COCKERHAM, 1984 Estimating F-statistics
NEI, M., 1977 F-statistics andthe analysis of gene diversity in for the analysis of population structure. Evolution 38: 1358-
subdivided populations. Ann. Hum. Genet. 41: 225-233. 1370.
NEI, M., and J. C. MILLER,1990 A simple method for estimating WRIGHT,S., 1951 The genetical structure of populations. Ann.
average number of nucleotide substitutions within and between Eugen. 1: 323-334.
populations from restriction data. Genetics 125: 873-879. WRIGHT,S., 1965 The interpretation of population structure by
NEI, M., and F. TAJIMA, 1981 DNA polymorphism detectable by F-statistics with specialregards to systems of mating. Evolution
restriction endonucleases. Genetics 97: 145-163. 1 9 395-420.
NEI, M., and F. TAJIMA,1983 Maximum likelihood estimation of ZHIVOTOVSKY, L. A. 1988 Some methods of analysis of correlated
the number of nucleotide substitutions from restriction sites characters, pp. 423-432 in Proceedings of the II International
data. Genetics 1 0 5 207-2 17. Conference onQuantitativeGenetics, edited byB. S. WEIR,G.
PRIM,R. C., 1957 Shortest connection networks and some gen- EISEN,M. M. GOODMAN, and G. NAMKOONG. Sinauer Associ-
eralizations. Bell Syst. Tech. J. 3 6 1389-1401. ates, Sunderland, Mass.
REYNOLDS, J., B. S. WEIRand C. C. COCKERHAM, 1983 Estimation Communicating editor: E. THOMPSON