Sie sind auf Seite 1von 19

Syst. Biol.

49(4):652–670, 2000

Likelihood-Based Tests of Topologies in Phylogenetics

NICK G OLDMAN,1 J ON P. ANDERSON,2 AND ALLEN G. R ODRIGO 3


1 University Museum of Zoology, Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK;
E-mail: N.Goldman@zoo.cam.ac.uk
2 Department of Molecular Biotechnology, University of Washington, Seattle, Washington USA
3 School of Biological Sciences, University of Auckland, Auckland, New Zealand

Abstract.—Likelihood-based statistical tests of competing evolutionary hypotheses (tree topolo-


gies) have been available for approximately a decade. By far the most commonly used is the
Kishino–Hasegawa test. However, the assumptions that have to be made to ensure the validity
of the Kishino–Hasegawa test place important restrictions on its applicability. In particular, it is
only valid when the topologies being compared are speciŽed a priori. Unfortunately, this means
that the Kishino–Hasegawa test may be severely biased in many cases in which it is now com-
monly used: for example, in any case in which one of the competing topologies has been selected
for testing because it is the maximum likelihood topology for the data set at hand. We review
the theory of the Kishino–Hasegawa test and contend that for the majority of popular applica-
tions this test should not be used. Previously published results from invalid applications of the
Kishino–Hasegawa test should be treated extremely cautiously, and future applications should use
appropriate alternative tests instead. We review such alternative tests, both nonparametric and
parametric, and give two examples which illustrate the importance of our contentions. [Kishino–
Hasegawa test; maximum likelihood; phylogeny; Shimodaira–Hasegawa test; statistical tests; tree
topology.]

Hasegawa and Kishino (1989) and Kishino Kishino and Hasegawa originally devised
and Hasegawa (1989) developed methods for and applied their methods for trees that were
estimating the standard error and conŽdence speciŽed a priori, that is, trees that corre-
intervals for the difference in log-likelihoods sponded to phylogenetic hypotheses derived
between two topologically distinct phylo- independently of the data at hand. However,
genetic trees representing hypotheses that since then, the KH-test has in the majority
might explain particular aligned sequence of cases been used to compare the maximum
data sets. The method initially was intro- likelihood (ML) tree derived from the data
duced to compute conŽdence intervals on to one or more a priori–speciŽed trees or to
posterior probabilities for topologies in a one or more a posteriori–speciŽed trees (e.g.,
Bayesian analysis (Hasegawa and Kishino, the trees with second-, third-, and so forth
1989; Kishino and Hasegawa, 1989). Atten- highest likelihoods). Such applications are fa-
tion quickly turned to using the same ideas to cilitated, even encouraged, by various soft-
perform nonparametric likelihood ratio tests ware packages. We contend that these latter
(LRTs) of the statistical signiŽcance of topolo- applications of the KH-test are incorrect. We
gies (Kishino and Hasegawa, 1989), as cur- are not the Žrst to note this: Swofford et al.
rently implemented in the popular PHYLIP (1996) observed that the KH-test should be
(Felsenstein, 1995), MOLPHY (Adachi and applied only when the trees in question are
Hasegawa, 1996), PUZZLE (Strimmer and speciŽed a priori. Shimodaira and Hasegawa
von Haeseler, 1996; Strimmer et al., 1997), (1999) recently made the same point and de-
and PAUP* (Swofford, 1998) software pack- scribed a correct nonparametric test for the
ages. Tests based on the ideas of Kishino and case in which the ML tree is one of the tested
Hasegawa (hereafter referred to as KH-tests) topologies.
could overcome some of the peculiar prop- In this paper, we review the published lit-
erties of phylogenetic estimation (see, e.g., erature on likelihood-based tests of topolo-
Yang et al., 1995), for example, that differ- gies in phylogenetics. We wish to draw at-
ent topologies do not share the same sets of tention to the inapplicability of the KH-test
parameters and do not generally represent for many common situations, despite the fact
nested hypotheses (one hypothesis being a that it has become the most widely used and
special case of another). generally accepted way of testing alternative

652
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 653

hypotheses of evolutionary relationship. We M ETHODS


Žrst look in detail at the KH-test, in particular Terminology and Notation
reviewing the conditions for its correct ap-
plication and considering various methods For the purposes of this paper, we are inter-
that can be used to generate the null hy- ested in the topologies of phylogenetic trees,
pothesis distribution of the KH-test statis- not in the lengths of the branches on those
tic. We then consider alternative applications trees. We use T j to denote the topology of the
of the KH-test, which represent the major- jth prespeciŽed tree, and TML for the topol-
ity of published applications, and discuss ogy of the ML tree for a given data set. Implic-
why these are incorrect. We contend that itly, because we focus on likelihood-based
the bias inherent in the KH-test when ap- models, we assume that some model of evo-
plied incorrectly warrants that the results lution that allows us to deŽne the probabil-
of such applications be treated extremely ity of character-state changes can be speciŽed
cautiously. Next, we describe both the non- before tree reconstruction. The vector of pa-
parametric test introduced by Shimodaira rameters for branch lengths (plus any other
and Hasegawa (1999) and a parametric LRT free parameters, e.g., in the model of nu-
of topologies, described briey by Swofford cleotide substitution being used) is written h x
et al. (1996) and based on the parametric for topology Tx . H0 and HA are, respectively,
bootstrap or Monte Carlo simulation ap- null and alternative hypotheses for statisti-
proach to hypothesis testing using likeli- cal testing. Lx is the log-likelihood of Tx or
hood ratio statistics advocated by Goldman Hx for a given data set, generally maximized
and others (e.g., Goldman, 1993; Hillis over all possible values of h x , and Lx (k) is the
et al., 1996; Huelsenbeck and Crandall, 1997; sitewise log-likelihood for site k out of a total
Huelsenbeck and Rannala, 1997). of S sites, so L x = kS=1 L x (k). For paramet-
(i )
All of the methods described in this pa- ric or nonparametric bootstrapped data, L x
per are equally relevant to ML analyses of is the log-likelihood of Tx for the ith replicate
(i)
DNA and amino acid (aa) data. We provide data set, and L x (k) is the corresponding site-
two analyses comparing the results of dif- wise log-likelihood. We use d to denote the
ferent tests, one using DNA sequences from difference in log-likelihoods between topolo-
HIV-1 isolates and one using aa sequences gies (d (i ) for the value from the ith replicate
from mammalian mitochondrial (mt) pro- data set), E[X] to denote the expectation of
teins. These examples illustrate the impor- a statistic X, and N(l r 2 ) to indicate a nor-
tance of understanding precisely what hy- mal (gaussian) distribution with mean l and
potheses are being tested, the importance of variance r 2 .
performing statistically valid tests, and the
differing power of parametric and nonpara-
metric tests. We conclude with a discussion Fundamentals of the KH-Test
of the importance of our and others’ recent Suppose we have two hypotheses (tree
work on the theory of tests of topologies. topologies) T1 and T2 , selected a priori, and
The likelihood-based test developed from we want to test whether they are equally well
the work of Hasegawa and Kishino (1989) supported by a data set. Intuitively, for any
and Kishino and Hasegawa (1989) is quite one data set we would expect that stochastic-
closely related to a parsimony-based test ity and sampling would ensure L1 6 = L2 , even
described by Templeton (1983). Kishino when the null hypothesis is true. However, if
and Hasegawa (1989 : 177) also discussed we were somehow able to obtain several data
parsimony-based analogs of their likelihood sets, we would expect that “on average” L1
methods. Such parsimony-based tests are = L2 when the null hypothesis is true. Writ-
subject to precisely the same misuses de- ing d ´ L 1 ¡ L 2 , this intuition corresponds
scribed in this paper for likelihood-based to E[d ] = 0. In terms of a statistical test, our
tests. In their unmodiŽed forms, they too hypotheses are:
should not be used with phylogenies that are
not selected a priori. Derivation of their cor-
rected forms could follow the same general H0 : E[d ] = 0
approaches described below (Buckley et al.,
unpubl.). HA : E[d ] 6 = 0.
654 S YSTEMATIC BIOLOGY VOL. 49

In this section we describe a nonparamet- be preferred); in this example, a 5% sig-


ric approach to performing this compari- niŽcance level is being used.
son. This method was essentially given by
Hasegawa and Kishino (1989), although they
did not use the procedure for signiŽcance Hope (1968) and Marriott (1979) consider
testing. the amount of resampling that is needed for
To perform a test of H0 versus HA , we need reliable statistical properties. It seems reason-
to know the distribution of d under the null able to hope that 100 data sets will give a suf-
hypothesis. Working nonparametrically, we Žciently accurate estimate of the distribution
cannot derive the exact distribution of d be- of d for testing at the 5% level, for both non-
cause H0 does not specify a distribution for parametric (resampled data) and parametric
the data from which it is calculated. We are (simulated data; see below) tests. More strin-
able, however, to implement the nonpara- gent tests will require more replicate data
metric bootstrap (Efron, 1982; Felsenstein sets.
1985; Efron and Tibshirani, 1986), the use of Hall and Wilson (1991) and Westfall and
which requires the assumption that the data Young (1993 : 35) explain more fully the need
are a representative and independent sample for the centering procedure (d ˜ (i ) ´ d (i ) ¡ d ø(i ) ;
from the true distribution of data. Table 1, level 4, option c) to ensure conformity
Table 1 summarizes all of the statistical to the null hypothesis in this and other non-
tests discussed in this paper and explains the parametric tests. Comparing the observed
mnemonic naming system we have adopted. value d with the distribution of the d (i ) (as
Table 2 relates this naming system to var- is appropriate in parametric tests; see below
ious published tests and software imple- and Table 1, level 4, option u) would give an
mentations (all described below). The proce- invalid test. A test comparing the expected
dure for the fundamental KH-test is now as value of d (i.e., 0) with the (uncentered) dis-
follows: tribution of the d (i ) is less inappropriate but
is equivalent to comparing d ø(i) with the dis-
Test priNPfcd: tribution of the d ˜ (i ) . This in turn is equivalent
² Calculate the test statistic d ´ L 1 ¡ L 2 . to using d ø(i ) as an estimate of the test statis-
² Resample data (repeated nonparametric tic d , which is inefŽcient and without any
bootstrap data sets i). redeeming advantage (see also Efron et al.,
² Reestimate any free parameters h 1 , h 2 1996). We note that under the more power-
(branch lengths, and so forth) to get max- ful normal approximations discussed below
(i ) (i )
imized log-likelihoods L 1 and L 2 un- (Table 1, level 5, options a and s) a test com-
der T1 and T2 , respectively. paring 0 with the distribution of the d (i ) be-
² Hence calculate bootstrap values of d (i ) comes equivalent to the test that compares d
(i )
´ L1 ¡ L 2 .
(i ) with the distribution of the d ˜ (i ) .
The Žrst stages of this test (up to the cal-
² For the values of d (i ) to conform to H0 ,
we require E[d (i ) ] = 0; to ensure this, re- culation of the d (i ) ) form the procedure at
the heart the work of Hasegawa and Kishino
place the d (i ) by d ˜ (i ) ´ d (i ) ¡ d ø(i ) , where d ø(i ) (1989). In that paper, however, signiŽcance
is the mean over bootstrap replicates i of testing of phylogenies is not contemplated
d (i ) . This procedure is known as “center- (there is no mention of the hypothesis E[d ] =
ing” (see below), and the resulting set of 0) and instead the estimated distribution of
values d ˜ (i ) gives an estimate of the distri- d is used to compute conŽdence intervals on
bution of d under H0 . posterior probabilities of different (a priori)
² Test whether the attained value of d topologies in a Bayesian analysis. The idea
(from the original data) is a plausible of using the distribution of the d (i ) to per-
sample from the distribution of the d ˜ (i ) form a signiŽcance test of phylogenies based
by seeing if it falls within the conŽdence on E[d ] = 0 was introduced by Kishino and
interval for E[d ], given, for example, by Hasegawa (1989), with some methods being
the 2.5% and 97.5% points of the ranked partially described in Hasegawa et al. (1988).
list of the d ˜ (i ) . A two-sided test is appro- To our knowledge, the above form (priNPfcd)
priate (because we have no a priori ex- of the KH-test has never been implemented.
pectation as to whether T1 or T2 should At the time of this test’s introduction, one
TABLE 1. Components of statistical tests described in this paper. Each test is composed of one option chosen at each of the Žve levels. Not all combinations are valid, as
indicated. Mnemonic codes are derived by concatenating the italicized letters labeling each option; the derivation of these mnemonics is indicated by underlining. See text for
further details.

Level 3: Level 4: Level 5:


Level 1: Level 2: optimization method for test statistic and distribution how test is performed or conŽdence
choice of trees to test statistical approach bootstrapped data against which it is compared intervals are generated
pri: trees chosen a priori NP: nonparametric f: full: all parameters estimated c: centered: attained d vs. distribution d: comparison of test statistic directly with
from data (and optimization of centered nonparametric its estimated distribution
over multiple topologies with estimates d ˜ (i ) ´ d (i) ¡ d ø(i ) (only
pos option at level 1) with N P option at level 2)
pos: includes tree(s) P: parametric p: partial: some parameters u: uncentered: attained d vs. n: assumption of normal distribution for
selected a posteriori, Žxed at values estimated from distribution of parametric estimates test statistic (applicable only with pri
from analysis of data to data d (i ) (only with P option at level 2) option at level 1)
be used for testing
n: no optimization for a: additional normal assumption (variance
bootstrapped data (gives rise of d estimated from variance of sitewise
to RELL methods with d or n d (k); applicable only with both pri and
options at level 5) n options at levels 1 and 3, respectively)
s: stronger assumption of normal
distribution for sitewise d (k) (applicable
only in combination with both pri and n
options at levels 1 and 3, respectively)

655
656
TABLE 2. Relationships of statistical tests previously described and popular software implementations with tests described in this paper, with additional notes.

In notation of
In literature/computer implementation this paper Notes
Kishino–Hasegawa test, fundamental concept priNPfcd Nonparametric test; never before published/implemented in this forma
Hasegawa and Kishino (1989) priNPf² ² b Distribution of d derived in a different context; no statistical test speciŽed
Kishino and Hasegawa (1989) priNPn² ² b RELL and normal approximations introduced; statistical tests only briey discussed
PHYLIP (Felsenstein, 1995) and PUZZLE (Strimmer and priNPnca Use additional normal assumption and perform (two-sided) z-test
von Haeseler, 1996; Strimmer et al., 1997)
implementations
MOLPHY implementation (Adachi and Hasegawa, 1996) priNPnca Uses additional normal assumption; estimates variance of d but performs no test
PAUP* implementation (Swofford, 1998) priNPncs Uses stronger normal assumption and performs (two-sided) paired t-test
Shimodaira and Hasegawa (1999) implementation of priNPncd Used in an example (with RELL and a one-sided test), for comparison with posNPncd
Kishino–Hasegawa test
Shimodaira–Hasegawa test, fundamental concept posNPfcd Nonparametric test; never before implemented in this forma
Shimodaira and Hasegawa (1999) posNPncd Uses RELL; test described and used in an example
SOWH-test, fundamental concept posPfud Parametric test; originally described by Swofford et al. (1996)
SOWH-test, alternative implementation in this paper posPpud Used in an example (some approximations under HA ); see text for details
a These tests, amongst others, are implemented in this paper for the HIV-1 data set.
b Dots represent components of testing procedures that were not speciŽed in the corresponding publications.
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 657

main reason for this was probably the com- forth) derived from the original data set,
0 (i) 0 (i )
putation time required for the step in which compute log-likelihoods L 1 and L 2
free parameters h 1 , h 2 are reestimated for under T1 and T2 , respectively (the resam-
each bootstrap data set. (Nowadays this com- pling is effectively being made from the
putational demand would not be problem- sitewise log-likelihoods L1 (k) and L2 (k)
atic but this form of the test is still not used, estimated under H0 ; hence the RELL
probably because interest is typically in phy- mnemonic).
logenies, at least one of which has been cho- ² Calculate bootstrap values of d 0 (i ) ´
sen a posteriori—see below.) Accordingly, 0 (i )
L1 ¡ L2 .
0 (i )
methods were devised to reduce the compu-
² As before, perform the centering proce-
tational burden.
dure d ˜ 0 (i ) ´ d 0 (i ) ¡ d ø0 (i ) .
The KH-Test: Time-Saving Approximations ² Test whether the attained value of d
(from the original data) is a plausible
The fundamental KH-test (priNPfcd above) sample from the distribution of the d ˜ 0 (i )
has the disadvantage that likelihood max- by seeing if it falls within the conŽdence
imization is performed for each bootstrap interval for E[d ] given by appropriate
replicate data set i. Although no maximiza- points of the ranked list of the d ˜ 0 (i) (two-
tion over topologies is required, because
sided test).
in this case only two a priori–speciŽed
topologies are being considered, Kishino and
Hasegawa (1989) were concerned about the Kishino and Hasegawa (1989) further
computation time needed. To reduce the showed that the difference (d ) in log-
computational burden, they developed a re- likelihoods between two topologies speciŽed
sampling estimated log-likelihood (RELL) a priori would follow a normal distribution;
technique (Table 1, level 3, option n; see also the mean and variance of which could be
Kishino et al., 1990). In brief, they showed speciŽed in terms of the differences in log-
that instead of performing time-consuming likelihoods (d (i ) ) calculated for nonparamet-
likelihood optimizations for each bootstrap ric bootstrap data sets i. This approximation,
data set, one could use values of d (i ) calcu- based on the Central Limit Theorem, requires
lated with the optimized parameter values the same asymptotic conditions as the RELL
(h ˆ1 , h ˆ2 ) from the original data set. Certain method for its validity. An alternative to us-
asymptotic conditions (correctly speciŽed ing the direct comparison of the attained d
evolutionary models and sufŽciently large with the distribution of the d ˜ 0 (i ) , as in the last
amounts of data) are required for this ap- step of test priNPncd, is then to utilize this
proximation to be valid; the RELL method normal approximation for d (Table 1, level 5,
has been shown to perform well in the esti- option n):
mation of bootstrap probabilities of phyloge-
nies (Hasegawa and Kishino, 1994; see also Test priNPncn:
below for discussion of the possible effects ² Proceed as in test priNPncd above, but
of approximations to log-likelihood scores in replace the Žnal step with the following:
LRTs). The necessary likelihood calculations ² Compute the variance of the d ˜ 0 (i ) (denote
now require no optimization after the initial this by m 2 ) and test whether the attained
analysis of the original data, which saves a value of d (from the original data) is a
large amount of computational effort. By us- plausible sample from a N(0, m 2 ) distri-
ing a prime symbol (0 ) to denote this form bution, by seeing if it falls within the
of approximation where parameters are not conŽdence interval for E[d ] given, for
re-optimized for replicate data sets, the re- example, by 0 § 1.96m (two-sided test;
sulting test can be described as follows: 5% signiŽcance level in this example).

Test priNPncd: In practice, these tests have rarely been


² Calculate the test statistic d ´ L 1 ¡ L 2 . implemented, to our knowledge. More usu-
² Resample data (repeated bootstrap data ally, an additional assumption is also made:
sets i). that the variance of d can be estimated from
² Using the ML estimates of any free pa- the variance (over sites k = 1, 2, . . . , S) of
rameters h ˆ1 , h ˆ2 (branch lengths and so the sitewise log-likelihood differences d (k) ´
658 S YSTEMATIC BIOLOGY VOL. 49

(k) (k)
L 1 ¡ L 2 (Table 1, level 5, option a; Kishino the means of the {L1 (k)} and {L2 (k)} are
and Hasegawa, 1989). In this case a test can equal (two-sided test).
be made without any resampling, thus giv-
ing an even greater saving in time: We know of no theoretical justiŽcation for
this additional assumption, and it does not
Test priNPnca: give rise to any signiŽcant saving in com-
² Calculate the test statistic d ´ (L 1 ¡ L 2 ). putation time. However, we would expect
² Using the ML estimates of any free pa- that it will only make the smallest of dif-
rameters h ˆ1 , h ˆ2 (branch lengths, and so ferences in real applications (with a large
forth) derived from the original data set, number of sites S), and the signiŽcance lev-
compute sitewise log-likelihoods L1 (k) els reported by PAUP* and by the DNAML
and L2 (k), under T1 and T2 , respectively, program of PHYLIP are invariably very sim-
for the sites k of the original data set. ilar (J. Felsenstein, pers. comm.; D. Swofford,
² Calculate the values d (k) ´ L 1 (k) ¡ L 2 (k) pers. comm.; J.P. Anderson, unpubl.). Note
and hence the centered values d ˜ (k) ´ that both the priNPnca and priNPncs tests re-
d (k) ¡ d ø(k) and an estimate of their vari- quire no resampling and so are even faster
ance m 2 = k (d ˜ (k))2 / (S ¡ 1) (clearly, the than the priNPncd and priNPncn tests, which
variances of the d (k) and the d ˜ (k) are iden- use RELL methods.
tical). Because d ´ ˜ ø
k d (k) + Sd (k), and
d ø = 0 under H0 , Sm 2 is an estimate of the
variance of d . Incorrect Usage of the KH-Test
² Test whether the attained value of d cal- Many of the arguments in the derivations
culated from the original data is a plausi-
of the statistical tests above are strongly de-
ble sample from a N(0, Sm 2 ) distribution, pendent on the topologies T1 and T2 having
for example, by comparing itp with the been selected independently of any analysis
conŽdence interval 0 § 1.96 Sm (two- of the data used for the testing. In particu-
sided test; 5% signiŽcance level in this lar, this assumption is necessary to justify the
example). fundamental hypothesis H0 : E[d ] = 0. Unfor-
tunately, when the selection of topologies has
This is the method implemented in various been made with reference to the data, espe-
programs of the PHYLIP (Felsenstein, 1995), cially if they have been selected by a criterion
MOLPHY (Adachi and Hasegawa, 1996), linked to their likelihood scores, this expec-
and PUZZLE (Strimmer and von Haeseler, tation is no longer justiŽed. Two particular
1996; Strimmer et al., 1997) packages. (Pro- cases in which it is not reasonable to expect
p E[d ] = 0 are (1) the comparison of the tree
grams in MOLPHY compute Sm but leave
found to have the maximal likelihood, TML,
statistical testing to the user).
with an a priori tree, T1 , and (2) the compari-
In Swofford’s (1998) PAUP* program, a
son of TML with a tree selected for having the
stronger assumption is made (D. Swofford,
second- (or third-, and so forth) highest like-
pers. comm.): that the sitewise log-likelihood
lihood. In fact, in both of these cases E[d ] > 0.
differences d (k) are themselves normally dis-
TML is selected exactly because its likelihood
tributed (Table 1, level 5, option s). This as-
for the data at hand is greater than that of
sumption, if accurate, guarantees the accu-
any other tree: In other words, it is guaran-
racy of the normal approximations described
teed that LML will be at least as large as L1 or
above (Table 1, level 5, options n and a) and
the log-likelihood of any other topology.
permits a different test to be performed di-
This is not a minor discrepancy: no re-
rectly on the sitewise log-likelihoods:
sult suggests that E[d ] will be even near to
0 in these cases. Further, given that neces-
Test priNPncs: sarily E[d ] > 0, two-sided tests are no longer
² Proceed as in test priNPnca above, but appropriate: we are interested in assessing
replace the Žnal two steps with the deviations only in one “direction” from ex-
following: pectation, and one-sided tests are required.
² Perform a paired-t-test of the L1 (k) In our experience, however, situations such
and L2 (k) (pairs {L1 (1), L2 (1)}, {L1 (2), as those just described in which trees are
L2 (2)}, . . . , {L1 (S), L2 (S)}) to determine if selected a posteriori are precisely those for
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 659

which the KH-test has most often been used. t(Matthew) and t(Mark) equivalent to the
Indeed, even Hasegawa and coworkers ap- log-likelihoods L1 and L2 , respectively, and
pear to omit consideration of whether E[d ] d (Matthew, Mark) corresponding to the dif-
= 0 is a valid assumption (Hasegawa et al., ference in log-likelihoods d .
1988:8; Kishino and Hasegawa, 1989:175), The coach is also interested in another
even though Kishino and Hasegawa (1989 : runner, Luke, whom he believes to be the
177, equation 20 and following) recognize fastest runner on his squad. Given his suc-
that E[d ] depends on how the topologies be- cess at collecting data for the earlier hypoth-
ing assessed were chosen. esis test, the coach decides to do something
We believe that the results of all such analy- similar with Luke. He obtains running times
ses using the KH-test are invalid and require for Luke, t(Luke), over several races amongst
recomputation by methods such as those de- the squad members, as well as the fastest
scribed later in this paper. We stress that this time for each race, t(fastest). For each race he
is not a minor, obscure, or purely hypotheti- computes the difference between these times,
cal point, of interest only to theoretical statis- d (Luke, fastest) ´ t(Luke) ¡ t(fastest), argu-
ticians. Although the KH-test is suitable for ing that if Luke truly is the fastest then, over
the questions it was designed to answer, it is many races, the average of d (Luke, fastest)
entirely inappropriate for use in testing the will be zero. However, as the team statis-
signiŽcance of trees selected by ML from the tician points out, this assumption is neces-
data which are to be used for the testing. We sarily false. The reason for this is simple. If
can Žnd only one possible adjustment appli- Luke truly is the fastest, then we may ex-
cable to the results of incorrect applications pect that in the majority of the races he par-
of KH-tests that may render them useful; this ticipates in, his time is the fastest time, i.e.,
is discussed below. t(Luke) = t(fastest) and d (Luke, fastest) = 0.
To help illustrate why we can no longer However, we also expect that some other
base tests on the hypothesis E[d ] = 0, we squad members will manage to win some
use an analogy of running races. The coach races, if only infrequently, so that d (Luke,
of a running squad is interested to know fastest) >0. Note that it is never possible that
whether two runners, Matthew and Mark, d (Luke, fastest) < 0, and consequently the av-
differ signiŽcantly in their average running erage of d (Luke, fastest) over several races
times for the 100 m sprint. To determine this, (the majority with d (Luke, fastest) = 0 and
he times the two runners when they partici- some with d (Luke, fastest) >0) must neces-
pate in several races. For each race, the coach sarily be >0. Variations in the squad mem-
calculates the difference in running times bers’ performances in different races ensures
t between Matthew and Mark: d (Matthew, that even if none is systematically faster than
Mark) ´ t(Matthew) ¡ t(Mark). Note that Luke, there is a chance that someone of equal
d (Matthew, Mark) can sometimes be positive or lower ability will appear to outperform
and sometimes be negative, depending on Luke in any one race. In fact, the bigger the
who runs faster in any given race; in fact, squad, the more likely this is, and the greater
if Matthew and Mark are equally good at the level of outperformance one can expect.
the 100 m sprint, then the average value of The statistical test used should reect this fact
d (Matthew, Mark) over many races will tend and cannot be based on E[d (Luke, fastest)] ´
to 0. In fact, as the team statistician approv- E[t(Luke) ¡ t(fastest)] = 0.
ingly explains, the data the coach has col- This example is analogous to the com-
lected can be used to estimate the variance of mon but incorrect application of the KH-
d (Matthew, Mark) and, consequently, to test test, when TML and T1 are identiŽed with
the following hypothesis: the faster runner and Luke, respectively, LML
and L1 with t(fastest) and t(Luke); and d with
H0 : E[d (Matthew, Mark)] = 0 d (Luke, fastest). It would be possible, but of
HA : E[d (Matthew, Mark)] 6 = 0. little interest here, to devise a correct test
based on runners selected a posteriori. In-
This is analogous to the KH-test for phylo- stead, we will revert to discussing phyloge-
genies, with Matthew and Mark correspond- netic examples and will describe two tests
ing to two a priori topologies T1 and T2 , that can be used in place of the KH-test when
each race equivalent to a sample of data, it is not applicable.
660 S YSTEMATIC BIOLOGY VOL. 49

The Shimodaira–Hasegawa Test: A Corrected each permitted topology Tx , giving opti-


Nonparametric Test of Topologies mal log-likelihood values L x .
(i )

Although it has been noted in the past that ² For each topology Tx , form the adjusted
(i ) (i ) (i )
the KH-test is suitable only for cases where log-likelihood L̃ x ´ L x ¡ Løx by sub-
E[d ] = 0 (Swofford et al., 1996), it appears (i )
tracting Løx , the mean over replicates i
that Shimodaira and Hasegawa (1999) are the (i )
of L x , from each value of L x —this is
(i )
Žrst to publish a full explanation of why the the centering method devised by Shi-
test is not appropriate when one or more modaira (1998), which is appropriate for
topologies under test were selected with ref- enforcing that the resampled data con-
erence to the same data being used for test- form to H0 for this a posteriori test.
ing. Our arguments in the preceding section (i )
are essentially an extended version of those ² For each replicate i, Žnd L̃ ML , the maxi-
of Shimodaira and Hasegawa (1999). Based mum over topologies Tx of the adjusted
(i )
on earlier work by Shimodaira (1993, 1998), log-likelihoods L̃ x , and form bootstrap
(i) (i ) (i )
Shimodaira and Hasegawa (1999) have pro- replicate statistics d x ´ L̃ ML ¡ L̃ x ; this
posed a nonparametric test similar to the KH- allows for the a posteriori selection of
test but making the appropriate allowance TML .
for the method by which topologies are usu- ² For each topology Tx , test whether the
ally selected for statistical comparison. The attained d x is a plausible sample from
Shimodaira–Hasegawa test (SH-test) simul- the distribution (over replicates i) of the
taneously compares all topologies in some (i )
d x by seeing if it falls within the conŽ-
set M and makes appropriate allowance for dence interval for E[d x ] given, for exam-
these multiple comparisons. It is necessary ple, by the interval between 0 and the
that M contains every topology that can pos- 95% point of the ranked list of the d x .
(i )
sibly be entertained as the true topology, Such a one-sided test is appropriate, be-
to ensure that the true topology is always (i ) (i )
cause we know that only L̃ ML ¸ L̃ x is
“available” to be the ML topology for any
possible; in this example, a 5% signiŽ-
bootstrap data set; if this condition is not met,
cance level is being used.
the signiŽcance levels computed will be inac-
curate (Westfall and Young, 1993:48). In ad-
dition, selection of topologies for the set M (We have used some notation different from
should be made a priori and not with refer- that of Shimodaira and Hasegawa [1999], to
ence to the observed data; otherwise, signiŽ- maintain a consistent style within this pa-
cance levels will again be inaccurate. Choos- per. Shimodaira and Hasegawa [1999] use
ing M to be the set of all possible topologies Ta , a = 1, 2, . . . , M; L̃ a ¢ i ; R̃a ¢ i ; and S̃a ¢ i , where
is always safe, if conservative. (i )
we have used d x Tx 2 M; L x ; L̃ x ; and d x ,
(i ) (i )
We cannot write the null hypothesis as respectively.)
E[d ] = 0 for this test; instead, the hypothe- Time-saving approximations are also
ses tested are as follows: possible with this test. Shimodaira and
Hasegawa (1999) propose the use of the
H0 : all Tx 2 M (including TML , the ML RELL method for Žnding approximate
tree) are equally good explanations (i )
values of L x without having to re-optimize
of the data h x for each replicate data set. This test,
HA : some or all Tx 2 M are not equally implemented by Shimodaira and Hasegawa
good explanations of the data (1999) and Buckley et al. (unpubl.), can be
described as follows:
and the test may proceed as follows:
Test posNPncd:
Test posNPfcd: ² Calculate a test statistic d x for each topol-
² Calculate a test statistic d x for each topol- ogy Tx 2 M: d x is the attained value of
ogy Tx 2 M: d x is the attained value of L ML ¡ L x .
L ML ¡ L x . ² Generate nonparametric bootstrap repli-
² Generate nonparametric bootstrap repli- cate data sets i; for each one, and for
cate data sets i and for each one maxi- each tree T x , use the ML estimates h ˆ x of
mize likelihoods over parameters h x for any free parameters derived for each tree
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 661

Tx from the original data set to compute to make the new test more conservative (i.e.,
0 (i )
log-likelihoods L x , which approximate less likely to reject the null hypothesis). How-
(i )
the optimized values L x in test posNPfcd ever, the SH-test correctly uses a one-sided
above. test, and this acts to increase the signiŽcance
² For each topology Tx , form the adjusted of results.
0 (i ) 0 (i ) 0 (i)
log-likelihood L̃ x ´ L x ¡ Løx (center-
ing). Is It Possible to Salvage the Results
0 (i ) of Incorrect Applications of the KH-Test?
² For each replicate i, Žnd L̃ ML , the maxi-
mum over topologies Tx of the adjusted We can Žnd only one possible adjustment
0 (i ) that might render some previously published
log-likelihoods L̃ x , and form bootstrap
0 (i ) 0 (i ) 0 (i ) results from incorrect applications of the KH-
replicate statistics d x ´ L̃ ML ¡ L̃ x .
test useful in the light of the theoretical ad-
² For each topology T x , test whether the
vances described in this paper. It is straight-
attained d x is a plausible sample from the
0 (i) forward to convert the signiŽcance level of
distribution (over replicates i) of the d x a two-sided test to that of a one-sided test:
by seeing if it falls within the conŽdence the P-value should simply be halved. If the
interval for E[d x ] given, for example, by P-value obtained from an incorrectly applied
0 and the 95% point of the ranked list of KH-test is p, then the P-value that would be
0 (i )
the d x (one-sided test; 5% signiŽcance obtained in the SH-test is necessarily ¸ p/2.
level used in this example). Therefore, if the adjusted value p/2 is large
enough to indicate no rejection of the null
Note that the SH-test simultaneously as- hypothesis (a priori tree T1 ), e.g., p/2 > 0.05
sesses the signiŽcance level for each of the for a 5% signiŽcance level, we can be certain
topologies T x 2 M. It immediately reduces that using the SH-test would give the same
to a version of the KH-test, modiŽed for conclusion.
the comparison of a priori–selected topology However, in all other cases (where the ad-
T1 and a posteriori–selected topology TML , justed value p/2 is sufŽciently small to indi-
when attention is restricted to the signiŽ- cate rejection of the null hypothesis in favor
cance level computed for d 1 from the distri- of the ML tree, e.g., p/2 < 0.05 for a 5% signif-
(i ) 0 (i ) icance level) we cannot assume that this re-
bution of the d 1 or d 1 . Note, however, that
sult would hold under a SH-test, which must
the set M of all plausible topologies still has
give a P-value that would exceed p/2 by an
to be considered to compute this distribution.
unknown amount. Note that this will neces-
The effect of the new centering procedure
sarily be the case whenever p indicated rejec-
introduced in this method is to decrease the
tion of the null hypothesis in the incorrectly
signiŽcance accorded to the difference d x in
applied KH-test (e.g., p < 0.05 implies p/2 <
log-likelihoods between each topology T x
0.025), and in some instances when p did not
and the ML topology TML (Shimodaira and
indicate rejection of the null hypothesis (e.g.,
Hasegawa, 1999), in comparison with the
0.05 < p < 0.1 implies 0.025 < p/2 < 0.05).
signiŽcance indicated by the corresponding
In summary, if the P-value obtained from
but inappropriate KH-test. Intuitively, this
an incorrectly applied KH-test is greater than
is because the attained value of d x should
twice the value required to indicate no rejec-
be attributed to two components: one (nec-
tion of the null hypothesis, then that conclu-
essarily positive) being a consequence of
sion would hold under the SH-test. If the P-
the selection of TML precisely because it has
value is less than this, we cannot determine
the highest likelihood, and another (of un-
from the KH-test result what the result would
known sign) attributable to the difference
be for any test making proper allowance for a
in the abilities of Tx and TML to explain
posteriori selection of hypotheses, and a full
the observed data. Whereas the SH-test cor-
reanalysis of the original data by appropriate
rectly compares Tx and TML on the basis of
tests is necessary.
the second component alone, making an ap-
propriate allowance for the Žrst component,
the incorrectly applied KH-test assesses both The SOWH-Test: a Correct Parametric
components combined as though they were Test of Topologies
only the second component. The fact that Parametric statistical testing of hypothe-
the Žrst component is necessarily > 0 acts ses is becoming increasingly popular in
662 S YSTEMATIC BIOLOGY VOL. 49

phylogenetics, based on the same models (1996), and the only published application is
of sequence evolution being used for mak- by Hillis et al. (1996). We refer to it as the
ing phylogenetic inferences. These tests are SOWH-test, after the authors who originally
generally based on parametric bootstrapping described it. The hypotheses compared are
techniques (also known as Monte Carlo sim- these:
ulation), the theory of which is described
in more detail by Goldman (1993), Huelsen- H0 : T1 is the true topology
beck and Crandall (1997), and Huelsenbeck
and Rannala (1997). Whereas in nonpara- HA : some other topology is true
metric bootstrap methods, replicate data sets
are created by resampling with replacement and the parametric bootstrap statistical test
from the original data set, in parametric boot- of these hypothesis is as follows:
strapping, replicate data sets that conform
precisely to the assumptions of a paramet- Test posPfud:
ric null hypothesis are created by simulat- ² Calculate the test statistic d ´ L ML ¡ L 1 .
ing data through the use of those assump- ² Simulate data sets i by parametric boot-
tions in conjunction with parameter values strapping, based on the null hypothesis
estimated under the null hypothesis from topology T1 and the ML estimates of any
the original data set (Goldman, 1993). Sub- free parameters, h ˆ1 , derived for T1 from
sequent analysis of these data sets by the the original data set.
same methods as used for the original data ² Use T1 and reestimate free parameters
gives replicate values of any required statis- h 1 to get maximized log-likelihoods L 1
(i )
tic. These are guaranteed to be drawn from under H0 .
the distribution induced by the null hypothe- ² Maximize likelihood over all topologies
sis, and their distribution therefore is a para- Tx and their respective parameters h x to
metric estimate of the null hypothesis distri- (i )
get log-likelihoods L ML .
bution of that statistic. (i ) (i )
Parametric tests offer the possibility of ² Calculate values of d (i ) ´ L ML ¡ L 1 , the
more power than nonparametric tests, be- set of these giving an estimate of the dis-
cause the knowledge they can use of the form tribution (under H0 ) of d .
of the distribution giving rise to the data is ² Test whether the attained value of d (from
unavailable to nonparametric tests. The cost the original data) is a plausible sam-
of this power is an increased reliance on the ple from the estimated distribution of d
models they assume (which could be inaccu- given by the set of the d (i ) by seeing if
rate and thus lead to misinterpretation of the it falls below the 95% point (for exam-
data). Typically, LRTs in phylogenetics (in- ple) of the ranked list of the d (i ) . Such a
cluding those performed by using the para- one-sided test is appropriate because we
metric bootstrap) have been found to be pow- know that d must be > 0; in this example,
erful and are quite robust to deviations from a 5% signiŽcance level is being used.
the assumed model so long as a reasonable
effort is made to use a model sufŽciently Notice that the test statistic d is the same as
complex to encompass the major features of in the KH- and SH-tests. The use of TML in the
the distribution of the data (e.g., Goldman, test means that the assumption E[d ] = 0 can-
1993; Yang et al., 1994, 1995; Hillis et al., 1996; not be made, but the use of parametric boot-
Huelsenbeck and Crandall, 1997; Huelsen- strapping to generate data conforming to the
beck and Rannala, 1997; Cunningham et al., null hypothesis means that this presents no
1998; Zhang, 1999). problem: the repeated analysis of parametric
It is possible to create a parametric boot- bootstrap data sets guarantees the appropri-
strap LRT to assess whether an a priori se- ate statistical properties. (Indeed, E[d ] may
lected topology T1 is supported by a se- be estimated by d ø(i ) , the mean of the d (i ) .) The
quence data set or should be rejected in favor fact that the data necessarily conform to the
of another topology. The following test is a null hypothesis is also the reason that no cen-
straightforward application of the paramet- tering procedure is necessary for this test (Ta-
ric bootstrap, yet it appears to be little-known ble 1, level 4, option u).
or -used. The only previous (and brief) de- This test has a substantial time penalty,
scription we know of is due to Swofford et al. however, caused by the need to repeatedly
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 663

maximize likelihoods over topologies under the exact d (i ) had been used in place of the
the hypothesis HA . The same penalty ex- d 0 (i ) , and the test does not give us a deŽnitive
ists in the basic SH-test (posNPfcd) above but result.
is avoided by using the RELL method (test A similar approach applied to the alter-
posNPncd above). Although we do not have native hypothesis will be less effective (e.g.,
theoretical results to justify applying all the for a test denoted posPnud). Although it pro-
most useful approximations described above vides a much greater scope for saving time
for nonparametric tests, we have a several under HA , because searches are performed
suggestions for possible ways to reduce the over topologies, using Žxed values of TML
computational burden of the above paramet- and h ˆML from the ML analysis (over all trees)
ric test. of the original data to assess the replicate
The Žrst suggestion is to use RELL-like data set is not sensible. The original TML
methods applied only to the a priori–spe- and corresponding h ˆML will probably be far
ciŽed null hypothesis topology T1 (Table 1, from the optimal values for replicate data
level 3, option p): sets (which were simulated using the origi-
(i )
nal T1 and h ˆ1 ), so the difference between L ML
Test posPpud (approximation under H0 ): and its possible approximation L ML may be
0 (i )

² Calculate the test statistic d ´ L ML ¡ L 1 . large.


² Simulate data sets i by parametric boot- However, ML estimates of some param-
strapping, based on the null hypothesis eters of nucleotide substitution models are
topology T1 and the ML estimates of any known to be quite stable over different
free parameters, h ˆ1 , derived for T1 from topologies (e.g., Yang et al., 1994, 1998; Sulli-
the original data set. van et al., 1996; Yang, 1997). Examples of such
² Use T1 and the ML estimates of param- parameters are base frequencies (p A , p C , p G ,
0 (i )
eters h ˆ1 to get log-likelihoods L 1 under p T ), the transition/transversion rate ratio j ,
H0 . and the shape parameter a of the gamma dis-
² Maximize likelihood over all topologies tribution widely used to model among-sites
and their respective h x to get maximized rate variation. Therefore, we think it reason-
(i )
log-likelihoods L ML under HA . able to use Žxed values of these parameters
(i )
² Calculate values of d 0 (i ) ´ L ML ¡ L 1 .
0 (i) estimated under H0 from each bootstrap data
(i )
² Test whether the attained value of d set i (i.e., the components of h ˆ1 that are not
(from the original data) is a plausible the lengths of branches of T1 and are thus
sample from the estimated distribution common to all topologies Tx ) when assess-
(
of d given by the set of the d 0 i ) by see- ing that data set under HA . This gives the
ing if it falls below the 95% point of following test:
the ranked list of the d 0 (i ) (one-sided
test; 5% signiŽcance level used in this
example). Test posPpud (approximation under HA ):
² Calculate the test statistic d ´ L ML ¡ L 1 .
² Simulate data sets i by parametric boot-
This saves on the maximization of param-
strapping, based on the null hypothesis
eters under the Žxed topology T1 and results
topology T1 and the ML estimates of any
in a small saving in computation time. It does
free parameters, h ˆ1 , derived for T1 from
not address the more difŽcult problem of re-
the original data set.
peated maximizations over topologies in the
² Use T1 and reestimate free parameters
alternative hypothesis. As with other tests (i )
described above, the approximation under h 1 to get maximized log-likelihoods L 1
H0 can be trusted and the signiŽcance lev- under H0 (and respective optimal values
(i)
els taken at face value. Alternatively, note of h ˆ1 ).
0 (i ) (i )
that necessarily L 1 ·L 1 and so d 0 (i ) ¸ d (i ) . ² Maximize likelihood over all topologies
0 (i)
Therefore, if this test rejects H0 (the attained Tx to get log-likelihoods L ML under HA :
d is too big), then this result is certain (the these maximizations all Žx the values
approximation cannot have changed the re- of substitution process parameters to be
(i )
sult). But if the test does not reject H0 (the equal to h ˆ1 , but the maximization is per-
attained d is sufŽciently small), we do not formed over topologies Tx and their re-
know whether it would have been rejected if spective branch length parameters.
664 S YSTEMATIC BIOLOGY VOL. 49

0 (i )
² Calculate values of d 0 (i ) ´ L ML ¡ L 1 .
(i ) damental KH-test for a priori–speciŽed trees
² Test whether the attained value of d T1 and T2 :
(from the original data) is a plausible
sample from the estimated distribution H0 : T1 is the true topology
of d given by the set of the d 0 (i ) by see- HA : T2 is the true topology
ing if it falls below the 95% point of the
ranked list of the d 0 (i ) (one-sided test; 5% This fundamental version of such a test
signiŽcance level used in this example). would be denoted priPfud (see Table 1). De-
noting the topology with second highest like-
(Note that the two preceding tests both re- lihood by TML2 , we could also devise a para-
ceive the mnemonic posPpud because they metric bootstrap test of the hypotheses:
vary only in the form of the approxima-
H0 : TML2 is the true topology
tion used in their likelihood maximizations
[Table 1, level 3]. A more complex naming HA : TML is the true topology
system that would assign different mnemon-
ics to these tests seems unwarranted in this This test would be based on modiŽca-
paper.) Now, some substantial saving of time tions of posPfud by using the test statis-
is made as the substitution process parame- tic d ´ L ML ¡ L ML2 , data simulated by us-
ter values are Žxed during the likelihood op- ing TML2 and h ˆML2 ; ML analysis of simulated
timizations under HA . The greater problem (i )
data sets to Žnd the distinct topologies TML
of optimizing over topologies is still not ad- (i )
and TML2 which give the greatest and second-
dressed. For this test, we know that necessar- greatest likelihoods, respectively; and d (i ) ´
0 (i ) (i )
ily that L ML ·L ML and so d 0 (i ) ·d (i ) . There- (i ) (i )
fore, if this test fails to reject H0 (attained d not L ML ¡ L ML2 . Having introduced the general
principles of such tests, we will not go into
excessively large relative to the null hypoth-
further details here. We also draw readers’
esis distribution of the d 0 (i ) ), then this result is attention to the related parametric bootstrap
certain. If this test does reject H0 , this could test of monophyly described by Huelsenbeck
in principle be a consequence solely of the et al. (1996), which compares partially con-
approximation. However, we expect the ef- strained topologies (chosen a priori) with the
fect of this approximation to be small; in the ML topology (chosen a posteriori).
example given below, it is insigniŽcant and
has no bearing on the conclusions reached.
If approximations are made under both hy- EXAMPLES
potheses, for example, by some combination
of the two posPpud tests above (and as in HIV-1 Subtypes A, B, D, and E gag and pol
tests priNPncd, priNPncn, priNPnca, priNPncs, Nucleotide Sequences
and posNPncd above), it is no longer possi- Six homologous sequences, each consist-
ble to make general statements about the di- ing of 2,000 base pairs (bp) from the gag
rection of the bias that they produce in the and pol genes, were selected from isolates of
d (i ) . The precise effects of the combination HIV-1 subtypes A (two sequences, A1 and
of such approximations in a posteriori para- A2), B (one sequence), D (one sequence), and
metric tests require further investigation. Ap- E (two sequences, E1 and E2). The sequences
proximations based on assumptions of a nor- were easily aligned by eye. The conventional
mal distribution for d (Table 1, level 5, options phylogeny for these subtypes would group
n, a, and s) seem unlikely to be useful in tests the two subtype A sequences and also the
designed for hypotheses chosen a posteriori, two subtype E sequences—that is, T1 = ((A1,
given that the necessary condition of d ¸ 0 A2), (B, D), (E1, E2))—for which the opti-
indicates a truncation of the distribution of d , mal log-likelihood is L1 = ¡ 5,073.75. For our
which precludes normality. sequences, however, the ML phylogeny in-
dicated that the subtype A sequences did
not cluster together; that is, TML = (A1, (B,
Other SOWH-Like Tests D), (A2, (E1, E2))) with LML = ¡ 5,069.85.
It is also straightforward to devise a para- In this example, all ML calculations were
metric bootstrap test of the following hy- performed with the general time reversible
potheses, which are akin to those of the fun- model of nucleotide substitution, using a
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 665

TABLE 3. Results of statistical tests of topologies for HIV-1 gag and pol gene nucleotide data set.

Test code Notes P-value a


priNPfcd KH-test (incorrect application); full optimization; direct estimation 0.38 (0.19)
of P-value
priNPfcn KH-test (incorrect application); full optimization; normal approximation 0.41 (0.20)
for distribution of d
priNPncs KH-test (incorrect application); RELL approximation; stronger normal approximation 0.38 (0.19)
for distribution of d (k)
posNPfcd SH-test; full optimization; direct estimation of P-value 0.26
posPfud SOWH-test; full optimization; direct estimation of P-value 0.002
posPpud SOWH-test; partial optimization under HA ; direct estimation of P-value 0.002
a First value is from a two-sided test, as widely used to date; second value, when present, is for the more appropriate one-sided

test.

gamma distribution to model rate hetero- et al., 1997; and PAUP*, Swofford, 1998). The
geneity among sites (REV+C ; Yang, 1994, one-sided test is more suitable and, as de-
1996, 1997); the parameters h x for topol- scribed above, at least has the possibility of
ogy Tx are branch lengths, base frequencies, giving a statistically interpretable result. In-
parameters describing the relative rates of deed, that is the case in this example: the
substitution between each nucleotide pair, one-sided P-values of « 0.2 indicate no rejec-
and the shape parameter a of the gamma tion of the null hypothesis, and as explained
distribution. above, this conclusion must necessarily be
We illustrate some of the statistical tests maintained by the correction inherent in the
described above by investigating whether SH-test. We also note the good agreement be-
or not the data provide signiŽcant evidence tween the P-values calculated by the three
that TML is to be preferred over T1 . For all variants of the KH-test.
the tests we performed, the test statistic Our application of the SH-test used full
d = L ML ¡ L 1 = ¡ 5,069.85 ¡ (¡ 5,073.75) = likelihood optimizations (test posNPfcd) and
3.90. Because TML has been selected for permitted the consideration of three topolo-
testing a posteriori, that is, as a consequence gies as possibly true: T1 , TML , and the topol-
of having the highest likelihood, the KH-test ogy (A2, (B, D), (A1, (E1, E2))). We report
is inappropriate (but was performed for only the signiŽcance level for the test of T1
comparative purposes), and the SH- or against TML . This test, with its allowance for
SOWH-tests should be used. These tests the a posteriori selection of one topology,
were performed as described above, with must give a higher P-value (i.e., a less signif-
1,000 replicates used whenever parametric icant result), and this is conŽrmed in Table 3.
or nonparametric bootstraps were per- There seems no way to draw any general con-
formed. The results are summarized in clusions about the size of the difference in
Table 3. the P-values (0.26 vs. 0.19–0.20 for the KH-
We performed three versions of the KH- tests). Figure 1 shows the distribution of the
(i )
test, two using full likelihood optimizations 1,000 replicate values d 1 against which the
and computing the tests’ P-values either di- attained value d 1 = 3.90 is compared. We con-
rectly (test priNPfcd) or by assuming a nor- clude that the SH-test indicates no signiŽcant
mal distribution for d (test priNPfcn), and difference between T1 and TML; therefore, we
one using the strongest assumption of nor- do not reject T1 in favor of TML for these data.
mality of the sitewise d (k) (test priNPncs). In We also note from Figure 1 that the minimum
all cases, both a two-sided test and a one- value of d 1 that would indicate rejection of
sided test were performed. The two-sided the null hypothesis (T1 ) in this example is
test is inappropriate for this a posteriori test « 8.8 (the value of d for which the SH-test
(as indeed is the entire KH-test) but repre- distribution reaches a cumulative frequency
sents the computation performed in the most of 0.95).
widely available implementations of the KH- The results of the SOWH-test are very dif-
test (PHYLIP, Felsenstein, 1995; PUZZLE, ferent. We performed two versions of this
Strimmer and von Haeseler, 1996; Strimmer test, one using full likelihood optimizations
666 S YSTEMATIC BIOLOGY VOL. 49

FIGURE 1. Test distributions for SH-test (nonparametric bootstrap) and SOWH-tests (parametric bootstrap) of
topologies for HIV-1 gag and pol gene nucleotide data set. The histogram (right-hand y-axis; note the break used
on this scale) shows the distribution over 1,000 replicates i of the d 1(i ) (SH-test, code posNPfcd; wide dark-gray bars),
d (i ) (SOWH-test, code posPfud; narrow white bars), and d 0 (i ) (SOWH-test, code posPpud; light-gray bars). The curves
(i)
(left-hand y-axis) show the cumulative frequency distributions of the d 1 (SH-test; solid line) and d (i ) (SOWH-test
posPfud; dashed line). The cumulative frequency distribution of the d 0 (i ) (SOWH-test posPpud) is indistinguishable
from that of the d (i ) . The points at which the horizontal line (dashed gray) at a cumulative frequency of 0.95 crosses
these curves indicate the values of d that must be exceeded for a signiŽcant result at the 5% level. Given the attained
value of d = 3.90, the SH-test is not signiŽcant and does not reject T1 , but the SOWH-tests are highly signiŽcant and
reject T1 in favor of TML (see text for further details).

(test posPfud) and one using the approxima- between the ML and all other permitted
tion described above as test posPpud (ap- hypotheses.
proximation under HA ). The differences be- Another factor affecting the signiŽcance
tween these two tests were negligible in this levels of the different tests may simply be the
example, and both indicated a P-value of increase in power expected for a parametric
0.002 (Fig. 1; Table 3). For these data, this test over a nonparametric test. We also recall
test strongly rejects topology T1 in favor the reliance of parametric tests on the models
of TML . As Figure 1 shows, any observed that they assume. Although we may hope for
value of d exceeding « 1.2 would have re- robustness of tests to inaccuracy of models,
sulted in rejection of T1 at the 95% level. this has generally been left untested in phy-
The attained value is 7.7 standard deviations logenetics. To examine whether the REV+C
above the mean of the SOWH-test statistic model Žts the data well in the present ex-
distribution. ample, we performed ML analyses under a
One explanation of the difference between variety of nucleotide substitution models to
the signiŽcance levels for the SH-test and compare those models (Goldman, 1993; Yang
the SOWH-test is the different forms of their et al., 1994). The results of these analyses are
null hypotheses. In this example the SH- shown in Table 4. It is immediately evident
test considers whether the three competing that the REV+C model Žts these data signif-
topologies are equally good explanations of icantly better than any of the other models
the data, whereas the SOWH-test considers considered, in agreement with the good per-
whether other topologies are better than the formance of both the REV and C components
single topology T1 . As a consequence, the reported over a variety of data sets (see Yang,
SH-test is permitting more a priori possible 1994, 1996; Arvestad and Bruno, 1997; and
topologies in its null hypothesis, which will references therein). We conclude that in this
generally lead to more conservative results— example all reasonable steps have been taken
an effect of the allowances made for the to exclude any effects on the SOWH-test at-
multiple statistical comparisons being made tributable to model inaccuracy.
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 667

TABLE 4. Maximum likelihood scores for HIV-1 gag data is TML = (((human, (harbor seal, cow)),
and pol gene data set under various models of nucleotide rabbit), mouse, opossum)—topology a = 1
substitution. Models JC69 (Jukes and Cantor, 1969),
K80 (Kimura, 1980), F81 (Felsenstein, 1981), HKY85 of Shimodaira and Hasegawa (1999). We
(Hasegawa et al., 1985), and REV (Yang, 1994) were im- compare our results from the SOWH-test
plemented, each without (no C ) and with ( + C ) a gamma with Shimodaira and Hasegawa’s (1999)
distribution to model rate heterogeneity amongst sites results for analogous comparisons from
(Yang, 1996, 1997). All calculations were performed with
the topology referred to in the text as TML . Numbers
KH- and SH-tests. In this example, as in
given represent the log-likelihood value by which each Shimodaira and Hasegawa (1999), all ML
model is worse than the best value, attained under the calculations were performed using a model
REV+C model, ¡ 5,069.85. Also shown, in parentheses, of mammalian mt aa replacement described
are the numbers of free parameters in each substitution by Yang et al. (1998), with aa frequencies
model. Pairs of nested models can be compared by using
a test statistic that is twice the log-likelihood difference estimated from the data set being ana-
between those models, assessed with either a v 2 distri- lyzed and using a gamma distribution to
bution (models compared are both of the no ‘C ’ form or model rate heterogeneity amongst sites
both of the + ‘C ’ form) or a v ø2 distribution (exactly one (mtmam+F+C ; see also Yang, 1997). For
of the models compared is of the + ‘C ’ form), degrees of
freedom are given by the difference in the numbers of
this model, the optimal log-likelihoods
free parameters. For full details of these tests see, for for these topologies were L 1 = ¡ 21,727.26
example, Yang et al. (1994) and Goldman and Whelan and L ML = ¡ 21,724.60; therefore, the test
(2000). statistic for all the tests of topologies
considered below was d = L ML ¡ L 1 =
Substitution model No C +C
¡ 21,727.26 ¡ (¡ 21,724.60) = 2.66. Table 5
JC69 395.08 (0) 349.93 (1) summarizes the results of the KH-, SH-, and
K80 190.28 (1) 131.32 (2) SOWH-tests for these data. The KH- and
F81 280.19 (3) 231.43 (4)
HKY85 81.29 (4) 12.79 (5) SH- test results are taken from Shimodaira
REV 65.09 (8) 0 (9) and Hasegawa (1999) and were calculated
by using RELL approximations and esti-
mating the tests’ P-values directly (without
a normal approximation). The SOWH-test
Mammalian Mitochondrial Protein Amino was performed by using full likelihood
Acid Sequences optimizations.
Shimodaira and Hasegawa (1999) illus- The P-value obtained from a one-sided
trated the SH-test with a data set consisting comparison in the KH-test, as given by
of aligned mt protein sequences, each Shimodaira and Hasegawa (1999), was
comprising 3,414 aa, from six mammals: 0.36, which suggests that T1 cannot be
human, harbor seal, cow, rabbit, mouse, rejected in favor of TML. As explained
and opossum. The grouping (harbor seal, above, this conclusion must be maintained
cow) was assumed to be true, which left by the SH-test and we see from Table 5
15 candidate topologies to be evaluated. that this is so (P = 0.81). Notice that the
Shimodaira and Hasegawa (1999) applied difference between the P-values for the
the SH-test to this data set, comparing all (one-sided) KH- and SH-tests (0.36 and
15 candidate topologies simultaneously, 0.81, respectively) is considerably greater
and concluded that seven topologies could than in the HIV-1 example (« 0.20 and 0.26,
not be rejected. To illustrate the SOWH- respectively).
test, we selected (a priori) the topology T1 The SOWH-test again gives very different
= ((human, ((harbor seal, cow), rabbit)), results. The P-value from this test, for 1,000
mouse, opossum), called topology a = 2 by replicate parametric bootstrap data sets, is
Shimodaira and Hasegawa (1999), To test estimated to be < 0.001—in other words, in
against the ML topology, which for these none of the 1,000 replicates i did the value

TABLE 5. Results of statistical tests of topologies for mammalian mitochondrial protein amino acid data set.

Test code Notes P-value


priNPncd KH-test (incorrect application); RELL approximation; direct estimation of P-value 0.36 a
posNPncd SH-test; RELL approximation; direct estimation of P-value 0.81 a
posPfud SOWH-test; full optimization; direct estimation of P-value < 0.001
a From Shimodaira and Hasegawa (1999).
668 S YSTEMATIC BIOLOGY VOL. 49

of d (i ) equal or exceed the value d = 2.66 ency of a more readily available or faster test.
observed for the real data. (The attained The SOWH-test, as described above, tests a
value d = 2.66 lies 27.8 standard deviations single a priori hypothesis of topology. If such
above the mean of the d (i ) .) Thus, topology tests are used repeatedly, to assess the sig-
T1 is very strongly rejected in favor of TML . niŽcance of multiple trees, the issue of cor-
As with the HIV-1 example above, there recting signiŽcance values for multiple tests
is no obvious single reason for the contrast- arises. This might occur with data sets for
ing results of the SH- and SOWH-tests for which large numbers of tree topologies are
these mt protein sequences. The SOWH-test considered plausible a priori. Bar-Hen and
considers only one a priori topology, T1 , Kishino (in press) describe a novel paramet-
and of the 1,000 replicate data sets gener- ric likelihood-based test for computing si-
ated by using T1 only 7 resulted in topolo- multaneous signiŽcance values for multiple
gies different from T1 when analyzed by ML. topologies.
Evidently, if this topology, its correspond- The SH-test simultaneously compares all
ing parameter values h ˆ1 (as estimated by members of a set M of topologies. The inclu-
ML from the original mt protein data set), sion in M of all a priori possible topologies
and the mtmam+F+C model of aa replace- is important. Even topologies with low boot-
(i )
ment were all adequate, then we would ex- strap replicate likelihoods (L x ) can readily
pect to retrieve the correct topology from a affect the signiŽcance levels of other topolo-
data set of 3,414 aa with high probability gies, because these are based only on varia-
(1,000 ¡ 7) / 1,000 ¼ 0.99; consequently, our tions in likelihoods over bootstrap replicates
(i ) (i ) (i )
Žnding that for the original data TML and ( L̃ x ´ L x ¡ Løx ). A posteriori selection of
not T1 is optimal seems unreconcilable with topologies for inclusion in or exclusion from
the hypothesis that T1 is true. In contrast, the M based on their likelihoods may thus bias
SH-test has 15 topologies considered equally all signiŽcance levels recorded—analogous
plausible a priori in its null hypothesis and to performing multiple comparisons tests on
therefore the signiŽcance level assigned to a only a subset of a larger number of compar-
particular one of these, e.g., T1 , is reduced. isons, selected (for example) to be the most
The effects of differences between parametric (or least) signiŽcant. Decreasing the number
and nonparametric tests and the possibility of comparisons performed this way will un-
that the mtmam+F+C model is inadequate justiŽably increase the apparent signiŽcance
have not been assessed. levels of the results. In the HIV-1 example
above, if the SH-test (posNPfcd) is applied
D ISCUSS ION by considering that all 105 topologies for
We want to emphasize once more that the six sequences are possibly true, the P-value
problems described above with typical appli- for T1 is increased from 0.26 to 0.90. Con-
cations of the KH-test are very real and will sidering all 105 possible topologies in the
have practical consequences in many appli- SH-test applied in the mt protein sequence
cations. We contend that all future applica- example above increases the P-value for T1
tions must use new methods such as the SH- from 0.81 to 0.93 (RELL approximation, pos-
test and the SOWH-test (above). Assessment NPncd); for the topologies called a = 9–15 by
of the results of published analyses based on Shimodaira and Hasegawa (1999), P-values
incorrect applications of the KH-test must be are increased from signiŽcant values (< 0.05)
made with extreme caution. The sole correc- to nonsignŽcant values ( >0.05). Clearly, the
tion to these results that we have been able honest choice of a priori hypothesis topolo-
to derive (see above) will often generate in- gies may be crucial to the conclusions ulti-
conclusive results, demanding reanalysis of mately drawn.
data. The claim is sometimes made—when phy-
Evidently, it is vital that researchers think logenies of different genes are compared, for
carefully about what phylogenetic hypothe- instance—that no a priori topologies can be
ses they wish to test. A priori hypotheses and constructed. In such cases, however, one can
a posteriori hypotheses can be quite different, usually recast the hypothesis and its statisti-
as can the statistical distributions required to cal test differently. To use the example above,
test them. It serves no scientiŽc purpose to when comparing the evolutionary histories
“cheat” and represent an a posteriori hypoth- of different genes, we may restate this as
esis as an a priori one simply for the expedi- a test of whether the two (or more) trees
2000 GOLDMAN ET AL.—TESTS OF TOPOLOGIES IN PHYLOGENETICS 669

are sample estimates of the same phylogeny R EFERENCES


(Rodrigo et al., 1993). ADACHI, J., AND M. HASEGAWA. 1996. MOLPHY: Pro-
As illustrated in this paper, the results of grams for molecular phylogenetics based on maxi-
parametric tests (e.g., the SOWH-test) and mum likelihood, vers. 2.3. Institute of Statistical Math-
ematics, Tokyo.
nonparametric tests (e.g., the SH-test) can ap- ARVESTAD , L., AND W. J. B RUNO . 1997. Estimation of
pear to be very different. The SH-test may reversible substitution matrices from multiple pairs
often appear to be more conservative than of sequences. J. Mol. Evol. 45:696–703.
the SOWH-test. As we have explained, this B AR-HEN, A., AND H. K IS HINO . In press. Comparing the
may be due to some or all of the following likelihood functions of phylogenetic trees. Ann. Inst.
Stat. Math.
phenomena: different forms of null hypothe- CUNNINGHAM , C. W., H. ZHU , AND D. M. HILLIS . 1998.
ses; increased power of parametric tests; and Best-Žt maximum-likelihood models for phylogenetic
greater reliance of parametric tests on mod- inference: empirical tests with known phylogenies.
els of sequence evolution. The relative con- Evolution 52:978–987.
EFRON, B. 1982. The jackknife, the bootstrap and other
sequences of these and other effects require resampling plans. CBMS-NSF regional conference se-
further investigation in the future. ries in applied mathematics, volume 38. Society for
Industrial and Applied Mathematics, Philadelphia.
EFRON, B., AND R. TIBSHIRANI . 1986. Bootstrap methods
PROGRAM AND D ATA AVAILABILITY for standard errors, conŽdence intervals, and other
measures of statistical accuracy. Stat. Sci. 1:54–77.
Notes on the use of PAUP* (Swofford, EFRON, B., E. HALLORAN, AND S. HOLMES . 1996. Boot-
1998) to perform SOWH-tests, and details of strap conŽdence levels for phylogenetic trees. Proc.
the HIV-1 nucleotide sequences (6 sequences, Natl. Acad. Sci. USA 93:13429– 13434.
FELSENSTEIN , J. 1981. Evolutionary trees from DNA se-
each 2,000 bp) and mammalian mt protein quences: A maximum likelihood approach. J. Mol.
aa sequences (6 sequences, each 3,414 aa) Evol. 17:368–376.
used in the examples above, can be obtained FELSENSTEIN , J. 1985. ConŽdence limits on phylogenies:
from the authors at http://www.zoo.cam. an approach using the bootstrap. Evolution 39:783–
791.
ac.uk/zoostaff/goldman/tests and ‘down-
FELSENSTEIN , J. 1995. PHYLIP (Phylogenetic inference
stream’ Web pages. A computer program, package), version 3.57. Univ. of Washington, Seattle.
shtests, to perform SH-tests by using the G OLDMAN, N. 1993. Statistical tests of models of DNA
RELL approximation (posNPncd above) substitution. J. Mol. Evol. 36:182–198.
can be obtained from Andrew Rambaut G OLDMAN, N., AND S. WHELAN. 2000. Statistical tests
of gamma-distributed rate heterogeneity in models of
at http://evolve.zoo.ox.ac.uk/software/ sequence evolution in phylogenetics. Mol. Biol. Evol.
shtests. Versions of PHYLIP and PAUP* 17:975–978.
package programs (Felsenstein, 1995; HALL, P., AND S . R. WILS ON. 1991. Two guidelines for
Swofford, 1998) implementing the SH- bootstrap hypothesis testing. Biometrics 47:757–762.
HAS EGAWA, M., AND H. K ISHINO . 1989. ConŽdence lim-
test are currently under development
its on the maximum-likelihood estimate of the homi-
(J. Felsenstein, pers. comm.; D. Swofford, noid tree from mitochondrial-DNA sequences. Evo-
pers. comm.). lution 43:672–677.
HAS EGAWA, M., AND H. K ISHINO . 1994. Accuracies of
the simple methods for estimating the bootstrap prob-
ability of a maximum-likelihood tree. Mol. Biol. Evol.
ACKNOWLEDGMENTS 11:142–145.
Work by N.G. and A.G.R. on this topic was partially HAS EGAWA, M., H. K ISHINO , AND T. YANO . 1985. Dating
supported by the Isaac Newton Institute for the Mathe- of the human–ape splitting by a molecular clock of
matical Sciences programme on “Biomolecular Function mitochondrial DNA. J. Mol. Evol. 22:160–174.
and Evolution in the Context of the Genome Project” HAS EGAWA, M., H. K IS HINO , AND T. YANO . 1988. Phy-
(July–December 1998). N. G. is supported by a Well- logenetic inference from DNA sequence data. Pages
come Trust Fellowship in Biodiversity Research. J.P.A. 1–13 in Statistical theory and data analysis II (K. Mu-
is supported by a NIH Institutional NRSA Interdisci- tusita, ed.). Elsevier, Amsterdam.
plinary Training in Genomic Sciences Fellowship and HILLIS , D. M., B. K. M ABLE, AND C. M ORITZ. 1996. Ap-
by the University of Washington Center for AIDS Re- plications of molecular systematics: the state of the
search (CFAR). We are very grateful for the assistance Želd and a look to the future. Pages 515–543 in Molec-
given to us by Hidetoshi Shimodaira, Joe Felsenstein, ular systematics (D. M. Hillis, C. Moritz, and B. K.
David Swofford, Hirohisa Kishino, and Andrew Ram- Mable, eds.). Sinauer, Sunderland, Massachusetts.
baut throughout the preparation of this paper; for pre- HOPE, A. C. A. 1968. A simpliŽed Monte Carlo signiŽ-
publication versions of papers provided by Hidetoshi cance test procedure. J. R. Statist. Soc. B 30:582–598.
Shimodaira, Thomas Buckley, and Hirohisa Kishino; and HUELSENBECK , J. P., AND K. A. CR ANDALL. 1997. Phy-
for critical readings of draft versions of the paper by Ed- logeny estimation and hypothesis testing using max-
ward Holmes, Ann Oakenfull, Tim Massingham, Martin imum likelihood. Annu. Rev. Ecol. Syst. 28:437–466.
Embley, and Andrew Rambaut. Andrew Rambaut and HUELSENBECK , J. P., AND B. R ANNALA . 1997. Phyloge-
Korbinian Strimmer provided all of the 105-topology netic methods come of age: Testing hypotheses in an
SH-test P-value calculations given in the Discussion. evolutionary context. Science 276:227–232.
670 S YSTEMATIC BIOLOGY VOL. 49

HUELSENBECK , J. P., D. M. HILLIS , AND R. NIELSEN . S ULLIVAN, J., K. E. HOLSING ER, AND C. S IMON. 1996.
1996. A likelihood-ratio test of monophyly. Syst. Biol. The effect of topology on estimates of among-site rate
45:546–558. variation. J. Mol. Evol. 42:308–312.
J UKES , T. H., AND C. R. CANTOR . 1969. Evolution of pro- S WOFFORD, D. L. 1998. PAUP* 4.00: *Phylogenetic anal-
tein molecules. Pages 21–132 in Mammalian protein ysis using parsimony (and other methods). Sinauer,
metabolism (H. N. Munro, ed.). Academic Press, New Sunderland, Massachusetts.
York. S WOFFORD, D. L., G. J. O LSEN, P. J. WADDELL, AND D. M.
K IMUR A, M. 1980. A simple method for estimating evo- HILLIS . 1996. Phylogenetic inference. Pages 407–514 in
lutionary rates of base substitutions through compar- Molecular systematics (D. M. Hillis, C. Moritz, and B.
ative studies of nucleotide sequences. J. Mol. Evol. K. Mable, eds.). Sinauer, Sunderland, Massachusetts.
16:111–120. TEMPLETON, A. R. 1983. Phylogenetic inference from re-
K ISHINO , H., AND M. HASEGAWA. 1989. Evaluation of striction endonuclease cleavage site maps with par-
the maximum likelihood estimate of the evolution- ticular reference to the evolution of humans and the
ary tree topologies from DNA sequence data, and the apes. Evolution 37:221–244.
branching order in Hominoidea. J. Mol. Evol. 29:170– WESTFALL, P. H., AND S. S. YOUNG . 1993. Resampling-
179. based multiple testing: Examples and methods for
K ISHINO , H., T. M IYATA, AND M. HASEG AWA. 1990. p-value adjustment. John Wiley & Sons, New York.
Maximum likelihood inference of protein phylogeny YANG , Z. 1994. Estimating the pattern of nucleotide sub-
and the origin of chloroplasts. J. Mol. Evol. 31:151–160. stitution. J. Mol. Evol. 39:105–111.
M ARRIOTT, F. H. C. 1979. Barnard’s Monte Carlo tests: YANG , Z. 1996. Among-site variation and its impact on
how many simulations? Appl. Statist. 28:75–77. phylogenetic analysis. TREE 11:367–372.
R ODRIGO, A. G., M. K ELLY -B ORGES , P. R. B ERGQUIST , YANG , Z. 1997. PAML: A program package for phy-
AND P. L. B ERGQUIST . 1993. A randomization test of logenetic analysis by maximum likelihood. CABIOS
the null hypothesis that two cladograms are sample 13:555–556.
estimates of a parametric phylogenetic tree. N.Z. J. YANG , Z., N. G OLDMAN, AND A. FRIDAY. 1994. Com-
Bot. 31:257–268. parison of models for nucleotide substitution used in
S HIMOD AIRA, H. 1993. A model search technique based maximum-likelihood phylogenetic estimation. Mol.
on conŽdence set and map of models. Proc. Inst. Stat. Biol. Evol. 11:316–324.
Math. 41:131–147 (in Japanese). YANG , Z., N. G OLDMAN, AND A. FRIDAY. 1995. Maxi-
S HIMOD AIRA, H. 1998. An application of multiple com- mum likelihood trees from DNA sequences: a pecu-
parison techniques to model selection. Ann. Inst. Stat. liar statistical estimation problem. Syst. Biol. 44:384–
Math. 50:1–13. 399.
S HIMOD AIRA, H., AND M. HASEGAWA. 1999. Multiple YANG , Z., R. NIELSEN , AND M. HASEGAWA. 1998. Models
comparisons of log-likelihoods with applications to of amino acid substitution and applications to mito-
phylogenetic inference. Mol. Biol. Evol. 16:1114–1116. chondrial protein evolution. Mol. Biol. Evol. 15:1600–
S TRIMMER, K., AND A. VON HAESELER . 1996. Quartet 1611.
puzzling: a quartet maximum-likelihood method for ZHANG , J. 1999. Performance of likelihood ratio tests of
reconstructing tree topologies. Mol. Biol. Evol. 13:964– evolutionary hypotheses under inadequate substitu-
969. tion models. Mol. Biol. Evol. 16:868–875.
S TRIMMER, K., N. G OLDMAN, AND A. VON HAESELER .
1997. Bayesian probabilities and quartet puzzling. Received 9 November 1999; accepted 17 December 1999
Mol. Biol. Evol. 14:210–211. Associate Editor: R. Olmstead

Das könnte Ihnen auch gefallen