Xia DAMBE

From the Department of Applied Economics and Statis- Table 1. Common data file formats used in disk.
ed in disk. Extensive network functions have

tics 204, University of Nevada, Reno, NV 89557. Ad- DAMBE
dress correspondence to George C. J. Fernandez at the also been implemented for retrieving se-
address above. Sequence format Read in Convert to quences directly from GenBank, either by
䉷 2001 The American Genetic Association
LOCUS name or accession number, or by
PHYLIP ⫹ ⫹
PAUP ⫹ ⫹ keyword search.
MEGA ⫹ ⫹ DAMBE features a color-coded se-
CLUSTAL ⫹ – quence editor for either sequence input or
References FASTA ⫹ ⫹
GenBank ⫹ ⫹ visual alignment. Sequences can be as
Gabriel KR, 1971. The biplot graphic display of matri-
ces with application to principal component analysis. GCG ⫹ ⫹ long as 32,768 bp.
Biometrika 58:453–467. MSF ⫹ ⫹
DNA strider ⫹ ⫹
Fernandez GCJ, 1991. Analysis of genotype ⫻ environ- PAML ⫹ ⫹
ment interaction by stability estimates. HortScience 26: RST,MPa ⫹ – Sequence Manipulation
947–950. PHYLTEST ⫺ ⫹
IG/Stanford ⫹ ⫹ We will only highlight two of the many se-
Fernandez GCJ, 2000. Quick results from statistical
analysis (visited/last modified August 16, 2000). NBRF ⫹ ⫹ quence manipulation features in DAMBE.
http://www.ag.unr.edu/gf. EMBL ⫹ ⫹
FITCH ⫹ ⫹
Shafii B, Mahler KA, Price WJ, and Auld DL, 1992. Ge- PIR/CODATA ⫹ ⫹ Sequence Alignment
notype by environment interaction effects on winter Plain textb ⫹ ⫹ DAMBE can align nucleotide and amino
rapeseed yield and oil content. Crop Sci 32:922–927. Allele frequency ⫹ –
Distance matrix ⫹ ⫹
acid sequences as most other alignment
Shafii B and Price WJ, 1998. Analysis of genotype-by-
environment interaction using the additive main effects
programs do. However, one particular fea-
and multiplicative interaction model and stability esti-
a
Sequence formats for storing original sequences and ture that is not available in most other
mates. J Agric Biol Environ Stat 3:335–345. http:// reconstructed ancestral sequences from the original
sequences.
alignment programs is the ability to align
www.uidaho.edu/ag/statprog/ammi/.
b
The one-sequence-per-file text format from programs protein-coding nucleotide sequences
Tai GCC, 1971. Genotypic stability analysis and its ap-
plication to potato regional trials. Crop Sci 11:184–190.
such as Sequence Navigator and DNA Star. against aligned amino acid sequences.
Zobel RW, Wright MJ, and Gauch HG, 1988. Statistical
Other programs often introduce frame-
shift indels in the aligned protein-coding
Downloaded from jhered.oxfordjournals.org at Fuzhou University on May 31, 2011

analysis of a yield trial. Agron J 80:388–393.
Received September 7, 2000
descriptive statistics such as nucleotide, sequences, even if the protein genes are
Accepted April 30, 2001 amino acid, and codon frequencies, dinu- known to be functional and do not have
Corresponding Editor: Bruce S. Weir cleotide and diamino acid frequencies, these frame-shifting indels. In other words,
analysis of codon usage and amino acid the introduced frame-shifting indels in the
usage bias; and (3) comparative sequence aligned sequences are alignment artifacts,
analysis such as phylogenetic reconstruc- and the correctly aligned sequences
tion of trees and ancestral sequences with should have complete codons, not one or
DAMBE: Software Package distance, maximum parsimony and maxi- two nucleotides, inserted or deleted.
for Data Analysis in mum-likelihood methods, bootstrapping DAMBE solves this problem by aligning
Molecular Biology and and jackknifing, significance tests of mo- the protein-coding nucleotide sequences
lecular clock and alternative phylogenetic against aligned amino acid sequences.
Evolution hypotheses (i.e., how much better is the One can read in the protein-coding nucle-
X. Xia and Z. Xie best tree compared to alternative trees), otide sequences, translate them into ami-
fitting statistical distributions to substitu- no acid sequences, align the amino acid
DAMBE (data analysis in molecular biolo- tion data over sites including Poisson, sequences, and then align the original nu-
gy and evolution) is an integrated software negative binomial, and gamma distribu- cleotide sequences against the aligned
package for converting, manipulating, sta- tions. DAMBE features a user-friendly Win- amino acid sequences.
tistically and graphically describing, and dows interface with extensive on-line help.
analyzing molecular sequence data with a Translation
user-friendly Windows 95/98/2000/NT in- DAMBE implements all 12 different genetic
Sequence Input and Sequence
terface. DAMBE is free and can be down- codes and can therefore translate protein-
Format Conversion
loaded from http://web.hku.hk/⬃xxia/ coding nucleotide sequences from any or-
software/software.htm. The current version DAMBE can read and convert almost all ganism to amino acid sequences. The im-
is 4.0.36. commonly used molecular data formats plementation of these genetic codes
( Table 1). In particular, DAMBE can take greatly facilitates amino acid-based and
DAMBE (data analysis in molecular biolo- advantage of the rich information con- codon-based analyses.
gy and evolution) is an integrated com- tained in the FEATURES table of GenBank
puter program for descriptive and com- files and extract specific segments such as
Descriptive Sequence Analysis
parative analysis of molecular data CDS, exons, introns, rRNA, etc., by a few
(including nucleotide and amino acid se- mouse clicks. The user can also use ‘‘cus- This includes nucleotide, amino acid and
quence data, as well as allele frequency tom splicing’’ to extract sequence seg- codon usage analysis, compositional anal-
and distance matrix data). It has features ments that are not specified in the Gen- ysis based on dinucleotide and diamino
either not available or poorly implement- Bank sequences. acid frequencies, quantification of the ef-
ed in other programs. These features are DAMBE can read sequence files directly fect of GC and TpA frequencies on exon
grouped into (1) sequence format conver- from a networked computer, such as a re- and CDS lengths, and the methylation ef-
sion and manipulation supporting 20 com- mote UNIX workstation, in the same way fect on codon usage bias.
monly used molecular data formats; (2) as one would read a file from a local hard A substitution model used in compara-
Computer Notes 371

tive sequence analysis, such as phyloge- grams output only one of the possible The z score is computed and declared as
netic reconstruction using the maximum- trees. significant if it is larger than 1.96 ( Felsen-
likelihood method, typically has two All phylogenetic analyses for protein- stein 1985). The main problem with this
categories of parameters, the frequency coding genes can be performed on individ- test is that the result can be interpreted
parameters and the rate ratio parameters. ual codon positions or combinations of probabilistically only when you have just
The descriptive sequence analysis helps codon positions, for example, the first and two topologies and is not appropriate with
to understand the factors affecting the fre- second codon positions when the third multiple comparisons. DAMBE takes the
quency parameters and to select which codon position experienced substitution same approach but uses the Newman–
substitution model to use in phylogenetic saturation. Alternatively, one can also per- Keuls test that is better for multiple com-
reconstruction. form analysis on translated amino acid se- parisons.
quences. DAMBE can translate nucleotide For the maximum-likelihood method,
sequences from any organism into amino the Kishino–Hasegawa test ( Kishino and
Comparative Sequence Analysis
acid sequences because it implements all Hasegawa 1989), which is also called the
Quantification of Substitution Patterns known genetic codes. RELL test, is implemented as in PAML
A substitution model is characterized by ( Yang 2000) from which I have taken part
frequency parameters and rate ratio pa- Phylogenetic Analysis Involving of the code. The Kishino–Hasegawa test,
rameters, and it is important to know the Bootstrapping and Jackknifing as is practiced in literature, is analogous
empirical substitution patterns in order to Bootstrapping and delete-half jackknifing to the test in DNAPARS mentioned above,
decide which substitution model to use in are implemented in DAMBE in conjunction except that the test is based on the likeli-
analyzing sequences. The quantification of with the phylogenetic methods mentioned hood values rather than on the number of
empirical substitution patterns requires above. Resampling can be nucleotide steps. In short, one calculates the log-like-
pairwise comparisons. However, when the based, amino acid based, or codon based. lihood for each topology, the difference in
comparison is done between all possible The last is necessary for doing bootstrap- log-likelihood between the best tree and
sequence pairs, the resulting substitution ping and jackknifing with codon-based each of the alternative topologies, and the
pattern may be biased because the com- variance of the differences estimated by

methods such as Li’s (1993) synonymous
parisons are not independent ( Felsenstein and nonsynonymous distances. Consen- resampling methods such as bootstrap-
1992; Nee et al. 1996; Xia et al. 1996). For sus trees are displayed with bootstrapping ping. The z score is then calculated and
example, if there is one species that has values at internal nodes. The branch declared as significant if it is larger than
recently experienced a large number of lengths of a consensus tree can be evalu- 1.96. Again, such interpretation is heuris-
A→G transitions and few other substitu- ated. tic and is not appropriate probabilistically
tions, then all pairwise comparisons be- if there are more than two topologies be-
tween this species and the other species Testing Alternative Phylogenetic ing compared. DAMBE does the same
will each contribute one data point with a Hypotheses computation but uses the Newman–Keuls
large A→G transition bias. One way to It is often necessary to evaluate the rela- test which is more appropriate for multi-
avoid such a problem of nonindependence tive statistical support for alternative phy- ple comparisons.
is to reconstruct ancestral states of DNA logenetic hypotheses such as alternative
sequences and estimate the number of phylogenetic trees. Such hypothesis tests Phylogenetic Tree Viewing and
substitutions between neighboring nodes can be carried out in DAMBE with the dis- Manipulation
along the phylogenetic tree ( Tamura and tance, maximum parsimony, or maximum- DAMBE can graph and print publication-
Nei 1993; Xia 1998; Xia and Li 1998). DAM- likelihood methods. The significance tests quality trees. The tree-displaying window
BE automates this process. make proper multiple comparisons involv- is scrollable and therefore can accommo-
ing multiple trees. date very large trees. The displayed tree
Phylogenetic Reconstruction For the distance methods, the test is can also be copied to presentation pro-
DAMBE implements most commonly used similar to that detailed in Xia (2000), ex- grams such as Microsoft PowerPoint.
phylogenetic methods such as distance- cept that the following equation
based ( UPGMA, Fitch-Margoliash, and Fitting Statistical Distributions to
neighbor-joining methods), maximum par-
冘冘 (x Substitutions Over Sites
n⫺1 n
simony, and maximum-likelihood meth- ij ⫺ yij ) 2 It is important to know if the substitution

i⫽1 j⫽i⫹1
ods. A variety of genetic distances are im- var E ⫽ (1) rates vary among sites, because such rate
n(n ⫺ 1)/2 ⫺ m ⫺ 1
plemented. Nucleotide-based distances heterogeneity, according to a comparative
include the one-parameter (Jukes and Can- replaces equation (21.2) in Xia (2000). study based on simulated data ( Kuhner
tor 1969) and two-parameter ( Kimura For the maximum parsimony method, a and Felsenstein 1994), results in failure to
1980) distances, the paralinear distance rooted tree is required to represent alter- recover the true phylogenetic relation-
( Lake 1994), as well as distances based on native topologies. DNAPARS in PHYLIP has ships in virtually all commonly used phy-
the F84 model ( Felsenstein 1993) and already provided a significance test if you logenetic programs (or algorithms), in-
TN93 model ( Tamura and Nei 1993). Co- include user trees in the input file. In cluding the maximum-likelihood method
don-based distances include Li’s (1993) short, DNAPARS computes the number of (e.g., PHYLIP), maximum parsimony (e.g.,
synonymous and nonsynonymous dis- steps (changes in character states) for PAUP), or neighbor-joining (e.g., MEGA)
tances. The UPGMA and neighbor-joining each topology, the difference in the num- methods. DAMBE can fit the Poisson, neg-
methods can handle tied values in the ma- ber of steps between the best and each ative binomial, and gamma distributions
trix and generate all possible alternative alternative topology, and the associated to substitution data over sites. The maxi-
trees. Most other distance-based pro- ( large sample) variance of the differences. mum-likelihood estimator for k in the neg-
372 The Journal of Heredity 2001:92(4)

ative binomial distribution is from John- grant from the University of Hong Kong (10203043/ Kuhner MK and Felsenstein J, 1994. A simulation com-
27662, 10203435/27662) and an RGC grant from Hong parison of phylogeny algorithms under equal and un-
son et al. (1992), and that for the shape Kong Research Grant Council ( HKU7265/00M; to equal evolutionary rates. Mol Biol Evol 11:459–468
parameter in the gamma distribution is X.X.). We thank W. H. Li for using DAMBE in his bioin- [published erratum appears in Mol Biol Evol 1995;12:
formatics course and H. Kong for assistance. Address 525].
from Evans et al. (1993). correspondence to Xuhua Xia at the address above
Lake JA, 1994. Reconstructing evolutionary trees from
or e-mail: xxia@hkusua.hku.hk.
DNA and protein sequences: paralinear distances. Proc
䉷 2001 The American Genetic Association Natl Acad Sci USA 91:1455–1459.
Graphics
Li W-H, 1993. Unbiased estimation of the rates of syn-
In addition to graphically displaying and onymous and nonsynonymous substitution. J Mol Evol
36:96–99.
printing trees, DAMBE also produces a va- References
Nee S, Holmes EC, Rambaut A, and Harvey PH, 1996.
riety of graphic outputs including plotting Evans M, Hastings N, and Peacock B, 1993. Statistical Inferring population history from molecular phyloge-
one or more amino acid properties along distributions. New York: John Wiley & Sons. nies. In: New uses for new phylogenies ( Harvey PH,
amino acid sequences (e.g., polarity plot), Felsenstein J, 1985. Confidence limits on phylogenies Brown AJL, Maynard Smith J, Nee S, eds). Oxford: Ox-
with a molecular clock. Syst Zool 34:152–161. ford University Press; 66–80.
saturation plots (i.e., transitions and
Felsenstein J, 1992. Estimating effective population size Tamura K and Nei M, 1993. Estimation of the number
transversions over divergence), variabil- of nucleotide substitutions in the control region of mi-
from samples of sequences: inefficiency of pairwise
ity-over-site plots, substitution-over-site and segregating sites as compared to phylogenetic es- tochondrial DNA in humans and chimpanzees. Mol Biol
plots, etc. timates. Genet Res 59:139–147. Evol 10:512–526.
In short, DAMBE is a user-friendly pro- Felsenstein J, 1993. PHYLIP 3.5 (phylogeny inference Xia X, 1998. The rate heterogeneity of nonsynonymous
package). Seattle: Department of Genetics, University substitutions in mammalian mitochondrial genes. Mol
gram for the Windows platform that fea- Biol Evol 15:336–344.
of Washington.
tures a suite of unique features as well as
Johnson NL, Kotz S, and Kemp AW, 1992. Univariate Xia X, 2000. Data analysis in molecular biology and evo-
the capability of performing most routine discrete distributions. New York: John Wiley & Sons. lution. Boston: Kluwer Academic.
data analyses in molecular biology, ecolo- Xia X, Hafner MS, and Sudman PD, 1996. On transition
Jukes TH and Cantor CR, 1969. Evolution of protein
gy, and evolution. molecules. In: Mammalian protein metabolism (Munro bias in mitochondrial genes of pocket gophers. J Mol
HN, ed). New York: Academic Press; 21–123. Evol 43:32–40.
From the Bioinformatics Laboratory, Department of
Kimura M, 1980. A simple method for estimating evo- Xia X and Li W-H, 1998. What amino acid properties
Ecology and Biodiversity, University of Hong Kong,
affect protein evolution? J Mol Evol 47:557–564.

lutionary rates of base substitutions through compar-
Pokfulam Road, Hong Kong. DAMBE has incorporated
ative studies of nucleotide sequences. J Mol Evol 16:
codes from PHYLIP with permission from J. Felsen- Yang Z, 2000. Phylogenetic analysis by maximum like-
111–120.
stein, and from the BASEML program in PAML with lihood (PAML). London: University College.
permission from Z. Yang. Part of the codes for se- Kishino H and Hasegawa M, 1989. Evaluation of the
Received July 30. 2000
quence alignment in DAMBE are taken from the pro- maximum likelihood estimate of the evolutionary tree
Accepted February 14, 2001
gram CLUSTAL by D. Higgins, J. Thompson, and T. topologies from DNA sequence data, and the branching
Gibson. DAMBE development is supported by a CRCG order in Hominoidea. J Mol Evol 29:170–179. Corresponding Editor: Sudhir Kumar
Computer Notes 373

Xia DAMBE

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Xia DAMBE

Hochgeladen von

Copyright:

Verfügbare Formate

From the Department of Applied Economics and Statis- Table 1. Common data file formats used in disk.

ed in disk. Extensive network functions have

Downloaded from jhered.oxfordjournals.org at Fuzhou University on May 31, 2011

Computer Notes 371

Downloaded from jhered.oxfordjournals.org at Fuzhou University on May 31, 2011

simony, and maximum-likelihood meth- ij ⫺ yij ) 2 It is important to know if the substitution

372 The Journal of Heredity 2001:92(4)

Downloaded from jhered.oxfordjournals.org at Fuzhou University on May 31, 2011

Computer Notes 373

Das könnte Ihnen auch gefallen