From the Department of Applied Economics and Statis- Table 1. Common data file formats used in disk.
ed in disk. Extensive network functions have
tics 204, University of Nevada, Reno, NV 89557. Ad- DAMBE dress correspondence to George C. J. Fernandez at the also been implemented for retrieving se- address above. Sequence format Read in Convert to quences directly from GenBank, either by 䉷 2001 The American Genetic Association LOCUS name or accession number, or by PHYLIP ⫹ ⫹ PAUP ⫹ ⫹ keyword search. MEGA ⫹ ⫹ DAMBE features a color-coded se- CLUSTAL ⫹ – quence editor for either sequence input or References FASTA ⫹ ⫹ GenBank ⫹ ⫹ visual alignment. Sequences can be as Gabriel KR, 1971. The biplot graphic display of matri- ces with application to principal component analysis. GCG ⫹ ⫹ long as 32,768 bp. Biometrika 58:453–467. MSF ⫹ ⫹ DNA strider ⫹ ⫹ Fernandez GCJ, 1991. Analysis of genotype ⫻ environ- PAML ⫹ ⫹ ment interaction by stability estimates. HortScience 26: RST,MPa ⫹ – Sequence Manipulation 947–950. PHYLTEST ⫺ ⫹ IG/Stanford ⫹ ⫹ We will only highlight two of the many se- Fernandez GCJ, 2000. Quick results from statistical analysis (visited/last modified August 16, 2000). NBRF ⫹ ⫹ quence manipulation features in DAMBE. http://www.ag.unr.edu/gf. EMBL ⫹ ⫹ FITCH ⫹ ⫹ Shafii B, Mahler KA, Price WJ, and Auld DL, 1992. Ge- PIR/CODATA ⫹ ⫹ Sequence Alignment notype by environment interaction effects on winter Plain textb ⫹ ⫹ DAMBE can align nucleotide and amino rapeseed yield and oil content. Crop Sci 32:922–927. Allele frequency ⫹ – Distance matrix ⫹ ⫹ acid sequences as most other alignment Shafii B and Price WJ, 1998. Analysis of genotype-by- environment interaction using the additive main effects programs do. However, one particular fea- and multiplicative interaction model and stability esti- a Sequence formats for storing original sequences and ture that is not available in most other mates. J Agric Biol Environ Stat 3:335–345. http:// reconstructed ancestral sequences from the original sequences. alignment programs is the ability to align www.uidaho.edu/ag/statprog/ammi/. b The one-sequence-per-file text format from programs protein-coding nucleotide sequences Tai GCC, 1971. Genotypic stability analysis and its ap- plication to potato regional trials. Crop Sci 11:184–190. such as Sequence Navigator and DNA Star. against aligned amino acid sequences. Zobel RW, Wright MJ, and Gauch HG, 1988. Statistical Other programs often introduce frame- shift indels in the aligned protein-coding
Downloaded from jhered.oxfordjournals.org at Fuzhou University on May 31, 2011
analysis of a yield trial. Agron J 80:388–393. Received September 7, 2000 descriptive statistics such as nucleotide, sequences, even if the protein genes are Accepted April 30, 2001 amino acid, and codon frequencies, dinu- known to be functional and do not have Corresponding Editor: Bruce S. Weir cleotide and diamino acid frequencies, these frame-shifting indels. In other words, analysis of codon usage and amino acid the introduced frame-shifting indels in the usage bias; and (3) comparative sequence aligned sequences are alignment artifacts, analysis such as phylogenetic reconstruc- and the correctly aligned sequences tion of trees and ancestral sequences with should have complete codons, not one or DAMBE: Software Package distance, maximum parsimony and maxi- two nucleotides, inserted or deleted. for Data Analysis in mum-likelihood methods, bootstrapping DAMBE solves this problem by aligning Molecular Biology and and jackknifing, significance tests of mo- the protein-coding nucleotide sequences lecular clock and alternative phylogenetic against aligned amino acid sequences. Evolution hypotheses (i.e., how much better is the One can read in the protein-coding nucle- X. Xia and Z. Xie best tree compared to alternative trees), otide sequences, translate them into ami- fitting statistical distributions to substitu- no acid sequences, align the amino acid DAMBE (data analysis in molecular biolo- tion data over sites including Poisson, sequences, and then align the original nu- gy and evolution) is an integrated software negative binomial, and gamma distribu- cleotide sequences against the aligned package for converting, manipulating, sta- tions. DAMBE features a user-friendly Win- amino acid sequences. tistically and graphically describing, and dows interface with extensive on-line help. analyzing molecular sequence data with a Translation user-friendly Windows 95/98/2000/NT in- DAMBE implements all 12 different genetic Sequence Input and Sequence terface. DAMBE is free and can be down- codes and can therefore translate protein- Format Conversion loaded from http://web.hku.hk/⬃xxia/ coding nucleotide sequences from any or- software/software.htm. The current version DAMBE can read and convert almost all ganism to amino acid sequences. The im- is 4.0.36. commonly used molecular data formats plementation of these genetic codes ( Table 1). In particular, DAMBE can take greatly facilitates amino acid-based and DAMBE (data analysis in molecular biolo- advantage of the rich information con- codon-based analyses. gy and evolution) is an integrated com- tained in the FEATURES table of GenBank puter program for descriptive and com- files and extract specific segments such as Descriptive Sequence Analysis parative analysis of molecular data CDS, exons, introns, rRNA, etc., by a few (including nucleotide and amino acid se- mouse clicks. The user can also use ‘‘cus- This includes nucleotide, amino acid and quence data, as well as allele frequency tom splicing’’ to extract sequence seg- codon usage analysis, compositional anal- and distance matrix data). It has features ments that are not specified in the Gen- ysis based on dinucleotide and diamino either not available or poorly implement- Bank sequences. acid frequencies, quantification of the ef- ed in other programs. These features are DAMBE can read sequence files directly fect of GC and TpA frequencies on exon grouped into (1) sequence format conver- from a networked computer, such as a re- and CDS lengths, and the methylation ef- sion and manipulation supporting 20 com- mote UNIX workstation, in the same way fect on codon usage bias. monly used molecular data formats; (2) as one would read a file from a local hard A substitution model used in compara-
Computer Notes 371
tive sequence analysis, such as phyloge- grams output only one of the possible The z score is computed and declared as netic reconstruction using the maximum- trees. significant if it is larger than 1.96 ( Felsen- likelihood method, typically has two All phylogenetic analyses for protein- stein 1985). The main problem with this categories of parameters, the frequency coding genes can be performed on individ- test is that the result can be interpreted parameters and the rate ratio parameters. ual codon positions or combinations of probabilistically only when you have just The descriptive sequence analysis helps codon positions, for example, the first and two topologies and is not appropriate with to understand the factors affecting the fre- second codon positions when the third multiple comparisons. DAMBE takes the quency parameters and to select which codon position experienced substitution same approach but uses the Newman– substitution model to use in phylogenetic saturation. Alternatively, one can also per- Keuls test that is better for multiple com- reconstruction. form analysis on translated amino acid se- parisons. quences. DAMBE can translate nucleotide For the maximum-likelihood method, sequences from any organism into amino the Kishino–Hasegawa test ( Kishino and Comparative Sequence Analysis acid sequences because it implements all Hasegawa 1989), which is also called the Quantification of Substitution Patterns known genetic codes. RELL test, is implemented as in PAML A substitution model is characterized by ( Yang 2000) from which I have taken part frequency parameters and rate ratio pa- Phylogenetic Analysis Involving of the code. The Kishino–Hasegawa test, rameters, and it is important to know the Bootstrapping and Jackknifing as is practiced in literature, is analogous empirical substitution patterns in order to Bootstrapping and delete-half jackknifing to the test in DNAPARS mentioned above, decide which substitution model to use in are implemented in DAMBE in conjunction except that the test is based on the likeli- analyzing sequences. The quantification of with the phylogenetic methods mentioned hood values rather than on the number of empirical substitution patterns requires above. Resampling can be nucleotide steps. In short, one calculates the log-like- pairwise comparisons. However, when the based, amino acid based, or codon based. lihood for each topology, the difference in comparison is done between all possible The last is necessary for doing bootstrap- log-likelihood between the best tree and sequence pairs, the resulting substitution ping and jackknifing with codon-based each of the alternative topologies, and the pattern may be biased because the com- variance of the differences estimated by
Downloaded from jhered.oxfordjournals.org at Fuzhou University on May 31, 2011
methods such as Li’s (1993) synonymous parisons are not independent ( Felsenstein and nonsynonymous distances. Consen- resampling methods such as bootstrap- 1992; Nee et al. 1996; Xia et al. 1996). For sus trees are displayed with bootstrapping ping. The z score is then calculated and example, if there is one species that has values at internal nodes. The branch declared as significant if it is larger than recently experienced a large number of lengths of a consensus tree can be evalu- 1.96. Again, such interpretation is heuris- A→G transitions and few other substitu- ated. tic and is not appropriate probabilistically tions, then all pairwise comparisons be- if there are more than two topologies be- tween this species and the other species Testing Alternative Phylogenetic ing compared. DAMBE does the same will each contribute one data point with a Hypotheses computation but uses the Newman–Keuls large A→G transition bias. One way to It is often necessary to evaluate the rela- test which is more appropriate for multi- avoid such a problem of nonindependence tive statistical support for alternative phy- ple comparisons. is to reconstruct ancestral states of DNA logenetic hypotheses such as alternative sequences and estimate the number of phylogenetic trees. Such hypothesis tests Phylogenetic Tree Viewing and substitutions between neighboring nodes can be carried out in DAMBE with the dis- Manipulation along the phylogenetic tree ( Tamura and tance, maximum parsimony, or maximum- DAMBE can graph and print publication- Nei 1993; Xia 1998; Xia and Li 1998). DAM- likelihood methods. The significance tests quality trees. The tree-displaying window BE automates this process. make proper multiple comparisons involv- is scrollable and therefore can accommo- ing multiple trees. date very large trees. The displayed tree Phylogenetic Reconstruction For the distance methods, the test is can also be copied to presentation pro- DAMBE implements most commonly used similar to that detailed in Xia (2000), ex- grams such as Microsoft PowerPoint. phylogenetic methods such as distance- cept that the following equation based ( UPGMA, Fitch-Margoliash, and Fitting Statistical Distributions to neighbor-joining methods), maximum par- 冘 冘 (x Substitutions Over Sites n⫺1 n
simony, and maximum-likelihood meth- ij ⫺ yij ) 2 It is important to know if the substitution
i⫽1 j⫽i⫹1 ods. A variety of genetic distances are im- var E ⫽ (1) rates vary among sites, because such rate n(n ⫺ 1)/2 ⫺ m ⫺ 1 plemented. Nucleotide-based distances heterogeneity, according to a comparative include the one-parameter (Jukes and Can- replaces equation (21.2) in Xia (2000). study based on simulated data ( Kuhner tor 1969) and two-parameter ( Kimura For the maximum parsimony method, a and Felsenstein 1994), results in failure to 1980) distances, the paralinear distance rooted tree is required to represent alter- recover the true phylogenetic relation- ( Lake 1994), as well as distances based on native topologies. DNAPARS in PHYLIP has ships in virtually all commonly used phy- the F84 model ( Felsenstein 1993) and already provided a significance test if you logenetic programs (or algorithms), in- TN93 model ( Tamura and Nei 1993). Co- include user trees in the input file. In cluding the maximum-likelihood method don-based distances include Li’s (1993) short, DNAPARS computes the number of (e.g., PHYLIP), maximum parsimony (e.g., synonymous and nonsynonymous dis- steps (changes in character states) for PAUP), or neighbor-joining (e.g., MEGA) tances. The UPGMA and neighbor-joining each topology, the difference in the num- methods. DAMBE can fit the Poisson, neg- methods can handle tied values in the ma- ber of steps between the best and each ative binomial, and gamma distributions trix and generate all possible alternative alternative topology, and the associated to substitution data over sites. The maxi- trees. Most other distance-based pro- ( large sample) variance of the differences. mum-likelihood estimator for k in the neg-
372 The Journal of Heredity 2001:92(4)
ative binomial distribution is from John- grant from the University of Hong Kong (10203043/ Kuhner MK and Felsenstein J, 1994. A simulation com- 27662, 10203435/27662) and an RGC grant from Hong parison of phylogeny algorithms under equal and un- son et al. (1992), and that for the shape Kong Research Grant Council ( HKU7265/00M; to equal evolutionary rates. Mol Biol Evol 11:459–468 parameter in the gamma distribution is X.X.). We thank W. H. Li for using DAMBE in his bioin- [published erratum appears in Mol Biol Evol 1995;12: formatics course and H. Kong for assistance. Address 525]. from Evans et al. (1993). correspondence to Xuhua Xia at the address above Lake JA, 1994. Reconstructing evolutionary trees from or e-mail: xxia@hkusua.hku.hk. DNA and protein sequences: paralinear distances. Proc 䉷 2001 The American Genetic Association Natl Acad Sci USA 91:1455–1459. Graphics Li W-H, 1993. Unbiased estimation of the rates of syn- In addition to graphically displaying and onymous and nonsynonymous substitution. J Mol Evol 36:96–99. printing trees, DAMBE also produces a va- References Nee S, Holmes EC, Rambaut A, and Harvey PH, 1996. riety of graphic outputs including plotting Evans M, Hastings N, and Peacock B, 1993. Statistical Inferring population history from molecular phyloge- one or more amino acid properties along distributions. New York: John Wiley & Sons. nies. In: New uses for new phylogenies ( Harvey PH, amino acid sequences (e.g., polarity plot), Felsenstein J, 1985. Confidence limits on phylogenies Brown AJL, Maynard Smith J, Nee S, eds). Oxford: Ox- with a molecular clock. Syst Zool 34:152–161. ford University Press; 66–80. saturation plots (i.e., transitions and Felsenstein J, 1992. Estimating effective population size Tamura K and Nei M, 1993. Estimation of the number transversions over divergence), variabil- of nucleotide substitutions in the control region of mi- from samples of sequences: inefficiency of pairwise ity-over-site plots, substitution-over-site and segregating sites as compared to phylogenetic es- tochondrial DNA in humans and chimpanzees. Mol Biol plots, etc. timates. Genet Res 59:139–147. Evol 10:512–526. In short, DAMBE is a user-friendly pro- Felsenstein J, 1993. PHYLIP 3.5 (phylogeny inference Xia X, 1998. The rate heterogeneity of nonsynonymous package). Seattle: Department of Genetics, University substitutions in mammalian mitochondrial genes. Mol gram for the Windows platform that fea- Biol Evol 15:336–344. of Washington. tures a suite of unique features as well as Johnson NL, Kotz S, and Kemp AW, 1992. Univariate Xia X, 2000. Data analysis in molecular biology and evo- the capability of performing most routine discrete distributions. New York: John Wiley & Sons. lution. Boston: Kluwer Academic. data analyses in molecular biology, ecolo- Xia X, Hafner MS, and Sudman PD, 1996. On transition Jukes TH and Cantor CR, 1969. Evolution of protein gy, and evolution. molecules. In: Mammalian protein metabolism (Munro bias in mitochondrial genes of pocket gophers. J Mol HN, ed). New York: Academic Press; 21–123. Evol 43:32–40. From the Bioinformatics Laboratory, Department of Kimura M, 1980. A simple method for estimating evo- Xia X and Li W-H, 1998. What amino acid properties Ecology and Biodiversity, University of Hong Kong, affect protein evolution? J Mol Evol 47:557–564.
Downloaded from jhered.oxfordjournals.org at Fuzhou University on May 31, 2011
lutionary rates of base substitutions through compar- Pokfulam Road, Hong Kong. DAMBE has incorporated ative studies of nucleotide sequences. J Mol Evol 16: codes from PHYLIP with permission from J. Felsen- Yang Z, 2000. Phylogenetic analysis by maximum like- 111–120. stein, and from the BASEML program in PAML with lihood (PAML). London: University College. permission from Z. Yang. Part of the codes for se- Kishino H and Hasegawa M, 1989. Evaluation of the Received July 30. 2000 quence alignment in DAMBE are taken from the pro- maximum likelihood estimate of the evolutionary tree Accepted February 14, 2001 gram CLUSTAL by D. Higgins, J. Thompson, and T. topologies from DNA sequence data, and the branching Gibson. DAMBE development is supported by a CRCG order in Hominoidea. J Mol Evol 29:170–179. Corresponding Editor: Sudhir Kumar