Data For Phylogenetics

Data for phylogenetic analysis
The data that are used to estimate the phylogeny of a set of tips are the characteristics of
those tips. Therefore the success of phylogenetic inference depends in large measure on
the choice of trait data, the accuracy of those data, and the quantity of data obtained.
Whether or not you plan on being engaged in phylogenetic analysis it is important to
know the kinds of data that are typically used in phylogenetic analysis and to understand
how data are organized to permit phylogenetic analysis.
In this chapter I introduce the concept of a data matrix, and focus initially on the two
kinds of data that are most important to know about: DNA sequence data and
morphological data. I end by surveying the full range of data types that have at one time
or another been used for phylogenetic analysis.
A character-state data matrix
The first step in a phylogenetic analysis is to decide on the organisms or species that will
serve as tips. The tips scored for traits in a phylogenetic analysis are usually called taxa
(singular == taxon). [This usage of the term taxon/taxa is similar but not identical to the
“formally named groups or clade” used in the context of biological classification.]
Enough taxa need to be included in a phylogenetic analysis so that the tree will resolve
the phylogenetic questions that motivate the study. This will almost always mean
including multiple taxa from within the group of interest (the ingroup) and at least one,
but perhaps many, outgroups: taxa that are known not to be within the ingroup. The
reason for including such outgroups was discussed briefly in chapter 6, and will be
covered in more detail in chapter 9. The choice of taxa is also heavily influenced by
practical issues, such as the availability of material for study.
Imagine for example that you are studying the phylogeny of the Carnivora, a group of
mammals, which share a number of dental and skeletal traits that support their
monophyly. Your ingroup would likely include at least one representative of each of the
previously identified clades from within Carnivora: dogs (Canidae), cats (Felidae),
hyaenas (Hyaenidae), otters (Mustelidae), raccoons (Procyonidae), bears (Ursidae), civets
(Viverridae), and seals (Pinnipedia). Additionally, individual taxa whose relationships are
uncertain, in this case the giant and red pandas, might be included in the study.
As outgroups, you could theoretically include any taxon that is not a member of
Carnivora. However the best outgroups are usually ones that are reasonably closely
related to the ingroup so that they can be compared directly for many traits. For example,
you might use non-Carnivora mammals that have previously been found to be closely-
related to Carnivora, for example members of the horse/rhino/tapir clade.
Having selected the set of taxa to include in a phylogenetic study, the next step is to
collect information on the traits of those taxa. In order to keep track of the similarity or
differences of the taxa, the data are organized into a data matrix. Two kinds of data
matrix are important to know about.
A character-state matrix is a list of taxa and the state that they manifest for each of a set
of characters. A character-state matrix has one entry for each taxon for each character
scored. Thus, for T taxa and C characters, the total number of data points is T x C.
A distance matrix is a listing of the overall dissimilarity (or more rarely similarity) of
each pair of taxa. A distance matrix has (T x (T-1))/2 entries. Given that we typically
score many more characters than taxa, distance matrices usually include less total
information than character-state matrices.
A character-state matrix lists the state of each taxon for a number of characters. In the
case of morphological data, a character is a measurable attribute of an organism that has
the potential to be present in distinct states. Some examples of characters and character
states are given in Table 1. Note that the characters include hard and soft body parts and
behavioral traits. Further some characters are presence/absence, some show discrete
variation (e.g., number of teeth), others show truly continuous variation (e.g., color, size)
that may be broken-up into discrete states based on patterns of variation seen among the
study taxa. Character scoring issues will be discussed further below.
Character States: actual States: numerical

Hair color White, brown, black 0, 1, 2
Number of lower 1, 2, 3 0, 1, 2
molars
External ear (pinna) Present, absent 0, 1
Life habit Terrestrial, amphibious, marine 0, 1, 2
Adult body mass < 50g, 50-500g, 500-2000g, >2000g 0, 1, 2, 3
Having decided on a list of characters and character states you can then score them for
each taxon. The scoring is usually conducted on one or a limited number of individual
organisms that are considered to represent the taxon of interest (e.g., a species). In the
case that all the individuals examined for a taxon have the same state, that state is scored
unambiguously. If there is variation among the individuals of one taxon for a trait, the
trait will be scored so as to keep track of the fact that there is variation, or polymorphism,
within this taxon.
Normally scoring is facilitated by assigning numerical values to each state of each

character, conventionally starting with ‘0’. The numerical values assigned are arbitrary,
except that for characters where there is an intrinsic ordering among the states (for
example molar number and body mass in the above example) one preserves the ordering
of states, although it does not matter whether they are ordered them from high-to-low or
low-to-high. In the Hennigian Method, ancestral states or states occurring in the
outgroup would usually be assigned state ‘0,’ but this convention is no longer followed.
An example of a part of a typical morphological data matrix is given. To the left the taxa
are listed. Following each taxon are the values for each of the ten characters in this small
data matrix. Notice that some taxa may be scored as unknown for certain characters
(conventionally represented with ‘?’), either because we are ignorant as to the proper
scoring or because it is impossible to score (e.g., toe number in snakes). Also, a taxon
can be scored as polymorphic by listing multiple states within a cell.
Taxon list Character-state scoring
1 2 3 4 5 6 7 8 9 10 Character number
Horse (outgroup) 0 1 0 0 1 1 0 0 0 0 Character number

Cat 0 1 1 1 1 1 0 1 1 1
Dog 0 0 1 1 0 1 0 1 1 1
Bear 1 0 1 12 1 2 0 2 1 1
Otter 0 0 2 2 1 2 0 1 1 01
Seal 1 ? 2 2 1 0 0 2 1 ?
[Does this matrix make sense?!]
In the case of DNA sequence data, a character-state matrix looks very similar. The taxa
correspond to individual organisms from the taxon whose DNA was extracted and
sequenced. The characters in this case correspond to a particular position in a gene’s
sequence. Therefore, unlike morphological data, the character number is not arbitrary but
indicates the relative position of the nucleotide within the region that has been sequenced.
The four states correspond to the four different bases that can occupy each position. As
with morphological data one can be ignorant as to the identity of a base in a particular
position (‘?’) or you can find multiple alternative states within the taxon. Additionally,
you might find that some taxa have a portion of the sequence missing entirely due to
insertions of deletions of DNA in the course of evolution. These are generally indicated
with a gap code, ‘-,‘ but are most commonly treated as equivalent to missing data. The
phylogenetic analysis of gaps and the determination of where insertion/deletion events
have occurred raises a number of complex issues that will be touched on when we cover
sequence alignment.
Taxon list Sequence data
1 2 3 4 5 6 7 8 9 10 Sequence position
Horse (outgroup) A T G G A T C A A C
Cat A T ? G A C C A A C
Dog A T G G A C C A T CT
Bear A T G A A C C A T C
Otter A T G G A C C A T C
Seal A T G A A C - - - C
A distance matrix
A distance matrix differs from a character-state data matrix in that instead of include
entiries for T taxa and N characters, it includes a single value that summarizes the
evolutionary difference (dissimilarity) (or its converse, similarity) of each pair of taxa.
For T taxa, there are T * (T-1)/2 pairwise distances. An example is shown:
sealion walrus seal bear racoon weasel dog civet hyaena cat taxa
horse 0.290 0.289 0.297 0.270 0.293 0.299 0.288 0.250 0.274 0.250
sealion - 0.028 0.058 0.134 0.156 0.148 0.187 0.196 0.207 0.202
walrus - 0.055 0.135 0.155 0.147 0.188 0.194 0.209 0.197
seal - 0.134 0.161 0.154 0.181 0.198 0.205 0.199
taxa
bear - 0.139 0.156 0.185 0.179 0.193 0.181 Distance
racoon - 0.130 0.205 0.214 0.221 0.214 measures
weasel - 0.205 0.208 0.219 0.213
dog - 0.210 0.217 0.202
civet - 0.092 0.081
hyaena - 0.092
In this case the matrix lists the difference (dissimilarity) of each pair of taxa, meaning
that the higher the number the greater the distance between two taxa. Sometimes a
matrix will show pairwise similarity, in which case the higher the number the shorter the
distance between the two taxa.
Immunological cross-reactivity and DNA-DNA hybridization data, kinds of data that are
rarely collected nowadays, are two kinds of data that are obtained in the form of pairwise
similarity or difference. More commonly a distance matrix is derived from a character-
state matrix.
The simplest way to convert a character-state matrix into a distance matrix is to calculate
the proportion of characters for which a pair of taxa differ in state (and repeat this for all
taxa). For example, as shown below, the dog and cat in the example matrix differ at two
out of 10 morphological characters, representing a dissimilarity of 0.2 (or a similarity of
0.8). There are many complications in calculating distances, a few of which will be
discussed in Chapter 11. It is worth noting that because many character state matrices
can yield the same distance matrix you cannot infer a character state matrix from a
distance matrix.
Cat 0 1 1 1 1 1 0 1 1 1
Dog 0 0 1 1 0 1 0 1 1 1
Generating an aligned DNA sequence data matrix
Most of published phylogenetic results are derived from analysis of DNA sequence data.
To provide some context, it will be useful to briefly review the experimental methods that
are used to collect DNA sequence data.
Sequence databases such as GenBank contain a phenomenal amount of sequence data

from a very large number of different species. As a result, it is sometimes possible to
conduct a phylogenetic study without collecting any new sequence data. All that is
required is to download a particular annotated gene from a set of taxa of interest.
However, because the best phylogenetic research utilizes large quantities of data (many
genes) for a representative set of taxa, such work still usually requires the generation of
new DNA sequence data for the specific purpose of the phylogenetic study.
Usually phylogenetic research begins with a choice of a gene to study. The aim is to pick
a gene with a suitable level of variation: too little variation and the research is not
deploying resources efficiently, too much variation and the data is likely to be messy (for
reasons that will be discussed). Additionally, it is desirable to select a gene that is present
as a single copy in all the organisms being studied, because extra copies can confound
interpretation of the resulting trees (Chapter 13). Lastly, practical issues, such as the
amount of preexisting data and the ease with which the gene can be isolated, always
influence the choice.
Appendix 7.1.i provides a brief summary of the molecular methods used to obtain DNA
sequences for each taxon in a data matrix. The sequences are composed of the letters, A,
C, G, and T corresponding to the four bases, adenine, cytosine, guanine, and thymine.
There may also be additional letters such as R, Y, S, W, M, and K, which are used to
indicate uncertainty in the identity of a base at a certain position (for example, R means
that either an A or G is present, Y refers to C or T, etc.). ‘N’ indicates that an unknown
nucleotide is present, whereas ‘?’ is used to indicate a position where the editor was
uncertain if there was or was not a base at all, for example if the data are ambiguous as to
whether there are 9 or 10 A’s in a row in the sequence.
Here are two sequences as they might be obtained.
CGTTTATGGTGACGGAGCCGGGGGAGGTAGCACGTGGCAAAAAGAACGGCCTCGATTATCTCTTCCATCTT
TACGAACAGTGCCGGGAGTTCTTGATTCAAGTCCAAAACATCGCCAAGGAGCGCGGCGAAAAATGCCCCAC
CAAGGTAACAATAGAAACAAATCTATTTTTAATGTTTCTTAAGTAAAATTTTGAATTCAAGCTCCGTAAAT
GAATGAAAATATGAGAAATATCCTGTTTTTGATCCGATTCTCATGGAAAAATATGAAACTAGGATAGTTTT
TGCATGGTGCACGAGGTTTGACACGGGACTAGCTGTAAAAACAAGGCTGTCTCTGTTAGAATCTTAGAACT
GGACCAGCCCTCCCATTAAAGCTAGGGTTTCTAGCCCATGAAAATGTGACAACTCAGGTACGGGGAGGAAT
GGAGTCTGAAAACTTGGGACATGTATGTCTAAATTTTTGCAGAGTAAGGTCCCCTCCGCCCCAAAAGGTTG
TAC?TTTTGTCTTTAAASACTTTACTGTCTTCCTTTCTGAAGCCTCGTTTTCCCTGTCCGGTTTAGCTGAG
GTGGCGTGACCCTAATACGACAGCTCCACCAYTTTTGGATCCTAATCTTATTGCTTATACAGGTGACCAAC
CAAGTTTTCAGATATGCTAAGAAGGCTGGGGCGAGCTACATTAACAARCCCAAAATGMGCCATTACGTCGG
CAGGA
TCACCCACGACCGTTCATGGTGACGGAGCCGGGGGAGGTAGCAAGTGGCAAAAAGAACGGCCTCGATTATC
TCTTCCATCTTTACGAGCAGTGCAGGGAGTTCTTGATTCAAGTCCAAAACATCGCCAAGGAACGCGGCGAA
AAATGCCCCACGAAGGTAACAATAGAAACAAATCTATTTTTAATGATTCTTAAGTAAAATTTTGAATTCAA
GCGTAAATGAATGAAATATGAGAAATATCCTGTTTTTGATCCGATTCTCATGGAAAAATATGAAACTAGGA
TAGTTTTTGCATGGTGCACGAGGTTTGACACGTGACTAGCTGTAAAAACAAGGCTGTCTCTGTTAGAATCT
TAGAACTGGACCAACCCTCCCATTAAAGCTAGGGTTTCTAGCCCATGAAAATGTGACAACTCAGGTACGGG
GAGGAATGGAGTCTGAAAACTTGGGACATGTATGTCTAAATTTTTGCAGAGTAAGGTCCCCTCCGCCCCAA
AAGGTTGTACTTTTTGTCTTTAAACACTTTACTGTCCTCCTTTCTGAAGCCTCGTTTTCCCCTGTCCGGTT
GAGCTGAGGTGGCGTGACCCTAATACGACAGCTCCATTGGATCCTAACCTTGTTACTTATACAGGTGACCA
ACCAAGTCCTCAGATATGCTAAGAAGGCTGGGGCGAGCTACATTAACAAACCCAAAATGCGCCACTATGTC
You will see that they are slightly different lengths and have different numbers of bases.
In order to proceed to use these sequences for phylogenetic analysis the first step is to
align these sequences to one another so that a nucleotide position in one is matched-up
with the homologous position in all the other sequences.
A data matrix is composed of characters that are shared by the taxa but which potentially
differ in state: for example, hair (the character) may be white, brown, or black (the
character states). For DNA sequences, the character is the nucleotide position (numbered
1, 2, 3 etc.) and the states are the nucleotides (A, C, G, and T). It is critical, therefore,
that nucleotides in each taxon be assigned to the correct positions. This process, called
sequence alignment, involves sliding the sequences over one another and inserting gaps,
guided by the sequences themselves. Sequence alignment, done properly, poses severe
computational challenges and has become a very technical subject. Here, I will just
summarize the underlying issues and point to some additional resources.
A DNA strand is a physical structure with nucleotides in a specific linear order. In the
simple case where the only kind of mutations are base-substitutions, each nucleotide
position in one taxon would be homologous to a nucleotide position at the same place in
the sequence of another taxon: position 1 in taxon A will be homologous to position 1 in
taxon B, position 2 to position 2, and so on. If we write out the two sequences the
homologous positions are aligned above one another.
Parent: G T A T T G A C C A C T G A C T A G C A T
| | | | | | | | | | | | | | | | | | | | |
Offspring:G C A T T A A C C A T T G T C T A G C A A
If the only kind of mutation were base substitutions, having found the homologous genes
you would merely needed to line up one homologous position and the rest of the
alignment is trivial. However, sequences are subject to additional kinds of mutation:
deletions, insertions, inversions and translocations.
A deletion involves the removal of one or multiple continuous bases. Deletions may be
due to errors during DNA replication, but can also happen during the “life” of a DNA
molecule due to imperfect DNA repair following chemical or radiation induced damage,
unequal crossing-over during recombination, or due to the actions of mobile genetic
elements. Deletions can be very short (even one base pair) or very long (entire genes).
When deletions happen, nucleotide positions in the parent strand lack homologs in the
daughter strand: they have gone extinct.
| | | | | | | | | | | | | | | | | | | | |
Offspring:G C A T T - - - - - T T G T C T A G C A A
The same mechanisms that cause deletions (errors during replication and recombination,
DNA damage, and mobile genetic elements) can cause the insertion of DNA sequences
into a strand. In some cases inserted sequences are duplicates of nucleotide positions in
the parent strand. In that case, bases in the insertion have homologous positions in the
parent strand (but one parental position may be homologous to two daughter positions).
In other cases the inserted sequence will be novel and will lack identifiable homologous
positions – in effect a new nucleotide position has been created from scratch.
Parent: G T A T T G A C C - - - A C T G A C T A G C A T
| | | | | | | | | | | | | | | | | | | | | | | |
Offspring:G C A T T A A C C A C C A T T G T C T A G C A A
Inversion events occur when a piece of DNA is effectively cut out and then replaced in
the opposite orientation.
| | | | | | | | | | | | | |
Offspring:G C A T T A G T C A C C A T C T A G C A A
A translocation involves a piece of sequence being moved to a different position within

the sequence.
| | | | | | | | | | | | |
Offspring:G C A T C A T T T A A C G T C T A G C A A
Inversions and translocations both yield cases in which there is a single nucleotide
position in a daughter strand that is homologous to each position in a parent strand, but
the sequences are not in the same linear order. While inversions and translocations are
commonly identified, the major focus of most alignment programs is on inserting gaps in
sequences to capture the history of insertions and deletions, collectively indels.
The process of sequence alignment aims to align homologous positions based on the true
history of sequence evolution. Alignment is, thus, properly viewed as a problem of
historical inference. Furthermore, because base substitution, indels, and other structural
mutations occurred along the branches of the true gene tree, sequence alignment and tree
inference are really two aspects of the same problem. Therefore, in the ideal world, we
would have computer programs that could take raw, unaligned sequences and search for
trees that could simultaneously account for the bases in the sequences and their structural
evolution. A few programs do conduct combined alignment and phylogenetic inference
(e.g., POY; http://research.amnh.org/scicomp/projects/poy.php). However, the problem
is so computationally challenging, that it is necessary to make a number of unrealistic
assumptions to make the programs work. Therefore, the vast majority of phylogenetic
analysis separates the two problems: first generating an alignment, and then provisionally
accepting that alignment as the basis for phylogenetic inference.
To get a feel for how sequence alignment can be conducted free of a phylogeny, see if
you can align the following pair of sequences.
A T G A C C A G T A C G G C T T T A
A T G A T C G A T A T G G C A T T A
You might conclude that these sequences are already well aligned and that the two
sequences differ by five base substitution events. While this might be the best alignment,
it is worth considering alternatives that can also explain these data through the addition of
insertion/deletion, or indel, events. For example, you could align these same two
sequences by invoking eight indels and no substitutions, or three base substitutions and
two indels. These three alternatives are shown.
Five substitutions:
A T G A T C G A T A T G G C A T T A
Eight indels:
A T G A C - C A G - - T A - C G G C - T T T A
A T G A - T C - - G A T A T - G G C A - T T A
Three substitutions and two indels:

A T G A C C A G - - T A C G G C T T T A
A T G A T C - - G A T A T G G C A T T A
To choose between these we need to ask ourselves whether it is more likely that there
were five substitutions, eight indels, or three substitutions and two indels. Assuming that
the general rate of evolution is low then it is probably reasonable to favor five events over
eight, thereby rejecting the second alignment. Furthermore, most data from molecular
biology would say that base substitutions are more frequent that indels (especially in
coding genes), which means that we would normally favor the first alignment.
This example allows us to state a rule that is applied in almost all sequence alignment
programs: only invoke an indel if at least one base substitution is avoided. Indeed it is
normal to set the gap penalty, the threshold for the number of base substitutions avoided
before an indel is inferred, higher still: gap penalties of three to twenty are common.
Additionally, most computer programs impose an extra cost for longer gaps or gaps at the
ends to avoid alignments such as the following, which avoid all base substitutions at the
“cost” of two indels.
- - - - - - - - - - - - - - - - - - A T G A C C A G T A C G G C T T T A
A T G A T C G A T A T G G C A T T A - - - - - - - - - - - - - - - - - -
That being said, a smart alignment program will permit gaps to be inserted at the end in
cases such as the following.
A G C A T G G C T A T A G A T A C C
A T G A C C A G T A C G G C T T T A - - - - - -
- - - - - - A G C A T G G C T A T A G A T A C C
Here the “gaps” probably do not represent indel events but, rather, sequencing reactions
that started and/or ended at different positions in the sequence. A purist might therefore
change the gaps to uncertainty codes (below), although this will not affect phylogeny
reconstruction programs.
A T G A C C A G T A C G G C T T T A ? ? ? ? ? ?
? ? ? ? ? ? A G C A T G G C T A T A G A T A C C
It is probably clear that alignment is easiest and most certain when both base substitutions
and indels are rare. This is because matched parts of the sequence provide a framework
for identifying the position and size of indel events. For example, below are two true
alignments. Which do you think you would yield data that were easier to align?
A T G A - - - T G C A G C T T T A G G T A
? C A A C A G T A C G A - - C T A C - C A
A T G A C C A G T A C A G - T T T A ? ? ?
A C G T C C - - T A C G G C T T C A G T A
The answer is the second one. While the number of indels is similar, the many extra base
substitutions in the top case would make it very hard to identify the true alignment with
confidence.
Computer programs can often squeeze more information out of sequences by taking into
account not just the number of base substitutions but the kind of substitution. For good
molecular reasons (associated with repair mechanisms), purine-to-purine (A ⇔ G) and
pyrimidine-to-pyrimidine (C ⇔ T ) substitutions, collectively called transitions, are more
frequent than changes from a purine to a pyrmidine or vice versa (A ⇔ C, A ⇔ T, G ⇔
C, G ⇔ T), collectively called transversions. Computer programs can use differential
penalties on transitions and transversions so as to pick alignments that more closely
match the underlying molecular processes. Likewise, functional constraints on proteins
mean that certain amino acid substitutions, ones entailing amino acids with similar
chemical properties, are more frequent than others. These too can be taken into account
by computational algorithms.
Pairwise alignment considers just two sequences at a time whereas multiple alignments
include sequences from many taxa so as to obtain an entire aligned data matrix. A
pairwise alignment is relatively simple for a computer to determine, even when a
complex set of penalties are implemented. Multiple alignments are, however,
disproportionately more difficult. As the number of sequences being aligned increases,
the number of possible alignments goes up exponentially.
Multiple alignment algorithms should allow the placement of gaps in one sequence to
influence the placement of gaps in other sequence. This is because, when gaps in two
species are in the same position they can be attributed to a single indel occurring
somewhere on the gene tree. Nonetheless, most multiple alignment programs start by
making a pairwise alignment and then gradually align additional sequences to the earlier
alignment. In this procedure, a gap introduced early is retained even if the addition of
later sequences might suggest an alternative position for the gap.
Given the difficulties faced by alignment programs, humans can often visually identify
the more egregious mistakes. Thus, while computer multiple alignment programs (e.g.,
CLUSTAL) provide a good starting point, they usually need to be examined and adjusted
by eye. To illustrate this, here is a problematic portion of an alignment that was actually
returned by CLUSTAL and an eyeball-edited version of the same. You might notice that
not only does the second alignment imply a simpler mutational history, but also the
human editor could take account of the codon structure of this gene (something that few
computer programs keep track of) so as to both gaps in the same reading frame.
Given that sequence alignments will vary depending on how they were generated (what
algorithm, what penalties, and whether they were manually edited), you might worry that
phylogenetic analysis of aligned sequences is invalid. Actually, the problems are less
than they may seem.
Usually, even if some part of a sequence is hard to align unambiguously, many regions
can be aligned confidently. It is common practice to either exclude regions of ambiguous
alignment from the phylogenetic analysis or to repeat the analysis using a range of
alternative alignments to see if the inferred trees vary significantly. And even in the
worst-case scenario where the entire gene is hard to align, the data matrix is likely to
show a lack of clear phylogenetic signal rather than a strong, misleading signal. Thus,
even if a suspect alignment is used, statistical analysis of the phylogenetic conclusions
(Chaps X-XX) will probably show that they are weak.
Generating a morphological data matrix
Although the field of phylogenetics was founded on morphological data, most modern
research uses molecular data, especially DNA sequences, instead. Phylogenetic analysis
of morphological data turns out to be more challenging than the analysis of DNA
sequences, and it is more difficult to obtain statistically significant support for
phylogenetic conclusions. However, fossils can only be analyzed through the use of
morphological data (scored for them and related living species). Additionally, even when
trees come from different kinds of data it often becomes necessary to build a
morphological data matrix as a means to use the tree to reconstruct the evolution of
morphological traits. Therefore, despite the preeminence of molecular phylogenetics, it
is important to know how morphological data are scored and assembled into data
matrices.
Two steps can be recognized in the building of a morphological matrix, which I will call
character encoding and character scoring. Character encoding involves deciding on the
limits of characters and on the alternative states that are recognized for each character.
For example, when you decide to score fur color using two states, brown/black and white,
then you have encoded one character (fur color). Character scoring involves looking at
each taxon and assigning it a state for each encoded character. For example, you could
score an otter as having brown/black fur. In practice observations made while scoring
taxa often results in changes being made to character encoding, but it is still useful to
distinguish these two steps.
Character encoding involves defining the characters and character-states. Character

encoding is trivial for DNA sequence data because they are inherently divisible into
characters (positions) each of which can adopt one of four states (A, C, G, or T). In
contrast, for morphological data, the recognition of characters and character-states needs
to be determined by consideration of the patterns of variation among taxa.
Once a set of taxa has been selected for study a systematist generally starts by looking for
characteristics that appear to vary among them. Notice that whereas many characters in a
DNA sequence data matrix may be invariant, the way that morphological characters are
selected means that constant characters are usually not included.
Once some variation has been noted the next challenge is to clearly encode the
characters. This is not straightforward. For example, imagine that you observed that leaf
shape and size differed among a set of eight plant species, as shown in the figure. How
would you capture this variation?
Consider two of the
numerous possible ways to
encode this variation. (1)
You recognize two basic
leaf shapes, cordate (heart-
shaped) and obcordate
(with the widest point near
A C E G the top) and two size
classes. (2) You encode
leaf length, leaf width, and
the height of the widest
point. The scoring that
might result from these two
encoding schemes is shown
B D F H in the two matrices below.
Taxon Leaf shape (0 = cordate; 1 = Leaf size (0 = small; 1 =

obcordate) large)
A 0 1
B 0 0
C 0 1
D 0 1
E 1 0
F 1 1
G 1 1
H 1 01
Taxon Leaf length (0 = Leaf width (0 = Height of widest

short; 1 = long narrow; 1 = wide) point (0 = below
middle; 1 = above
middle)
A 1 0 0
B 0 0 0
C 1 0 0
D 1 0 0
E 0 1 1
F 1 0 1
G 1 0 1
H 1 1 1
The decision among alternative encoding schemes is guided by a few, potentially
conflicting considerations. You want to capture as much of the variation as possible
without “double counting.” Scoring the same basic variation multiple times results in
overweighting that variation to the point where it will dominate the phylogenetic results.
For example, you might be concerned that by measuring both length and width of these
leaves you might score one basic trait, leaf size, twice.
Another important consideration is that the character states recognized
should really be versions of the same character. This is not always easy to
decide. Suppose that close relatives of these plants have compound leaves.
Should their “leaf shape” be encoded based on the individual leaflets or the
outline of the whole compound leaf?
Once the characters are defined, the next question is how to delimit
character states. If the variation is rather discrete between taxa and with
little variation within taxa, as illustrated by leaf shape in the example
above, then it may be easy to encode the categories. However, most
morphological traits are inherently continuous and variation within taxa is
usual. Thus, it can be difficult to divide continuous variation into the discrete states
needed for phylogenetic analysis. The graph below shows hypothetical data on leaf
length in the ten species.
A
B
C
D
E
F
G
H
I
J
Leaf Length (cm)

You might see this as three “clusters” of taxa corresponding to three states: small (A-E),
medium (D-F-H), and large (B-G-I). In that case, you might score C as polymorphic for
small and medium and J as polymorphic for medium and large. Or you could recognize
two classes (small and large) or even five (A-E; C; D-F-H; J; B-G-I). And, sadly, there is
no well-grounded theory to tell you which of these encoding schemes will yield us the
best estimates of the phylogeny of these species.
Taken together, you can probably see that there are many somewhat subjective decisions
that must be made in encoding morphological data and that these are likely to be adjusted
by observations made while scoring individual taxa. Nonetheless, the aim is to
eventually define the encoding scheme so clearly that any researcher would score the
same taxa and arrive at the same morphological data matrix. However, the fact that
scoring can be objective once encoding is completed, does not change the fact that
morphological data matrices are always somewhat subjective because data encoding
cannot be rendered fully objective.
While subjectivity is something that makes scientists uncomfortable, the fact that there is
some subjectivity does not invalidate morphology as a source of phylogenetic data. So
long as different encoding schemes capture the actual variation among taxa, then they
should yield similar estimates of the tree. However, since different encoding schemes
can result in different trees being chosen, as with sequence alignment, it is considered
good practice to try a few different schemes and see if the phylogenetic conclusions
remain the same.
Classes of data used for phylogenetic analysis
Information for phylogenetic inference can theoretically come from any traits that we
believe evolved within the constraints of the underlying tree. A trait that varies among
tips and shows some degree of heritability (ancestors having the trait tend to give
descendants that have it too), thus, has the potential to provide phylogenetic information.
Prior to the advent of modern molecular methods, phylogenetic analysis was conducted
primarily on morphological variation. Nowadays, the great majority of phylogenetic
analyses are conducted based on DNA sequences obtained from representative
individuals. In the following chapters we will examine methods that have been
developed for phylogenetic inference from DNA sequence data and morphological data.
Although I will focus on DNA sequence and morphological data, there are many other
kinds of data that can be used for phylogenetic analysis. In most cases these other kinds
of data can be analyzed using methods developed for DNA sequences or morphology but
in some cases specialized methods need to be used. It is beyond the scope of this book to
explore all these methods. However, below I review most of the kinds of data that are
used for phylogenetic inference and provide some brief discussion as to how they are
typically analyzed.
i. Molecular Sequences
Sequences of peptides or small proteins were the first kind of molecular sequences to be
used for phylogenetics. Originally amino acid sequences were determined chemically.
Now it is much more common to infer an amino acid sequence from either the mRNA or
DNA sequence that encodes the protein.
Protein sequences are still used for phylogeny reconstruction, especially to study ancient
relationships. This is because protein sequences are often subject to functional
constraints that slow down the rate of evolution, reducing the frequency with which a
single position undergoes multiple evolutionary changes of state (multiple hits).
Nonetheless, the general principles of phylogenetic analysis of protein sequences are very
similar to those applied to DNA sequences.
The second kinds of molecular sequences to be collected widely were ribosomal RNA
(rRNA) sequences. This was because ribosomes could be purified and their DNA could
be sequenced with reverse transcriptase. This technology has now been superseded and
rRNA or messenger RNA (mRNA) sequences are nowadays obtained by sequencing the
encoding DNA (or in the case of mRNA, by copying it back into DNA). Because there is
usually a one to one correspondence between RNA and the encoding DNA, phylogenetic
analysis of RNA sequences is basically identical to that of DNA sequences. The only
complication that arises sometimes is that some RNA molecules adopt folded structures
due to bonding between bases at different positions in the molecule. When this happens,
there can be non-independence between positions in the sequence that interact,
potentially complication phylogenetic analysis.
ii. Molecular presence/absence data
Over the years a number of different molecular techniques have been developed that
allow researchers to score organisms for the presence/absence of particular molecular
markers. Appendix 7.1 provides a brief description of some of these method.
The first two of methods developed, RFLPs (Appendix 7.1.ii) and isozymes (Appendix
7.1.iii), date back to a time when molecular sequencing was not feasible. Isozyme data is
optimized for population genetic studies and is not well suited to phylogenetic analysis.
RFLP data, in contrast, provided a lot of robust phylogenetic information and played a
major role in the development of molecular phylogenetics. However, for all its historical
importance, RFLP has been superseded in phylogenetics with the advent of inexpensive
and efficient DNA sequencing methods.
RFLP data can be analyzed by treating it as presence/absence data using methods

appropriate to morphological characters. However, in so doing efforts should be made to
take account of an inherent asymmetry in RFLP data: shared sites are more likely to be
lost in parallel than gained in parallel. This is because an entire restriction site (typically
six base pairs long) must match for an enzyme to cut, whereas any one of many possible
mismatches will result in a failure to cut. Parsimony can been modified using character-
state weighting (Chap X) to partially account for this phenomenon. More specialized
methods have been developed that use explicit models of RFLP evolution to obtain better
phylogenetic estimates.
RAPDs (7.1.iv), ISSRs (7.1.v), AFLPs (7.1.vi), and microsatellites (7.1.vii), collectively
distributed molecular markers, were originally developed for studying variation within
populations. Because the members of a population are closely related, their DNA
sequences are usually very similar. Distributed molecular marker methods quickly scan a
set of taxa for molecular variation distributed anywhere in the whole genome. Because
these markers (of which AFLP is currently the most widely used) have been particularly
popular for phylogenetic studies of closely related organisms, where it can be difficult to
find sufficiently variable gene regions to sequence.
In addition to certain technical problems that are mentioned in the appendix, distributed
molecular markers have one significant drawback for studying phylogenies among
closely related species. The available methods of analysis estimate a common
phylogenetic tree under the assumption that all parts of the genome have tracked the
same history. This is fine if all the genes have tracked the same history. But, phenomena
such as incomplete lineage sorting, mean that different parts of the genome can have
different phylogenetic trees (chap. 5) – and these phenomena will be particular prevalent
at the low phylogenetic scales that are generally studied with distributed molecular
markers.
You might hope that computer programs for phylogenetic analysis would have a
safeguard built in to detect that the assumptions of the method have been violated. But
unfortunately it is not easy to tell from distributed marker data whether there is a single
shared history. You will usually obtain a well-resolved tree whether there is a single
history of the whole genome or not. The tree may not correspond to the history of the
whole genome but may be some kind of average across the conflicting histories from
different parts of the genome. Therefore, users of distributed marker data need to be
extra vigilant to avoid reading too much significance into the trees they obtain in cases
where discordance among gene trees is suspected.
iii. Molecular structural data
Whereas molecular marker methods have generally been motivated by the search for
variation at low taxonomic scales, structural molecular characters have mainly been of
interest because they tend to evolve very slowly. A rare structural molecular feature
shared by a group of taxa can provide compelling evidence that the taxa form a clade.
Structural molecular characters include insertions and deletions, inversions, and

duplicative or non-duplicative translocations. For each kind of structural mutation, there
is a range of scales from very local mutations (e.g., single base-pair deletions; six base
pair inversions) to large scale ones (e.g., deletion of a whole gene; insertion of an intron;
translocation of an entire chromosome). As a general rule, finding that two taxa share a
small/local structural mutation is considered weaker evidence of a close relationship than
sharing a larger structural characteristic. This is because the probability of homoplasy is
higher in the former case. It is more likely that two taxa independently underwent a
deletion of the same AT than they independently experienced an insertion of the same
500 base-pair sequence at the same point the genome.
Structural characters are usually identified by gene or genome sequencing, restriction

mapping, or by microscopic observation of chromosomes. In some cases the kind of
structural mutation involved can be unambiguously inferred. For example, when a large
region of otherwise quite similar sequence is inverted in some taxa relative to others, a
molecular inversion is clearly implied. In other cases a diversity of different mutational
processes could have contributed to an observed pattern. For example, if tips differ in
their gene order along a chromosome, different combinations of duplications,
translocations, inversions, and deletions might provide competing explanations for the
same basic data.
The use of molecular structural characters for phylogenetic inference generally follows
one of two approaches. The first involves scoring tips for the presence or absence of a
number of structural characters and then using parsimony or other standard phylogenetic
methods to infer the tree that best explains the full set of characters. This approach
allows that any of the characters could have been subject to homoplasy.
The alternative approach is invoked when a particular structural character is considered to

have resulted from such an improbable event that homoplasy is ruled out. In that case the
Hennigian logic is invoked. This means that once the structural character is polarized
(e.g., by looking in outgroups), a clade is inferred.
A simple example of the latter approach is provided by a study of land plants conducted
in 1992 by Linda Raubeson and Robert Jansen then at the University of Connecticut.
They used RFLP mapping approaches to show that the clubmosses and their allies (the
lycophytes) have a major inversion in their plastid genome relative to all other vascular
land plants. Furthermore, the condition found in lycophytes resembled the outgroups
(mosses and liverworts). From this they concluded that the lycophytes must be outside a
clade that includes all other living vascular plants.
B
A A Other vascular
Outgroup Lycophytes plants
A
Inversion
Abundant subsequent research on land plant phylogeny has confirmed the conclusion
reached based on this inversion. It seems that this inversion really did occur just once in
a common ancestor of all living vascular plants except lycophytes.
Another classic example that fits this basic principle are chromosomal inversion
phylogenies, which were reconstructed based on a logical analysis of a series of nested
chromosomal inversions in Hawaiian fruit flies. Fruit flies have polytene chromosomes
in the salivary glands that can be stained to reveal banding patterns, which may be
observed under the microscope. In 1982, Carson summarize more than a decade of such
data and was able to provide a detailed phylogenetic tree for approximately 103 fly
species – a tree than has largely been validated based on DNA seqence data (O’Grady et
al. 2001; BMC Evo Bio).
Nonetheless, while structural molecular characters have provided definitive data in many
cases, this does not mean that the Hennigian logic will always succeed. Our knowledge
of molecular process is not good enough to definitively rule out independent origins of
the same structural mutations. Therefore, even when clades are supported by supposedly
rare structural mutations, biologists still hope to corroborate those clades through the use
of other kinds of data.
iv. Morphology and other morphotypic data
As discussed earlier in this chapter, morphological features that reflect some underlying
aspects of the genotype, making them heritable, can be used for inferring phylogenetic
relationships. The basic approach with such data is to examine the variation among the
taxa and develop an encoding scheme for summarizing that variation with a number of
discrete states. As discussed earlier the delimitation of characters and character states can
be tricky, depending on a series of somewhat ambiguous judgment calls.
There are a number of kinds of data that do not involve morphology in the strict sense
(gross physical features of organisms), but nonetheless contend with similar problems of
character encoding. These are sometimes called “phenotypic data,” but I don’t find that
term appropriate given that all features of organisms we observe are technically
phenotypic. Rather, I will use the term morphotypic to refer to those kinds of data for
which character encoding is not defined a priori, but is guided by the observed variation
among the taxa combined with insights into the homology and independence of the traits.
Below is a list of some kinds of morphotypic data that have been used for phylogenetic
analysis. It is common for a data matrix to include more than one kind of morphotypic
data.
a) Morphology (sensu stricto): Gross physical features of organisms

b) Anatomy: Microscopic feature of organisms
c) Molecular morphology: The shape of particular molecules or molecular
complexes, for example ribosomes. Sometimes this variation can be captured
effectively using sequence data, but occasionally the shape of subcellular
structures has been treated as morphotypic data.
d) Development: How traits change during the lifetime of an organism
e) Behavior: How the organisms tend to behave. Mating behaviors have been
particular widely used.
f) Secondary biochemistry: Compounds organisms accumulate and chemical
reactions that its cells can conduct. The production of defensive secondary
chemicals has proved important in plants, whereas fungi are often scored for the
ability to cause a color reaction given a particular substrate. In some cases the
biochemical trait corresponds to the presence or absence of a particular gene, in
which case this kind of data blends into structural molecular or molecular
presence/absence data.
g) Biogeography: The geographic distribution of organisms can theoretically be
scored like other morphotypic traits. More commonly geographic history is
mapped onto phylogenies once they are determined.
h) Ecology: The preferred habit or way of life of the organisms, which is to say
aspects of their ecological niche. Examples include pollinator identity, prey type
or preferred habitat type.
v. Frequency data
If different individuals organisms scored from a given taxon vary in a trait, one strategy
(the usual one nowadays) is to score that taxon as polymorphic, i.e., containing multiple
states, but not to worry about the frequency of each variant within the taxon. However,
in earlier times the frequency of a polymorphic character, usually an isozyme or RFLP
allele, was considered to provide evidence of relatedness. Therefore, a data matrix could
be constructed in which the entries are not the presence or absence of a state, but the
frequency of different alleles in different data, as illustrated below. You will see that for
each locus, the sum of the frequencies of each allele adds up to 1.0.
Taxon list
Locus 1 Locus 2 Locus 3
1a 1b 1c 2a 2b 3a 3b 3c 3d 3e Alleles
A 0.0 0.9 0.1 0.0 1.0 1.0 0.0 0.0 0.0 0.0
B 0.0 0.3 0.7 0.2 0.8 0.4 0.6 0.0 0.0 0.0
C 0.1 0.4 0.5 0.1 0.9 0.1 0.8 0.1 0.0 0.0
D 0.9 0.1 0.0 0.8 0.2 0.0 0.1 0.0 0.9 0.0
E 0.7 0.3 0.0 1.0 0.0 0.0 0.0 0.0 0.7 0.3
F 1.0 0.0 0.0 0.9 0.1 0.0 0.1 0.1 0.4 0.4
Methods have been developed for analyzing frequency data, most commonly involving
first converting the frequency matrix into a distance matrix. However, nowadays
frequency is rarely if ever used for phylogeny reconstruction. When taxa share
polymorphisms (e.g., each contains both alleles a and b of the same locus) then it is
questionable whether the taxa are monophyletic entities that can meaningfully be placed
assigned to a single tip of the tree of life. This is, I think, why frequency data have so
completely fallen out of favor.
vi. Distance data
All the other kinds of data discussed are initially scored in the form of a character-state
matrix data. Although they can be converted into a distance matrix, they start out as a list
of discrete states scored for each taxon. However, two kinds of data are collected in the
form of a distance between a pair of taxa: immunological distances and DNA-DNA
similarity.
Immunological methods were popular in animals in the 1970’s and 1980’s. The intensity
of the reaction between the immune serum of one animal and antigenic proteins from
another animal was scored quantitatively in the laboratory. The underlying logic was that
the greater the time since common ancestry the greater the protein differences, which
should in turn lead to a more intense immune reaction.
Similarly, mainly in the 1980’s, many systematists put great stock in DNA-DNA
hybridization data as a measure of the overall sequence similarity of a pair of genomes.
A typical experiment would involve single stranded DNA from one taxon being attached
to a column and then being allowed to hybridize with single stranded (and radioactively
labeled) DNA from another species. The greater the sequence similarity of the two
genomes the more tightly would the complementary sequences bind. Overall genomic
similarity could thus be measured by looking at the release of radioactivity when the
column was gradually heated up to melt apart the two strands.
Immunological and DNA-DNA distance measures can only be analyzed using distance
methods, which is somewhat limiting. Also, whole genome distance data do not allow
you to detect the existence of different gene trees for different parts of the genome.
However, the main reason that these data are no longer used for phylogenetic research is
that compared to DNA sequencing the data are harder to generate and less readily
repeatable. Nonetheless, a number of conclusions reached using immunological or DNA-
DNA hybridization data have since been validated using DNA sequence data.
Major Points
Phylogenetic inference is based on variable traits that have been scored for a set of taxa
and have been entered into a data matrix. While there are many kinds of data that can be
used, the most important kinds to know about are DNA sequences and morphology.
DNA sequence data is the most widely used tool for studying phylogeny, being easy to
collect in large quantities and relatively simple to analyze using statistical methods. The
most common approaches involve first aligning the DNA sequences to one another by
adding gaps that represent insertion/deletion (indel) events. Morphological data are
needed to add fossils to the tree of life and morphology is often scored as a first step in
using phylogenies to study evolution (Chap. XX). Morphological data do not need
alignment, but the delimitation of characters and character states can be quite difficult
and can introduce an element of subjectivity.
Learning objectives
• Understand the difference between a character-state matrix and a distance matrix.

o Be able to identify the state of a taxon for a particular character based on a
data matrix
o Be able to determine the pairwise distance or similarity of a pair of taxa
given a distance matrix
o Be able to convert a pairwise distance into a pairwise similarity, and vice
versa
o Be able to convert a simple character-state matrix into a distance matrix
based on simple proportional similarity
o Be able to explain why a character-state matrix can be converted into a
distance matrix, but the reverse is not possible
• Understand the nature and challenge of DNA alignment.
o Be able to identify sequence alignment as a problem of homology
assessment of the nucleotide positions, analogous to homology assessment
in morphology
o Be able to explain why alignment and phylogeny reconstruction are two
parts of the same historical inference problem
o Be able to align similar sequences without invoking indels
o Be able to determine the number of indels and the number of substitutions
implied by a pairwise alignment
o Be able to use an argument based on probabilities of indels versus
substitutions to favor one pairwise alignment over another
o Be able to explain why alignment becomes more difficult in more
divergent sequences
• Understand the issues that arise in building a morphological data matrix
o Be able to generate two or more alternative encoding strategies given
information on trait variation
o Be able to give examples of cases where different views on homology
alter the way that a characters are encoded
o Be able to recognize as misguided character-state encoding that obviously
over-weight some characters (e.g., by repeating the same character)
o Be able to correctly score taxa for some non-technical traits given a
character encoding scheme
o Be able to distinguish the objectivity of character-state scoring from the
subjectivity of character encoding
• Understand that many kinds of data can be used for phylogenetic inference
o Be able to explain why molecular data are currently the most widely used
tool for inferring phylogenies (abundant, easy to collect, easy to analyze
because there are good models of sequence evolution)
o Be able to defend the use of morphological data for phylogenetic analysis
o Be able to list at least one kind of molecular data besides DNA sequences
that have been used for phylogenetic analysis
o Be able to explain why DNA-DNA hybridization or immunological
distance data cannot be represented in a character-state matrix
o Be able to articulate the difference between the character-state scoring and
frequency scoring

Data For Phylogenetics

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data For Phylogenetics

Hochgeladen von

Copyright:

Verfügbare Formate

Data for phylogenetic analysis

A character-state data matrix

Character States: actual States: numerical

Normally scoring is facilitated by assigning numerical values to each state of each

Horse (outgroup) 0 1 0 0 1 1 0 0 0 0 Character number

Generating an aligned DNA sequence data matrix

Sequence databases such as GenBank contain a phenomenal amount of sequence data

Here are two sequences as they might be obtained.

A translocation involves a piece of sequence being moved to a different position within

Three substitutions and two indels:

Character encoding involves defining the characters and character-states. Character

Taxon Leaf shape (0 = cordate; 1 = Leaf size (0 = small; 1 =

Taxon Leaf length (0 = Leaf width (0 = Height of widest

Leaf Length (cm)

Classes of data used for phylogenetic analysis

ii. Molecular presence/absence data

RFLP data can be analyzed by treating it as presence/absence data using methods

iii. Molecular structural data

Structural molecular characters include insertions and deletions, inversions, and

Structural characters are usually identified by gene or genome sequencing, restriction

The alternative approach is invoked when a particular structural character is considered to

iv. Morphology and other morphotypic data

a) Morphology (sensu stricto): Gross physical features of organisms

vi. Distance data

• Understand the difference between a character-state matrix and a distance matrix.

Das könnte Ihnen auch gefallen