Beruflich Dokumente
Kultur Dokumente
The data that are used to estimate the phylogeny of a set of tips are the characteristics of
those tips. Therefore the success of phylogenetic inference depends in large measure on
the choice of trait data, the accuracy of those data, and the quantity of data obtained.
Whether or not you plan on being engaged in phylogenetic analysis it is important to
know the kinds of data that are typically used in phylogenetic analysis and to understand
how data are organized to permit phylogenetic analysis.
In this chapter I introduce the concept of a data matrix, and focus initially on the two
kinds of data that are most important to know about: DNA sequence data and
morphological data. I end by surveying the full range of data types that have at one time
or another been used for phylogenetic analysis.
The first step in a phylogenetic analysis is to decide on the organisms or species that will
serve as tips. The tips scored for traits in a phylogenetic analysis are usually called taxa
(singular == taxon). [This usage of the term taxon/taxa is similar but not identical to the
“formally named groups or clade” used in the context of biological classification.]
Enough taxa need to be included in a phylogenetic analysis so that the tree will resolve
the phylogenetic questions that motivate the study. This will almost always mean
including multiple taxa from within the group of interest (the ingroup) and at least one,
but perhaps many, outgroups: taxa that are known not to be within the ingroup. The
reason for including such outgroups was discussed briefly in chapter 6, and will be
covered in more detail in chapter 9. The choice of taxa is also heavily influenced by
practical issues, such as the availability of material for study.
Imagine for example that you are studying the phylogeny of the Carnivora, a group of
mammals, which share a number of dental and skeletal traits that support their
monophyly. Your ingroup would likely include at least one representative of each of the
previously identified clades from within Carnivora: dogs (Canidae), cats (Felidae),
hyaenas (Hyaenidae), otters (Mustelidae), raccoons (Procyonidae), bears (Ursidae), civets
(Viverridae), and seals (Pinnipedia). Additionally, individual taxa whose relationships are
uncertain, in this case the giant and red pandas, might be included in the study.
As outgroups, you could theoretically include any taxon that is not a member of
Carnivora. However the best outgroups are usually ones that are reasonably closely
related to the ingroup so that they can be compared directly for many traits. For example,
you might use non-Carnivora mammals that have previously been found to be closely-
related to Carnivora, for example members of the horse/rhino/tapir clade.
Having selected the set of taxa to include in a phylogenetic study, the next step is to
collect information on the traits of those taxa. In order to keep track of the similarity or
differences of the taxa, the data are organized into a data matrix. Two kinds of data
matrix are important to know about.
A character-state matrix is a list of taxa and the state that they manifest for each of a set
of characters. A character-state matrix has one entry for each taxon for each character
scored. Thus, for T taxa and C characters, the total number of data points is T x C.
A distance matrix is a listing of the overall dissimilarity (or more rarely similarity) of
each pair of taxa. A distance matrix has (T x (T-1))/2 entries. Given that we typically
score many more characters than taxa, distance matrices usually include less total
information than character-state matrices.
A character-state matrix lists the state of each taxon for a number of characters. In the
case of morphological data, a character is a measurable attribute of an organism that has
the potential to be present in distinct states. Some examples of characters and character
states are given in Table 1. Note that the characters include hard and soft body parts and
behavioral traits. Further some characters are presence/absence, some show discrete
variation (e.g., number of teeth), others show truly continuous variation (e.g., color, size)
that may be broken-up into discrete states based on patterns of variation seen among the
study taxa. Character scoring issues will be discussed further below.
Having decided on a list of characters and character states you can then score them for
each taxon. The scoring is usually conducted on one or a limited number of individual
organisms that are considered to represent the taxon of interest (e.g., a species). In the
case that all the individuals examined for a taxon have the same state, that state is scored
unambiguously. If there is variation among the individuals of one taxon for a trait, the
trait will be scored so as to keep track of the fact that there is variation, or polymorphism,
within this taxon.
An example of a part of a typical morphological data matrix is given. To the left the taxa
are listed. Following each taxon are the values for each of the ten characters in this small
data matrix. Notice that some taxa may be scored as unknown for certain characters
(conventionally represented with ‘?’), either because we are ignorant as to the proper
scoring or because it is impossible to score (e.g., toe number in snakes). Also, a taxon
can be scored as polymorphic by listing multiple states within a cell.
Taxon list Character-state scoring
1 2 3 4 5 6 7 8 9 10 Character number
The four states correspond to the four different bases that can occupy each position. As
with morphological data one can be ignorant as to the identity of a base in a particular
position (‘?’) or you can find multiple alternative states within the taxon. Additionally,
you might find that some taxa have a portion of the sequence missing entirely due to
insertions of deletions of DNA in the course of evolution. These are generally indicated
with a gap code, ‘-,‘ but are most commonly treated as equivalent to missing data. The
phylogenetic analysis of gaps and the determination of where insertion/deletion events
have occurred raises a number of complex issues that will be touched on when we cover
sequence alignment.
Taxon list Sequence data
1 2 3 4 5 6 7 8 9 10 Sequence position
Horse (outgroup) A T G G A T C A A C
Cat A T ? G A C C A A C
Dog A T G G A C C A T CT
Bear A T G A A C C A T C
Otter A T G G A C C A T C
Seal A T G A A C - - - C
A distance matrix
A distance matrix differs from a character-state data matrix in that instead of include
entiries for T taxa and N characters, it includes a single value that summarizes the
evolutionary difference (dissimilarity) (or its converse, similarity) of each pair of taxa.
For T taxa, there are T * (T-1)/2 pairwise distances. An example is shown:
sealion walrus seal bear racoon weasel dog civet hyaena cat taxa
horse 0.290 0.289 0.297 0.270 0.293 0.299 0.288 0.250 0.274 0.250
sealion - 0.028 0.058 0.134 0.156 0.148 0.187 0.196 0.207 0.202
walrus - 0.055 0.135 0.155 0.147 0.188 0.194 0.209 0.197
seal - 0.134 0.161 0.154 0.181 0.198 0.205 0.199
taxa
bear - 0.139 0.156 0.185 0.179 0.193 0.181 Distance
racoon - 0.130 0.205 0.214 0.221 0.214 measures
weasel - 0.205 0.208 0.219 0.213
dog - 0.210 0.217 0.202
civet - 0.092 0.081
hyaena - 0.092
In this case the matrix lists the difference (dissimilarity) of each pair of taxa, meaning
that the higher the number the greater the distance between two taxa. Sometimes a
matrix will show pairwise similarity, in which case the higher the number the shorter the
distance between the two taxa.
Immunological cross-reactivity and DNA-DNA hybridization data, kinds of data that are
rarely collected nowadays, are two kinds of data that are obtained in the form of pairwise
similarity or difference. More commonly a distance matrix is derived from a character-
state matrix.
The simplest way to convert a character-state matrix into a distance matrix is to calculate
the proportion of characters for which a pair of taxa differ in state (and repeat this for all
taxa). For example, as shown below, the dog and cat in the example matrix differ at two
out of 10 morphological characters, representing a dissimilarity of 0.2 (or a similarity of
0.8). There are many complications in calculating distances, a few of which will be
discussed in Chapter 11. It is worth noting that because many character state matrices
can yield the same distance matrix you cannot infer a character state matrix from a
distance matrix.
Cat 0 1 1 1 1 1 0 1 1 1
Dog 0 0 1 1 0 1 0 1 1 1
Most of published phylogenetic results are derived from analysis of DNA sequence data.
To provide some context, it will be useful to briefly review the experimental methods that
are used to collect DNA sequence data.
Appendix 7.1.i provides a brief summary of the molecular methods used to obtain DNA
sequences for each taxon in a data matrix. The sequences are composed of the letters, A,
C, G, and T corresponding to the four bases, adenine, cytosine, guanine, and thymine.
There may also be additional letters such as R, Y, S, W, M, and K, which are used to
indicate uncertainty in the identity of a base at a certain position (for example, R means
that either an A or G is present, Y refers to C or T, etc.). ‘N’ indicates that an unknown
nucleotide is present, whereas ‘?’ is used to indicate a position where the editor was
uncertain if there was or was not a base at all, for example if the data are ambiguous as to
whether there are 9 or 10 A’s in a row in the sequence.
CGTTTATGGTGACGGAGCCGGGGGAGGTAGCACGTGGCAAAAAGAACGGCCTCGATTATCTCTTCCATCTT
TACGAACAGTGCCGGGAGTTCTTGATTCAAGTCCAAAACATCGCCAAGGAGCGCGGCGAAAAATGCCCCAC
CAAGGTAACAATAGAAACAAATCTATTTTTAATGTTTCTTAAGTAAAATTTTGAATTCAAGCTCCGTAAAT
GAATGAAAATATGAGAAATATCCTGTTTTTGATCCGATTCTCATGGAAAAATATGAAACTAGGATAGTTTT
TGCATGGTGCACGAGGTTTGACACGGGACTAGCTGTAAAAACAAGGCTGTCTCTGTTAGAATCTTAGAACT
GGACCAGCCCTCCCATTAAAGCTAGGGTTTCTAGCCCATGAAAATGTGACAACTCAGGTACGGGGAGGAAT
GGAGTCTGAAAACTTGGGACATGTATGTCTAAATTTTTGCAGAGTAAGGTCCCCTCCGCCCCAAAAGGTTG
TAC?TTTTGTCTTTAAASACTTTACTGTCTTCCTTTCTGAAGCCTCGTTTTCCCTGTCCGGTTTAGCTGAG
GTGGCGTGACCCTAATACGACAGCTCCACCAYTTTTGGATCCTAATCTTATTGCTTATACAGGTGACCAAC
CAAGTTTTCAGATATGCTAAGAAGGCTGGGGCGAGCTACATTAACAARCCCAAAATGMGCCATTACGTCGG
CAGGA
TCACCCACGACCGTTCATGGTGACGGAGCCGGGGGAGGTAGCAAGTGGCAAAAAGAACGGCCTCGATTATC
TCTTCCATCTTTACGAGCAGTGCAGGGAGTTCTTGATTCAAGTCCAAAACATCGCCAAGGAACGCGGCGAA
AAATGCCCCACGAAGGTAACAATAGAAACAAATCTATTTTTAATGATTCTTAAGTAAAATTTTGAATTCAA
GCGTAAATGAATGAAATATGAGAAATATCCTGTTTTTGATCCGATTCTCATGGAAAAATATGAAACTAGGA
TAGTTTTTGCATGGTGCACGAGGTTTGACACGTGACTAGCTGTAAAAACAAGGCTGTCTCTGTTAGAATCT
TAGAACTGGACCAACCCTCCCATTAAAGCTAGGGTTTCTAGCCCATGAAAATGTGACAACTCAGGTACGGG
GAGGAATGGAGTCTGAAAACTTGGGACATGTATGTCTAAATTTTTGCAGAGTAAGGTCCCCTCCGCCCCAA
AAGGTTGTACTTTTTGTCTTTAAACACTTTACTGTCCTCCTTTCTGAAGCCTCGTTTTCCCCTGTCCGGTT
GAGCTGAGGTGGCGTGACCCTAATACGACAGCTCCATTGGATCCTAACCTTGTTACTTATACAGGTGACCA
ACCAAGTCCTCAGATATGCTAAGAAGGCTGGGGCGAGCTACATTAACAAACCCAAAATGCGCCACTATGTC
You will see that they are slightly different lengths and have different numbers of bases.
In order to proceed to use these sequences for phylogenetic analysis the first step is to
align these sequences to one another so that a nucleotide position in one is matched-up
with the homologous position in all the other sequences.
A data matrix is composed of characters that are shared by the taxa but which potentially
differ in state: for example, hair (the character) may be white, brown, or black (the
character states). For DNA sequences, the character is the nucleotide position (numbered
1, 2, 3 etc.) and the states are the nucleotides (A, C, G, and T). It is critical, therefore,
that nucleotides in each taxon be assigned to the correct positions. This process, called
sequence alignment, involves sliding the sequences over one another and inserting gaps,
guided by the sequences themselves. Sequence alignment, done properly, poses severe
computational challenges and has become a very technical subject. Here, I will just
summarize the underlying issues and point to some additional resources.
A DNA strand is a physical structure with nucleotides in a specific linear order. In the
simple case where the only kind of mutations are base-substitutions, each nucleotide
position in one taxon would be homologous to a nucleotide position at the same place in
the sequence of another taxon: position 1 in taxon A will be homologous to position 1 in
taxon B, position 2 to position 2, and so on. If we write out the two sequences the
homologous positions are aligned above one another.
Parent: G T A T T G A C C A C T G A C T A G C A T
| | | | | | | | | | | | | | | | | | | | |
Offspring:G C A T T A A C C A T T G T C T A G C A A
If the only kind of mutation were base substitutions, having found the homologous genes
you would merely needed to line up one homologous position and the rest of the
alignment is trivial. However, sequences are subject to additional kinds of mutation:
deletions, insertions, inversions and translocations.
A deletion involves the removal of one or multiple continuous bases. Deletions may be
due to errors during DNA replication, but can also happen during the “life” of a DNA
molecule due to imperfect DNA repair following chemical or radiation induced damage,
unequal crossing-over during recombination, or due to the actions of mobile genetic
elements. Deletions can be very short (even one base pair) or very long (entire genes).
When deletions happen, nucleotide positions in the parent strand lack homologs in the
daughter strand: they have gone extinct.
Parent: G T A T T G A C C A C T G A C T A G C A T
| | | | | | | | | | | | | | | | | | | | |
Offspring:G C A T T - - - - - T T G T C T A G C A A
The same mechanisms that cause deletions (errors during replication and recombination,
DNA damage, and mobile genetic elements) can cause the insertion of DNA sequences
into a strand. In some cases inserted sequences are duplicates of nucleotide positions in
the parent strand. In that case, bases in the insertion have homologous positions in the
parent strand (but one parental position may be homologous to two daughter positions).
In other cases the inserted sequence will be novel and will lack identifiable homologous
positions – in effect a new nucleotide position has been created from scratch.
Parent: G T A T T G A C C - - - A C T G A C T A G C A T
| | | | | | | | | | | | | | | | | | | | | | | |
Offspring:G C A T T A A C C A C C A T T G T C T A G C A A
Inversion events occur when a piece of DNA is effectively cut out and then replaced in
the opposite orientation.
Parent: G T A T T G A C C A C T G A C T A G C A T
| | | | | | | | | | | | | |
Offspring:G C A T T A G T C A C C A T C T A G C A A
Parent: G T A T T G A C C A C T G A C T A G C A T
| | | | | | | | | | | | |
Offspring:G C A T C A T T T A A C G T C T A G C A A
Inversions and translocations both yield cases in which there is a single nucleotide
position in a daughter strand that is homologous to each position in a parent strand, but
the sequences are not in the same linear order. While inversions and translocations are
commonly identified, the major focus of most alignment programs is on inserting gaps in
sequences to capture the history of insertions and deletions, collectively indels.
The process of sequence alignment aims to align homologous positions based on the true
history of sequence evolution. Alignment is, thus, properly viewed as a problem of
historical inference. Furthermore, because base substitution, indels, and other structural
mutations occurred along the branches of the true gene tree, sequence alignment and tree
inference are really two aspects of the same problem. Therefore, in the ideal world, we
would have computer programs that could take raw, unaligned sequences and search for
trees that could simultaneously account for the bases in the sequences and their structural
evolution. A few programs do conduct combined alignment and phylogenetic inference
(e.g., POY; http://research.amnh.org/scicomp/projects/poy.php). However, the problem
is so computationally challenging, that it is necessary to make a number of unrealistic
assumptions to make the programs work. Therefore, the vast majority of phylogenetic
analysis separates the two problems: first generating an alignment, and then provisionally
accepting that alignment as the basis for phylogenetic inference.
To get a feel for how sequence alignment can be conducted free of a phylogeny, see if
you can align the following pair of sequences.
A T G A C C A G T A C G G C T T T A
A T G A T C G A T A T G G C A T T A
You might conclude that these sequences are already well aligned and that the two
sequences differ by five base substitution events. While this might be the best alignment,
it is worth considering alternatives that can also explain these data through the addition of
insertion/deletion, or indel, events. For example, you could align these same two
sequences by invoking eight indels and no substitutions, or three base substitutions and
two indels. These three alternatives are shown.
Five substitutions:
A T G A C C A G T A C G G C T T T A
A T G A T C G A T A T G G C A T T A
Eight indels:
A T G A C - C A G - - T A - C G G C - T T T A
A T G A - T C - - G A T A T - G G C A - T T A
To choose between these we need to ask ourselves whether it is more likely that there
were five substitutions, eight indels, or three substitutions and two indels. Assuming that
the general rate of evolution is low then it is probably reasonable to favor five events over
eight, thereby rejecting the second alignment. Furthermore, most data from molecular
biology would say that base substitutions are more frequent that indels (especially in
coding genes), which means that we would normally favor the first alignment.
This example allows us to state a rule that is applied in almost all sequence alignment
programs: only invoke an indel if at least one base substitution is avoided. Indeed it is
normal to set the gap penalty, the threshold for the number of base substitutions avoided
before an indel is inferred, higher still: gap penalties of three to twenty are common.
Additionally, most computer programs impose an extra cost for longer gaps or gaps at the
ends to avoid alignments such as the following, which avoid all base substitutions at the
“cost” of two indels.
- - - - - - - - - - - - - - - - - - A T G A C C A G T A C G G C T T T A
A T G A T C G A T A T G G C A T T A - - - - - - - - - - - - - - - - - -
That being said, a smart alignment program will permit gaps to be inserted at the end in
cases such as the following.
A T G A C C A G T A C G G C T T T A
A G C A T G G C T A T A G A T A C C
A T G A C C A G T A C G G C T T T A - - - - - -
- - - - - - A G C A T G G C T A T A G A T A C C
Here the “gaps” probably do not represent indel events but, rather, sequencing reactions
that started and/or ended at different positions in the sequence. A purist might therefore
change the gaps to uncertainty codes (below), although this will not affect phylogeny
reconstruction programs.
A T G A C C A G T A C G G C T T T A ? ? ? ? ? ?
? ? ? ? ? ? A G C A T G G C T A T A G A T A C C
It is probably clear that alignment is easiest and most certain when both base substitutions
and indels are rare. This is because matched parts of the sequence provide a framework
for identifying the position and size of indel events. For example, below are two true
alignments. Which do you think you would yield data that were easier to align?
A T G A - - - T G C A G C T T T A G G T A
? C A A C A G T A C G A - - C T A C - C A
A T G A C C A G T A C A G - T T T A ? ? ?
A C G T C C - - T A C G G C T T C A G T A
The answer is the second one. While the number of indels is similar, the many extra base
substitutions in the top case would make it very hard to identify the true alignment with
confidence.
Computer programs can often squeeze more information out of sequences by taking into
account not just the number of base substitutions but the kind of substitution. For good
molecular reasons (associated with repair mechanisms), purine-to-purine (A ⇔ G) and
pyrimidine-to-pyrimidine (C ⇔ T ) substitutions, collectively called transitions, are more
frequent than changes from a purine to a pyrmidine or vice versa (A ⇔ C, A ⇔ T, G ⇔
C, G ⇔ T), collectively called transversions. Computer programs can use differential
penalties on transitions and transversions so as to pick alignments that more closely
match the underlying molecular processes. Likewise, functional constraints on proteins
mean that certain amino acid substitutions, ones entailing amino acids with similar
chemical properties, are more frequent than others. These too can be taken into account
by computational algorithms.
Pairwise alignment considers just two sequences at a time whereas multiple alignments
include sequences from many taxa so as to obtain an entire aligned data matrix. A
pairwise alignment is relatively simple for a computer to determine, even when a
complex set of penalties are implemented. Multiple alignments are, however,
disproportionately more difficult. As the number of sequences being aligned increases,
the number of possible alignments goes up exponentially.
Multiple alignment algorithms should allow the placement of gaps in one sequence to
influence the placement of gaps in other sequence. This is because, when gaps in two
species are in the same position they can be attributed to a single indel occurring
somewhere on the gene tree. Nonetheless, most multiple alignment programs start by
making a pairwise alignment and then gradually align additional sequences to the earlier
alignment. In this procedure, a gap introduced early is retained even if the addition of
later sequences might suggest an alternative position for the gap.
Given the difficulties faced by alignment programs, humans can often visually identify
the more egregious mistakes. Thus, while computer multiple alignment programs (e.g.,
CLUSTAL) provide a good starting point, they usually need to be examined and adjusted
by eye. To illustrate this, here is a problematic portion of an alignment that was actually
returned by CLUSTAL and an eyeball-edited version of the same. You might notice that
not only does the second alignment imply a simpler mutational history, but also the
human editor could take account of the codon structure of this gene (something that few
computer programs keep track of) so as to both gaps in the same reading frame.
Given that sequence alignments will vary depending on how they were generated (what
algorithm, what penalties, and whether they were manually edited), you might worry that
phylogenetic analysis of aligned sequences is invalid. Actually, the problems are less
than they may seem.
Usually, even if some part of a sequence is hard to align unambiguously, many regions
can be aligned confidently. It is common practice to either exclude regions of ambiguous
alignment from the phylogenetic analysis or to repeat the analysis using a range of
alternative alignments to see if the inferred trees vary significantly. And even in the
worst-case scenario where the entire gene is hard to align, the data matrix is likely to
show a lack of clear phylogenetic signal rather than a strong, misleading signal. Thus,
even if a suspect alignment is used, statistical analysis of the phylogenetic conclusions
(Chaps X-XX) will probably show that they are weak.
Generating a morphological data matrix
Although the field of phylogenetics was founded on morphological data, most modern
research uses molecular data, especially DNA sequences, instead. Phylogenetic analysis
of morphological data turns out to be more challenging than the analysis of DNA
sequences, and it is more difficult to obtain statistically significant support for
phylogenetic conclusions. However, fossils can only be analyzed through the use of
morphological data (scored for them and related living species). Additionally, even when
trees come from different kinds of data it often becomes necessary to build a
morphological data matrix as a means to use the tree to reconstruct the evolution of
morphological traits. Therefore, despite the preeminence of molecular phylogenetics, it
is important to know how morphological data are scored and assembled into data
matrices.
Two steps can be recognized in the building of a morphological matrix, which I will call
character encoding and character scoring. Character encoding involves deciding on the
limits of characters and on the alternative states that are recognized for each character.
For example, when you decide to score fur color using two states, brown/black and white,
then you have encoded one character (fur color). Character scoring involves looking at
each taxon and assigning it a state for each encoded character. For example, you could
score an otter as having brown/black fur. In practice observations made while scoring
taxa often results in changes being made to character encoding, but it is still useful to
distinguish these two steps.
Once a set of taxa has been selected for study a systematist generally starts by looking for
characteristics that appear to vary among them. Notice that whereas many characters in a
DNA sequence data matrix may be invariant, the way that morphological characters are
selected means that constant characters are usually not included.
Once some variation has been noted the next challenge is to clearly encode the
characters. This is not straightforward. For example, imagine that you observed that leaf
shape and size differed among a set of eight plant species, as shown in the figure. How
would you capture this variation?
Consider two of the
numerous possible ways to
encode this variation. (1)
You recognize two basic
leaf shapes, cordate (heart-
shaped) and obcordate
(with the widest point near
A C E G the top) and two size
classes. (2) You encode
leaf length, leaf width, and
the height of the widest
point. The scoring that
might result from these two
encoding schemes is shown
B D F H in the two matrices below.
Once the characters are defined, the next question is how to delimit
character states. If the variation is rather discrete between taxa and with
little variation within taxa, as illustrated by leaf shape in the example
above, then it may be easy to encode the categories. However, most
morphological traits are inherently continuous and variation within taxa is
usual. Thus, it can be difficult to divide continuous variation into the discrete states
needed for phylogenetic analysis. The graph below shows hypothetical data on leaf
length in the ten species.
A
B
C
D
E
F
G
H
I
J
Taken together, you can probably see that there are many somewhat subjective decisions
that must be made in encoding morphological data and that these are likely to be adjusted
by observations made while scoring individual taxa. Nonetheless, the aim is to
eventually define the encoding scheme so clearly that any researcher would score the
same taxa and arrive at the same morphological data matrix. However, the fact that
scoring can be objective once encoding is completed, does not change the fact that
morphological data matrices are always somewhat subjective because data encoding
cannot be rendered fully objective.
While subjectivity is something that makes scientists uncomfortable, the fact that there is
some subjectivity does not invalidate morphology as a source of phylogenetic data. So
long as different encoding schemes capture the actual variation among taxa, then they
should yield similar estimates of the tree. However, since different encoding schemes
can result in different trees being chosen, as with sequence alignment, it is considered
good practice to try a few different schemes and see if the phylogenetic conclusions
remain the same.
Information for phylogenetic inference can theoretically come from any traits that we
believe evolved within the constraints of the underlying tree. A trait that varies among
tips and shows some degree of heritability (ancestors having the trait tend to give
descendants that have it too), thus, has the potential to provide phylogenetic information.
Prior to the advent of modern molecular methods, phylogenetic analysis was conducted
primarily on morphological variation. Nowadays, the great majority of phylogenetic
analyses are conducted based on DNA sequences obtained from representative
individuals. In the following chapters we will examine methods that have been
developed for phylogenetic inference from DNA sequence data and morphological data.
Although I will focus on DNA sequence and morphological data, there are many other
kinds of data that can be used for phylogenetic analysis. In most cases these other kinds
of data can be analyzed using methods developed for DNA sequences or morphology but
in some cases specialized methods need to be used. It is beyond the scope of this book to
explore all these methods. However, below I review most of the kinds of data that are
used for phylogenetic inference and provide some brief discussion as to how they are
typically analyzed.
i. Molecular Sequences
Sequences of peptides or small proteins were the first kind of molecular sequences to be
used for phylogenetics. Originally amino acid sequences were determined chemically.
Now it is much more common to infer an amino acid sequence from either the mRNA or
DNA sequence that encodes the protein.
Protein sequences are still used for phylogeny reconstruction, especially to study ancient
relationships. This is because protein sequences are often subject to functional
constraints that slow down the rate of evolution, reducing the frequency with which a
single position undergoes multiple evolutionary changes of state (multiple hits).
Nonetheless, the general principles of phylogenetic analysis of protein sequences are very
similar to those applied to DNA sequences.
The second kinds of molecular sequences to be collected widely were ribosomal RNA
(rRNA) sequences. This was because ribosomes could be purified and their DNA could
be sequenced with reverse transcriptase. This technology has now been superseded and
rRNA or messenger RNA (mRNA) sequences are nowadays obtained by sequencing the
encoding DNA (or in the case of mRNA, by copying it back into DNA). Because there is
usually a one to one correspondence between RNA and the encoding DNA, phylogenetic
analysis of RNA sequences is basically identical to that of DNA sequences. The only
complication that arises sometimes is that some RNA molecules adopt folded structures
due to bonding between bases at different positions in the molecule. When this happens,
there can be non-independence between positions in the sequence that interact,
potentially complication phylogenetic analysis.
Over the years a number of different molecular techniques have been developed that
allow researchers to score organisms for the presence/absence of particular molecular
markers. Appendix 7.1 provides a brief description of some of these method.
The first two of methods developed, RFLPs (Appendix 7.1.ii) and isozymes (Appendix
7.1.iii), date back to a time when molecular sequencing was not feasible. Isozyme data is
optimized for population genetic studies and is not well suited to phylogenetic analysis.
RFLP data, in contrast, provided a lot of robust phylogenetic information and played a
major role in the development of molecular phylogenetics. However, for all its historical
importance, RFLP has been superseded in phylogenetics with the advent of inexpensive
and efficient DNA sequencing methods.
RAPDs (7.1.iv), ISSRs (7.1.v), AFLPs (7.1.vi), and microsatellites (7.1.vii), collectively
distributed molecular markers, were originally developed for studying variation within
populations. Because the members of a population are closely related, their DNA
sequences are usually very similar. Distributed molecular marker methods quickly scan a
set of taxa for molecular variation distributed anywhere in the whole genome. Because
these markers (of which AFLP is currently the most widely used) have been particularly
popular for phylogenetic studies of closely related organisms, where it can be difficult to
find sufficiently variable gene regions to sequence.
In addition to certain technical problems that are mentioned in the appendix, distributed
molecular markers have one significant drawback for studying phylogenies among
closely related species. The available methods of analysis estimate a common
phylogenetic tree under the assumption that all parts of the genome have tracked the
same history. This is fine if all the genes have tracked the same history. But, phenomena
such as incomplete lineage sorting, mean that different parts of the genome can have
different phylogenetic trees (chap. 5) – and these phenomena will be particular prevalent
at the low phylogenetic scales that are generally studied with distributed molecular
markers.
You might hope that computer programs for phylogenetic analysis would have a
safeguard built in to detect that the assumptions of the method have been violated. But
unfortunately it is not easy to tell from distributed marker data whether there is a single
shared history. You will usually obtain a well-resolved tree whether there is a single
history of the whole genome or not. The tree may not correspond to the history of the
whole genome but may be some kind of average across the conflicting histories from
different parts of the genome. Therefore, users of distributed marker data need to be
extra vigilant to avoid reading too much significance into the trees they obtain in cases
where discordance among gene trees is suspected.
Whereas molecular marker methods have generally been motivated by the search for
variation at low taxonomic scales, structural molecular characters have mainly been of
interest because they tend to evolve very slowly. A rare structural molecular feature
shared by a group of taxa can provide compelling evidence that the taxa form a clade.
The use of molecular structural characters for phylogenetic inference generally follows
one of two approaches. The first involves scoring tips for the presence or absence of a
number of structural characters and then using parsimony or other standard phylogenetic
methods to infer the tree that best explains the full set of characters. This approach
allows that any of the characters could have been subject to homoplasy.
A simple example of the latter approach is provided by a study of land plants conducted
in 1992 by Linda Raubeson and Robert Jansen then at the University of Connecticut.
They used RFLP mapping approaches to show that the clubmosses and their allies (the
lycophytes) have a major inversion in their plastid genome relative to all other vascular
land plants. Furthermore, the condition found in lycophytes resembled the outgroups
(mosses and liverworts). From this they concluded that the lycophytes must be outside a
clade that includes all other living vascular plants.
B
A A Other vascular
Outgroup Lycophytes plants
A
Inversion
Abundant subsequent research on land plant phylogeny has confirmed the conclusion
reached based on this inversion. It seems that this inversion really did occur just once in
a common ancestor of all living vascular plants except lycophytes.
Another classic example that fits this basic principle are chromosomal inversion
phylogenies, which were reconstructed based on a logical analysis of a series of nested
chromosomal inversions in Hawaiian fruit flies. Fruit flies have polytene chromosomes
in the salivary glands that can be stained to reveal banding patterns, which may be
observed under the microscope. In 1982, Carson summarize more than a decade of such
data and was able to provide a detailed phylogenetic tree for approximately 103 fly
species – a tree than has largely been validated based on DNA seqence data (O’Grady et
al. 2001; BMC Evo Bio).
Nonetheless, while structural molecular characters have provided definitive data in many
cases, this does not mean that the Hennigian logic will always succeed. Our knowledge
of molecular process is not good enough to definitively rule out independent origins of
the same structural mutations. Therefore, even when clades are supported by supposedly
rare structural mutations, biologists still hope to corroborate those clades through the use
of other kinds of data.
As discussed earlier in this chapter, morphological features that reflect some underlying
aspects of the genotype, making them heritable, can be used for inferring phylogenetic
relationships. The basic approach with such data is to examine the variation among the
taxa and develop an encoding scheme for summarizing that variation with a number of
discrete states. As discussed earlier the delimitation of characters and character states can
be tricky, depending on a series of somewhat ambiguous judgment calls.
There are a number of kinds of data that do not involve morphology in the strict sense
(gross physical features of organisms), but nonetheless contend with similar problems of
character encoding. These are sometimes called “phenotypic data,” but I don’t find that
term appropriate given that all features of organisms we observe are technically
phenotypic. Rather, I will use the term morphotypic to refer to those kinds of data for
which character encoding is not defined a priori, but is guided by the observed variation
among the taxa combined with insights into the homology and independence of the traits.
Below is a list of some kinds of morphotypic data that have been used for phylogenetic
analysis. It is common for a data matrix to include more than one kind of morphotypic
data.
If different individuals organisms scored from a given taxon vary in a trait, one strategy
(the usual one nowadays) is to score that taxon as polymorphic, i.e., containing multiple
states, but not to worry about the frequency of each variant within the taxon. However,
in earlier times the frequency of a polymorphic character, usually an isozyme or RFLP
allele, was considered to provide evidence of relatedness. Therefore, a data matrix could
be constructed in which the entries are not the presence or absence of a state, but the
frequency of different alleles in different data, as illustrated below. You will see that for
each locus, the sum of the frequencies of each allele adds up to 1.0.
Taxon list
Locus 1 Locus 2 Locus 3
1a 1b 1c 2a 2b 3a 3b 3c 3d 3e Alleles
A 0.0 0.9 0.1 0.0 1.0 1.0 0.0 0.0 0.0 0.0
B 0.0 0.3 0.7 0.2 0.8 0.4 0.6 0.0 0.0 0.0
C 0.1 0.4 0.5 0.1 0.9 0.1 0.8 0.1 0.0 0.0
D 0.9 0.1 0.0 0.8 0.2 0.0 0.1 0.0 0.9 0.0
E 0.7 0.3 0.0 1.0 0.0 0.0 0.0 0.0 0.7 0.3
F 1.0 0.0 0.0 0.9 0.1 0.0 0.1 0.1 0.4 0.4
Methods have been developed for analyzing frequency data, most commonly involving
first converting the frequency matrix into a distance matrix. However, nowadays
frequency is rarely if ever used for phylogeny reconstruction. When taxa share
polymorphisms (e.g., each contains both alleles a and b of the same locus) then it is
questionable whether the taxa are monophyletic entities that can meaningfully be placed
assigned to a single tip of the tree of life. This is, I think, why frequency data have so
completely fallen out of favor.
All the other kinds of data discussed are initially scored in the form of a character-state
matrix data. Although they can be converted into a distance matrix, they start out as a list
of discrete states scored for each taxon. However, two kinds of data are collected in the
form of a distance between a pair of taxa: immunological distances and DNA-DNA
similarity.
Immunological methods were popular in animals in the 1970’s and 1980’s. The intensity
of the reaction between the immune serum of one animal and antigenic proteins from
another animal was scored quantitatively in the laboratory. The underlying logic was that
the greater the time since common ancestry the greater the protein differences, which
should in turn lead to a more intense immune reaction.
Similarly, mainly in the 1980’s, many systematists put great stock in DNA-DNA
hybridization data as a measure of the overall sequence similarity of a pair of genomes.
A typical experiment would involve single stranded DNA from one taxon being attached
to a column and then being allowed to hybridize with single stranded (and radioactively
labeled) DNA from another species. The greater the sequence similarity of the two
genomes the more tightly would the complementary sequences bind. Overall genomic
similarity could thus be measured by looking at the release of radioactivity when the
column was gradually heated up to melt apart the two strands.
Immunological and DNA-DNA distance measures can only be analyzed using distance
methods, which is somewhat limiting. Also, whole genome distance data do not allow
you to detect the existence of different gene trees for different parts of the genome.
However, the main reason that these data are no longer used for phylogenetic research is
that compared to DNA sequencing the data are harder to generate and less readily
repeatable. Nonetheless, a number of conclusions reached using immunological or DNA-
DNA hybridization data have since been validated using DNA sequence data.
Major Points
Phylogenetic inference is based on variable traits that have been scored for a set of taxa
and have been entered into a data matrix. While there are many kinds of data that can be
used, the most important kinds to know about are DNA sequences and morphology.
DNA sequence data is the most widely used tool for studying phylogeny, being easy to
collect in large quantities and relatively simple to analyze using statistical methods. The
most common approaches involve first aligning the DNA sequences to one another by
adding gaps that represent insertion/deletion (indel) events. Morphological data are
needed to add fossils to the tree of life and morphology is often scored as a first step in
using phylogenies to study evolution (Chap. XX). Morphological data do not need
alignment, but the delimitation of characters and character states can be quite difficult
and can introduce an element of subjectivity.
Learning objectives