Final 2

Phylogenet
ics
Course Bio310: Bioinformatics-I

Waseem Haider Alvi
Lecturer,
Dept. of Biosciences,
COMSATS Institute of Information Technology Islamabad
Phylogenetic Analysis
Overview
Insight into evolution
Evolutionary relationships shown as branches
Length and nesting reflects degree of

similarity between any two items
Relationship among species
IMAGE: http://www.howe.k12.ok.us/~jimaskew/bclasfy.htm
Relationship within a species (HIV-1 vs HIV-2)
IMAGE: http://www-hto.usc.edu/~cbmp/2002/PhylogeneticAnalysis/Distance%20Methods.html
Overview
C-Terminal Motor Kinesin sequences
 http://www.proweb.org/kinesin/BE4_Cterm.html
Overview
Objective: determine branch length and figure
out how the tree should be drawn
Sequences most closely related drawn as

neighboring branches
Overview
Dependent upon good multiple sequence
alignment programs
Group sequences with similar patterns of

substitutions
Overview
Consider two related sequences:
ancestoral sequence can be (partially) derived
With additional sequences, more information

can be gathered
Uses of Phylogenetic
Analysis
Given a set of genes, determine genes
likely to have equivalent functions
Follow changes occurring in a rapidly

changing species
Example: influenza
Study rapidly changing genes
Next year’s strain can be predicted
Flu vaccination can be developed
Tree of Life
Phylogenies study the evolution of a species
 Image: http://microbialgenome.org/primer/tree.html
Tree of Life
Traditionally, morphological (visible features)
characters used to classify organisms
Living organisms
Fossil records
Sequence data now can be used

Tree of Life
Many different resources including:
 NCBI taxonomy web sites
University of Arizona’s tree of life project
NCBI Taxonomy Web Site
 http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html
/
Tree of Life
http://tolweb.org/tree/phylogeny.html
Evolutionary Trees
Two dimensional graph showing evolutionary
relationship among a set of items
organisms, genes, or sequences
Each unit defined by a distinct branch

Evolutionary Trees
leaves represent the units(taxa) being studied
nodes and branches represent the

relationships among the taxa
Two taxa derived from the same common

ancestor share a node
Evolutionary Trees
Two types:
Evolutionary Time
 Assumes molecular clock hypothesis
Edit Distance (Phylogram)

 Number of sequence level changes occurring
 Distance not in direct relation to age
Cladograms Phylograms
measure time. measure change.
sharks
seahorses seahorses
sharks frogs
owls
frogs
Root
Root owls
crocodiles
crocodiles armadillos
armadillos 5% change
50 million years bats
bats
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Ultrametricity Additivity
All tips are an equal Distance between any two tips
distance from the root. equals the total branch length
between them.
X
a a X
b Y
b Root e
e Y c d
Root c d
a=b+c+d+e XY = a + b + c + d + e
In simple scenarios, evolutionary trees are ultrametric and
phylograms are additive.
Homology: identical character due to shared ancestry
(evolutionary signal)
Homoplasy: identical character due to evolutionary

convergence or reversal (evolutionary noise)
lizards +flight
birds worms
snakes
snakes
rodents lizards
primates rodents
+hair snakes
bats +legs
–legs
+flight
Homology Homoplasy
Homoplasy
(Reversal)
(Convergence)
Rooted Trees
One sequence (root) defined to be
common ancestor of all other sequences
Unique path leads from root node to any

other node
Direction of path indicates evolutionary

time
Root chosen as a sequence thought to

have branched off earliest
Rooted Trees
If molecular clock hypothesis holds, it is
possible to predict a root
As the number of sequences increase, the

number of possible rooted trees increases
very rapidly
bifurcating binary tree is usually the best

model to simulate evolutionary events
Example Rooted Tree
 Image source:
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Unrooted Tree (Star)
Indicates evolutionary relationship without
revealing location of oldest ancestry
Fewer possible unrooted trees than a rooted

tree
Example Unrooted Tree
Image source:
http://www.shef.ac.uk/english/language/quantling/images/quantling1.jp
 Image:
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Number of Trees
Unrooted trees Rooted trees
# # pairwise # branches # branches
sequences distances # trees /tree # trees /tree
3 3 1 3 3 4
4 6 3 5 15 6
5 10 15 7 105 8
6 15 105 9 945 10
10 45 2,027,025 17 34,459,425 18
30 435 8.69 × 1036 57 4.95 × 1038 58
N N (N - 1) (2N - 5)! 2N - 3 (2N - 3)! 2N - 2
2 2N - 3 (N - 3)! 2N - 2 (N - 2)!
Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

Methods for Determining
Trees
Three main methods:
maximum parsimony
Distance
maximum likelihood
Maximum Parsimony
Predicts evolutionary tree minimizing number
of steps required to generate observed
variation
Multiple sequence alignment must first be

obtained
Maximum Parsimony
For each position, phylogenetic trees
requiring smallest number of evolutionary
changes to produce observed sequence
changes are identified
Trees producing smallest number of

changes for all sequence positions are
identified
Maximum Parsimony
Time consuming algorithm
Only works well if the sequences have a

strong sequence similarity
Maximum Parsimony
Example
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
four sequences, three possible unrooted trees

Maximum Parsimony
Example
Possible Trees:
3 2 3
1 1 1
2 4 3 4 4 2
Maximum Parsimony
Example
Some sites are informative, others are not
Informative site has same sequence character

in at least two different sequences
Only informative sites are considered

Maximum Parsimony
Example
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Three informative columns

Maximum Parsimony
Example
1 3 1 2 1 3
1 G G A 2 4 3 4 4 2
2 G G G Column 1
3 A C A 1 3 1 2 1 3
4 A C G 2 4 3 4 4 2
Column 2
Tree 1: 4 1 3 1 2 1 3
Tree 2: 5 2 4 3 4 4 2
Tree 3: 6 Column 3
Is a substitution
Maximum Parsimony
Columns representing greater variation
dominate the analysis
To overcome this look only at transversion

events
Most significant base changes (i.e. changes a
purine to a pyrimidine or vice versa)
Lake’s method of invariants

Distance Method
Looks at number of changes between each
pair in a group of sequences
Goal: identify tree positioning neighbors

correctly that has branch lengths reproducing
original data as closely as possible
Distance Method
CLUSTALW: neighbor-joining method as a
guide to multiple sequence alignments
PHYLIP suite of programs employ

neighbor-joining methods
 http://evolution.genetics.washington.edu/phylip.html
Distance Programs in
Phylip
FITCH: estimates phylogenetic tree assuming
additivity of branch lengths using the Fitch-
Margoliash method
KITSH: same as FITCH, but under the

assumption of a molecular clock
Distance Programs in
Phylip
NEIGHBOR: estimates phylogenies using
either:
neighbor-joining (no molecular clock assumed)

unweighted pair group method with arithmetic
mean (UPGMA) (molecular clock assumed)
Distance Analysis
distance score counted as:
number of mismatched positions in alignment
number of sequence positions changed to
generate the second sequence
Success depends on degree the distances

are additive on a predicted evolutionary
tree
Example of Distance
Analysis
Consider the alignment:
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
Example of Distance
Analysis
Distances can be shown as a table
Example of Distance
Analysis
Using this information, a tree can be drawn:
A C
2 1
4
1 2
B D
Fitch and Margoliash
Algorithm (3 sequences)
Distance table used
Sequences combined in threes

define branches of predicted tree
calculate branch lengths of the tree
1) Draw unrooted tree with three branches
originating from common node:
A a
C
c
B b
1) Calculate lengths of tree branches
algebraically:
 distance from A to B = a + b = 22 (1)
 distance from A to C = a + c = 39 (2)
 distance from B to C = b + c = 41 (3)

 subtracting (3) from (2) yields:

 b + c = 41
 -a – c = -39
 __________
 b – a = 2 (4)

 adding (1) and (4) yields 2b = 24; b = 12
 so a + 12 = 22; a = 10
 10 + c = 39; c = 29
3) Resulting tree:
A 10
C
29
12
B
Algorithm can be extended to more
sequences. Consider the distances:
C
A c
a f
D
d
b g
B
e E
Locate most closely related sequences
create a new table by combining
remaining sequences
Distance from D to A,B,C: average distance of

each of these to D ( (39 + 41 + 18) / 3 = 32.7)
Distance from E to A,B,C: average distance of

each of these to E ((41+43+20)/3 = 34.7).
average distance from D to ABC and E to ABC
can be found by averaging the sum of the
appropriate branch lengths:
D to E: d + e = 10 (1)
D to ABC: d + m = 32.7
where m = g + (c + 2f + a + b) / 3 (2)
E to ABC: e + m = 34.7 (3)

By subtracting the third equation from
the second equation we get: C
A c
d – e = -2 a f
Adding this result to (1) we get: 2d = 8; d = 4 D
d
Substitute back in to get e = 6 b g
B
e E
Treat DE as a single sequence
Create new distance table
Distance from A to DE is average of A to D

and A to E (similar for other distances as
well)
C
A c
a f 9
D
d
b g 5 4
B 6
e E
By algebra, c = 9; g = 5
Continue process until all lengths are found:
C
A c
a 10 f 9
D
20 d
b g5 4
B 12 6
e E
Summary of Fitch-
Margoliash
1) Find the mostly closely related pairs of
sequences (A, B).
2) Treat the rest of the sequences as a

composite. Calculate the average distance
from A to all others; and from B to all
others.
3) Use these values to calculate the

length of the edges a and b.
Summary of Fitch-
Margoliash
4) Treat A and B as a composite.
Calculate the average distances between
AB and each of the other sequences.
Create a new distance table.
5) Identify next pair of related sequences

and begin as with step 1.
6) Subtract extended branch lengths to

calculate lengths of intermediate
branches.
Summary of Fitch-
Margoliash
7) Repeat entire process with all possible
pairs of sequences.
8) Calculate predicted distances between

each pair of sequences for each tree to find
the best tree.
Neighbor Joining
Similar to Fitch-Margoliash
Sequences chosen to give best least-squares

estimate of branch length
Neighbor Joining
Begin with star topology – no neighbors have
been joined
B C
A
E
Neighbor Joining
Tree modified by joining pairs of
sequences
Pair is chosen by calculating sum of

branch lengths for the corresponding tree:
∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
Neighbor Joining
If A and B are joined:
B C
A
E
∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
Neighbor Joining
Pair with smallest branch length chosen to
be joined
Fitch-Margoliash algorithm used to

compute actual branch lengths
a new distance table is created with joined

sequences entered as a composite
Neighbor Joining
Repeat process to select next pair to join
Process continues until correctly branched

tree and distances identified
UPGMA
“Unweighted Pair Group Method Using
Arithmetic Averages”
Works by clustering sequences
Produces Rooted Trees by assembling the

tree upwards beginning with most similar
sequences
UPGMA
distance dij between clusters Ci and Cj is
average distance between pairs of
sequences from each cluster:

1
d ij = ∑ d pq
| Ci || C j | pinCi , qinC j
|Ci| and |Cj| are the number of sequences

in clusters i and j
UPGMA
 INITIALIZATION STEPS
Assign each sequence i to its own cluster Ci
Define one leaf of the tree T for each

sequence, and place it at height 0.
UPGMA
Iteration steps
3. Determine the two clusters, i and j for

which dij is minimal
4. Define a new cluster k by Ck = Ci ∪ Cj, and

define dkl for all l
UPGMA
Iteration steps
5. Define a node k with daughter nodes i

and j, and place it at height dij/2.
6. Add k to the current clusters and
remove i and j.

7. Continue steps 3-6 until only two
clusters i and j remain, and place the root
of the tree at height dij/2
UPGMA
Example using 5 Sequences:

1 2 6
8
7
3
4
1 2 4 5 3
5
UPGMA
Rooted trees provide distance measures that
can be used with constant molecular clock
See Durbin, et al. 166-167

See Mount, 262-268
Maximum Likelihood
Calculates likelihood of a tree given an
alignment
Trees with least number of changes will be
most likely
Incorporates Jukes-Cantor and Kimura models

of evolution
Maximum Likelihood
Probability of each tree is product of mutation
rates in each branch
Likelihoods given by each column multiplied

to give the likelihood of the tree
Maximum Likelihood
PHYLIP Programs:
DNAML: Maximum likelihood, allows for
different mutation models, unequal rates of
mutation
DNAMLK: Same as DNAML, but assuming

molecular clock
Maximum Likelihood
Disadvantages:
Computationally intensive
Can only be done for a handful of sequences
Assessing significance of
Tree
Bootstrapping methods are used to assess
significance
Given an alignment, randomly pick columns

from alignment with replacement
Apply tree building to new data set

Assessing significance of
Tree
Repeat selection a number of times
Frequency at which phylogenetic feature

appears is measure of confidence for this
feature
Which Method to
Choose?
depends upon the sequences that are being
compared
strong sequence similarity:
 maximum parsimony
clearly recognizable sequence similarity
 distance methods
All others:
 maximum likelihood
Which Method to
Choose?
Best to choose at least two approaches
Compare the results – if they are similar, you

can have more confidence
Neighbor-joining Maximum parsimony Maximum likelihood
Uses only pairwise Uses only shared Uses all data

distances derived characters
Minimizes distance Minimizes total Maximizes tree likelihood

between nearest distance given specific parameter
neighbors values
Very fast Slow Very slow
Easily trapped in local Assumptions fail when Highly dependent on

optima evolution is rapid assumed evolution model
Good for generating Best option when Good for very small data
tentative tree, or choosing tractable (<30 taxa, sets and for testing trees
among multiple trees homoplasy rare) built using other methods

Building Alignments and Trees
Simultaneously
Sankoff & Cedergren gap substitution (using
dynamic programming)
Hein’s affine cost algorithm
Covered in detail in Durbin, et al. (not in this

class)
Difficulties With
Horizontal or lateral transfer of genetic
material (for instance through viruses)
makes it difficult to determine
phylogenetic origin of some evolutionary
events
Genes selective pressure can be rapidly

evolving, masking earlier changes that
had occurred phylogenetically
Difficulties With
two sites within comparative sequences
may be evolving at different rates
Rearrangements of genetic material can

lead to false conclusions

duplicated genes can evolve along
separate pathways, leading to different
functions
HOMEWORK
Read Mount, Chapter 7 (Phylogenetic
Analysis)
Read Durbin, et al., chapters 7

Final 2

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Final 2

Hochgeladen von

Copyright:

Verfügbare Formate

Phylogenet

Course Bio310: Bioinformatics-I

Evolutionary relationships shown as branches

Length and nesting reflects degree of

Sequences most closely related drawn as

Group sequences with similar patterns of

With additional sequences, more information

Follow changes occurring in a rapidly

Sequence data now can be used

organisms, genes, or sequences

Each unit defined by a distinct branch

nodes and branches represent the

Two taxa derived from the same common

Edit Distance (Phylogram)

Homoplasy: identical character due to evolutionary

Unique path leads from root node to any

Direction of path indicates evolutionary

Root chosen as a sequence thought to

As the number of sequences increase, the

bifurcating binary tree is usually the best

Fewer possible unrooted trees than a rooted

Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

Multiple sequence alignment must first be

Trees producing smallest number of

Only works well if the sequences have a

four sequences, three possible unrooted trees

Informative site has same sequence character

Only informative sites are considered

Three informative columns

To overcome this look only at transversion

Lake’s method of invariants

Goal: identify tree positioning neighbors

PHYLIP suite of programs employ

KITSH: same as FITCH, but under the

neighbor-joining (no molecular clock assumed)

Success depends on degree the distances

Sequences combined in threes

Distance from D to A,B,C: average distance of

Distance from E to A,B,C: average distance of

Create new distance table

Distance from A to DE is average of A to D

2) Treat the rest of the sequences as a

3) Use these values to calculate the

5) Identify next pair of related sequences

6) Subtract extended branch lengths to

8) Calculate predicted distances between

Sequences chosen to give best least-squares

Pair is chosen by calculating sum of

Fitch-Margoliash algorithm used to

a new distance table is created with joined

Process continues until correctly branched

Works by clustering sequences

Produces Rooted Trees by assembling the

|Ci| and |Cj| are the number of sequences

Assign each sequence i to its own cluster Ci

Define one leaf of the tree T for each

3. Determine the two clusters, i and j for

4. Define a new cluster k by Ck = Ci ∪ Cj, and

5. Define a node k with daughter nodes i

See Durbin, et al. 166-167

Incorporates Jukes-Cantor and Kimura models

Likelihoods given by each column multiplied

DNAMLK: Same as DNAML, but assuming