Sie sind auf Seite 1von 85

Phylogenet

ics

Course Bio310: Bioinformatics-I


Waseem Haider Alvi
Lecturer,
Dept. of Biosciences,
COMSATS Institute of Information Technology Islamabad
Phylogenetic Analysis
Overview
Insight into evolution

Evolutionary relationships shown as branches

Length and nesting reflects degree of


similarity between any two items
Phylogenetic Analysis
Relationship among species

IMAGE: http://www.howe.k12.ok.us/~jimaskew/bclasfy.htm
Phylogenetic Analysis
Relationship within a species (HIV-1 vs HIV-2)

IMAGE: http://www-hto.usc.edu/~cbmp/2002/PhylogeneticAnalysis/Distance%20Methods.html
Phylogenetic Analysis
Overview
C-Terminal Motor Kinesin sequences
 http://www.proweb.org/kinesin/BE4_Cterm.html
Phylogenetic Analysis
Overview
Objective: determine branch length and figure
out how the tree should be drawn

Sequences most closely related drawn as


neighboring branches
Phylogenetic Analysis
Overview
Dependent upon good multiple sequence
alignment programs

Group sequences with similar patterns of


substitutions
Phylogenetic Analysis
Overview
Consider two related sequences:
ancestoral sequence can be (partially) derived

With additional sequences, more information


can be gathered
Uses of Phylogenetic
Analysis
Given a set of genes, determine genes
likely to have equivalent functions

Follow changes occurring in a rapidly


changing species

Example: influenza
Study rapidly changing genes
Next year’s strain can be predicted
Flu vaccination can be developed
Tree of Life
Phylogenies study the evolution of a species

 Image: http://microbialgenome.org/primer/tree.html
Tree of Life
Traditionally, morphological (visible features)
characters used to classify organisms
Living organisms
Fossil records

Sequence data now can be used


Tree of Life
Many different resources including:
 NCBI taxonomy web sites
University of Arizona’s tree of life project
NCBI Taxonomy Web Site

 http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html
/
Tree of Life
http://tolweb.org/tree/phylogeny.html
Evolutionary Trees
Two dimensional graph showing evolutionary
relationship among a set of items

organisms, genes, or sequences

Each unit defined by a distinct branch


Evolutionary Trees
leaves represent the units(taxa) being studied

nodes and branches represent the


relationships among the taxa

Two taxa derived from the same common


ancestor share a node
Evolutionary Trees
Two types:
Evolutionary Time
 Assumes molecular clock hypothesis

Edit Distance (Phylogram)


 Number of sequence level changes occurring
 Distance not in direct relation to age
Cladograms Phylograms
measure time. measure change.
sharks

seahorses seahorses
sharks frogs
owls
frogs
Root
Root owls
crocodiles
crocodiles armadillos
armadillos 5% change
50 million years bats
bats
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Ultrametricity Additivity
All tips are an equal Distance between any two tips
distance from the root. equals the total branch length
between them.
X

a a X
b Y
b Root e
e Y c d
Root c d

a=b+c+d+e XY = a + b + c + d + e
In simple scenarios, evolutionary trees are ultrametric and
phylograms are additive.
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Homology: identical character due to shared ancestry
(evolutionary signal)

Homoplasy: identical character due to evolutionary


convergence or reversal (evolutionary noise)

lizards +flight
birds worms
snakes
snakes
rodents lizards
primates rodents
+hair snakes
bats +legs
–legs
+flight
Homology Homoplasy
Homoplasy
(Reversal)
(Convergence)
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Rooted Trees
One sequence (root) defined to be
common ancestor of all other sequences

Unique path leads from root node to any


other node

Direction of path indicates evolutionary


time

Root chosen as a sequence thought to


have branched off earliest
Rooted Trees
If molecular clock hypothesis holds, it is
possible to predict a root

As the number of sequences increase, the


number of possible rooted trees increases
very rapidly

bifurcating binary tree is usually the best


model to simulate evolutionary events
Example Rooted Tree

 Image source:
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Unrooted Tree (Star)
Indicates evolutionary relationship without
revealing location of oldest ancestry

Fewer possible unrooted trees than a rooted


tree
Example Unrooted Tree

Image source:
http://www.shef.ac.uk/english/language/quantling/images/quantling1.jp
 Image:
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Number of Trees
Unrooted trees Rooted trees
# # pairwise # branches # branches
sequences distances # trees /tree # trees /tree
3 3 1 3 3 4
4 6 3 5 15 6
5 10 15 7 105 8
6 15 105 9 945 10
10 45 2,027,025 17 34,459,425 18
30 435 8.69 × 1036 57 4.95 × 1038 58
N N (N - 1) (2N - 5)! 2N - 3 (2N - 3)! 2N - 2
2 2N - 3 (N - 3)! 2N - 2 (N - 2)!

Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt


Methods for Determining
Trees
Three main methods:
maximum parsimony
Distance
maximum likelihood
Maximum Parsimony
Predicts evolutionary tree minimizing number
of steps required to generate observed
variation

Multiple sequence alignment must first be


obtained
Maximum Parsimony
For each position, phylogenetic trees
requiring smallest number of evolutionary
changes to produce observed sequence
changes are identified

Trees producing smallest number of


changes for all sequence positions are
identified
Maximum Parsimony
Time consuming algorithm

Only works well if the sequences have a


strong sequence similarity
Maximum Parsimony
Example
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G

four sequences, three possible unrooted trees


Maximum Parsimony
Example
Possible Trees:

3 2 3
1 1 1

2 4 3 4 4 2
Maximum Parsimony
Example
Some sites are informative, others are not

Informative site has same sequence character


in at least two different sequences

Only informative sites are considered


Maximum Parsimony
Example
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G

Three informative columns


Maximum Parsimony
Example
1 3 1 2 1 3
1 G G A 2 4 3 4 4 2
2 G G G Column 1
3 A C A 1 3 1 2 1 3

4 A C G 2 4 3 4 4 2

Column 2
Tree 1: 4 1 3 1 2 1 3

Tree 2: 5 2 4 3 4 4 2

Tree 3: 6 Column 3
Is a substitution
Maximum Parsimony
Columns representing greater variation
dominate the analysis

To overcome this look only at transversion


events
Most significant base changes (i.e. changes a
purine to a pyrimidine or vice versa)

Lake’s method of invariants


Distance Method
Looks at number of changes between each
pair in a group of sequences

Goal: identify tree positioning neighbors


correctly that has branch lengths reproducing
original data as closely as possible
Distance Method
CLUSTALW: neighbor-joining method as a
guide to multiple sequence alignments

PHYLIP suite of programs employ


neighbor-joining methods

 http://evolution.genetics.washington.edu/phylip.html
Distance Programs in
Phylip
FITCH: estimates phylogenetic tree assuming
additivity of branch lengths using the Fitch-
Margoliash method

KITSH: same as FITCH, but under the


assumption of a molecular clock
Distance Programs in
Phylip
NEIGHBOR: estimates phylogenies using
either:

neighbor-joining (no molecular clock assumed)


unweighted pair group method with arithmetic
mean (UPGMA) (molecular clock assumed)
Distance Analysis
distance score counted as:
number of mismatched positions in alignment
number of sequence positions changed to
generate the second sequence

Success depends on degree the distances


are additive on a predicted evolutionary
tree
Example of Distance
Analysis
Consider the alignment:

A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
Example of Distance
Analysis
Distances can be shown as a table

A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
Example of Distance
Analysis
Using this information, a tree can be drawn:

A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT

A C
2 1
4

1 2
B D
Fitch and Margoliash
Algorithm (3 sequences)
Distance table used

Sequences combined in threes


define branches of predicted tree
calculate branch lengths of the tree
Fitch and Margoliash
Algorithm (3 sequences)
1) Draw unrooted tree with three branches
originating from common node:

A a
C
c
B b
Fitch and Margoliash
Algorithm (3 sequences)
1) Calculate lengths of tree branches
algebraically:
 distance from A to B = a + b = 22 (1)
 distance from A to C = a + c = 39 (2)
 distance from B to C = b + c = 41 (3)
  
 subtracting (3) from (2) yields:
  
 b + c = 41
 -a – c = -39
 __________
 b – a = 2 (4)
  
 adding (1) and (4) yields 2b = 24; b = 12
 so a + 12 = 22; a = 10
 10 + c = 39; c = 29
Fitch and Margoliash
Algorithm (3 sequences)
3) Resulting tree:

A 10

C
29
12
B
Fitch and Margoliash
Algorithm (5 sequences)
Algorithm can be extended to more
sequences. Consider the distances:

C
A c
a f
D
d
b g
B
e E
Fitch and Margoliash
Algorithm (5 sequences)
Locate most closely related sequences
Fitch and Margoliash
Algorithm (5 sequences)
create a new table by combining
remaining sequences

Distance from D to A,B,C: average distance of


each of these to D ( (39 + 41 + 18) / 3 = 32.7)

Distance from E to A,B,C: average distance of


each of these to E ((41+43+20)/3 = 34.7).
Fitch and Margoliash
Algorithm (5 sequences)
average distance from D to ABC and E to ABC
can be found by averaging the sum of the
appropriate branch lengths:
D to E: d + e = 10 (1)
D to ABC: d + m = 32.7
where m = g + (c + 2f + a + b) / 3 (2)
E to ABC: e + m = 34.7 (3)
 
By subtracting the third equation from
the second equation we get: C
A c
d – e = -2 a f
Adding this result to (1) we get: 2d = 8; d = 4 D
d
Substitute back in to get e = 6 b g

B
e E
Fitch and Margoliash
Algorithm (5 sequences)
Treat DE as a single sequence

Create new distance table

Distance from A to DE is average of A to D


and A to E (similar for other distances as
well)
Fitch and Margoliash
Algorithm (5 sequences)
C
A c
a f 9
D
d
b g 5 4
B 6
e E

By algebra, c = 9; g = 5
Fitch and Margoliash
Algorithm (5 sequences)
Continue process until all lengths are found:

C
A c
a 10 f 9
D
20 d
b g5 4
B 12 6
e E
Summary of Fitch-
Margoliash
1) Find the mostly closely related pairs of
sequences (A, B).

2) Treat the rest of the sequences as a


composite. Calculate the average distance
from A to all others; and from B to all
others.

3) Use these values to calculate the


length of the edges a and b.
Summary of Fitch-
Margoliash
4) Treat A and B as a composite.
Calculate the average distances between
AB and each of the other sequences.
Create a new distance table.

5) Identify next pair of related sequences


and begin as with step 1.

6) Subtract extended branch lengths to


calculate lengths of intermediate
branches.
Summary of Fitch-
Margoliash
7) Repeat entire process with all possible
pairs of sequences.

8) Calculate predicted distances between


each pair of sequences for each tree to find
the best tree.
Neighbor Joining
Similar to Fitch-Margoliash

Sequences chosen to give best least-squares


estimate of branch length
Neighbor Joining
Begin with star topology – no neighbors have
been joined

B C

A
E
Neighbor Joining
Tree modified by joining pairs of
sequences

Pair is chosen by calculating sum of


branch lengths for the corresponding tree:

∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
Neighbor Joining
If A and B are joined:
B C

A
E

∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
Neighbor Joining
Pair with smallest branch length chosen to
be joined

Fitch-Margoliash algorithm used to


compute actual branch lengths

a new distance table is created with joined


sequences entered as a composite
Neighbor Joining
Repeat process to select next pair to join

Process continues until correctly branched


tree and distances identified
UPGMA
“Unweighted Pair Group Method Using
Arithmetic Averages”

Works by clustering sequences

Produces Rooted Trees by assembling the


tree upwards beginning with most similar
sequences
UPGMA
distance dij between clusters Ci and Cj is
average distance between pairs of
sequences from each cluster:
 
1
d ij = ∑ d pq
| Ci || C j | pinCi , qinC j

|Ci| and |Cj| are the number of sequences


in clusters i and j
UPGMA
 INITIALIZATION STEPS

Assign each sequence i to its own cluster Ci

Define one leaf of the tree T for each


sequence, and place it at height 0.
UPGMA
Iteration steps

3. Determine the two clusters, i and j for


which dij is minimal

4. Define a new cluster k by Ck = Ci ∪ Cj, and


define dkl for all l
UPGMA
Iteration steps

5. Define a node k with daughter nodes i


and j, and place it at height dij/2.
6. Add k to the current clusters and
remove i and j.
 
7. Continue steps 3-6 until only two
clusters i and j remain, and place the root
of the tree at height dij/2
UPGMA
Example using 5 Sequences:

1 2 6
8
7
3
4
1 2 4 5 3

5
UPGMA
Rooted trees provide distance measures that
can be used with constant molecular clock

See Durbin, et al. 166-167


See Mount, 262-268
Maximum Likelihood
Calculates likelihood of a tree given an
alignment
Trees with least number of changes will be
most likely

Incorporates Jukes-Cantor and Kimura models


of evolution
Maximum Likelihood
Probability of each tree is product of mutation
rates in each branch

Likelihoods given by each column multiplied


to give the likelihood of the tree
Maximum Likelihood
PHYLIP Programs:
DNAML: Maximum likelihood, allows for
different mutation models, unequal rates of
mutation

DNAMLK: Same as DNAML, but assuming


molecular clock
Maximum Likelihood
Disadvantages:
Computationally intensive
Can only be done for a handful of sequences
Assessing significance of
Tree
Bootstrapping methods are used to assess
significance

Given an alignment, randomly pick columns


from alignment with replacement

Apply tree building to new data set


Assessing significance of
Tree
Repeat selection a number of times

Frequency at which phylogenetic feature


appears is measure of confidence for this
feature
Which Method to
Choose?
depends upon the sequences that are being
compared
strong sequence similarity:
 maximum parsimony
clearly recognizable sequence similarity
 distance methods
All others:
 maximum likelihood
Which Method to
Choose?
Best to choose at least two approaches

Compare the results – if they are similar, you


can have more confidence
Neighbor-joining Maximum parsimony Maximum likelihood

Uses only pairwise Uses only shared Uses all data


distances derived characters

Minimizes distance Minimizes total Maximizes tree likelihood


between nearest distance given specific parameter
neighbors values

Very fast Slow Very slow

Easily trapped in local Assumptions fail when Highly dependent on


optima evolution is rapid assumed evolution model

Good for generating Best option when Good for very small data
tentative tree, or choosing tractable (<30 taxa, sets and for testing trees
among multiple trees homoplasy rare) built using other methods

Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt


Building Alignments and Trees
Simultaneously
Sankoff & Cedergren gap substitution (using
dynamic programming)
Hein’s affine cost algorithm

Covered in detail in Durbin, et al. (not in this


class)
Difficulties With
Phylogenetic Analysis
Horizontal or lateral transfer of genetic
material (for instance through viruses)
makes it difficult to determine
phylogenetic origin of some evolutionary
events

Genes selective pressure can be rapidly


evolving, masking earlier changes that
had occurred phylogenetically
Difficulties With
Phylogenetic Analysis
two sites within comparative sequences
may be evolving at different rates

Rearrangements of genetic material can


lead to false conclusions
 
duplicated genes can evolve along
separate pathways, leading to different
functions
HOMEWORK
Read Mount, Chapter 7 (Phylogenetic
Analysis)
Read Durbin, et al., chapters 7

Das könnte Ihnen auch gefallen