Sie sind auf Seite 1von 38

Seminar

 Biological Sequence means a single


continuous molecule of nucleic acid or
protein
 Sequences are compared:
 Patterns of variability or conservation
 Find common motifs
 Similarity with seq. in database
 Evolution from same sequence
 Sequence alignment in bioinformatics
means the method of arranging DNA, RNA
or proteins to identify the areas of similarity
 Uses of finding similarity in sequences
 access genes and protein homology
 predict biological function, secondary and
tertiary protein structure
 detect point mutations
 construct evolutionary trees
 classify genes and proteins, etc.
 q= GLISVT and d= GIVT two sequences
evolved from an ancestor h=GLVST
h:GLVST
VI S

GLIST GLVT

V LI
q:GLISVT d:GIVT
 h: GLVST q= GLISVT and d = GIVT
q’:GLISVT
d’:GIV- -T
‘-‘ means deletion or insertion (one or more blanks = gaps)
h is the known ‘true’ alignment
Importance of alignment comes when evolutionary history is
not known (here sequence h)
 What is an Alignment?
 An alignment of two sequences q and d must satisfy the
following constraints
 • All symbols (residues) in q and d have to be in the alignment, and
in the same order as they appear in q and d.
 • We can align one symbol from q with one from d. • A symbol can
be aligned with a blank, written as ‘-’.
 • Two blanks cannot be aligned.
 Dot plots
 Direct qualitative method

 Sequence alignment
 Exact quantitative method
 Involves
 Construction of best alignment between sequences
 Assessment of similarity from the alignment

 Focuses mainly on pair wise alignment


 Mismatches in seq. alignment means point
mutations

 Gaps as indels

 e.g. h: GLVST q= GLISVT and d = GIVT


q’:GLISVT
d’:GIV- -T
 Long sequences cannot be aligned by hand

 On computational approach
 Global alignment
 Local alignment
 Global alignment
 Force to span entire length of all sequences
 Try to align every residue in every sequence
 Useful if sequences are of similar length
 Need a suitable degree of similarity throughout

 Eg. Needleman-Wunsch Algorithm (NW


algorithm)
 Local alignment
 Regions of similarity within long sequences that
may be widely divergent can be identified
 More overhead required
 More preferable but difficult to calculate
 Sequences of different length and less
similarity

 E.g. Smith Waterman algorithm (SW algorithm)


Wikipedia.org
 Global and local alignment of two
sequences (DNA or amino acid sequences)

 Methods used
 Dot matrix
 Dynamic programming
 Word methods
 Global alignment via Needleman-Wunsch
algorithm (NW algorithm)
 Local alignment via the Smith-Waterman
algorithm (SW algorithm)
 Mathematical framework
 Substitution Matrix
 Gap penalty
 Alignment score
 Alignment score : goodness of alignment
 Substitution score + gap penalty
 E.g. +1 is used for a match and -1 as the penalty for
a mismatch. Gaps are ignored. Alignment of two
sequences ATGGCGT and ATGAGT :
 best alignment “by eye”
ATGGCGT
ATG - AGT score:+1 + 1 + 1 + 0 - 1 + 1 + 1= 4
An alternative alignment
ATGGCGT
A- TGAGT score:+1 + 0 - 1 + 1 - 1 + 1 + 1 = 2
 Substitution matrix and substitution score
 4*4 matrix for DNA, 20*20 matrix for proteins

C T A G
C 2 1 -1 -1
T 1 2 -1 -1
A -1 -1 2 1
G -1 -1 1 2

 Gap penalties account for the introduction of a gap


(insertion or deletion mutation)
 BLOSUM ("Blocks substitution matrix") for proteins
 Uses dynamic programming
 Two sequences, say X=x1x2x3…. And
Y=y1y2y3…..
 The alignment is based on finding the
matrix elements (say matrix D)
 D(i,j) is the optimal alignment score for
the sequence (x1,x2,x3..xi) with
(y1,y2,y3…yj)
 Two matrices the score matrix and the
traceback matrix

 The algorithm consists of three steps


 1) Initialization of the score matrix
 2) Calculation of scores and filling the traceback
matrix
 3) Deducing the alignment from the traceback
matrix
 The alignment of two sequences SEND
and AND with BLOSUM62 substitution
matrix and gap opening penalty (no gap
extension) -10
SEND
 -AND score : -10+-1+6+6= +1
 A-ND score : 1+-10+6+6= +3 the best
 AN-D score : 1+0+-10+6= -3
 AND- score : 1+0+1+-10= -8
 The score and traceback matrices. The cells of
the score matrix are labeled C(i, j) where i = 1;
2; :::;N and j = 1; 2; :::;M

- S E N D
- C(1,1) C(1,2) C(1,3) C(1,4) C(1,5)
A C(2,1) C(2,2) C(2,3) C(2,4) C(2,5)
N C(3,1) C(3,2) C(3,3) C(3,4) C(3,5)
D C(4,1) C(4,2) C(4,3) C(4,4) C(4,5)
 Gap penalty = -10 INITALIZATION

- S E N D S E N D
DONE LEFT LEFT LEFT LEFT
- 0 -10 -20 -30 -40
UP
A -10 A
UP
N -20 N
UP
D -30 D

SCORE MATRIX TRACEBACK MATRIX


 D(i,j) = max { D(i-1,j-1)+s(xi, yj) ,
D(i-1,j)+g ,
D(i,j-1)+g }

 subject to a boundary conditions. S (i,j)


is the substitution score for residues i
and j, and g is the gap penalty
 Scoring starts from cell C(2,2)

The score of any cell C (i,j) is the maximum of:


qdiag = C(i – 1, j - 1) + S(i, j) = C(1,1)+S(S,A)= 0+1=1
qup = C(i – 1, j) + g = -10 + -10 =-20
qleft = C(i, j – 1) + g = -10 + -10 =-20
SCORE OF C(2,2)= qdiag = 1

 where S(i, j) is the substitution score for letters i and j


(taken from substitution matrix), and g is the gap penalty
 The value of the cell C(i,j) depends only on the
values of the immediately adjacent northwest
diagonal, up, and left cells

C(i-1,j-1) C(i-1,j)

C(i,j-1) C(i,j)
 C(2,2)

S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP
A -10 A
UP
N -20 N
UP
D -30 D

SCORE MATRIX TRACEBACK MATRIX


 C(2,2)

S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP Diag
A -10 1 A
UP
N -20 N
UP
D -30 D

SCORE MATRIX TRACEBACK MATRIX


 C(2,3)

S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP Diag left
A -10 1 -9 A
UP
N -20 N
UP
D -30 D

SCORE MATRIX TRACEBACK MATRIX


 Final Matrices

S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP Diag Left Left Left
A -10 1 -9 -19 -29 A
UP Diag Diag Diag Left
N -20 -9 -1 -3 -13 N
UP Up Diag Diag Diag
D -30 -19 -11 2 3 D

SCORE MATRIX TRACEBACK MATRIX


 Traceback: the
process of
deduction of the S E N D
best alignment DONE LEFT LEFT LEFT LEFT
from the traceback UP Diag Left Left Left
matrix A
UP Diag Diag Diag Left
 Starts from right N
bottom most cell in D UP Up Diag Diag Diag

traceback matrix
 Follow the traceback
written in the cell
 The best alignment: The
alignment is deduced
from the values of cells
along the traceback path, S E N D
by taking into account DONE LEFT LEFT LEFT LEFT
the values of the cell in
the traceback matrix: UP Diag Left Left Left
 diag - the letters from
A
two sequences are UP Diag Diag Diag Left
N
aligned;
UP Up Diag Diag Diag
 left - a gap is introduced D
in the left sequence;
 up - a gap is introduced
in the top sequence
Sequences are aligned
backwards.
 Thus the best
alignment S E N D
DONE LEFT LEFT LEFT LEFT

UP Diag Left Left Left


A
SEND UP Diag Diag Diag Left
N
A-ND UP Up Diag Diag Diag
D
 Based on dynamic programming
 Modifications on boundary conditions
 Negative scoring matrix cells set to 0
 Only positive scoring, rendering local alignments
visible
 Backtracking from highest scoring matrix cell until a
zero score is encountered
 Gap penalties needed
 Algorithm expansion
 Based on finding matrix H, build as follows
 H(i,0)= 0 ,0≤ i≤ m
 H(0,j)= 0 ,0≤j≤ n
 H(i,j)= max { 0 ,
H(i-1,j-1) + w(ai,bj) (correspond to
match/mismatch),
H(i-1,j) + w(ai,-) (correspond to deletion),
H(i,j-1) + w(-,bj) (to insertion)
} ,1≤ i ≤ m, ,1≤j≤ n
Where a,b = sequences; m = length (a) ; n=length (b); w(c,d), c, d
element in the sequence or {‘-‘}, ‘-‘ is the gap-scoring scheme.
SW algo. For local alignment of ACACACTA and AGCACACA
w(match) = +2
w(a, − ) = w( − ,b) = w(mismatch) = − 1
- A C A C A C T A
- 0 0 0 0 0 0 0 0 0
A 0 2 1
G 0
C 0
A 0
C 0
A 0
C 0
A 0
SW algo. For local alignment of ACACACTA and AGCACACA
w(match) = +2
w(a, − ) = w( − ,b) = w(mismatch) = − 1

- A C A C A C T A
- 0 0 0 0 0 0 0 0 0
A 0 2 1 2 1 2 1 0 2
G 0 1 1 1 1 1 1 0 1
C 0 0 3 2 3 2 3 2 1
A 0 2 2 5 4 5 4 3 4
C 0 1 4 4 7 6 7 6 5
A 0 2 3 6 6 9 8 7 8
C 0 1 4 5 8 8 11 10 9
A 0 2 3 6 7 10 10 10 12
 The best local alignment: the method to find
the traceback path is same as that of NW
algorithm. A diagonal jump implies there is an
alignment (either a match or a mismatch). A
top-down jump implies there is a deletion. A
left-right jump implies there is an insertion.
Thus for the example, we get:
 Sequence 1 = A-CACACTA
 Sequence 2 = AGCACAC-A
 Global alignments  Local alignments
 Requires alignment  Residue alignment
score for a pair of score may be positive
residues to be >=0 or negative
 No gap penalty  Requires a gap penalty
to work effectively
required
 Score can increase,
 Score cannot decrease or stay level
decrease between between two cells of a
two cells of a pathway
pathway
NW algorithm SW algorithm
 [1] Technology Blog
http://technology66.blogspot.com/2008/08/sequence-
alignment-techniques.html
 [2] Saul B. Needleman and Christian D. Wunsch A general
method applicable to the search for similarities in the amino
acid sequence of two proteins, Dept. of Biochemistry
Northwestern University, July 1969
 [3] Wikipedia http://en.wikipedia.org/wiki/File:Zinc-finger-seq-
alignment2.png
 [4] Felix Autenrieth, Barry Isralewitz, Zaida Luthey-Schulten,
Anurag Sethi, Taras Pogorelov Bioinformatics and Sequence
Alignment University of Illinois at Urbana-Champaign June
2005
 [5] T. F. SMITH, M. S. WATERMAN, Identification of
Common Molecular Subsequences July 1980

Das könnte Ihnen auch gefallen