Beruflich Dokumente
Kultur Dokumente
GLIST GLVT
V LI
q:GLISVT d:GIVT
h: GLVST q= GLISVT and d = GIVT
q’:GLISVT
d’:GIV- -T
‘-‘ means deletion or insertion (one or more blanks = gaps)
h is the known ‘true’ alignment
Importance of alignment comes when evolutionary history is
not known (here sequence h)
What is an Alignment?
An alignment of two sequences q and d must satisfy the
following constraints
• All symbols (residues) in q and d have to be in the alignment, and
in the same order as they appear in q and d.
• We can align one symbol from q with one from d. • A symbol can
be aligned with a blank, written as ‘-’.
• Two blanks cannot be aligned.
Dot plots
Direct qualitative method
Sequence alignment
Exact quantitative method
Involves
Construction of best alignment between sequences
Assessment of similarity from the alignment
Gaps as indels
On computational approach
Global alignment
Local alignment
Global alignment
Force to span entire length of all sequences
Try to align every residue in every sequence
Useful if sequences are of similar length
Need a suitable degree of similarity throughout
Methods used
Dot matrix
Dynamic programming
Word methods
Global alignment via Needleman-Wunsch
algorithm (NW algorithm)
Local alignment via the Smith-Waterman
algorithm (SW algorithm)
Mathematical framework
Substitution Matrix
Gap penalty
Alignment score
Alignment score : goodness of alignment
Substitution score + gap penalty
E.g. +1 is used for a match and -1 as the penalty for
a mismatch. Gaps are ignored. Alignment of two
sequences ATGGCGT and ATGAGT :
best alignment “by eye”
ATGGCGT
ATG - AGT score:+1 + 1 + 1 + 0 - 1 + 1 + 1= 4
An alternative alignment
ATGGCGT
A- TGAGT score:+1 + 0 - 1 + 1 - 1 + 1 + 1 = 2
Substitution matrix and substitution score
4*4 matrix for DNA, 20*20 matrix for proteins
C T A G
C 2 1 -1 -1
T 1 2 -1 -1
A -1 -1 2 1
G -1 -1 1 2
- S E N D
- C(1,1) C(1,2) C(1,3) C(1,4) C(1,5)
A C(2,1) C(2,2) C(2,3) C(2,4) C(2,5)
N C(3,1) C(3,2) C(3,3) C(3,4) C(3,5)
D C(4,1) C(4,2) C(4,3) C(4,4) C(4,5)
Gap penalty = -10 INITALIZATION
- S E N D S E N D
DONE LEFT LEFT LEFT LEFT
- 0 -10 -20 -30 -40
UP
A -10 A
UP
N -20 N
UP
D -30 D
C(i-1,j-1) C(i-1,j)
C(i,j-1) C(i,j)
C(2,2)
S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP
A -10 A
UP
N -20 N
UP
D -30 D
S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP Diag
A -10 1 A
UP
N -20 N
UP
D -30 D
S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP Diag left
A -10 1 -9 A
UP
N -20 N
UP
D -30 D
S E N D S E N D
DONE LEFT LEFT LEFT LEFT
0 -10 -20 -30 -40
UP Diag Left Left Left
A -10 1 -9 -19 -29 A
UP Diag Diag Diag Left
N -20 -9 -1 -3 -13 N
UP Up Diag Diag Diag
D -30 -19 -11 2 3 D
traceback matrix
Follow the traceback
written in the cell
The best alignment: The
alignment is deduced
from the values of cells
along the traceback path, S E N D
by taking into account DONE LEFT LEFT LEFT LEFT
the values of the cell in
the traceback matrix: UP Diag Left Left Left
diag - the letters from
A
two sequences are UP Diag Diag Diag Left
N
aligned;
UP Up Diag Diag Diag
left - a gap is introduced D
in the left sequence;
up - a gap is introduced
in the top sequence
Sequences are aligned
backwards.
Thus the best
alignment S E N D
DONE LEFT LEFT LEFT LEFT
- A C A C A C T A
- 0 0 0 0 0 0 0 0 0
A 0 2 1 2 1 2 1 0 2
G 0 1 1 1 1 1 1 0 1
C 0 0 3 2 3 2 3 2 1
A 0 2 2 5 4 5 4 3 4
C 0 1 4 4 7 6 7 6 5
A 0 2 3 6 6 9 8 7 8
C 0 1 4 5 8 8 11 10 9
A 0 2 3 6 7 10 10 10 12
The best local alignment: the method to find
the traceback path is same as that of NW
algorithm. A diagonal jump implies there is an
alignment (either a match or a mismatch). A
top-down jump implies there is a deletion. A
left-right jump implies there is an insertion.
Thus for the example, we get:
Sequence 1 = A-CACACTA
Sequence 2 = AGCACAC-A
Global alignments Local alignments
Requires alignment Residue alignment
score for a pair of score may be positive
residues to be >=0 or negative
No gap penalty Requires a gap penalty
to work effectively
required
Score can increase,
Score cannot decrease or stay level
decrease between between two cells of a
two cells of a pathway
pathway
NW algorithm SW algorithm
[1] Technology Blog
http://technology66.blogspot.com/2008/08/sequence-
alignment-techniques.html
[2] Saul B. Needleman and Christian D. Wunsch A general
method applicable to the search for similarities in the amino
acid sequence of two proteins, Dept. of Biochemistry
Northwestern University, July 1969
[3] Wikipedia http://en.wikipedia.org/wiki/File:Zinc-finger-seq-
alignment2.png
[4] Felix Autenrieth, Barry Isralewitz, Zaida Luthey-Schulten,
Anurag Sethi, Taras Pogorelov Bioinformatics and Sequence
Alignment University of Illinois at Urbana-Champaign June
2005
[5] T. F. SMITH, M. S. WATERMAN, Identification of
Common Molecular Subsequences July 1980