Beruflich Dokumente
Kultur Dokumente
1)
Create a 2-D matrix and populate it with scores representing the similarities of the compared sequences
!
2)
Accumulate the scores in the matrix & penalize insertions and deletions
!
3)
A I C I N SEQ2 R C K C R H P
A 1 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 0 0 1 0
C 0 0 1 0 0 0 1 0 1 0 0 0
N 0 0 0 0 1 0 0 0 0 0 0 0
I 0 1 0 1 0 0 0 0 0 0 0 0
SEQ1 R V S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0
! V 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 1 0 1 0 0 0
L 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 1 0 1 0 0 0
R 0 0 0 0 0 0 0 0 0 1 0 0
P 0 0 0 0 0 0 0 0 0 0 0 1
M ! 0! 0 ! 0! 0! 0 ! 0! 0! 0 ! 0! 0! 0! 0
A I C I N R C K C R H P
A 1 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 0 0 1 0
C 0 0 1 0 0 0 1 0 1 0 0 0
N 0 0 0 0 1 0 0 0 0 0 0 0
I 0 1 0 1 0 0 0 0 0 0 0 0
R 0 0 0 0 0 1 0 0 0 1 0 0
V 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0
V 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 1 0 1 0 0 0
L 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 1 0 1 0 0 0
R 0 0 0 0 0 0 0 0 0 1 0 0
P 0 0 0 0 0 0 0 0 0 0 0 1
M ! 0! 0 ! 0! 0! 0 ! 0! 0! 0 ! 0 ! 0! 0! 0
The matrix is accumulated moving from the bottom right corner to the top left corner!
The MAX previous score, the one that has to be added to the current RED CELL value, is the highest in the BLUE ROW OR COLUMN.
A I C I N R C K C R H P
A 8 7 6 6 5 4 3 3 2 2 1 0
H 7 7 6 6 5 4 3 3 2 1 2 0
C 6 6 7 6 5 4 4 3 3 1 1 0
N 6 6 6 5 6 4 3 3 2 1 1 0
I 5 6 5 6 5 4 3 3 2 1 1 0
R 4 4 4 4 4 5 3 3 2 2 1 0
V 4 4 4 4 4 4 3 3 2 1 1 0
S 4 4 4 4 4 4 3 3 2 1 1 0
G 4 4 4 4 4 4 3 3 2 1 1 0
V 4 4 4 4 4 4 3 3 2 1 1 0
C 3 3 4 3 3 3 4 3 3 1 1 0
L 3 3 3 3 3 3 3 3 2 1 1 0
C 2 2 3 2 2 2 3 2 3 1 1 0
R 1 1 1 1 1 1 1 1 1 2 1 0
P 0 0 0 0 0 0 0 0 0 0 0 1
M ! 0! 0 ! 0! 0! 0 ! 0! 0! 0 ! 0! 0! 0! 0
Tiny
P N Q
Polar
Charged
Hydrophobic
Aromatic
Matrix representing probabilities of amino acid substitutions. This and other existing matrices can be used to build more accurate alignments of two sequences.
Similarity based database searches generate local alignments to nd (within a sequence database) sequences related to the query sequence.!
!
Given a query sequence, local alignments of the query sequence are generated against every sequence in the database. The scores of the alignments are used to identify sequences that are related to the query! sequence.!
!
BLAST is the most common heuristic algorithm used to search sequence databases.
BLAST!
The Basic Local Alignment Search Tool!
!
!
BLAST is a class of related software that perform a variety of! database comparisons.!
!
For example:!
! ! ! !
Objective: To nd high scoring untapped alignments between a query sequence and the sequences in a database.!
!
These are called High Scoring Pairs (HSP).!
The existence of such segments above a given similarity threshold! indicates pairwise similarity beyond random chance.!
!
This is used to distinguish related from unrelated sequences in a! database.
The Algorithm
!
FIRST STEP - SEEDING. Generate all words of length K (e.g. k= 2)in the query sequence. Words in our example:QL; NF; SA; GW; LN; FS; AG.
!
SECOND STEP. Identify all words in the sequences in the database.
!
THIRD STEP. Align every seed against every word generated from the database. Calculate (Using BLOSUM62 -or another matrix) the score of every ungapped two letter alignment generated in this way. An alignment is considered a MATCH if its score is above a certain threshold (default = 8 for amino acids).
!
FOURTH STEP. Matches (only) are extended to generate longer alignments. If no match is found for two sequences, they are not considered any longer. This saves time. If multiple matches are found for two sequences, all matches are extended. The extension of a match continues until mismatches cause the alignment score to drop below a given threshold (22 for proteins 20 for DNA).
Extending a match
Every MATCH (alignment with a minimal score of 8), is extended until we found the best extension (alignment of maximal score).
Initial Match (or Hit)
AGT PYN NGT NNT LTW HKR RRR K TAG PYN NGT NNT LTW KHK KKK R
Extend until score of alignment increases Keep extending until score drops below 22
Interpreting BLAST!
!
The output of BLAST provides a list of pairwise sequence matches! ranked by the statistical signicance of the scores of their HSP.!
! ! ! !
In BLAST the statistical indicator is the E-value! (NOT to be confused with a P value -see below).!
!
E-values (expectation values) express how likely it is for an HSP of! a certain score to be observed by chance alone in a database of! given dimensions.!
!
E = m * n * P.! m = total number of residue in database.! n = number of residue in the query sequence! P = the probability that an HSP alignment is a result of random chance (THIS IS THE PROBABILITY OF THE ALIGNMENT!)
Interpretations of E-values
E =< 1e - 50: Extremely high sequence similarity. Very close homologs. 1e - 50< E < 1e - 8: Signicantly high similarity. Surely homologous. 1e - 7 < E < 1e - 2 (0.01): Sequences similar but not necessarily homologous. If they are homologous, they are distant homologoue. 0.01 < E < 10: Match not signicant. Generally speaking, as a rule of thumb: E =< 1e - 8 is signicant.
E = 4.2
If we can identify a conserved motif we learned something! useful about the considered protein family!
! Motifs generally have functional and/or structural relevance! !
GCGGCCCA GCGGCCCA GCGTTCCA GCGTCCCA GCGGCGCA ******** TTGACATG TTGACATG TTGACATG TTGACATG TTGACATC ********
TCAGGTAGTT TCAGGTAGTT TCAGCTGGTT TCAGCTAGTT TTAGCTAGTT ********** CCGGGG---A CCGGTG--GT -CTAGG---A -CTAGGGAAC -CTCTG---A ??????????
GGTGG GGTGG GGTGG GGTGG GGTGA ***** AACCG AAGCC ACGCG ACGCG ACGCG *****
Easy
Sometimes used to illustrate the dissimilarity or! similarity between a group of sequences.!
!
Alignments can be treated as models that can be! used to test hypotheses.!
!
1) Given a set of sequences, the rst step of a multiple sequence alignment is calculating the pairwise distances between the sequences. 2) The pairwise distances are used to build a guide tree which is used as a guide to perform the multiple sequence alignment. 3) Using the guide tree sequences are aligned starting from the two most similar. More distantly related sequences are progressively added.
This guide tree gives the order in! which the progressive alignment will! be carried out.
This alignment is then xed and will! never change. If a gap is to be! introduced subsequently, then it will be! introduced in the same place in both! sequences, but their relative alignment! remains unchanged.
CLUSTAL W
Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Whale 1 2 3 4 5 .17 .59 .59 .77 .60 .59 .77 .13 .75 .75 Quick pairwise alignment: calculate distance matrix
1 2
4
Neighbor-joining tree (guide tree)
1 2 3 4 5
1 2
alpha-helices
Can be a very good estimate! Can be an impossibly poor estimate! Requires user input and skill