Sie sind auf Seite 1von 33

So you have a sequence. What now?

The simplest bioinformatic problem:


Let us assume you have an uncharacterised (yet) nucleotide sequence that you obtained from a PCR experiment. ! Question:! How do you characterise (validate) your PCR product? Answer: (1) You interrogate a PRIMARY database (e.g. GenBank) and retrieve all the sequences that are signicantly similar (i.e are HOMOLOGOUS) to your query. This is done using the BLAST software. ! (2) You generate a multiple sequence alignment of the retrieved (Homologous) proteins. This is done using ClustalW.

1)

Create a 2-D matrix and populate it with scores representing the similarities of the compared sequences
!

2)

Accumulate the scores in the matrix & penalize insertions and deletions
!

3)

Identify the highest scoring path in the matrix.

A I C I N SEQ2 R C K C R H P

A 1 0 0 0 0 0 0 0 0 0 0 0

H 0 0 0 0 0 0 0 0 0 0 1 0

C 0 0 1 0 0 0 1 0 1 0 0 0

N 0 0 0 0 1 0 0 0 0 0 0 0

I 0 1 0 1 0 0 0 0 0 0 0 0

SEQ1 R V S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

G 0 0 0 0 0 0 0 0 0 0 0 0

! V 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 1 0 1 0 0 0

L 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 1 0 1 0 0 0

R 0 0 0 0 0 0 0 0 0 1 0 0

P 0 0 0 0 0 0 0 0 0 0 0 1

M ! 0! 0 ! 0! 0! 0 ! 0! 0! 0 ! 0! 0! 0! 0

A I C I N R C K C R H P

A 1 0 0 0 0 0 0 0 0 0 0 0

H 0 0 0 0 0 0 0 0 0 0 1 0

C 0 0 1 0 0 0 1 0 1 0 0 0

N 0 0 0 0 1 0 0 0 0 0 0 0

I 0 1 0 1 0 0 0 0 0 0 0 0

R 0 0 0 0 0 1 0 0 0 1 0 0

V 0 0 0 0 0 0 0 0 0 0 0 0

S 0 0 0 0 0 0 0 0 0 0 0 0

G 0 0 0 0 0 0 0 0 0 0 0 0

V 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 1 0 1 0 0 0

L 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 1 0 1 0 0 0

R 0 0 0 0 0 0 0 0 0 1 0 0

P 0 0 0 0 0 0 0 0 0 0 0 1

M ! 0! 0 ! 0! 0! 0 ! 0! 0! 0 ! 0 ! 0! 0! 0

The matrix is accumulated moving from the bottom right corner to the top left corner!

The MAX previous score, the one that has to be added to the current RED CELL value, is the highest in the BLUE ROW OR COLUMN.

A I C I N R C K C R H P

A 8 7 6 6 5 4 3 3 2 2 1 0

H 7 7 6 6 5 4 3 3 2 1 2 0

C 6 6 7 6 5 4 4 3 3 1 1 0

N 6 6 6 5 6 4 3 3 2 1 1 0

I 5 6 5 6 5 4 3 3 2 1 1 0

R 4 4 4 4 4 5 3 3 2 2 1 0

V 4 4 4 4 4 4 3 3 2 1 1 0

S 4 4 4 4 4 4 3 3 2 1 1 0

G 4 4 4 4 4 4 3 3 2 1 1 0

V 4 4 4 4 4 4 3 3 2 1 1 0

C 3 3 4 3 3 3 4 3 3 1 1 0

L 3 3 3 3 3 3 3 3 2 1 1 0

C 2 2 3 2 2 2 3 2 3 1 1 0

R 1 1 1 1 1 1 1 1 1 2 1 0

P 0 0 0 0 0 0 0 0 0 0 0 1

M ! 0! 0 ! 0! 0! 0 ! 0! 0! 0 ! 0! 0! 0! 0

Venn diagram of amino acids properties


Small
Aliphatic
CS-S I M Y F W H V L K R G CS-H T D + E A S

Tiny
P N Q

Polar

Charged

Hydrophobic

Aromatic

Matrix representing probabilities of amino acid substitutions. This and other existing matrices can be used to build more accurate alignments of two sequences.

To search databases we use heuristic, similarity! based algorithms!


! !

Similarity based database searches generate local alignments to nd (within a sequence database) sequences related to the query sequence.!
!

Given a query sequence, local alignments of the query sequence are generated against every sequence in the database. The scores of the alignments are used to identify sequences that are related to the query! sequence.!
!

BLAST is the most common heuristic algorithm used to search sequence databases.

BLAST!
The Basic Local Alignment Search Tool!
!

BLAST is the standard database search tool.!


Developed by Altschul Stephen in 1990.!

!
BLAST is a class of related software that perform a variety of! database comparisons.!

!
For example:!

! ! ! !

Objective: To nd high scoring untapped alignments between a query sequence and the sequences in a database.!
!
These are called High Scoring Pairs (HSP).!
The existence of such segments above a given similarity threshold! indicates pairwise similarity beyond random chance.!

!
This is used to distinguish related from unrelated sequences in a! database.

The Algorithm

!
FIRST STEP - SEEDING. Generate all words of length K (e.g. k= 2)in the query sequence. Words in our example:QL; NF; SA; GW; LN; FS; AG.

Given a Query sequence (e.g. QLNFSAGW)

!
SECOND STEP. Identify all words in the sequences in the database.

!
THIRD STEP. Align every seed against every word generated from the database. Calculate (Using BLOSUM62 -or another matrix) the score of every ungapped two letter alignment generated in this way. An alignment is considered a MATCH if its score is above a certain threshold (default = 8 for amino acids).

!
FOURTH STEP. Matches (only) are extended to generate longer alignments. If no match is found for two sequences, they are not considered any longer. This saves time. If multiple matches are found for two sequences, all matches are extended. The extension of a match continues until mismatches cause the alignment score to drop below a given threshold (22 for proteins 20 for DNA).

Resulting ungapped alignments are the HSPs.

Extending a match
Every MATCH (alignment with a minimal score of 8), is extended until we found the best extension (alignment of maximal score).
Initial Match (or Hit)

AGT PYN NGT NNT LTW HKR RRR K TAG PYN NGT NNT LTW KHK KKK R
Extend until score of alignment increases Keep extending until score drops below 22

Stop when : Score Current Extension < 22.

Interpreting BLAST!
!

The output of BLAST provides a list of pairwise sequence matches! ranked by the statistical signicance of the scores of their HSP.!
! ! ! !

In BLAST the statistical indicator is the E-value! (NOT to be confused with a P value -see below).!
!

E-values (expectation values) express how likely it is for an HSP of! a certain score to be observed by chance alone in a database of! given dimensions.!
!

E = m * n * P.! m = total number of residue in database.! n = number of residue in the query sequence! P = the probability that an HSP alignment is a result of random chance (THIS IS THE PROBABILITY OF THE ALIGNMENT!)

Calculating E-values an example!


Given a Query Sequence 100 residues long! A database containing 1012 residues! P = 1*10-20 (of the HSP between 2 sequences)! E-value = 100 * 1012 * 10-20 = 10-6! This will be expressed as: 1e-6 in the BLAST output.

Interpretations of E-values
E =< 1e - 50: Extremely high sequence similarity. Very close homologs. 1e - 50< E < 1e - 8: Signicantly high similarity. Surely homologous. 1e - 7 < E < 1e - 2 (0.01): Sequences similar but not necessarily homologous. If they are homologous, they are distant homologoue. 0.01 < E < 10: Match not signicant. Generally speaking, as a rule of thumb: E =< 1e - 8 is signicant.

E = 4.2

Proteins can be classied in families


Members of a family generally perform similar (or related) tasks! and have specic signatures. They are identied using BLAST!
If we can identify a protein as a member of a well-characterised family,! we can generally predict its function.! !

Signatures of a protein family are referred as Conserved Motifs.!

! Conserved motifs can only be identied building a multiple sequence! alignment.! !

If we can identify a conserved motif we learned something! useful about the considered protein family!
! Motifs generally have functional and/or structural relevance! !

Understanding motifs is useful for: biotech proposes.!


Proteins with specic functions can be engineered.!

Clues about the causes of diseases can be unrevealed.

Building a multiple sequence alignment! can be easy or difcult

GCGGCCCA GCGGCCCA GCGTTCCA GCGTCCCA GCGGCGCA ******** TTGACATG TTGACATG TTGACATG TTGACATG TTGACATC ********

TCAGGTAGTT TCAGGTAGTT TCAGCTGGTT TCAGCTAGTT TTAGCTAGTT ********** CCGGGG---A CCGGTG--GT -CTAGG---A -CTAGGGAAC -CTCTG---A ??????????

GGTGG GGTGG GGTGG GGTGG GGTGA ***** AACCG AAGCC ACGCG ACGCG ACGCG *****

Easy

Difcult due to insertions or deletions (indels)

Multiple Sequence Alignment- Goals!


!

To generate a concise, information-rich summary! of sequence data.!


!

Sometimes used to illustrate the dissimilarity or! similarity between a group of sequences.!
!

Alignments can be treated as models that can be! used to test hypotheses.!
!

Does this model of events accurately reect known! biological evidence.

Multiple Sequence Alignment with Clustal (Thompson! 1996): The Principle

1) Given a set of sequences, the rst step of a multiple sequence alignment is calculating the pairwise distances between the sequences. 2) The pairwise distances are used to build a guide tree which is used as a guide to perform the multiple sequence alignment. 3) Using the guide tree sequences are aligned starting from the two most similar. More distantly related sequences are progressively added.

Seq_a Seq_b Seq_c Seq_d

ClustalW- Guide Tree!


!

Generate a Neighbor-Joining guide! tree from these pairwise distances.!


!

This guide tree gives the order in! which the progressive alignment will! be carried out.

ClustalW- First pair!


!

Align the two most closely-related! sequences rst.!


!

This alignment is then xed and will! never change. If a gap is to be! introduced subsequently, then it will be! introduced in the same place in both! sequences, but their relative alignment! remains unchanged.

CLUSTAL W
Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Whale 1 2 3 4 5 .17 .59 .59 .77 .60 .59 .77 .13 .75 .75 Quick pairwise alignment: calculate distance matrix

Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Whale

1 2

4
Neighbor-joining tree (guide tree)

1 2 3 4 5

PEEKSAVTALWGKVN--VDEVGG GEEKAAVLALWDKVN--EEEVGG PADKTNVKAAWGKVGAHAGEYGA AADKTNVKAAWSKVGGHAGEYGA EHEWQLVLHVWAKVEADVAGHGQ

1 2

Progressive alignment following guide tree

alpha-helices

Advice on progressive alignment!


!

Progressive alignment is a mathematical! process that is completely independent! of biological reality!


!

Can be a very good estimate! Can be an impossibly poor estimate! Requires user input and skill

Das könnte Ihnen auch gefallen