Sie sind auf Seite 1von 17

Scoring Matrices

Scoring matrices, PSSMs


Position Specific Score Matrix
(PSSM)
A position specific scoring matrix (PSSM) is a
matrix based on the amino acid frequencies
(or nucleic acid frequencies) at every position
of a multiple alignment.
From these frequencies, the PSSM that will
be calculated will result in a matrix that will
assign superior scores to residues that
appear more often than by chance at a
certain position.
Creating a PSSM: Example
NTEGEWI
NITRGEW
NIAGECC
Amino acid frequencies at every
position of the alignment:
Creating a PSSM: Example
Amino acids that do not appear at a specific position of a multiple
alignment must also be considered in order to model every possible
sequence and have calculable log-odds scores. A simple procedure
called pseudo-counts assigns minimal scores to residues that do not
appear at a certain position of the alignment according to the
following equation:

Where
Frequency is the frequency of residue i in column j (the count of
occurances).
pseudocount is a number higher or equal to 1.
N is the number of sequences in the multiple alignment.
Creating a PSSM: Example
In this example, N = 3 and lets use pseudocount = 1:

Score(N) at position 1 = 3/3 = 1.


Score(I) at position 1 = 0/3 = 0.
Readjust:
Score(I) at position 1 -> (0+1) / (3+20) = 1/23 = 0.044.
Score(N) at position 1 -> (3+1) / (3+20) = 4/23 = 0.174.

The PSSM is obtained by taking the logarithm of (the values obtained above
divided by the background frequency of the residues).

To simplify for this example well assume that every amino acid appears equally in
protein sequences, i.e. fi = 0.05 for every i):

PSSM Score(I) at position 1 = log(0.044 / 0.05) = -0.061.


PSSM Score(N) at position 1 = log(0.174 / 0.05) = 0.541.
Creating a PSSM: Example

The matrix assigns positive scores to residues


that appear more often than expected by
chance and negative scores to residues that
appear less often than expected by chance.
Using a PSSM
To search for matches to a PSSM, scan along a the
sequence using a window the length (L) of the
PSSM.
The matrix is slid on a sequence one residue at a
time and the scores of the residues of every region of
length L are added.
NTEGEWI
Scores that are higher than an empirically
NITRGEW
predetermined threshold are reported.
NIAGECC
Advantages of PSSM

Weights sequence according to


observed diversity specific to the family
of interest
Minimal assumptions
Easy to compute
Can be used in comprehensive
evaluations.
More sophisticated PSSMs
From less to more complicated

1. PSSM with pseudocounts.


2. Giving pseudocounts less weight when more
alignment data is available.
3. Weight pseudocount amino acids by their
frequency of occurrence in proteins.
4. Instead of giving pseudocounts all the same
value, weight them by their similarity to the
consensus (like BLOSUM62 does) at each
position. (PSI-BLAST method).
5. Combine 2 & 4 (Dirichlet mixture method).
Method 1 and
standard
BLOSUM62
matrix

Method 5

A PSSM column with a perfectly conserved isoleucine with different


methods used to calculate the scores.
Using Hidden Markov models to
describe sequence alignment profiles

A profile HMM can represent a sequence


alignment profile similar to how a PSSM does.

A profile HMM includes information on the


amino acid consensus at each position in the
alignment like a PSSM.

A profile HMM also has position-specific


scores for gap insertion and extensions.
Background: Creating HMMs

To create an HMM to model data we need to


determine two things:

The structure/topology of the HMMstates


and transitions
The values of the parametersemission and
transition probabilities.
Determining the parameters is called
training.
A HMM structure/topology

M=matchstate(scoretheaainthesequenceatthispositioninthe
profile)
I=insertion(w.r.tprofileinsertgapcharactersinprofile)
D=deletion(w.r.tsequenceinsertgapcharactersinsequence)

M1isfirstaaintheprofile,M2issecond,etc.
Example HMMER parameters
NULE 595 -1558 85 338 -294 453 -1158 (...) -21 -313 45 531 201 384
HMM A C D E F G H (...) m->m m->i m->d i->m i->i d->m d->d b->m m->e
1 -1084 390 -8597 -8255 -5793 -8424 -8268 (...) 1
- -149 -500 233 43 -381 399 106 (...)
C -1 -11642 -12684 -894 -1115 -701 -1378 -16 *
2 -2140 -3785 -6293 -2251 3226 -2495 -727 (...) 2
- -149 -500 233 43 -381 399 106 (...)
C -1 -11642 -12684 -894 -1115 -701 -1378 * * (...)
76 -2255 -5128 -302 363 -784 -2353 1398 (...) 103
- -149 -500 233 43 -381 399 106 (...)
E -1 -11642 -12684 -894 -1115 -701 -1378 * *
77 -633 879 -2198 -5620 -1457 -5498 -4367 (...) 104
- * * * * * * * (...)
C * * * * * * * * 0
//
A profile HMM with match state
probabilities shown

AAs PATH is the consensus sequence.


Building a profile HMM

Pick a HMM structure/topology.


Estimate initial parameters.
Train the HMM by running sequences
through it.
Transitions that get used are given
higher probabilities, those rarely used
are given lower probabilities.
Protein profile HMMs
Better (in theory) representations than PSSMs.
More complicated.
Not hand-tuned by curators.
Used in some protein profile databases:
Pfam (http://pfam.sanger.ac.uk/)
SMART (http://smart.embl-heidelberg.de/)
Difficult to describe in human readable formats.

Schuster-Bckler et al., 2004 (http://www.biomedcentral.com/1471-2105/5/7)

Das könnte Ihnen auch gefallen