Sie sind auf Seite 1von 47

ComputationalBiology,Part8

RepresentingandFinding
SequenceFeatures
RobertF.Murphy
Copyright 19962006.
Allrightsreserved.

SequenceAnalysisTasks
Representingsequencefeatures,and

findingsequencefeaturesusingconsensus
sequencesandfrequencymatrices

Definition

Asequencefeatureisapatternthatis
observedtooccurinmorethanone
sequenceand(usually)tobecorrelatedwith
somefunction

Sequencefeatures

Featuresfollowinganexactpattern
restrictionenzymerecognitionsites

Featureswithapproximatepatterns
promoters
transcriptioninitiationsites
transcriptionterminationsites
polyadenylationsites
ribosomebindingsites
proteinfeatures

Consensussequences
Aconsensussequenceisasequencethat
summarizesorapproximatesthepattern
observedinagroupofalignedsequences
containingasequencefeature
Consensussequencesareregular
expressions

Findingoccurrencesof
consensussequences

Example:recognitionsiteforarestrictionenzyme

EcoRIrecognizesGAATTC
AccIrecognizesGTMKAC

BasicAlgorithm

Startwithfirstcharacterofsequencetobesearched
Seeifenzymesitematchesstartingatthatposition
Advancetonextcharacterofsequencetobesearched
Repeatprevioustwostepsuntilallpositionshavebeen
tested

InteractiveDemonstration

(A1Patternmatchingdemo)

BlockDiagramforSearchwitha
ConsensusSequence
Consensus
Sequence(in
IUBcodes)
Sequencetobe
searched

Search
Engine

Listofpositions
wherematches
occur

Describingfeaturesusing
frequencymatrices
Goal:Describeasequencefeature(or
motif)morequantitativelythanpossible
usingconsensussequences
Needtodescribehowoftenparticularbases
arefoundinparticularpositionsina
sequencefeature

Describingfeaturesusing
frequencymatrices

Definition:Forafeatureoflengthmusing
analphabetofncharacters,afrequency
matrixisannbymmatrixinwhicheach
elementcontainsthefrequencyatwhicha
givenmemberofthealphabetisobservedat
agivenpositioninanalignedsetof
sequencescontainingthefeature

Frequencymatrices(continued)

Threeusesoffrequencymatrices
Describeasequencefeature
Calculateprobabilityofoccurrenceoffeature

inarandomsequence
Calculatedegreeofmatchbetweenanew
sequenceandafeature

InteractiveDemonstration

(A4Frequencymatrixdemo)

FrequencyMatrices,PSSMs,and
Profiles
Afrequencymatrixcanbeconvertedtoa
PositionSpecificScoringMatrix(PSSM)
byconvertingfrequenciestoscores
PSSMsalsocalledPositionWeight
Matrixes(PWMs)orProfiles

Methodsforconverting
frequencymatricestoPSSMs

Usinglogratioofobservedtoexpected

score( j,i) log m( j,i) / f ( j)

wherem(j,i)isthefrequencyofcharacterjobservedat
positioniandf(j)istheoverallfrequencyofcharacterj
(usuallyinsomelargesetofsequences)

Usingaminoacidsubstitutionmatrix(Dayhoff
similaritymatrix)[seelater]

Pseudocounts
Howdowegetascoreforapositionwith
zerocountsforaparticularcharacter?
Canttakelog(0).
Solution:addasmallnumbertoall
positionswithzerofrequency

Findingoccurrencesofa
sequencefeatureusingaProfile
Aswithfindingoccurrencesofaconsensus
sequence,weconsiderallpositionsinthe
targetsequenceascandidatematches
Foreachposition,wecalculateascoreby
lookingupthevaluecorrespondingtothe
baseatthatposition

InteractiveDemonstration

(A5SearchingwithProfiledemo)

BlockDiagramforBuildinga
PSSM
SetofAligned
Sequence
Features
Expected
frequenciesof
eachsequence
element

PSSM
builder

PSSM

BlockDiagramforSearching
withaPSSM
PSSM
Threshold
Setof
Sequencesto
search

PSSM
search

Sequencesthat
matchabove
threshold
Positionsand
scoresof
matches

Setof
Aligned
Sequence
Features
Expected
frequencies
ofeach
sequence
element

BlockDiagramforSearchingfor
sequencesrelatedtoafamily
withaPSSM
PSSM
builder

PSSM
Threshold
Setof
Sequences
tosearch

PSSM
search

Sequencesthatmatchabove
threshold
Positionsandscoresof
matches

Consensussequencesvs.
frequencymatrices

ShouldIuseaconsensussequenceora
frequencymatrixtodescribemysite?

Ifallallowedcharactersatagivenpositionare

equally"good",useIUBcodestocreate
consensussequence

Example:Restrictionenzymerecognitionsites

Ifsomeallowedcharactersare"better"than

others,usefrequencymatrix

Example:Promotersequences

Consensussequencesvs.
frequencymatrices
Advantagesofconsensussequences:
smallerdescription,quickercomparison
Disadvantage:losequantitativeinformation
onpreferencesatcertainlocations

SequenceAnalysisTasks
Representingandfindingsequence

featuresusinghiddenMarkovmodels

Markovchains
Ifwecanpredictallofthepropertiesofa
sequenceknowingonlytheconditional
dinucleotideprobabilities,thenthat
sequenceisanexampleofaMarkovchain
AMarkovchainisdefinedasasequence
ofstatesinwhicheachstatedependsonly
onthepreviousstate

FormalismforMarkovchains

M=(Q,,P)isaMarkovchain,where
Q=vector(1,..,n)isthelistofstates

=vector(p1,..,pn)istheinitialprobabilityof
eachstate

Q(1)=A,Q(2)=C,Q(3)=G,Q(4)=TforDNA

(i)=pQ(i)(e,g.,(1)=pAforDNA)

P=nxnmatrixwheretheentryinrowiand
columnjistheprobabilityofobservingstatejif
thepreviousstateisiandthesumofentriesin
eachrowis1(dinucleotideprobabilities)

P(i,j)=p*Q(i)Q(i)(e.g.,P(1,2)=p*ACforDNA)

GeneratingMarkovchains

GivenQ,,P(andarandomnumbergenerator),we
cangeneratesequencesthataremembersofthe
MarkovchainM
If,Parederivedfromasinglesequence,the
familyofsequencesgeneratedbyMwillinclude
thatsequenceaswellasmanyothers
If,Parederivedfromasampledsetof
sequences,thefamilyofsequencesgeneratedby
Mwillbethepopulationfromwhichthatsethas
beensampled

InteractiveDemonstration

(A11Markovchains)

Discriminatingbetweentwo
stateswithMarkovchains

Todeterminewhichoftwostatesa
sequenceismorelikelytohaveresulted
from,wecalculate

x i 1 x i

x i 1 x i

a
P(x | model)
S(x) log
log
P(x | model) i1
a
L

S(x) x i 1 x i
i1

Stateprobablitiesfor+and
models

Givenexamplessequencesthatarefrom
either+model(CpGisland)ormodel(not
CpGisland),cancalculatetheprobability
thateachnucleotidewilloccurforeach
model(theavaluesforeachmodel)

+ACGTACGT
A0.1800.2740.4260.120A0.3000.2050.2850.210
C0.1710.3680.2740.188C0.3220.2980.0780.302
G0.1610.3390.3750.125G0.2480.2460.2980.208
T0.0790.3550.3840.182T0.1770.2390.2920.292

Transitionprobabilitiesconverted
tologlikelihoodratios

A
C
G
T

A
0.740
0.913
0.624
1.169

C
0.419
0.302
0.461
0.573

G
0.580
1.812
0.331
0.393

T
0.803
0.685
0.730
0.679

Example
WhatisrelativeprobabilityofC+G+C+
comparedwithCGC?
Firstcalculatelogoddsratio:
S(CGC)=(CG)+(GC)=1.812+0.461=2.273
Converttorelativeprobability:
22.273=4.833
Relativeprobabilityisratioof(+)to()
P(+)=4.833P()

Example
Converttopercentage
P(+)+P()=1
4.833P()+P()=1
P()=1/5.833=17%
Conclusion
P(+)=83%P()=17%

BlockDiagramforGenerating
SequenceswithaMarkovModel
alphabet
initial
probabilities
transition
probabilities
numberof
charactersto
generate

Markov
Model
Sequence
Generator

sequence

HiddenMarkovmodels

Hiddenconnotesthatthesequenceis
generatedbytwoormorestatesthathave
differenttransitionprobabilitymatrices

Moredefinitions
i=stateatpositioniinapath
akl=P( i=l| i1=k)

probabilityofgoingfromonestatetoanother
transitionprobability

ek(b)=P(xi=b| i=k)
probabilityofemittingabwheninstatek
emissionprobability

Decoding
ThegoalofusinganHMMisoftento
determine(estimate)thesequenceof
underlyingstatesthatlikelygaverisetoan
observedsequence
Thisiscalleddecodinginthejargonof
speechrecognition

Moredefinitions

Cancalculatethejointprobabilityofa
sequencexandastatesequence
L

P(x, ) a0 1 e i (x i )a i i 1
i1

requiring
L 1 0

Determiningtheoptimalpath:
theViterbialgorithm
Viterbialgorithmisformofdynamic
programming
Definition:Letvk(i)betheprobabilityofthe
mostprobablepathendinginstatekwith
observationi

Determiningtheoptimalpath:
theViterbialgorithm
Initialisation(i=0):
v0(0)=1,vk(0)=0fork>0

Recursion(i=1..L):
vl(i)=el(xi)maxk(vk(i1)akl)

ptri(l)=argmaxk(vk(i1)akl)
Termination:P(x,*)=maxk(vk(L)ak0)
L*=argmaxk(vk(L)ak0)
Traceback(i=L..1):i1*=ptri(i*)

BlockDiagramforViterbi
Algorithm
alphabet
initial
probabilities
transition
probabilities
sequence
positioni
statek

Viterbi
Algorithm

probability
sequence
was
generated
with
positioni
beingin
statek

Multiplepathscangivethesame
sequence
TheViterbialgorithmfindsthemostlikely
pathgivenasequence
Otherpathscouldalsogiverisetothesame
sequence
Howdowecalculatetheprobabilityofa
sequencegivenanHMM?

Probabilityofasequence
Sumtheprobabilitiesofallpossiblepaths
thatgivethatsequence
LetP(x)betheprobabilityofobserving
sequencexgivenanHMM

P(x) P(x, )

Probabilityofasequence
CanfindP(x)usingavariationonViterbi
algorithmusingsuminsteadofmax
Thisiscalledtheforwardalgorithm
Replacevk(i)withfk(i)=P(x1xi,i=k)

Forwardalgorithm
Initialisation(i=0):
f0(0)=1,fk(0)=0fork>0
Recursion(i=1..L):

f l (i) el (x i ) f k (i 1)akl
k

Termination:

P(x) f k (L)ak 0
k

Backwardalgorithm
Wemayneedtoknowtheprobabilitythata
particularobservationxicamefroma
particularstatekgivenasequencex,
P(i=k|x)
Usealgorithmanalogoustoforward
algorithmbutstartingfromtheend

Backwardalgorithm
Initialisation(i=0):
bk(L)=ak0forallk
Recursion(i=L1,,1):

bk (i) akl el (x i1 )bl (i 1)


l

Termination:

P(x) a0l el (x1 )bl (1)


l

Estimatingprobabilityofstateat
particularposition

Combinetheforwardandbackwardprobabilities
toestimatetheposteriorprobabilityofthe
sequencebeinginaparticularstateataparticular
position

f k (i)bk (i)
P( i k | x)
P(x)

Das könnte Ihnen auch gefallen