Beruflich Dokumente
Kultur Dokumente
RepresentingandFinding
SequenceFeatures
RobertF.Murphy
Copyright 19962006.
Allrightsreserved.
SequenceAnalysisTasks
Representingsequencefeatures,and
findingsequencefeaturesusingconsensus
sequencesandfrequencymatrices
Definition
Asequencefeatureisapatternthatis
observedtooccurinmorethanone
sequenceand(usually)tobecorrelatedwith
somefunction
Sequencefeatures
Featuresfollowinganexactpattern
restrictionenzymerecognitionsites
Featureswithapproximatepatterns
promoters
transcriptioninitiationsites
transcriptionterminationsites
polyadenylationsites
ribosomebindingsites
proteinfeatures
Consensussequences
Aconsensussequenceisasequencethat
summarizesorapproximatesthepattern
observedinagroupofalignedsequences
containingasequencefeature
Consensussequencesareregular
expressions
Findingoccurrencesof
consensussequences
Example:recognitionsiteforarestrictionenzyme
EcoRIrecognizesGAATTC
AccIrecognizesGTMKAC
BasicAlgorithm
Startwithfirstcharacterofsequencetobesearched
Seeifenzymesitematchesstartingatthatposition
Advancetonextcharacterofsequencetobesearched
Repeatprevioustwostepsuntilallpositionshavebeen
tested
InteractiveDemonstration
(A1Patternmatchingdemo)
BlockDiagramforSearchwitha
ConsensusSequence
Consensus
Sequence(in
IUBcodes)
Sequencetobe
searched
Search
Engine
Listofpositions
wherematches
occur
Describingfeaturesusing
frequencymatrices
Goal:Describeasequencefeature(or
motif)morequantitativelythanpossible
usingconsensussequences
Needtodescribehowoftenparticularbases
arefoundinparticularpositionsina
sequencefeature
Describingfeaturesusing
frequencymatrices
Definition:Forafeatureoflengthmusing
analphabetofncharacters,afrequency
matrixisannbymmatrixinwhicheach
elementcontainsthefrequencyatwhicha
givenmemberofthealphabetisobservedat
agivenpositioninanalignedsetof
sequencescontainingthefeature
Frequencymatrices(continued)
Threeusesoffrequencymatrices
Describeasequencefeature
Calculateprobabilityofoccurrenceoffeature
inarandomsequence
Calculatedegreeofmatchbetweenanew
sequenceandafeature
InteractiveDemonstration
(A4Frequencymatrixdemo)
FrequencyMatrices,PSSMs,and
Profiles
Afrequencymatrixcanbeconvertedtoa
PositionSpecificScoringMatrix(PSSM)
byconvertingfrequenciestoscores
PSSMsalsocalledPositionWeight
Matrixes(PWMs)orProfiles
Methodsforconverting
frequencymatricestoPSSMs
Usinglogratioofobservedtoexpected
wherem(j,i)isthefrequencyofcharacterjobservedat
positioniandf(j)istheoverallfrequencyofcharacterj
(usuallyinsomelargesetofsequences)
Usingaminoacidsubstitutionmatrix(Dayhoff
similaritymatrix)[seelater]
Pseudocounts
Howdowegetascoreforapositionwith
zerocountsforaparticularcharacter?
Canttakelog(0).
Solution:addasmallnumbertoall
positionswithzerofrequency
Findingoccurrencesofa
sequencefeatureusingaProfile
Aswithfindingoccurrencesofaconsensus
sequence,weconsiderallpositionsinthe
targetsequenceascandidatematches
Foreachposition,wecalculateascoreby
lookingupthevaluecorrespondingtothe
baseatthatposition
InteractiveDemonstration
(A5SearchingwithProfiledemo)
BlockDiagramforBuildinga
PSSM
SetofAligned
Sequence
Features
Expected
frequenciesof
eachsequence
element
PSSM
builder
PSSM
BlockDiagramforSearching
withaPSSM
PSSM
Threshold
Setof
Sequencesto
search
PSSM
search
Sequencesthat
matchabove
threshold
Positionsand
scoresof
matches
Setof
Aligned
Sequence
Features
Expected
frequencies
ofeach
sequence
element
BlockDiagramforSearchingfor
sequencesrelatedtoafamily
withaPSSM
PSSM
builder
PSSM
Threshold
Setof
Sequences
tosearch
PSSM
search
Sequencesthatmatchabove
threshold
Positionsandscoresof
matches
Consensussequencesvs.
frequencymatrices
ShouldIuseaconsensussequenceora
frequencymatrixtodescribemysite?
Ifallallowedcharactersatagivenpositionare
equally"good",useIUBcodestocreate
consensussequence
Example:Restrictionenzymerecognitionsites
Ifsomeallowedcharactersare"better"than
others,usefrequencymatrix
Example:Promotersequences
Consensussequencesvs.
frequencymatrices
Advantagesofconsensussequences:
smallerdescription,quickercomparison
Disadvantage:losequantitativeinformation
onpreferencesatcertainlocations
SequenceAnalysisTasks
Representingandfindingsequence
featuresusinghiddenMarkovmodels
Markovchains
Ifwecanpredictallofthepropertiesofa
sequenceknowingonlytheconditional
dinucleotideprobabilities,thenthat
sequenceisanexampleofaMarkovchain
AMarkovchainisdefinedasasequence
ofstatesinwhicheachstatedependsonly
onthepreviousstate
FormalismforMarkovchains
M=(Q,,P)isaMarkovchain,where
Q=vector(1,..,n)isthelistofstates
=vector(p1,..,pn)istheinitialprobabilityof
eachstate
Q(1)=A,Q(2)=C,Q(3)=G,Q(4)=TforDNA
(i)=pQ(i)(e,g.,(1)=pAforDNA)
P=nxnmatrixwheretheentryinrowiand
columnjistheprobabilityofobservingstatejif
thepreviousstateisiandthesumofentriesin
eachrowis1(dinucleotideprobabilities)
P(i,j)=p*Q(i)Q(i)(e.g.,P(1,2)=p*ACforDNA)
GeneratingMarkovchains
GivenQ,,P(andarandomnumbergenerator),we
cangeneratesequencesthataremembersofthe
MarkovchainM
If,Parederivedfromasinglesequence,the
familyofsequencesgeneratedbyMwillinclude
thatsequenceaswellasmanyothers
If,Parederivedfromasampledsetof
sequences,thefamilyofsequencesgeneratedby
Mwillbethepopulationfromwhichthatsethas
beensampled
InteractiveDemonstration
(A11Markovchains)
Discriminatingbetweentwo
stateswithMarkovchains
Todeterminewhichoftwostatesa
sequenceismorelikelytohaveresulted
from,wecalculate
x i 1 x i
x i 1 x i
a
P(x | model)
S(x) log
log
P(x | model) i1
a
L
S(x) x i 1 x i
i1
Stateprobablitiesfor+and
models
Givenexamplessequencesthatarefrom
either+model(CpGisland)ormodel(not
CpGisland),cancalculatetheprobability
thateachnucleotidewilloccurforeach
model(theavaluesforeachmodel)
+ACGTACGT
A0.1800.2740.4260.120A0.3000.2050.2850.210
C0.1710.3680.2740.188C0.3220.2980.0780.302
G0.1610.3390.3750.125G0.2480.2460.2980.208
T0.0790.3550.3840.182T0.1770.2390.2920.292
Transitionprobabilitiesconverted
tologlikelihoodratios
A
C
G
T
A
0.740
0.913
0.624
1.169
C
0.419
0.302
0.461
0.573
G
0.580
1.812
0.331
0.393
T
0.803
0.685
0.730
0.679
Example
WhatisrelativeprobabilityofC+G+C+
comparedwithCGC?
Firstcalculatelogoddsratio:
S(CGC)=(CG)+(GC)=1.812+0.461=2.273
Converttorelativeprobability:
22.273=4.833
Relativeprobabilityisratioof(+)to()
P(+)=4.833P()
Example
Converttopercentage
P(+)+P()=1
4.833P()+P()=1
P()=1/5.833=17%
Conclusion
P(+)=83%P()=17%
BlockDiagramforGenerating
SequenceswithaMarkovModel
alphabet
initial
probabilities
transition
probabilities
numberof
charactersto
generate
Markov
Model
Sequence
Generator
sequence
HiddenMarkovmodels
Hiddenconnotesthatthesequenceis
generatedbytwoormorestatesthathave
differenttransitionprobabilitymatrices
Moredefinitions
i=stateatpositioniinapath
akl=P( i=l| i1=k)
probabilityofgoingfromonestatetoanother
transitionprobability
ek(b)=P(xi=b| i=k)
probabilityofemittingabwheninstatek
emissionprobability
Decoding
ThegoalofusinganHMMisoftento
determine(estimate)thesequenceof
underlyingstatesthatlikelygaverisetoan
observedsequence
Thisiscalleddecodinginthejargonof
speechrecognition
Moredefinitions
Cancalculatethejointprobabilityofa
sequencexandastatesequence
L
P(x, ) a0 1 e i (x i )a i i 1
i1
requiring
L 1 0
Determiningtheoptimalpath:
theViterbialgorithm
Viterbialgorithmisformofdynamic
programming
Definition:Letvk(i)betheprobabilityofthe
mostprobablepathendinginstatekwith
observationi
Determiningtheoptimalpath:
theViterbialgorithm
Initialisation(i=0):
v0(0)=1,vk(0)=0fork>0
Recursion(i=1..L):
vl(i)=el(xi)maxk(vk(i1)akl)
ptri(l)=argmaxk(vk(i1)akl)
Termination:P(x,*)=maxk(vk(L)ak0)
L*=argmaxk(vk(L)ak0)
Traceback(i=L..1):i1*=ptri(i*)
BlockDiagramforViterbi
Algorithm
alphabet
initial
probabilities
transition
probabilities
sequence
positioni
statek
Viterbi
Algorithm
probability
sequence
was
generated
with
positioni
beingin
statek
Multiplepathscangivethesame
sequence
TheViterbialgorithmfindsthemostlikely
pathgivenasequence
Otherpathscouldalsogiverisetothesame
sequence
Howdowecalculatetheprobabilityofa
sequencegivenanHMM?
Probabilityofasequence
Sumtheprobabilitiesofallpossiblepaths
thatgivethatsequence
LetP(x)betheprobabilityofobserving
sequencexgivenanHMM
P(x) P(x, )
Probabilityofasequence
CanfindP(x)usingavariationonViterbi
algorithmusingsuminsteadofmax
Thisiscalledtheforwardalgorithm
Replacevk(i)withfk(i)=P(x1xi,i=k)
Forwardalgorithm
Initialisation(i=0):
f0(0)=1,fk(0)=0fork>0
Recursion(i=1..L):
f l (i) el (x i ) f k (i 1)akl
k
Termination:
P(x) f k (L)ak 0
k
Backwardalgorithm
Wemayneedtoknowtheprobabilitythata
particularobservationxicamefroma
particularstatekgivenasequencex,
P(i=k|x)
Usealgorithmanalogoustoforward
algorithmbutstartingfromtheend
Backwardalgorithm
Initialisation(i=0):
bk(L)=ak0forallk
Recursion(i=L1,,1):
Termination:
Estimatingprobabilityofstateat
particularposition
Combinetheforwardandbackwardprobabilities
toestimatetheposteriorprobabilityofthe
sequencebeinginaparticularstateataparticular
position
f k (i)bk (i)
P( i k | x)
P(x)