Sie sind auf Seite 1von 41

Hidden Markov Models

Richard Golden
(following approach of Chapter 9 of Manning and Schutze, 2000)
REVISION DATE: April 15 (Tuesday), 2003
VMM (Visible Markov Model)
a11=0.7

S1
1

a12=0.3
S0
a21=0.5

2
S2

a22=0.5
HMM Notation
• State Sequence Variables: X1, …, XT+1
• Output Sequence Variables: O1, …, OT
• Set of Hidden States (S1, …, SN)
• Output Alphabet (K1, …, KM)
• Initial State Probabilities (1, .., N)
i=p(X1=Si), i=1,…,N
• State Transition Probabilities (aij) i,j{1,…,N}
aij =p(Xt+1|Xt), t=1,…,T
• Emission Probabilities (bij) i{1,…,N},j {1,…,M}
bij=p(Xt+1=Si|Xt=Sj), t=1,…,T
• Note that
HMM State-Emission sometimes a
Hidden Markov
Model is
Representation represented by
having the
emission
arrows come off
K1 the arcs
a11=0.7 • In this situation
b11=0.6
you would have
b12=0.1 K2 a lot more
emission
arrows because
b13=0.3 there’s a lot
S1 more arcs…
1=1 K3 • But the
transition and
emission
a12=0.3 b22=0.7 probabilities are
the same…it
S0 just takes
longer to draw
b23=0.2 on your
2=0 a21=0.5 powerpoint
presentation
S2 (self-conscious
presentation)
b21=0.1
a22=0.5
Arc-Emission Representation
• Note that sometimes a Hidden Markov Model
is represented by having the emission arrows
come off the arcs
• In this situation you would have a lot more
emission arrows because there’s a lot more
arcs…
• But the transition and emission probabilities
are the same…it just takes longer to draw on
your powerpoint presentation (self-conscious
presentation)
Fundamental Questions for HMMs

• MODEL FIT
– How can we compute likelihood of observations and
hidden states given known emission and transition
probabilities?
Compute:
p(“Dog”/NOUN,”is”/VERB,”Good”/ADJ | {aij},{bkm})

– How can we compute likelihood of observations given


known emission and transition probabilities?
p(“Dog”,”is”,”Good” | {aij},{bkm})
Fundamental Questions for HMMs

• INFERENCE

• How can we infer the sequence of hidden states given the


observations and the known emission and transition probabilities?

• Maximize:
• p(“Dog”/?,”is”/?, “Good”/? | {aij},{bkm})
with respect to the unknown labels
Fundamental Questions for HMMs

• LEARNING
– How can we estimate the emission and
transition probabilities given observations and
assuming that hidden states are observable
during learning process?

– How can we estimate emission and transition


probabilities given observations only?
Direct Calculation of Model Fit
(note use of “Markov” Assumptions)
Part 1

p(o1 , , oT , x1 , , xT |{aij },{bkm }) 


p(o1 , , oT | x1 , , xT ,{aij },{bkm }) p( x1 , , xT |{aij },{bkm })
Follows directly from the definition of a conditional probability: p(o,x)=p(o|x)p(x)

EXAMPLE:
P(“Dog”/NOUN,”is”/VERB,”Good”/ADJ | {aij},{bij}) =
p(“Dog”,”is”,”Good”|NOUN,VERB,ADJ {aij},{bij}) X
p(NOUN,VERB,ADJ | aij},{bij})
Direct Calculation of Likelihood of Labeled Observations
(note use of “Markov” Assumptions)
Part 2

EXAMPLE:
Compute p(“DOG”/NOUN,”is”/VERB,”good”/ADJ|{aij},{bkm})

p(o1 , , oT , x1 , , xT |{aij },{bkm }) 


p(o1 , , oT | x1 , , xT ,{aij },{bkm }) p( x1 , , xT |{aij },{bkm })
T
p(o1 , , oT | x1 , , xT ,{aij },{bkm })   p(ot | xt ,{bkm })
t 1
T
p( x1 , , xT |{aij },{bkm })  p( x1 |{ i }) p( xt 1 | xt ,{aij })
t 1
Graphical Algorithm Representation of Direct Calculation of
Likelihood of Observations and Hidden States (not hard!)

K1
a11=0.7 Note that
b11=0.6
b12=0.1 “good” is
K2
The name
S1 b13=0.3 Of the dogj
1=1 K3 So it is a
Noun!
a12=0.3 b22=0.7
S0

b23=0.2
a21=0.5
2=0
S2

b21=0.1
a22=0.5

The likelihood of a particular “labeled” sequence of observations


(e.g., p(“Dog”/NOUN,”is”/VERB,”Good”/NOUN|{aij},{bkm})) may be computed
Using the “direct calculation” method using following simple graphical algorithm.

Specifically, p(K3/S1, K2/S2, K1/S1 |{aij},{bkm}))= 1b13a12b22a21b11


Extension to case where the likelihood of the
observations given parameters is needed
(e.g., p( “Dog”, ”is”, ”good” | {aij},{bij})
p (o1 , , oT , x1 , , xT |{aij },{bkm }) 
p (o1 , , oT | x1 , , xT ,{aij },{bkm }) p( x1 , , xT |{aij },{bkm })
T
p (o1 , , oT | x1 , , xT ,{aij },{bkm })   p(ot | xt ,{bkm })
t 1
T
p ( x1 , , xT |{aij },{bkm })  p( x1 |{ i }) p( xt 1 | xt ,{aij })
t 1

p (o1 , , oT |{aij },{bkm })  


x1 , , xT
p (o1 , , oT , x1 , , xT |{aij },{bkm })

KILLER EQUATION!!!!!
Efficiency of Calculations
is Important (e.g., Model-Fit)
• Assume 1 multiplication per microsecond
• Assume N=1000 word vocabulary and T=7 word sentence.
• (2T+1)NT+1 multiplications by
“direct calculation” yields (2(7)+1)(1000)(7+1) is about
475,000 million years of computer time!!!

• 2N2T multiplications using “forward method”


is about 14 seconds of computer time!!!
Forward, Backward, and Viterbi
Calculations
• Forward calculation methods are thus very
useful.
• Forward, Backward, and Viterbi Calculations
will now be discussed.
Forward Calculations – Overview
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1

b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1

K1 K2 K3 K1 K2 K3 K1 K2 K3
Forward Calculations – Time 2 (1 word example)
TIME 2

NOTE: that 1 (2)+ 2 (2)


K1 K2 K3
is the likelihood of the
observation/word “K3”
b13=0.3 in this “1 word example”
a11=0.7
S1 S1
1 1 (1)  1
 2 (1)  0
a12=0.3 a21=0.5
S0
1 (2)  1 (1)b13a11   2 (1)b23 a21  0.21
2  2 (2)  1 (1)b13a12   2 (1)b23 a22  0.09
S2 S2
a22=0.5

b23=0.2

K1 K2 K3
Forward Calculations – Time 3 (2 word example)
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3
b12=0.1

b11=0.6 1(3)
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2 S2 S2
b21=0.1 a22=0.5
b22=0.1

K1 K2 K3 K1 K2 K3
Forward Calculations – Time 4 (3 word example)
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1

b11=0.6 b13=0.3
a11=0.7 S1
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2
S2 S2 S2
b21=0.1 a22=0.5 b23=0.2
b22=0.1

K1 K2 K3 K1 K2 K3 K1 K2 K3
Forward Calculation of
Likelihood Function (“emit and jump”)
t=1 t=2 t=3 t=4
(0-word) (1-word) (2-word) (3-word)

1(t) 1.0 0.21 0.0462 0.021294


1(1) a11 b13 1(2)a11 b12
1 =1 +2(1) a21 b23 +2(2)a21 b12

2(t) 0.0 0.09 0.0378 0.010206


1(1) a12 b13
2 =0 +2(1) a22 b23

L(t) 1.0 0.3 0.084 0.0315


1(1) 1(2) +2(2) 1(3) +2(3) 1(4) +2(4)
p(K1… Kt) +2(1)
Backward Calculations – Overview
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1

b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1

K1 K2 K3 K1 K2 K3 K1 K2 K3
Backward Calculations – Time 4
TIME 4

K1 K2 K3
b11`=0.6

S1

S2

b21=0.1

K1 K2 K3
Backward Calculations – Time 3
TIME 3

K1 K2 K3
b11`=0.6

S1

S2

b21=0.1

K1 K2 K3
Backward Calculations – Time 2
TIME 2 TIME 3 TIME 4
NOTE: that 1 (2)+ 2 (2)
is the likelihood the K1 K2 K3 K1 K2 K3
observation/word sequence
“K2,K1” b12=0.1
in this “2 word example”
b13=0.3
a11=0.7
S1
1 (4)  1 S1
 2 (4)  1
a12=0.3
a21=0.5

1 (3)  0.6
 2 (3)  0.1
a22=0.5
S2 S2
1 (2)  1 (3)a11b12   2 (3)a12 b12  0.045 b23=0.2

 2 (2)  1 (3)a21b22   2 (3)a22 b22  0.245 b22=0.7

K1 K2 K3 K1 K2 K3
Backward Calculations – Time 1
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1

b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1

K1 K2 K3 K1 K2 K3 K1 K2 K3
Backward Calculation of
Likelihood Function (“EMIT AND JUMP”)
t=1 t=2 t=3 t=4

1(t) 0.0315 0.045 0.6 1


a11b11 1(1) + b11
+ a12b21 1(1)

2(t) 0.029 0.245 0.1 1


a11b11 1(1) + b21
+ a12b21 1(1)
L(t) 0.0315 0.290 0.7 1
1 1(1) + 1(2) +2(2) 1(3) + 2(3)
p(Kt… KT)
2 2(1)
You get same answer going forward or backward!!

Forward Backward
t=1 t=2 t=3 t=4 t=1 t=2 t=3 t=4

1(t) 0.0315 0.045 0.6 1


1(t) 1.0 0.21 0.0462 0.02129 a11b11 1(1) + b11
1(1) a11 b13 1(2)a11 4 + a12b21 1(1)
1 =1 +2(1) a21 b12
b23 +2(2)a2
1 b12
2(t) 0.029 0.245 0.1 1
2(t) 0.0 0.09 0.0378 0.01020 a11b11 1(1) + b21
1(1) a12 6 + a12b21 1(1)
2 =0 b13
+2(1) a22
b23
L(t) 0.0315 0.290 0.7 1
L(t) 1.0 0.3 0.084 0.0315 1(2) +2(2) 1(3) +
p(Kt… 1 1(1) +
p(K1… 1(1) 1(2) +2(2) 1(3) 1(4) 2(3)
K T) 2 2(1)
Kt ) +2(1) +2(3) +2(4)
The Forward-Backward Method
• Note the forward method computes:
N
p ( K1 ,..., K t 1 )    i (t )
i 1

• Note the backward method computes (t>1):


N
p( K t ,..., KT )   i (t )
i 1

• We can do the forward-backward method


which computes p(K1,…,KT) using
formula (using any choice of t=1,…,T+1!):
N
L  p( K1 ,..., KT )   i (t ) i (t )
I 1
Example Forward-Backward Calculation!
Forward Backward
t=1 t=2 t=3 t=4 t=1 t=2 t=3 t=4

1(t) 0.0315 0.045 0.6 1


1(t) 1.0 0.21 0.0462 0.02129 a11b11 1(1) + b11
1(1) a11 b13 1(2)a11 4 + a12b21 1(1)
1 =1 +2(1) a21 b12
b23 +2(2)a2
1 b12
2(t) 0.029 0.245 0.1 1
2(t) 0.0 0.09 0.0378 0.01020 a11b11 1(1) + b21
1(1) a12 6 + a12b21 1(1)
2 =0 b13
+2(1) a22
b23
L(t) 0.0315 0.290 0.7 1
L(t) 1.0 0.3 0.084 0.0315 1(2) +2(2) 1(3) +
p(Kt… 1 1(1) +
p(K1… 1(1) 1(2) +2(2) 1(3) 1(4) 2(3)
K T) 2 2(1)
K t) +2(1) +2(3) +2(4)

L  p(K1 ,..., KT )  1 (t )1 (t )  2 (t )2 (t )  0.0315 for t  1,..., T  1


Solution to Problem 1
• The “hard part” of the 1st Problem
was to find the likelihood of the
observations for an HMM
• We can now do this using either the
forward, backward, or forward-backward
method.
Solution to Problem 2: Viterbi Algorithm
(Computing “Most Probable” Labeling)
• Consider direct calculation of labeled
observations
EXAMPLE:
Compute p(“DOG”/NOUN,”is”/VERB,”good”/ADJ|{aij},{bkm})
• Previously we summed these likelihoods together
across all possible labelings to solve the first problem
which was to compute the likelihood of the observations
given the parameters (Hard part of HMM Question 1!).
– We solved this problem using forward or backward
method.
• Now we want to compute all possible labelings and their
respective likelihoods and pick the labeling which is
the largest!
Efficiency of Calculations
is Important (e.g., Most Likely Labeling Problem)

• Just as in the forward-backward calculations we


can solve problem of computing likelihood of every possible
one of the NT labelings efficiently
• Instead of millions of years of computing time we can
solve the problem in several seconds!!
Viterbi Algorithm – Overview (same setup as forward algorithm)
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1

b11=0.6 b13=0.3
a11=0.7
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2 S2 S2
a22=0.5 b23=0.2
b21=0.1
b22=0.1

K1 K2 K3 K1 K2 K3 K1 K2 K3
Forward Calculations – Time 2 (1 word example)
TIME 2

K1 K2 K3
1 (1)   1  1
b13=0.3  2 (1)   2  0

a11=0.7
1 (2)  max{1 (1)b13a11 ,  2 (1)b23a21}  0.21
S1 S1
1=1  2 (2)  max{1 (1)b13a12 ,  2 (1)b23a22 }  0.09

a12=0.3 a21=0.5  1 (2)  1


S0  2 (2)  1

2=0
S2 S2
a22=0.5

b23=0.2

K1 K2 K3
Backtracking – Time 2 (1 word example)
TIME 2

K1 K2 K3
1 (1)   1  1
b13=0.3  2 (1)   2  0

a11=0.7
1 (2)  max{1 (1)b13a11 ,  2 (1)b23a21}  0.21
S1 S1
1=1  2 (2)  max{1 (1)b13a12 ,  2 (1)b23a22 }  0.09

a12=0.3 a21=0.5  1 (2)  1


S0  2 (2)  1

2=0
S2 S2
a22=0.5

b23=0.2

K1 K2 K3
Forward Calculations – (2 word example)
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3
b12=0.1

b13=0.3

S1 S1 S1
a11=0.7
1

a12=0.3 a21=0.5
S0

2 a22=0.5
S2 S2 S2

b23=0.2 b22=0.1

K1 K2 K3 K1 K2 K3
BACKTRACKING – (2 word example)
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3
b12=0.1

b13=0.3

S1 S1 S1
a11=0.7
1

a12=0.3 a21=0.5
S0

2 a22=0.5
S2 S2 S2

b23=0.2 b22=0.1

K1 K2 K3 K1 K2 K3
Formal Analysis of 2 word case
1 (3)  max{1 (2)b12 a11 ,  2 (2)b22 a21 }
1 (3)  max{(0.21)(0.1)(0.7), (0.09)(0.7)(0.5)}
1 (3)  max{0.0147, 0.0315}  0.0315

 2 (3)  max{1 (2)b12 a12 ,  2 (2)b22 a22 }


 2 (3)  max{(0.21)(0.1)(0.3), (0.09)(0.7)(0.5)}
 2 (3)  max{0.0063, 0.0315}  0.0315

 1 (2)  1 or 2  pick 1 arbitrarily 


 2 (2)  1 or 2  pick 1 arbitrarily 
Forward Calculations – Time 4 (3 word example)
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1

b11=0.6 b13=0.3
a11=0.7 S1
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2
S2 S2 S2
b21=0.1 a22=0.5 b23=0.2
b22=0.1

K1 K2 K3 K1 K2 K3 K1 K2 K3
Backtracking to Obtain Labeling for 3 word case
TIME 2 TIME 3 TIME 4

K1 K2 K3
K1 K2 K3 K1 K2 K3
b12=0.1

b11=0.6 b13=0.3
a11=0.7 S1
S1 S1 S1
1
a12=0.3 a21=0.5
S0

2
S2
S2 S2 S2
b21=0.1 a22=0.5 b23=0.2
b22=0.1

K1 K2 K3 K1 K2 K3 K1 K2 K3
Formal Analysis of 3 word case

1 (4)  max{1 (3)b11 a11 ,  2 (3)b21a21 }


1 (4)  max{(0.0315)(0.6)(0.7), (0.0315)(0.1)(0.5)}  0.01323

 2 (4)  max{1 (3)b11 a12 ,  2 (3)b21a22 }


 2 (4)  max{(0.0315)(0.6)(0.3), (0.0045)(0.1)(0.5)}  0.00567

 1 (4)  2
 2 (4)  2
Third Fundamental Question:
Parameter Estimation
• Make Initial Guess for {aij} and {bkm}
• Compute probability one hidden state follows another
given: {aij} and {bkm} and sequence of observations.
(computed using forward-backward algorithm)
• Compute probability of observed state given a hidden
state given: {aij} and {bkm} and sequence of observations.
(computed using forward-backward algorithm)
• Use these computed probabilities to
make an improved guess for {aij} and {bkm}
• Repeat this process until convergence
• Can be shown that this algorithm does in
fact converge to correct choice for {aij} and {bkm}
assuming that the initial guess was close enough..

Das könnte Ihnen auch gefallen