02 - Hidden Markov Model

14
Chapter III
Hidden Markov Model [2]

This section will describe a method to train and recognize speech utterance from
given observations,
Q
t
R O
2
e , where t is a time index and 2Q is the vector
dimension. A complete sequence of observations used to describe the utterance will
be denote O=(O
1,
O
2
,,O
T
). The utterance may be a word, a phoneme, a complete
sentence or paragraph. The method described here is the Hidden Markov Model
(HMM). The HMM is an stochastic approach which models the given problem as a
doubly stochastic process in which the observed data are thought to be the result
of having passed the true (hidden) process through a second process. Both
processes are to be characterized using only the one that could be observed. The
problem with this approach is that one does not know anything about the Markov
chains that generate the speech. The number of states in the model is unknown, the
probabilistic functions are unknown and one can not tell from which state an
observation was produced. These properties are hidden, and thereby the name
Hidden Markov Model.

III.1. Discrete Markov Process
Consider a system which may be described at any time as being one of a set of N
distinct states, S
1
, S
2
, , S
N
, as illustrated in figure III.1. At regularly spaced
discrete times, the system undergoes a change of state (possibly back to the same
state) according to a set of probabilities associated with that state. Denote the time
instants associated with state changes as t=1, 2, , and denote the actual state at
time t as q
t
. A full probabilistic description of the above system would, in general,
require specification of the current state (at time t), as well as all the predecessor
states. For the special case of a discrete, first order, Markov chain, this probabilistic
description is truncated to just the current and the predecessor state, i.e.,

15

Figure III.1. A Markov chain with 5 states with selected state transitions.
P[q
t
=S
j
|q
t -1
=S
j
, q
t -2
=S
k
, ]
= P[q
t
=S
j
|q
t -1
=S
i
]. (III.1)

Furthermore we only consider those processes in which the right-hand side of (III.1)
is independent of time, thereby leading to the set of state transition probabilities a
ij

of the form with the state transition coefficients having the properties since they
obey standard stochastic constrains.
| | N j i S q S q P a
i t j t ij
s s = = =

, 1 , |
1
(III.2)

=
=
>
N
j
ij
ij
a
a
1
1
0

The above stochastic process could be called an observable Markov model since the
output of the process is the set of states at each instant of time, where each state
corresponds to a physical (observable) event. To get ideas, consider as simple 3-
S
1
S
2
S
3
S
4
S
5
a
11
a
12
a
23
a
22
a
33
a
44
a
55
a
35
a
14
a
54
a
45
a
51
a
43
(III.3a)
(III.3b)
16

state Markov model of the weather. We assume that once a day (e.g., at noon), the
weather is observed as being one of the following:

State 1: rain or (snow)
State2: cloudy
State3: sunny.

We postulate that the weather on day t is characterized by a single one of the three
states above, and that the matrix A of state transition probabilities is

{ } .
8 . 0 1 . 0 1 . 0
2 . 0 6 . 0 2 . 0
3 . 0 3 . 0 4 . 0
(
(
(
= =
ij
a A

Given that the weather on day 1 (t=1) is sunny (state3), we can ask the question:
What is the probability (according to the model) that the weather for the next 7 days
will be sun-sun-rain-rain-sun-cloudy-sun? State more formally, we define the
observation sequence O as O={S
3
, S
3
, S
1
, S
1
, S
3
, S
2
, S
3
} corresponding to t=1,2, ,
8, and we wish to determine the probability of O, given the model. This probability
can be expressed (and evaluated) as
P(O|Model)=P[S
3
,S
3
,S
3
,S
1
,S
1
,S
3
,S
2
,S
3
|Model]
=P[S
3
]*P[S
3
|S
3
]*P[S
3
|S
3
]*P[S
1
|S
3
]
= *a
33
*a
33
*a
31
*a
11
*a
13
*a
32
*a
23

=1*(0.8)*(0.8)*(0.1)*(0.4)*(0.3)*(0.1)*(0.2)
= 1.536x10
-4

Where we use the notation

i
= P[q
1
=S
i
], 1 i N (III.4)
to denote the initial state probabilities.
17

Another interesting question we can ask (and answer using the model) is: Given
that the model is in a known state, what is the probability it stays in that state for
exactly d days? This probability can be evaluated as the probability of the
observation sequence
O={S
i
, S
i
, S
i
, , S
i
, S
j
S
i
},
1 2 3 d d+1
Given the model, which is
P(O|Model, q
1
=S
i
) =(a
ii
)
d-1
(1-a
ii
)=p
i
(d). (III.5)
The quantity p
i
(d) is the (discrete) probability density function of duration d in state
i. The exponential duration density is characteristic of the state duration in a
Markov chain. Based on p
i
(d), we can readily calculate the expected number of
observations in a state, conditioned on starting in that state as

=
1
) (
d
i i
d p d d
(III.6a)

= =
1
1
1
1
) 1 ( ) (
d
ii
ii
d
ii
a
a a d
(III.6b)
Thus the expected number of consecutive days of sunny weather, according to the
model, is 1/(0.2)=5; for cloudy it is 2.5; for rain it is 1.67.

III.2. Hidden Markov Models [1]
So far we have considered Markov models in which each state corresponded to an
observable (physical) event. This model is too restrictive to be applicable to many
problems. In this section we extend the concept of Markov models to include the
case where the observation is a probabilistic function of the state i.e., the resulting
model (which is called a hidden Markov model) is a doubly embedded stochastic
process with an underlying stochastic process that is not observable (it is hidden),
but can only be observed through another set of stochastic processes that produce
the sequence of observations. To fix ideas, consider the following model of some
simple coin tossing experiments.
18

Coin Toss Models: Assume the following scenario. You are in a room with a barrier
(e.g., a curtain) through which you cannot see what is happening. On the other side
of the barrier is another person who is performing a coin (or multiple coin) toss ing
experiment. The other person will not tell you anything about what he is doing
exactly; he will only tell you the result of each coin flip. Thus a sequence of hidden
coin tossing experiments is performed, with the observation sequence consisting of
a series of heads and tails; e.g., a typical observation sequence would be
O = O
1
O
2
O
3
. . . O
T

= . . .
Where stands for heads and stands for tails.
Given the above scenario, the problem of interest is how do we build an HMM to
explain (model) the observed sequence of heads and tails. The first problem one
faces is deciding what the states in the model correspond to, and the deciding how
many states should be in the model. One possible choice would be to assume that
only a single biased coin was being tossed. In this case we could model the situation
with a 2-state model where each state corresponds to a side of the coin (i.e., head or
tail). This model is depicted in figure III.2a. In this case the Markov model is
observable, and the only issue for complete specification of the model would be to
decide on the best value for the bias (i.e., the probability of, say, heads).
Interestingly, an equivalent HMM to that of figure III.2a would be a degenerate 1-
state model, where the state corresponds to the single biased coin, and the unknown
parameter is the bias of the coin.
A second form of HMM for explaining the observed sequence of coin toss outcome
is given in Figure III.2(b). In this case there are 2 states in the model and each state
corresponds to a different, biased, coin being tossed. Each state is characterized by a
probability distribution of heads and tails, and transitions between states are
characterized by a state transition matrix. The physical mechanism which accounts
for how state transitions are selected could itself be a set of independent coin tosses,
or some other probabilistic event.
A third form of HMM for explaining the observed sequence of coin toss outcome is
given in figure III.2 (c). This model corresponds to using 3 biased coins, and
choosing from among the three, based on some probabilistic event.

19

P(H)
1-P(H) 1-P(H)
P(H)
(a)
1
2
HEADS TAILS
O= H H T T H T H H T T H
S= 1 1 2 2 1 2 1 1 2 2 1
a
11
a
22
1-a
11

1-a
22
(b)
1
2
P(H)=P
1

P(T)=1-P
1

P(H)=P
2

P(T)=1-P
2

S= 2 1 1 2 1 2 1 2 2 1 2
a
11
a
22
a
12

a
21
(c)
1
2
STATE
P(H)
3 2 1
3 2 1
P P P

P(T) 1-P
1
1-P
2
1-P
3

P(H)=P
2

P(T)=1-P
2

S= 3 1 2 3 3 1 1 2 3 1 3
1
3
a
33
Figure III.2. Three possible Markov models which can account for the results of hidden coin
tossing experiments. (a) 1-coin model. (b) 2-coins model. (c) 3-coins model.
20

Given the choice among the three models shown in figure III.2 for explaining the
observed sequence of heads and tails, a natural question would be which model best
matches the actual observations. It should be clear that the simple 1-coin model of
figure III.2a has only 1 unknown parameter; the 2-coin model of figure III.2b has 4
unknown parameters; and the 3-coin model of figure III.2c has 9 unknown
parameters. Thus, with the greater degrees of freedom, the larger HMMs would
seem to inherently be more capable of modeling a series of coin tossing experiments
than would equivalently smaller models. Although this is theoretically true, we will
see later that practical consideration impose some strong limitations on the size of
models that we can consider. Furthermore, it might just be the case that only a
single coin is being tossed. Then usi ng the 3-coin model of figure III.2c would be
inappropriate, since the actual physical event would not correspond to the model
being used i.e., we would be using an underspecified system.
The Urn and Ball Model: To extend the ideas of the HMM to a somewhat more
complicated situation, consider the urn and ball system of figure III.3. We assume
that there are N (large) glass urns in a room. Within each urn there are a large
number of colored balls. We assume there are K distinct colors of the balls. The
physical process for obtaining observations is as follows. A genie is in the room,
and according to some random process, he (or she) chooses an initial urn. From this
urn, a ball is chosen at random, and its color is recorded as the observation. The ball
is then replaced in the urn from which it was selected. A new urn i s then selected
according to the random selection process associated with the current urn, and the
ball selection process is repeated. This entire process generates a finite observation
sequence of colors, which we would like to model as the observable out put of an
HMM.
It should be obvious that the simplest HMM that corresponds to the urn and ball
process is one in which each state corresponds to a specific urn, and for which a
(ball) color probability is defined for each state. The choice of color of urns is
dictated by state transition matrix of the HMM.

21

III.2.1. Discrete Observation Densities
The urn and ball example described in previous section is an example of a discrete
observation density HMM. This because there are K distinct colors. In general the
discrete observation density HMMs are based on partitioning the probability density
function (pdf) of observations into a discrete set of small cells and symbols v
1
, v
2
,
, v
K,
one symbol representing each cell. This partitioning subject is usually called
vector quantization. After a vector quantization is performed, a codebook is created
of the mean vectors for every cluster.

The corresponding symbol for the observation is determined by the nearest neighbor
rule, i.e. select the symbol of the cell with the nearest codebook vector. To make a
parallel to the urn and ball model, this means that if a dark gray ball is observed,
will it probably be closest to the black color. In this case the symbols v
1
, v
2
, , v
K

are represented by one color each (e.g. v
1
= RED). The observation symbol
-------
URN 1 URN 2 URN N
P(RED) =b
1
(1)
P(BLUE) =b
1
(2)
P(GREEN) =b
1
(3)
P(YELLOW)=b
1
(4)
----------------------
P(ORANGE)=b
1
(K)

P(RED) =b
2
(1)
P(BLUE) =b
2
(2)
P(GREEN) =b
2
(3)
P(YELLOW)=b
2
(4)
----------------------
P(ORANGE)=b
2
(K)

P(RED) =b
N
(1)
P(BLUE) =b
N
(2)
P(GREEN) =b
N
(3)
P(YELLOW)=b
N
(4)
----------------------
P(ORANGE)=b
N
(K)

O={GREEN, GREEN, BLUE, RED, YELLOW, RED, ,BLUE
Figure III.3. An N-state urn and ball model which illustrates the
general case of a discrete symbol HMM.

22

probability distribution, ( ) { }
N
j
t j
o b B
1 ==
= will now have the symbol distribution at
state j, b
j
(o
t
), defined as:
b
j
(o
t
)=b
j
(k)=P(o
t
=v
k
|q
t
=j), 1 k K (III.7)
The estimation of the probabilities b
j
(k) is normally accomplished in two steps, first
the determination of the codebook and then the estimation of the sets of observation
probabilities for each codebook vector in each state.
In this project, the codebook will be determined by K-means algorithm

The K-Means Algorithm
1. Initialization
Choose K vectors from the training vectors, here denoted x, at random. These
vectors will be the centroids
k
, which are to be found correctly.
2. Recursion
For each vector in the training set, let every vector belong to a cluster k. This is
done by choosing the cluster closest to the vector:

| | ) 8 . ( ) , ( min arg
*
III x d k
k
k
=
Where d(x,
k
) is a distance measure, here is the Euclidian distance measure is used:

) 9 . ( ) ( ) ( ) , ( III x x x d
k
T
k k
=

3. Test
Recomputed the centroids,
k
, by taking a mean of the vectors that belong to this
centroid. This is done for every
k
. If no vectors belongs to some
k
for some value
k-create new
k
by choosing a random vector from x. If there has been has been no
change of the centroids from the previous step goto termination, otherwise go back
to step 2.

23

III.2.2. Continuous Observation Densities
To create continuous observation density HMMs, b
j
(o
t
) are created as some
parametric probability density functions (pdf) or mixtures of them. The most
general representation of pdf, for which a reestimation procedure has been
formulated, is a finite mixture of form:

N j o b c o b
t
K
k
jk jk t j
...., , 2 , 1 ), ( ) (
1
= =
=
(III.10)
Where K is the number of mixtures and the following stochastic constraints fort the
mixture weights, c
jk
, holds:

K k N j c
N j c
jk
K
k
jk
..., , 2 , 1 , ...., , 2 , 1 0
..., , 2 , 1 1
1
= = >
= =
=
(III.11)
And b
jk
(o
t
) is a D-dimensional log-concave or elliptically symmetric density with
mean vector
jk
and covariance matrix
jk
:

= ) , , ( ) (
jk jk t t jk
o o b q
(III.12)
The most used D-dimensional log-concave or elliptically symmetric density, is the
Gaussian density. The Gaussian density can be found as:

( ) ( )
= =

jk r jk
T
jk t
o o
jk
D
jk jk t t jk
e o o b

t
q
1
2
1
2 / 1
2 /
) 2 (
1
) , , ( ) (

(III.13)
To approximate simple observation sources, the mixture Gaussians provide an easy
way to gain a considerable accuracy due to the flexibility and convenient estimation
of the pdfs. If the observation source generates a complicated high dimensional pdf,
the mixture Gaussians become computationally difficult to treat, due to excessive
number of parameters and large covariance matrixes.

24

As the length of the feature vectors are increased, the size of the covariance
matrices increases in square proportional to vector dimension. If feature vectors are
designed to avoid redundant components, the off diagonal elements of the
covariance matrices are usually small. This suggest to the covariance approximation
by diagonal matrices. The diagonality also provides a simpler and faster
implementation:
( ) ( )
( ) ( )

H
=
= =
=
D
l
jkl
jkl tl
jk r jk
T
jk t
o
jkl
D
l
D
o o
jk
D
jk jk t t jk
e
e o o b
1
2
2
1
2
) (
2 / 1
1
2 /
2
1
2 / 1
2 /
2
1
) 2 (
1
) , , ( ) (
o

o t
t
q

(III.14)
Where
jkD jk jk
o o o ..., , ,
2 1
are the diagonal elements of the covariance matrix
jk
E

III.2.3. Elements of an HMM
The above examples give us a pretty good idea of what an HMM is and how it can
be applied to some simple scenarios. We now formally define the elements of an
HMM, and explain how the model generates observation sequences.
An HMM is characterized by the following:
1) N, the number of states in the model. Although the states are hidden, for
many practical application there is often some physical significance attached
to the states or to sets of states of the model. Hence, in the coin tossing
experiments, each state corresponded to a distinct biased coin. In the urn and
ball model, the states corresponded to the urns. Generally the states are
interconnected in such a way that any state can be reached from any other
state (e.g., an ergodic model); however, we will see later in this paper that
other possible interconnections of states are often of interest. We denote the
individual states as S = {S
1
, S
2
, , S
N
}, and the state at time t as q
t
.

25

2) K, the number of distinct observations symbols per state, i.e., the discrete
alphabet size. The observation symbols correspond to the physical output of
the system being modeled. For the coin toss experiments the observation
symbols were simply heads or tails; for the ball and urn model they were the
colors of balls selected from the urns. We denote the individual symbols as
V={v
1
, v
2
, , v
M
}.

3) The state transition probability distribution A={a
i j
} where

a
ij
=P[q
t +1
=S
j
|q
t
=S
i
], 1 i, j N (III.15)

For the special case where any state can reach any other state in a single step,
we have a
ij
>0 for all i,j. For other types of HMMs, we would have a
ij
=0 for
one or more (i, j) pairs.

4) The observation symbol probability distribution in state j, B={b
j
(k)}, where
b
j
(k) = P[v
k
at t|q
t
= S
j
], 1 j N ; 1 k K (III.16)

5) The initial state distribution = {
i
} where

i
=P[q
1
=S
i
], 1 i N (III.17)

Given appropriate value of N, K, A, B, and , the HMM can be used as a generator
to give an observation sequence
O= O
1
O
2
O
T

(Where each observation O
t
is one of the symbols from V, and T is the number of
observations in the sequence ) as follows:

a. Choose an initial state q
1
=S
i
according to the initial state distribution
b. Set t=1.
c. Choose O
t
= v
k
according to the symbol probability distribution in state
S
i
, i.e., b
i
(k).
26

d. Transit to a new state q
t +1
=S
j
according to the state transition probability
distribution for state S
i
, i.e., a
ij
.
e. Set t=t+1; return to step c. if t<T; otherwise terminate the procedure.
The above procedure can be used as both a generator of observation, and as a model
for how a given observation sequence was generation by an appropriate HMM.
It can be seen from the above discussion that a complete specification of an HMM
requires specification of two model parameters (N and K), specification of
observation symbols, and the specification of the three probability measures A, B ,
and . For convenience, we use the compact notation
=(A, B , )
to indicate the complete parameter set of the model.

III.3. The There Basic Problem for HMMs [2]
Given the form of HMM of the previous section, there are three basic problem of
interest that must be solved for the model to be useful in real world applications.
These problems are the following:
Problem 1: Given the observation sequence O=O
1
O
2
O
T
, and a model =(A, B ,
), how do we efficiently compute P(O|), the probability of the observation
sequence, given the model?
Problem 2: Given the observation sequence O=O
1
O
2
O
T
, and the model , how
do we choose a correspond state sequence Q=q
1
q
2
... q
T
which is optimal in some
meaningful sense (i.e., best explains the observations)?
Problem 3: How do we adjust the model parameter =(A, B , ) to maximize
P(O|)?

III.3.1. Solution to problem 1
We wish to calculate the probability of the observation sequence, O=O
1
O
2
O
T
,
given the model , i.e., P(O|). The most straightforward way of doing this is
through enumerating every possible state sequence of length T (the number of
observations). Consider one such fixed state sequence
27

Q=q
1
q
2
... q
T
Where q
1
is the initial state. The probability of the observation sequence O for the
state sequence is

) , | ( ) , | (
1

t t
T
t
q O P Q O P
=
H =
Where we have assumed statistical independence of observations. Thus we get

) ( ).... ( * ) ( ) , | (
2 1
2 1
T q q q
O b O b O b Q O P
T
=
The probability of such a state sequence Q can be written as

T T
q q q q q q q
a a a O P
1 3 2 2 1 1
.... ) | (
=t

The joint probability of O and Q, i.e., the probability that O and Q occur
simultaneously, is simply the product of the above two terms, i.e.,
P(O,Q|)=P(O|Q,)P(Q,). (III.21)
The probability of O (given the model) is obtained by summing this joint
probability over all possible state sequences q giving

) ( ).... ( ) (
) | ( ) , | ( ) | (
1
2 1
2 2 1 1 1
2
..., ,
1 T q q q
q q q
q q q q q
Q all
O b a O b a O b
Q P Q O P O P
T T t
T
=
=
t

(III.22)

The interpretation of the computation in the above equation is the following.
Initially (at time t=1) we are in state q
1
with probability
1
q
t , and we generate the
symbol O
1
(in this state) with probability ) (
1
1
O b
q
. The clock changes from time t to
t+1 (t=2) and we make a transition to state q
2
from state q
1
with probability
2 1
q q
a ,
and generate symbol O
2
with probability ) (
2
2
O b
q
. This process continues in this
manner until we make the list transition (at time T) from state q
T-1
to state q
T
with
probability
T T
q q
a
1
and generate symbol O
T
with probability ) (
T q
O b
T
.
A little thought should convince the reader that the calculation of P(O|), according
to its direct definition (17) involves on the order of 2T*N
T
calculations, since at
every t=1,2, ., T, there are N possible states which can be reached (i,e., there are
N
T
possible state sequences), and for each such state sequence about 2T calculations
(III.18)
(III.19)
(III.20)
28

are required for each term in the sum of (17). (To be precise, we need (2T-1)N
T

multiplications, and N
T
-1 additions.) This calculation is computationally unfeasible,
even for small values of N and T; e.g., for N=5, T=100, there are on the order of
2*100*5
100
~ 10
72
computation! Clearly a more efficient procedure is required to
solve Problem 1. Fortunately such a procedure exists and is called the forward-
backward procedure.
The Forward-Backward Procedure: Consider the forward variable ) (i
t
o defined as

) | , ... ( ) (
2 1
o
i t t t
S q O O O P i = =
(III.23)

i.e, the probability of the partial observation sequence, O
1
O
2
O
t
, (until time t) and
state S
i
at time t, given the model . We can solve for ) (i
t
o inductively, as follow:

1) Initialization:

. 1 ), ( ) (
1
N i O b i
i i t
s s =t o
(III.24)

2) Induction:

. 1
1 1 ), ( ) ( ) (
1
1
1
N j
T t O b a i j
t j
N
i
ij t t
s s
s s
(
=
+
=
+
o o
(III.25)
3) Termination:

( )

=
=
N
i
T
i O P
1
). ( | o
(III.26)

Step 1) initializes the forward probabilities as the joint probability of state S
i
and
initial observation O
1
. The induction step, which is the heart of the forward
calculation, is illustrated in Figure III.4 (a). This figure show how state S
j
can be
reached at time t+1 from the N possible states, S
i
, 1 i N, at time t. Since ) (i
t
o is
the probability of the joint event that O
1
O
2
O
t
are observed, and the state at time t
is S
i
, the product
ij t
a i) ( o is then the probability of the joint event that O
1
O
2
O
t
are
observed, and state S
j
is reached at time t+1 via state S
i
at time t. Summing this
29

product over all the N possible states S
i
, 1 i N at time t results in the probability
of S
j
at time t+1 with all the accompanying previous partial observations. Once this
is done and S
j
is known, it is easy to see that ) (
1
j
t +
o is obtained by accounting for
observation O
t +1
in state j, i.e., by multiplying the summed quantity by the
probability b
j
(O
t +1
). The computation of (20) is performed for all states j, 1 j N,
for a given t; the computation is then iterated for t=1,2, , T-1. Finally, step 3)
gives the desired calculation of P(O|) as the sum of the terminal forward variable
T
(i). This is the case since, by definition,

T
(i)=P(O
1
O
2
O
T
, q
T
=S
i
|) (III.27)
and hence P(O|) is just the sum of the
T
(i)s.

Figure III.4 (a) Illustration of the sequence of operations required for the computation of the
forward variable
t+1
(j). (b) Implementation of the computation of
t
(i) in terms of a lattice of
observation t, and states i.

If we examine the computation involved in the calculation of
t
(j), 1 t T, 1 j
N, we see that it requires on the order of N
2
T calculation, rather than 2TN
T
as
required by the direct calculation. (Again, to be precise, we need N(N+1)(T-1)+N
multiplication and N(N-1)(T-1) additions.) For N=5, T=100, we need about 3000
computations for the forward method, versus 10
72
computations for the direct
calculation, a savings of about 69 orders of magnitude.
The forward probability calculation is, in effect, based upon the lattice (or trellis)
structure shown in Figure III.4 (b). The key is that since there are only N states
(nodes at each time slot in that lattice), all the possible state sequences will remerge
into these N nodes, no matter how long the observation sequence. At time t=1, we

S
1
S
2
S
N
t
t
(i)

S
j
t+1
t+1
(j)

(a)

(b
)
30

need to calculate values of
1
(i), 1 i N. At times t=2, 3, , T, we only need to
calculate values of
t
(j), 1 j N, where each calculation involves only N previous
values of
t -1
(i) because each of N grid points is reached from the same N grid
points at the previous time slot.
In a similar way, we can consider a backward variable
t
(i) defined as

t
(i) = P(O
t +1
O
t +2
. . . O
T
|q
t
=S
i
,) (III.28)
i.e., the probability of the partial observation sequence from t+1 to the end, given
state S
i
at time t and the model . Again we can solve for
t
(i) inductively as
follows:
1) Initialization:

. 1 , 1 ) ( N i i
T
s s = |
(III.29)

2) Induction:

. 1 , 1 ...., , 2 , 1
), ( ) ( ) (
1
1 1
N i T T t
j O b a i
N
j
t t j ij t
s s =
=
=
+ +
| |
(III.30)
The initialization step 1) arbitrarily define
T
(i) to be 1 for all i. Step 2), which is
illustrated in Figure IV.5., show that in order to have been in state S
i
at time t, and
to account for the observation sequence from time t+1 on, you have to consider all
possible states S
j
at time t+1, accounting for the transition from S
i
to S
j
(the a
ij
term),
as well as the observation O
t +1
in state j (the b
j
(O
t +1
) term), and then account for the
remaining partial observation sequence from state j (the
t +1
(j) term). We will see
later how the backward, as well as the forward calculation are used extensively to
help solve fundamental Problems 2 and 3 of HMMs.
Again, the computation of
t
(i), 1 t T, 1 i N, requires on the order of N
2
T
calculation, and can be computed in a lattice structure similar to that of figure III.4
(b).

31

III.3.2. Solution to Problem 2
Unlike Problem 1 for which an exact solution can be given, there are several
possible ways of solving Problem 2 , namely finding the optimal state sequence
associated with the given observation sequence. The difficulty lies with the
definition of the optimal state sequence; i.e., there are several possible optimality
criteria. For example, one possible optimality criterion is to choose the states q
t

which are individually most likely. This optimality criterion maximizes the
expected number of correct individual states. To implement this solution to Problem
2, we define the variable

t
(i)=P(q
t
=S
i
|O,) (III.31)
i.e., the probability of being in state S
i
at time t, given the observation sequence O,
and the model . Equation (26) can be expressed simply in terms of the forward-
backward variable, i,e.,

=
= =
N
i
t t
t i t t
t
i i
i i
O P
i i
i
1
) ( ) (
) ( ) (
) | (
) ( ) (
) (
| o
| o
| o
(III.32)

S
1
S
2
S
N
t
t
(i)

S
i
t+1
t+1
(j)

Figure III.5. Illustration of the sequence of operation required
for the computation of the backward variable
t
(i)

32

Since
t
(i) accounts for the partial observation sequence O
1
O
2
O
t
and state S
i
at t,
while
t
(i) accounts for the remainder of the observation sequence O
t +1
O
t +2
O
T
,
given state S
i
at t. The normalization factor P(O|)=

=
=
N
i
t t
i i O P
1
) ( ) ( ) | ( | o makes
t
(i) a probability measure so that

=
=
N
i
t
i
1
1 ) (
(III.33)
Using (i), we can solve for the individually most likely state q
t
at time t, as

| | T t i q
N i
t t
s s =
s s
1 , ) ( max arg
1
(III.34)
Although (29) maximizes the expected number of correct states (by choosing the
most likely state for each t), there could be some problem with the resulting state
sequence. For example, when the HMM has state transitions which have zero
probability (a
ij
=0 for some i and j), the optimal state sequence may, in fact, not
even be a valid state sequence may, in fact, not even be a valid state sequence. This
is due to fact that the solution of ( III.34) simply determines the most likely state at
every instant, without regard to the probability of occurrence of sequences of states.
One possible solution to the above problem is to modify the optimality criterion.
For example, one could solve for the state sequence that maximizes the expected
number of correct pairs of states (q
t
, q
t +1
), or triples of states (q
t
, q
t +1
, q
t +2
), etc.
Although these criteria might be reasonable for some applications, the most widely
used criterion is to find the single best state sequence (path), i.e., to maximize
P(Q|O,) which is equivalent to maximizing P(Q,O|). A formal technique for
finding this single best state sequence exists, based on dynamic programming
methods, and is called the Viterbi algorithm.
Viterbi Algorithm: To find the single best state sequence, Q={q
1
q
2
q
T
}, for the
given observation sequence O={O
1
O
2
O
T
}, we need to define the quantity

1 2 1
,..., ,
2 1 2 1
] | .... , ... [ max ) (
= =
t
q q q
t t t
O O O i q q q P i o
(III.35)

i.e.,
t
(i) is the best score (highest probability) along a single path, at time t, which
accounts for the first t observations and ends in state S
i .
By induction we have
33

i
t j ij t t
O b a i j ) ( * ] ) ( max [ ) (
1 1 + +
= o o
(III.36)
To actually retrieve the state sequence, we need to keep track of the argument which
maximized (IV.31), for each t and j. We do this via the array ) ( j
t
. The complete
procedure for finding the best state sequence can now be stated as follows:

1) Initialization:

N i O b i
i i
s s = 1 ), ( ) (
1 1
t o
(III.37)

0 ) (
1
= i
(III.38)

2) Recursion:

N j
T t O b a i j
t j ij t
N i
t
s s
s s =

s s
1
2 ), ( ] ) ( [ max ) (
1
1
o o
(III.39)

N j
T t a i j
ij t
N i
t
s s
s s =

s s
1
2 ], ) ( [ max arg ) (
1
1
o
(III.40)

3) Termination:

| | ) ( max
1
*
i P
T
N i
o
s s
=
(III.41)

| | ) ( max arg
1
*
i q
T
N i
T
o
s s
=
(III.42)

4) Path (state sequence) backtracking:

( ) 1 ...., , 2 , 1 ,
*
1 1
*
= =
+ +
T T t q q
t t t

(III.43)

34

III.3.3. Solution to Problem 3
The third, and by far the most difficult, problem of HMMs is to determine a method
to adjust the model parameters (A, B, ) to maximize the probability of the
observation sequence given the model. There is no known way to analytically solve
for the model which maximizes the probability of the observation sequence. In fact,
given any finite observation sequence as training data, there is no optimal way of
estimating the model parameters. We can, however, choose =(A,B,) such that
P(O|) is locally maximized using an iterative procedure such as the Baum-Welch
method.

In order to describe the procedure for reestimation (iterative update and
improvement) of HMM parameters, we first define ) , ( j i , the probability of being
in state S
i
at time t, and state S
j
at time t+1, given the model and the observation
sequence, i.e.

( ) , | , ) , (
1
O S q S q P j i
j t i t
= = =
+
(III.44)

The sequence of events leading to the conditions required by (III.41) is illustrated in
figure III.6. It should be clear, from the definitions of the forward and backward
variables, that we can write ) , ( j i in the from

S
j
t+2

t-1

S
i
a
ij
b
j
(O
t+1
)
a
t
(i)
t+1
(j)
t+1

t

Figure III.6. Illustration of the sequence of operations required for the computation of
the joint event that system is in state S
i
at time t and state S
j
at time t+1.
35

( )
= =
+ +
+ +
+ +
=
=
N
i
N
j
t t j ij t
t t j ij t
t t j ij i
j O b a i
j O b a i
O P
j O b a i
j i
1 1
1 1
1 1
1 1
) ( ) ( ) (
) ( ) ( ) (
) | (
) ( ) ( ) (
,
| o
| o

| o
(III.45)
Where the numerator term is just P(q
t
=S
i
, q
t +1
=S
j
, O|) and the division by P(O|)
gives the desired probability measure.
We have previously defined ) (i
t
as the probability of being in state S
i
at time t,
given the observation sequence and the model; hence we can relate ) (i
t
to ) , ( j i by
summing over j, giving

=
=
N
j
t t
j i i
1
) , ( ) (
(III.46)

If we sum ) (i
t
over the time index t, we get a quantity which can be interpreted as
the expected (over time) number of times that state S
i
is visited, or equivalently, the
expected number of transitions made from state S
i
(if we exclude the time slot t=T
from the summation). Similarly, summation of ) , ( j i over t (from t=1 to t=T-1) can
be interpreted as the expected number of transitions from state S
i
to state S
j
. That is

=
=
1
1
exp ) (
T
t
i t
S from s transition of number ected i
(III.47)

=
=
1
1
exp ) , (
T
t
j i t
S to S from s transition of number ected j i
(III.48)
Using the above formulas (and the concept of counting event occurrences) we can
give a method for reestimation of the parameters, of an HMM. A set of reasonable
reestimation formula, for , A, and B are
36

=
= = =
N
i
T
i i
i i
i t time at S state in times of number frequency ected
1
1 1
1
) ( ) (
) ( ) 1 ( ) ( exp
o
| o
t
(III.49a)

) ( ) (
) ( ) ( ) (
) (
) , (
exp
exp
1
1
1
1
1 1
1
1
1
i i
j O b a i
i
j i
S state from s transition of number ected
S state to S state from s transition of number ected
a
t
T
t
t
T
t
t t j ij t
T
t
t
T
t
t
i
j i
ij
| o
| o
=
+ +
=
=
=
=
=
(III.49b)

=
=
=
=
=
T
t
t
T
v O
t
t
i
k
j
j
j
S state in times of number ected
v symbol observing and j state in times of number ected
k b
k t
1
1
) (
) (
exp
exp
) (
(III.49c)
If we define the current model as =(A, B, ), and use that to compute the right -
hand side of (III.49a)-(III.49c), and we define the reestimated model as ) , , ( t B A =
, as determined from the left-hand sides of (III.49a)-(III.49c), then it has been
proven by Baum and his colleagues that either 1) the initial model defines a
critical point of the likelihood function, in which case = ; or 2) model is more
likely than model in the sense that ) | ( ) | ( O P O P > , i.e., we have found a new
model from which the observation sequence is more likely to have been
produced.

37

Based on the above procedure, if we iteratively use in place of and repeat the
reestimation calculation, we then can improve the probability of O being observed
from the model until some limiting point is reached. The final result of this
reestimation procedure is called a maximum likelihood estimate of the HMM. It
should be pointed out that the forward-backward algorithm leads to local maxima
only, and that in most problems of interest, the optimization surface is very complex
and has many local maxima.
The reestimation formulas of (III.49a)-(III.49c) can be derived directly by
maximizing (using standard constrained optimization techniques) Baums auxiliary
function

| |
=
Q
Q O P O Q P Q ) | , ( log ) , | ( ) , (

(III.50)

over . It has been proven by Baum that maximization of ) , ( Q leads to increased
likelihood, i.e.

| | ). | ( ) | ( ) , ( max
O P O P Q > =
(III.51)

Eventually the likelihood function converges to a critical point.
Notes on the Reestimation Procedure: The reestimation formulas can readily be
interpreted as an implementation of the EM algorithm of statistics in which the E
(expectation) step is the calculation of the auxiliary function ) , ( Q , and the M
(modification) step is the maximization over . Thus the Baum-Welch reestimation
equations are essentially identical to the EM steps for this particular problem.
An important aspect of the reestimation procedure is that the stochastic constraints
of the HMM parameters, namely

=
=
N
i
i
1
1 t
(III.52)

N i a
N
j
ij s s =
=
1 , 1
1
(III.53)

38

N j k b
K
k
j s s =
=
1 , 1 ) (
1
(III.54)

are automatically satisfied at each iteration.
III.3.4. Reestimation For Multiple Observation Sequences
If only one observation sequence is used to train the model then would the model
perform good recognition on this particular sample, but might give low recognition
rate when testing other utterances of the same word. So the good training need
multiple observation sequences from different speakers for the same word.
Let O
(r)
denote the r th observation of length T
r
, and let superscript r indicate results
for this sequence and R is number of sequence, then the F-B reestimation algorithm
must be modified as:

= =
=
=
R
r
N
i
r
T
r
R
r
r
i
i
i i
r
1 1
) (
^
) (
1
^
1
) (
1
^
) (
) ( ) (
o
| o
t
(III.55)

=
=
+
=
+
=
R
r
T
t
r
t
r
t
R
r
r
t
T
t
t
r
j ij
r
t
ij
r
r
i i
j O b a i
a
1
1
1
) (
^
) (
^
1
) (
1
^
1
1
1
) (
) (
^
) ( ) (
) ( ) ( ) (
| o
| o
(III.56)

= =
= = =
=
R
r
r
t
r
T
t
t
r
t
R
r
r
T
t V O
t
j
j j
j j
k b
r
r
k t
1
) (
^
) (
1
^
) (
^
1
) (
1 ,
^
) ( ) (
) ( ) (
) (
| o
| o
(III.57)

Where:

t
t
t
c
i
i
) (
) (
^
o
o =
(III.58)

t
t
t
c
i
i
) (
) (
^
|
| =
(III.59)

39

=
=
N
j
t t
j c
1
) ( o
: scale factor (III.60)
III.4. Type of HMM
Different kinds of structures for HMMS can be used. The structure is defined by the
transition matrix, A. The most general structure is the ergodic or fully connected
HMM. In this model can every state be reached from every other state of the model.
As show in figure III.7(a), for an N=4 state model, this model has the property 0 <
a
ij
< 1 (the zero and the one has to excluded, otherwise is the ergodic property not
fulfilled). The state transition matrix, A, for an ergodic model, can be described by:

(
(
(
(
=
44 43 42 41
34 33 32 31
24 23 22 21
14 13 12 11
a a a a
a a a a
a a a a
a a a a
A

(a)
1
2
3
4
(b) 1 2 3 4
1
2 4
6
3
5
(c)
Figure III.7. Illustration of 3 distinct types of HMMs. (a) A 4-state ergodic model.
(b) A 4-state left-right model. (c) A 6-state parallel path left-right model
40

In speech recognition, it is desirable to use a model which models the observations
in a successive manner since this is the property of speech. The models that
fulfills this modeling technique, is the left-right model or parallel path left-right
model. See figure III.7 (b),(c). The property for a left-right model is:
a
ij
=0, j < i (III.61)
That is, no jumps can be made to a previous states. The lengths of the transitions are
usually restricted to some maximum length, typical two or three:
a
ij
=0, j > i + (III.62)
Note that, for a left-right model, the state transitions coefficients for the last state
has the following property:
a
NN
=1 (III.63)
a
Nj
=0, j < N (III.64)
In Figure III. 7(b) and (c) two left-right models are presented. In figure III.7(b) is
=2 and the state transition matrix, A, will be:

(
(
(
(
=
44
34 33
24 23 22
13 12 11
0 0 0
0 0
0
0
a
a a
a a a
a a a
A
(III.65)
It should be clear that the imposition of the constraints of the left -right model, or
those of the constrained jump model, essentially have no effect on the reestimation
procedure. This is the case because any HMM parameter set to zero initially, will
remain at zero throughout the reestimation procedure.

III.5. Choice of Model Parameters
Size of codebook
For the case in which we wish to use an HMM with a discrete observation symbol
density, rather than the continuous one, a vector quantizer (VQ) is required to map
each continuous observation vector into a discrete codebook index. Once the
codebook of vectors has been obtained, the mapping between continuous vectors
and codebook indices become a simple nearest neighbor computation, the
41

continuous vector is assigned the index of the nearest codebook vector. Thus the
major issue in VQ is the design of an appropriate codebook for quantization.
A great deal of work has gone into devising an excellent iterative procedure for
designing codebooks based on having a representative training sequence of vectors.
The procedure basically partitions the training vectors into K disjoint sets (where K
is the size of the codebook), represents each such set by a single vector (v
m
, 1 k
K), which is generally the centroid of the vectors in the training set assigned to kth
region, and then iteratively optimizes the partition and the codebook. Associated
with VQ is a distortion penalty since we are representing an entire region of the
vector space by a single vector. Clearly it is advantageous to keep the distortion
penalty as small as possible. However, this implies a large size codebook, and that
leads to problems in implementing HMMs with a large number of parameters.
Figure III.8 illustrates the tradeoff of quantization distortion versus K (on a log
scale). Although the distortion steadily decreases as K increases, it can be seen from
figure III.8 that only small decreases in distortion accrue beyond a value of K=32.
Hence HHMs with codebook sizes of from K=32 to 256 vectors have been used in
speech recognition experiments using HMMs.

K
Figure III.8. Curve showing tradeoff of VQ average
distortion as a function of the size of the VQ, K as a log
scale.
42

Type of model
How do we select the type of model? and how do we choose the parameters of
selected model. For isolated word recognition with a distinct HMM designed for
each word in the vocabulary, it should be clear that a left-right model is more
appropriate than ergodic model, since we can then associate time with model states
in fairly straightforward manner. Furthermore we can envision the physical meaning
of the model states as distinct sound of the word being modeled.

Number of states.
The issue of the number of states to use in each word model leads to two ways of
thought. One idea is to let the number of states correspond roughly to the number of
sounds (phonemes) within the word hence models with from 2 to 10 states would
be appropriate. The other idea is to let the number of states correspond roughly to
the average number of observations in spoken version of the word. In this manner
each state corresponds to an observation interval. Each word models have same
number of states; this implies that the models will work best when they represent
works with the same number of sounds. So in this project, I chosen the second one.

Figure III.9. Average word error rate versus the number of states N in the HMM
43

To illustrate the effect of varying the number of states in a word model, figure
III.9 shows a plot of average word error rate versus N, for the case of recognition of
isolated digits. It can be seen that the error is somewhat insensitive to N, achieving a
local minimum at N=6; however, differences in error rate for values of N close to 6
are small, i.e. N=5.

III.6. Initial HMM Parameters
Before the reestimation formulas can be applied for training, it is important to get
good initial parameters so that the reestimation leads to the global maximum or as
close as possible to it. A adequate choice for and A is the uniform distribution.
But since left-right models are used, will have probability one for the first state
and zero for otherwise. For example will the left-right model in figure III.7(b) have
the following initial and A:

(
(
(
(
=
0
0
0
1
t (III.66)

(
(
(
(
=
1 0 0 0
5 . 0 5 . 0 0 0
0 5 . 0 5 . 0 0
0 0 5 . 0 5 . 0
A
(III.67)
The parameters for the emission distribution needs good initial estimations, to get a
rapid and proper convergence.

02 - Hidden Markov Model

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

02 - Hidden Markov Model

Hochgeladen von

Copyright:

Verfügbare Formate

14

Das könnte Ihnen auch gefallen