Sie sind auf Seite 1von 30
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Although niall invoduced and sted inthe Ite 1960s and carly 197, statistical methods of Markov source o hidden Markov ‘modeling have become Increasingly popula inthe last several Years here are two stong ressons why tis has occured. Fst the ‘odes ae very ich in mathematalstuctureand hence can frm the theoreti! bass Yor ase in a wide range of applications. See: ‘ond the models, when applied propery, work very wellin practice \orseveralimportan sppleavons. inthis paper we attempt to care Tully and methodtcay review the theoretic! aspects of Os type (of satsvea! modeling ana show how they have been applied "0 ‘lected problems in machine recognition of speech Real world processes generally produce observable out ‘puts which can be characterized a8 signals. The signals can bediscrete in nature(e.g., characters froma finite alphabet, {quantized vectors from codebook, etc), or continuous in nature (e.g, speech samples, temperature measurements, musi, etc). The signal source can be stationary (its sta tistical properties do not vary with time), or nonstationary (Ge, the signal properties vaty overtime). The signals can be pure (i.e, coming strictly rom a single source), or can bee corrupted from other signal sources (e.g, noise) or by ‘wansmission distortions, reverberation, te ‘Aproblem of fundamental interests characterizing such realworld signals in terms of signal models. There are sev- ‘eral reasons why one is interested in applying signal mode's. First ofall, a signal model can provide the basis for athe ‘retical description ofasignal processing system which can bbeused to process the signal sos to provide a desired out. put. For example if weare interested in enhancinga speech Signal corrupted by noise and transmission distortion, we can use the signal model to design asystem which will opti- ‘mally remove the noise and undo the transmission distor- tion, A second reason why signal models are important is, that they are potentially capable of letting us learn a great deal about the signal source (.e, the eakworld process Which produced the signal) without having to have the Source available. This property is especially mporantwhen the cost of getting signals from the actual source is high, ‘Manuscript received January 15, 988; revised Oct Thesuthoriswith ATRT Bell laboratories Murray Hil NOT 2070, SA TREE Log Number 8825949, Inthis case, with a good signal model, we can simulate the source and learn as much as possible via simulations. Finally, the most important reason why signal models are important isthat they often work extremely wellin practice, and enable us to realize important practical systems—e.g., prediction systems, recognition systems, identification sys tems, etc, in a very efficient manner, ‘These are several possible choices for what typeof signal model is used for characterizing the properties of a given ‘signal. Broadly one can dichotomize the types of signal ‘models into the class of deterministic models, and the class ‘of statistical models, Deterministic models generally exploit some known specific properties of the signal, e.g. that the signal sa sine wave, or aSum of exponentials, et. In these. ‘eases, specification ofthe signal modelis generally straight forwardsall hat is equiredistodetermine (estimate) values of the parameters ofthe signal model e., amplitude, fre quency, phase of asinewave, amplitudes and ates of expo- rentals, ec). The second broad class of signal models is, the set of statistical models in which one tres to charac terize only the statistical properties ofthe signal. Examples ‘of such statistical models include Gaussian processes, Pois: ‘son processes, Markov processes, and hidden Markov pro: cesses, among others, The underlying assumption of the ‘statistical model is thatthe signal can be well characterized {35 a parametric random process, and that the parameters lof the stochastic process can be determined (estimated) in & precise, welldefined manner. For the applications of intecest, namely speech process ing, both deterministic and stochastic signal models have had good success. In tis paper we will concern ourselves strictly with one type of stochastic signal model, namely the hidden Markov model (HMM). These models are referred 10 as Markov sources oF probabilistic functions of Markov chains in the communications literature.) We will first Feview the theory of Markov chains and then extend the ideas to the class of hidden Markov models using several simple examples. We will then focus our attention on the three fundamental problems! for HMM design, namely: the "he idea of characterizing the theoretical aspects of hidden -Markor madelingin terms of tolvng thre fundamental problems is duoto lack Ferguson of IDA Institute for Defense Analysis) who Introduced in lectures and writing, coven ans 40 © Hoe ‘evaluation of the probability (or likelihood) of a sequence of observations given a specific HMM; the determination fof a best sequence of model states; and the adjustment of ‘model parameters s0 a5 o best account for the observed signal. We will show that once these three fundamental problems are solved, we can apply HMMS o selected prob- ems in speech recognition. [Neither the theory of hidden Markov models nor its applications to speech recognition is new. The basic theory ‘was published ina series of classic papers by Baum and his colleagues [1-(5] inthe late 1960s and early 1970s and was implemented for speech processing applications by Baker []at CMU, and by Jelinek and his colleagues a IBM (7-113) in the 1970s. However, widespread understanding and application of the theory of HMMs to speech processing. has occurred only within the past several years. There are several reasons why this has been the case. First, the basic theory of hidden Markov models was published in math ematical journals which were not generally ead by engi- ‘neers working on problems in speech processing. The sec ‘ond reason was thatthe orginal applications of the theory to speech processing did not provide sulficient tutorial ‘material for most readers to understand the theory and (0 beable toapplyitto their own eeseatch, Asa result several tutorial papers were written which provided a sufficient level of detail for a number of research labs to begin work using HMMs in individual speech processing applications {14-(19. This tutorial is intended to provide an overview fof the basic theory of HMMs (as originated by Baum and. his colleagues), provide practical details on methods of implementation of the theory, and describe a couple of selected applications of the theory to distinct problems in speech recogaition. The paper combines results from a ‘numberof original sources and hopetully provides asingle source for acquiring the background required to pursue further this fascinating area of research. ‘The organization ofthis paper i as follows. In Section Il wwe review the theory of discrete Markov chains and show how the concept of hidden states, where the observation is probabilistic function of the state, can be used etfec tively. We illustrate the theory with two simple examples, namely coin-tossing, and the classic balls in-urns system In Section Ill we discuss the three fundamental problems of HMMs, and give several practical techniques fr solving these problems. In Section IV we discuss the various types ‘of HMMs that have been studied including ergodic as well asleftright models, In this section we also discuss the var- ous model features including the form of the observation density function, the state duration density, and the opti> ‘mization criterion for choosing optimal HMM parameter values. In Section wedliscuss the issues that arisen imple: menting HMMs including the topics of scaling, initial parameter estimates, model size, madel form, missing data, and multiple observation sequences, In Section VI We ‘describean isolated word speech recognizer, implemented ‘with HMM ideas, and show how it performs as compared, to alternative implementations. In Section Vil we extend the ideas presented in Section VI tothe problem of recog- nizing a string of spoken words based on concatenating, individual HMMsof exch wordin the vocabulary. nection VIL we briefly outline how the ideas of HMM have been applied toa large vocabulary speech recognizer, andin Sec tion 1X we summarize the ideas discussed throughout the paper. 1, Discrete Markov Proctsss? Consider a system which may be described at any time as being in one of a set of N distinct states, 5p," "Sy as illustrated in Fig. {where N= 5 for simplicity). At rea Fig. 1. A Markov chain with s states dabeled 5,0 5.) with Selected state transitions ularly spaced discrete times the system undergoesa change fof state (possibly back tothe same state) according toa set ‘of probabilities associated with the state, We denote the time instants associated with state changes as (= 1, 2, and we denote the actual state at time tas q. A full probabilistic description ofthe above system would, in ger- tral require specification of the current slate at time NaS ‘well a ll the predecessor states. Far the special case ofa discrete, fest order, Markov chain, this probabilistic description is truncated to just the current and the pre- decessor state, Le, Fla, = Slquas = Sin 4-2 Su] Aq, = Sigs = $1 o Furthermore we only consider those processesin which the right-hand side of (1)is independent of time, thereby lea: ing tothe set of state transition probabilities a ofthe form ay = Plgr = Signs = Sik ‘withthe state teansition coefficients having the properties, zo ea) isijzN Zant 3b) since they obey standard stochastic constraints The above stochastic process could be called an observ- able Markov model since the output ofthe process is the Set of states at each instant of time, where each state cor responds to physical observable) event. To setideas,con- sider a simple 3:state Markov model of the weather, We assume that once a day (eg, at noon), the weather is, *~.good oversiow of elserete Markov processes sin (20, ch. 5 ‘observed as being one of the following State 1: in of (snow) State 2: cloudy State 3: sunny. ‘We postulate thatthe weather on day tis characterized by 4 single one of the thre states above, and that the matrix ‘Ao state transition probabilities is 04 03 03) 02 a6 02 01 01 08, Given thatthe weather on day 1 (t = 1) is sunny (state 3), ‘we can ask the question: What i the probability (according to the model) that the weather for the next 7 days will be ‘un-sunvrain-rain-sun-cloudy-sun = - «7 Stated more for- mally, we define the observation sequence O as O= (Sy, Swe Su Sv Sp Sw Sx Sy) Corresponding to ¢= 1,2, -*~ 8, and we wish to determine the probability of O, given the ‘model. This probability can be expressed (and evaluated) PIOIModed = AS, Sy Sy Sy, Sy Sy Sy Ss/Model] = AIS SYS PSSISS + AIS. PISISA - ASISA ~ ASS ~ ASSL A= fay) = 1 ©.8108)0.00.90310.00.2) = 1536 x 10-* where we use the notation = Ag SI, 10 denote the initial state probabilities, ‘Another interesting question we can ask (and answer using the model is: Given thatthe modelisinaknown state, What isthe probabiltyitstaysinthat statefor exactly days? This probability can be evaluated as the probability of the ‘observation sequence OF (SSS °- SS, isisn « #5), given the model, which is POIModel,qy = 5) © lai"""0 - a) = pid. &) “The quantity pd) isthe (discrete) probability density func: tion of duration dn state. This exponential duration den. sitys characteristic ofthe state duration ina Markov chain. Based on pd}, we can readily calculate the expected num- ber of observations (duration) in a state, conditioned on starting in that state as a= E apie co = Baas — ay (oo cers ‘Thus the expected number of consecutive days of sunny ‘weather, according to the model, s 10.2) = 5; fr cloudy itis 255 for ein itis 1.67, ‘A. Extension to Hidden Markov Models So far we have considered Markov models in which each state corresponded to an observable (physical) event. This, ‘model is too restrictive tobe applicable to many problems of interest In this section we extend the concept of Markov ‘models to include the case where the observation isa prob: ablisic function of the state—te,, the resulting model (which iscalleda hidden Markov model) isadoubly embed: dd stochastic process with an underlying stochastic pro- cess that isnot observable iti hidden), but can only be ‘observed through another set of stochastic processes that produce the sequence of observations. To fix ideas, con- Sider the following model of some simple coin tossing, experiments, ‘Coin Toss Models: Assume the following scenario. You are in a room with a barrier (eg. a curtain) through which you cannot see what is happening. On the ather side ofthe barrier is another person who is performing a coin (or mul tiplecoin tossing experiment. The other person will not ell you anything about what he is doing exactly; he will only {ell you the result ofeach coin flip. Thusa sequence of hid: den coin tossing experiments is performed, withthe obser. vation sequence consisting ofa series of heads and tals; 6,2 typical observation sequence would be 0 = 00;0--- OF = RHISSHSIK where XC stands for heads and 5 stands for tails Given the above scenario, the problem of interest is how. ddo we build an HMM to explain (model) the observed sequence of heads and tals. The fist problem one faces is. deciding what the states in the model correspond to, and then deciding how many states should bein the model. One possiblechoice would beto assume thatonlyasingle biased coin was being tossed. In this case we could model the sit: vation witha 2state model where each state corresponds toa side of the coin (i.e,, heads or tall). This model is depicted in Fig. 2a)’ In this case the Markov model is ‘observable, and the only issue for complete specification fof the model would be to decide on the best value for the bias (Le, the probability of, say, heads) Interestingly, an equivalent HMM to that of Fig 21a) would be a degenerate ‘state model, where the state corresponds to the single biased coin, and the unknown parameters the bias of the coin ‘A second form of HMM for explaining the observed sequence of coin toss outcome is given in Fig. 20) In this, ‘ase there are 2 states in the model and each state corre sponds toa diferent, biased, coin being tossed. Each state is characterized by a probability distribution of heads and tails and transitions between states are characterized by a state transition matrix. The physical mechanism which accounts for how state transitions are selected could itself be a set of independent coin tosses, or some other prob: abllistic event. ‘A thied form of HMM for explaining the observed sequence of coin toss outcomes i given in Fig. (0. This model corresponds to using 3 biased coins, and choosing ‘rom among the three, based on some probabilistic event >the model of Fig. 26) isa memoryless process and thus is 2 degenerate case of a Markov model. ees suntan Ph he hy Fig.2. Three possible Markov models which cam account forthe results 0" hidden coin tossing experiments a) con model to) 2coins model (0 cov mel Given the choice among the three models shown in Fig. 2for explaining the observed sequence of heads and tails, natural question would be which model best matches the actual observations. Itshould beclearthatthe simple coin ‘model of Fig. 212)has only 1 unknown parameter; the 2coin ‘model ofFig, 2(6)has 4unknown parameters; andthe $coin ‘model of Fig. 2(c) has 9 unknown parameters. Thus, with the greater degrees of freedom, the larger HMMs would seem to inherently be more capable of modeling a series ofcointossing experiments than would equivalently smaller ‘models. Although this theoretieally eue, we wll se latee in this paper that practical considerations impose some Strong limitations on the size of models that we can con- Sider. Furthermore it might just be the case that only a sin- gle coin isbeing tossed, Then using the coin model ot Fig. 2(6 would be inappropriate since the actual physical event ‘would not cortespond to the model being used—ie, we Would be using an underspecities system, The Umand Ball Model" To extend the ideas ofthe HMM to.a somewhat more complicated situation, consider the turn and ball system of Fig. 3. We assume that there are N large) glass urnsina room. Within each urntherearealarge ‘number of colored balls. We assume there are Mf distinct colorsof the balls. The physical process or obtaining obser. ‘ations is as follows. A genie sin the room, and according tosome random process, he or she) chooses an initial ue, fom this urn, a ball is chosen at random, and its color is recordedasthe observation, Thebalisthen replacedin the tra from which it was selected. A new ura is then selected “The ur and ball model was introduced by ack Ferguson, and his colleagues in tectures on HMM theory, incor = bts) prewue) ea) Lue) Sts) \entea) = by Pretiow = prveucom sys! Promancey= By) ORANGE) lu) PlORANGEN By ({OREEN, GREEN, BLUE, REP, YELLOW, RED, BLUE) Fig 3. An Nstate urn and ball model which ilustats the feneral ease ofa discrete symbol HMM, according tothe random selection process associated with the current urn, and the ball selection process is repeated. Thisentire process generates finite observation sequence of colors, which we would like to model as the observable ‘output of an HMM, It should be obvious that the simplest HMM that cor- responds tothe urn and ball process is one in which each State corresponds to.a specific urn, and for which a tbl) color probability is defined for each state. The choice of urns i dictated by the state teansition matrix of the HMM, 2. Elements of an HMM ‘The above examples give usa pretty god idea of what an HMM is and how itean be applied fo some simple see- ‘aris. We now formally define the elements ofan HMM, find. explain how the model generates observation ‘An FMM is characterized bythe following 4) N, the number of states In the model Although the states ae hidden, for many practical applications there is ttten some physical significance attached to the states or to sets of states of the mode Hence in the coin tossing experiments, each state corresponded oa distinct biased Coin. In the urn and ball mode, the states corresponded tothe urns. Generally the states ae interconnected in such ‘way that any tate can be reached from any other state {eg an ergodic model; however, we wil see later inthis paper that other possible interconnections of sates are Sttenofinterest. Wedenote the individual statesas = {Sy Suc Syl and the state at time fas q 42) M,'the number of dstinct observation symbols per state, Le, the discrete alphabet size. The observation sym. bols correspond to the physial output ofthe system being ‘modeled. For the coin toss experiments the observation Symbols were simply heads or tall; for the ball and urn ‘model they were the colors ofthe balls selected from the tens: We denote the individual symbols as V = fv owt 3) The state transkion probability distribution A = (ay) where 1 = Plgies = Sig = Se For the special case where any state can reach any other state ina single step, we have ay > 0 forall, j. For other types of HMMs, we would have a, = 0 for one or more (i, i) pais isiis® 4 The observation symbol probability distribution in state j,B = {bio}, where bik) = Pyattign= 5 1s) =N isksM 5) The initial state distribution » = (x) where n= Pq= 5, 1si=N, o Given appropriate values of N, M,A, 8, and x, the HMM can be used asa generator to give an observation sequence 0=0,0,+++0, 00) (where each observation O, is one of the symbols fram V, and Tis the number of observations in the sequence) as follows: 1) Choose an intial state g state distabution 2) Sete 3) Choose 0, = vs according to the symbol probability distribution in state S, i.e, i. 4) Transitto anew state g.~y = § according tothe state teansition probability distribution for state S, he. ay 5) Sete = + 1; return to step Sift < Ty otherwise ter. minate the procedure. according to the initial ‘The above procedure can be used as bath a generator of ‘observations, and as a model for how a given observation sequence was generated by an appropriate HMM. can be seen from the above discussion that acomplete specification of an HMA requices specification of two ‘model parameters (N and M), specification of observation symbols, and the specification ofthe three probability mea sures 4, B, and x. For convenience, we use the compact » AB ay to indicate the complete parameter set of the model . The Three Basic Problems for HMMS? Given the form of HMM of the previous section there are three basic problems of interest that must be solved for the ‘model to be useful in real-world applications. These prob- lems are the following: Problem 1: Given the observation sequence O = 0,0; ,, and a model » = (4, B, 3), how do we efficiently compute PLO}, the proba bility of the observation sequence, given the model? Problem 2: Given the observation sequence O = 0,0; ‘O;,and the model A, how do we choose a corresponding state sequence Q = sq ‘dr which is optimal in some meaningtul sense (Le, best “explains” the observa- tions)? Problem 3: How do we adjust the model parameters =U, 8 3) to maximize POI “the material in this section and in Section iis based on the ideas presented by lack Ferguson of IDA in lecture a Bell > Problem 1 is the evaluation problem, namely given a ‘model anda sequence of observations, how dowecompute the probability thatthe observed sequence was produced by the model. We can also view the problem as one of cor. ing how well a given model matches a given observation sequence. The latter viewpoint is extremely useful, For ‘example, if we consider the case in which we are trying to ‘choose among several competing models, the solution to Problem 1 allows us to choose the model which best matches the observations Problem 2s the one in which we attempt to uncover the hidden part of the model, ie, 10 find the “correct” state sequence. It should be cieat that forall but the case of degenerate models, there is no "correct" state sequence tobe found. Hence for practical situations, we usually use an optimality criterion to solve this problem as best as pos sible, Unfortunately, as we will ee, there are several rex. sonable optimality erteria that can be imposed, and hence the choice of criterion is a strong function of the intended use for the uncovered state sequence. Typical uses might bbeto earn about the structure ofthe model, to find optimal state sequences for continuous speech recognition, oF 10 ‘et average statistics of individual states, etc Problem 3isthe one in which we attempt to optimize the ‘model parameters sos to best describe how a given obser. vation Sequence comes about. The observation sequence used to adjust the model parameters Is called a taining sequence since itis used ta “train” the HMM, The training problem is the crucial one for most applications of HMMs, since it allows us to optimally adapt model parameters 10 ‘observed training data. to create best models fr real phenomena, ‘To tix ideas, consider the following simple isolated word speech recognizer. For each word of a W word vocabulaty, ‘we want to design a separate N-state HMM. We represent the speech signal of a given word asa time sequence of ‘coded spectral vectors, We assume that the coding is done Using a spectral codebook with M unique spectral vectors; hence each observation isthe index of the spectral vector M7 = 1) additions) For N = 5, T = 100, we need about 3000 computations for the forward method, versus 10” com- putations for the direct ealeulation, a savings of about 69) Orders of magnitude. The forward probability calculation is, in effect, based ‘upon the lattice (or trellis) structure shown in Fig. Ab), The key is that since there are onlyN states nodes at each time slot inthe lattice, all the possible state sequences will re merge into these N nodes, no matter how long the obser. vation sequence. Attime f= 1 the frst time slot in the lat tice), weneed to calculate values of ay), = = N-Attimes £5 2,3,--+ ,T,we only need to calculate values of al), 1.5 j 5 N,where each calculation involves only N previous valuesofa,-;0) because each of the Ngrid pointsisreached {rom the sare grid points at the previous timeslot. Inasimilarmanner, wecan considera backward variable ui) defined as, BA = PlOp.y On-2 Org = 5, 3) {ce the probability ofthe patil observation sequence from 14 1 to the end, given state 5, at time Cand the model, Again we can solve for 8) inductively, as follows: “) Initialization: Bx leis 2 2) Induction: = 3 afr) Ba FaTHUTH2 ut si en The initialization step 1) arbitrarily defines 8) to be 1 for alli. Step 2), which sillustrated in Fig. 5, shows that in order to have been in state S, at time ¢, and to account for the "again we remind the reader that the backward procedure wil bbe use inthe solution to Problem 3, and not required forthe Solution at Problem 1 pair Brest Fig.5. Illustration ofthe sequence of operations required forthe computation ofthe backward varable Bil observation sequence from time t+ 10n, you have to.con- sider all possible states 5, at time t +1, accounting for the transition from 5; t0 5 the a term), as well as the obser. vation O, «yn state (the b (Oy, term), and then account for the remaining partial observation sequence from state j the 6, .)) term), We will see later how the backward, as ‘wellastive forward caleulationsare used extensively tohelp solve fundamental Problems 2 and 3 of HMMs. ‘Again, the computation of fi), 15 t= 7,151 SN, requires on the order of N°V calculations, and can be com- puted ina lattice structure similar to that of Fig. (0). 8. Solution to Problem 2 Unlike Problem forwhichan exact solution canbe given, there are several possibleways of solving Problem 2,namely, finding the “optimal” state sequence associated with the given abservation sequence. The difficulty lies with the det inition ofthe optimal state sequence; Le, there are several possible optimality criteria. For example, one possible opt mality criterion isto choose the states g, which are ind vidually mos likely. This optimality criterion maximizes the expected number of correct individual states, To. imple: ment this solution to Problem 2, we define the variable ld) = Plqy = S10, es) ie, the probability of being in state 5, at time f, given the ‘observation sequence O, andthe model. Equation (26)can be expressed simply in terms of the forward-backward variables, ie, x(t) 64H) _ _ aul Bud ROW ~ % on & aid 80) wh en since {i)accounts for the partial observation sequence O; (©, +" 0, and state 5, at ¢, while 8,/) accounts for the remainder ofthe observation sequence O,. Or,3*** Oy, given state 5) att. The normalization factor P(O}N) = Ev sai, Ba) makes 71) a probability measure so that De 2) Using 9), we can solve forthe individually most likely state quat time f, 38 = argmax yi, Ts t= 7. 29) Although (29) maximizes the expected number of correct states by choosing the most likely state for each 0, there. could be some problems with the resulting tate sequence, For example, when the HMM has state transitions which have zero probabilty(a, = Ofer someiandj),the optimal” state sequence may, in fact, not even be a valid state sequence. This is due to the fact thatthe solution of (29), simply determines the most likely state at every instant, Without regard to the probability of occurrence of Sequences of states, ‘One possible solution tothe above problem is o modily the optimality criterion, For example, one could solve for the state sequence that maximizes the expected number of correct pairs of states (qj, q,-), oF triples of states (g, ee G¢-2ete- Although these criteria might be reasonable {for some applications, the most widely used criterion is to find the single best state sequence (path i.e, to maximize (QO, ») which is equivalent to maximizing P(Q, ON. A formal technique fr finding this single best state sequence ‘exists, based on dynamic programming methods, and is called the Viterbi algorithm, Viterbi Algorithm [21], 22k: To {ind the single best state sequence, Q = (qy qp“* qr}, for the given observation sequence O = [0, 0; +++ 0}), we need to define the guantity a) = anax Plan gn °° 1 = 5, 0, Oz ++ ~ ON 30) oe 3 isthe best score (highest probability alonga single path, at time r, which accounts for the frst observations, and ends in state 5, By induction we have Brea = Imax Biba * 60,9. an To actually retrieve the state sequence, we need to keep. track ofthe argument which maximized (31, for each tand j.We do this via the array yj). The complete procedure for finding the best state sequence can now be stated a8 follows: 1) Initialization: 30) = xbfO), 1s1=N 2a) Wi) = 0, 22) 2) Recursion: 2G) = max -alb{O), 220s T lsj PlO)), ie, we have founda new model trom which the observation sequence is more likely to have been produced. Based on the above procedure, i we iteratively use Kin place of Nand repeat the reestimation calculation, we thea fan improve the probability of O being observed from the ‘model until some limiting points reached. The final eesut Of this reestimation procedure is called a maximum like: are automatically satisfied at each iteration. By looking at ‘the parameter estimation problem as a constrained opt- mization of P{O}N) (subject to the constraints of (43), the techniques of Lagrange multipliers can be used to find the values of 2, and b(k) which maximize P(weuse thenota tion P= P(O)N)as short-hand in this section). Based on set. ting up a standard Lagrange optimization using Lagrange ‘multipliers, itcanreadilybe shown that Pismaximized when the following conditions ace met ae (aap) aw 28 G6 Resta 2 3656 bun ‘By appropriate manipulation of 44, the right hand sides of teach equation can be readily converted to be identical to the righthand sides of each part of a}-(4e), thereby ‘showing thatthe reestimation formulas are indeed exactly correct at critical points of P. In fact the form of (4) is essen tially that of a reestimation formula in which the left-hand, Side is the reestimate and the right-hand side is computed tsing the current values of the variables. Finally, we note that since the entice problem can be set Lup as an optimization problem, standard gradient techy niques can be used (0 Solve for “optimal” values of the ‘model parameters {14}. Such procedures have been tried Andhave been shown to yield solutions comparabletothose fof the standard reestimation procedures, IV, Tees oF HMMs Until now, we have only considered the special case of ergodic or tully connected HMMS in which every state of the model could be reached (in a single step) {rom every other state of the model. (Strictly speaking, an ergodic moxlethas the property that every statecan bereached from every ther state in a finite number of steps.) As shown in Fig. 74), for an N = 4 state model, this type of model has the property that every a coefficient is postive. Hence for the example of Fig. 7a we have For some applications, in partcularthose tobe discussed later inthis paper, other types of HMMs have been found toaccount for observed properties ofthe signal being mod fled better than the standard ergodic model. One such ‘model is shown in Fig. (6). This model s called a left-right ‘model oF a Bakis model [11], [10] because the underlying, State sequence associated with the model has the property ‘hata time increases the state index increases (or stays the same), Le, the states proceed from left to right. Clearly the left-right type of HMMhas the desirable property thatitcan readily mode signals whose properties change over time— feg., speech, The fundamental property ofall left-right Fig 7 llustration of distinct ypes of HMMs, () AAstate rgodic model (byAsstateletight model () Aestate par SHES path etl mode. HMMsisthatthe state transition coetficients have the prop- ery ano f