Beruflich Dokumente
Kultur Dokumente
A SeminarReport On
DepartmentofComputerScience
JAWAHARLAL NEHRU INSTITUTE OF ADVANCED STUDIES (JNIAS) HYDERABAD- 500 003, A.P
20112012
Page 1
AutomaticSpeechRecognition
ABSTRACT
Automatic speech recognition (ASR) is a computerized speechtotext process, in which speech is usually recorded with acoustical microphones by capturing air pressure changes. This kind of airtransmitted speech signal is prone to two kinds of problems related to noise robustness and applicability. The former means the mixing of speech signal and ambient noise usually deteriorates ASR performance. The latter means speech could be overheard easily on the airtransmission channel and this often results in privacy loss or annoyance to other people. Automatic speech recognition systems are trained using human supervision to provide transcriptions of speech utterances. The goal is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. It aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label.
Page 2
AutomaticSpeechRecognition
Table Of Contents
S.No Topic
Page No
1. Introduction to Speech Technology 2. Basics of Speech Recognition 3. Performance of Speech Recognition 4. Architecture of Speech Recognition 5. Algorithms Used 5.1 Hidden Markov models(HMM) 5.2 Dynamic Time Warping(DTW) 5.3 Viterbi 6 .Challenges for Speech Recognition 7. Approaches for Speech Recognition 7.1 Template based approach 7.2 Knowledge or rule based approach 7.3 Statistical based approach 8. Machine Learning 9. Language Model 10. Applications 11. Advantages 12. Disadvantages 13. Conclusion 14. References 17 18 20
4 6 8 9 11
14 15
22 22 23 23
Page 3
AutomaticSpeechRecognition
Playing back simple information: Customers need fast access to information in many circumstancestheydonotactuallyneedorwanttospeaktoaliveoperator.Forexample,ifthey havelittletimeortheyonlyrequirebasicinformationthenspeechrecognitioncanbeusedtocut waitingtimesandprovidecustomerswiththeinformationtheywant.
Automated identification: Whereoneneedstoauthenticateonesidentityonthephonewithout usingriskypersonaldata.Someadvancedspeechrecognitionsystemsprovidean answertothisproblemusingvoicebiometrics.Thistechnologyisnowacceptedasamajortoolin combatingtelephonebasedcrime.Onaverageittakeslessthantwominutestocreateavoiceprint based on specific text such as Name and Account Number. This is then stored against the individualsrecord,sowhentheynextcall,theycansimplysaytheirnameandifthevoiceprint matches what they have stored, then the person is put straight through to a customer service representative. Speechrecognition(alsoknownasautomaticspeechrecognitionorcomputerspeech recognition)convertsspokenwordstotext.Theterm"voicerecognition"issometimesusedtorefer tospeechrecognitionwheretherecognitionsystemistrainedtoaparticularspeakerasisthecase formostdesktoprecognitionsoftware,hencethereisanelementofspeakerrecognition,which attemptstoidentifythepersonspeaking,tobetterrecognizewhatisbeingsaid.Speechrecognition isabroad termwhichmeansitcanrecognizealmostanybody'sspeech.Speechrecognitionisasystemtrained toaparticularuser,whereitrecognizestheirspeechbasedontheiruniquevocalsound.
Page 5
AutomaticSpeechRecognition
SpeakerDependence systems: Theyaredesignedaroundaspecificspeaker.Theygenerallyare moreaccurateforthecorrectspeaker,butmuchlessaccurateforotherspeakers.Theyassumethe speakerwillspeakinaconsistentvoiceandtempo.Speakerindependentsystemsaredesignedfora variety of speakers. Adaptive systems usually start as speaker independent systems and utilize trainingtechniquestoadapttothespeakertoincreasetheirrecognitionaccuracy
SpeakingMode: Therecognitionsystemscanbeeitherisolatedwordprocessorsorcontinuous speechprocessors.Somesystemsprocessisolatedutteranceswhichmayincludesinglewordoreven sentencesandothersprocesscontinuousspeechinwhichcontinuouslyutteredspeechisrecognized thatisimplementedinmostrealtimeapplications. SpeakingStyle:Itcaneitherbespeakerdependentorspeakerindependent.Forspeakerdependent Department of Computer Science Page 6
Vocabularies: TheyarelistsofwordsorutterancesthatcanberecognizedbytheSRsystem. Generally,smallervocabulariesareeasierforacomputertorecognize,whilelargervocabulariesare moredifficult.Unlikenormaldictionaries,eachentrydoesn'thavetobeasingleword.Theycanbe aslongasasentenceortwo.Smallervocabulariescanhaveasfewas1or2recognizedutterances (e.g.sunnatho"),whileverylargevocabulariescanhaveahundredthousandormore. Accuracy: The ability of a recognizer can be examined by measuring its accuracy or how well it recognizes utterances. This includes not only correctly identifying an utterance but also identifying if the spoken utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more. The acceptable accuracy of a system really depends on the application.
Training the acoustic models: Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it may allow training to take place. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy. As long as the speaker can consistently repeat an utterance, ASR systems with training should be able to adapt.
Page 7
AutomaticSpeechRecognition
Page 8
AutomaticSpeechRecognition This explains why some users, especially those whose speech is heavily accented, might achieverecognition rates much lower than expected. Limited vocabulary systems, require no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.
4. Architecture
Speech recognition is gettingacomputertounderstandspokenlanguage Byunderstandwemightmean Reactappropriately Converttheinputspeechintoanothermedium,e.g.text It is done by Digitization Acoustic analysis of the speech signal Linguistic interpretation The speech signal which is given as input is converted from analog signal into digital representation where the speech signal is segmented depending on the language. The architecture of speech recognition is shown below.
Page 9
AutomaticSpeechRecognition
The first step in almost all speech recognition systems is the extraction of features from the acoustic data. Most systems make use of the Mel Frequency Cepstral Coefficients (MFCC) to describe the speech signal. First, the input signal is windowed. Typically, this is done with Hamming windows of 30 ms long and with a 20 ms overlap. Next, the spectrum is computed by taking the Fourier transform.
These coefficients then are mapped onto the Mel scale by using a set of triangular-shaped filters. After taking the log of the powers(phase information is omitted because it contains no useful information and moreover the human ear is also phase-deaf) the resulting coefficients are treated as a signal and the inverse discrete cosine transform is taken.The resulting spectrum is called the Mel Frequency Spectrum and the resulting coefficients are called Mel Frequency Cepstral Coefficients. Usually the first 12 coefficients are used to describe the part of speech signal under the Hamming window, forming a feature vector. Next the energy of the signal, which also contains useful information is added to the feature vector. The window then shifts (10 ms) and a new feature vector is calculated. This procedure creates a time series of feature vectors from the continuous speech signal. Department of Computer Science Page 10
AutomaticSpeechRecognition Because speech is transient in nature, also first and second order time derivatives of the MFCC features are added to every feature vector. By the MFCC features we get the phonemes and it is compared with the database .Another important part of typical speech recognition systems is the lexicon (also called dictionary). The lexicon describes how to combine acoustic models (phonemes) to form words. It contains all words known to the ASR system and the series of phonemes that must be encountered to form that word. The language model combines words to form sentences.Finally we get the recognized words.
5. Algorithms Used
5.1 Hidden Markov model (HMM)
Modern general-purpose speech recognition systems are generally based on HMMs. These are statistical models which output a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought of as a Markov model for many stochastic processes. Department of Computer Science Page 11
AutomaticSpeechRecognition Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the HMM model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of spectral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and correlating the spectrum using a cosine transform, then taking the first (most significant) coefficients.ThehiddenMarkovmodelwilltendtohaveineachstateastatisticaldistributionthat isamixtureofdiagonalcovarianceGaussianswhichwillgivelikelihoodforeachobservedvector. Eachword,or(formoregeneralspeechrecognitionsystems),eachphoneme,willhaveadifferent outputdistribution. A HMM model for a sequence of words or phonemes is made by concatenating the individual trained HMM models for the separate words and phonemes. Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical largevocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states), it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use Vocal Tract Length Normalization (VTLN) for male-female normalization and Maximum Likelihood Linear Regression (MLLR) for more general speaker adaptation.
The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use Heteroscedastic Linear Discriminant Analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as Maximum Likelihood Linear Transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are Maximum Mutual Information (MMI), Minimum Department of Computer Science Page 12
AutomaticSpeechRecognition Classification Error (MCE) and Minimum Phone Error (MPE). Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information.
5.3 Viterbi
TheViterbialgorithmisadynamicprogrammingalgorithmforfindingthemostlikelysequenceof hiddenstatescalledtheViterbipaththatresultsinasequenceofobservedevents,especiallyin thecontextofMarkovinformationsources,andmoregenerally,HMMs.Theforwardalgorithmisa closelyrelatedalgorithmforcomputingtheprobabilityofasequenceofobservedevents.These algorithmsbelongtotherealmofinformationtheory. Thealgorithmmakesanumberofassumptions.First,boththeobservedeventsand Department of Computer Science Page 13
AutomaticSpeechRecognition hiddeneventsmustbeinasequence.Thissequenceoftencorrespondstotime.Second,thesetwo sequencesneedtobealigned,andaninstanceofanobservedeventneedstocorrespondtoexactly oneinstanceofahiddenevent.Third,computingthemostlikelyhiddensequenceuptoacertain pointtmustdependonlyontheobservedeventatpointt,andthemostlikelysequenceatpoint t1.TheseassumptionsareallsatisfiedinafirstorderhiddenMarkovmodel. Theterms"Viterbipath"and"Viterbialgorithm"areappliedtorelateddynamic programming algorithmsthatdiscoverthesinglemostlikelyexplanationforanobservation. In statistical parsingadynamicprogrammingalgorithmcanbeusedtodiscoverthesinglemost likelycontextfreederivation(parse)ofastring,whichissometimescalledthe"Viterbiparse". Dynamicprogrammingusuallytakesoneoftwoapproaches: Topdownapproach:Theproblemisbrokenintosubproblems,andthesesubproblemsaresolved and the solutions remembered, in case they need to be solved again. This is recursion and memorizationcombinedtogether. Bottomupapproach:Allsubproblemsthatmightbeneededaresolvedinadvanceandthenused tobuildupsolutionstolargerproblems.Thisapproachisslightlybetterinstackspaceandnumber offunctioncalls,butitissometimesnotintuitivetofigureoutallthesubproblemsneededfor solvingthegivenproblem.
AutomaticSpeechRecognition
Inter-Speaker Variability -Vocal Tract,Gender,Dialects Variability Language -From isolated words to continuous speech
-Out of vocabulay words
Vocabulary Size and domain -From just a few words to large vocabulary speech recognition
-Domain that is being recognized
Page 15
AutomaticSpeechRecognition
This approach is based on blackboard architecture: -At each decision point, lay out the possibilities Department of Computer Science Page 16
AutomaticSpeechRecognition -Apply rules to determine which sequences are permitted Poor performance due to -Difficulty to express rules -Difficulty to make rules interact -Difficulty to know how to improve the system
Page 17
AutomaticSpeechRecognition
Page 18
AutomaticSpeechRecognition
9. Language Model
Whilegrammarbasedlanguagemodelsareeasytounderstand,theyarenotgenerallyusefulfor largevocabularyapplicationssimplybecauseitissodifficulttowriteagrammarwithsufficient coverageofthelanguage.Themostcommonkindoflanguagemodelinusetodayisbasedon estimatesofwordstringprobabilitiesfromlargecollectionsoftextortranscribedspeech.Inorder to make these estimates tractable, the probability of a word given the preceeding sequence is approximated to the probability given the preceeding one (bigram) or two (trigram) words (in general,thesearecalledngrammodels).
Forabigramlanguagemodel: P(wn|w1,w2,w3...wn1)=P(wn|wn1)
Toestimatethebigramprobabilitiesfromatextwemustcountthenumberofoccurencesofthe wordpair(wn1,wn)anddividethatbythenumberofoccurencesofthepreceedingwordw n1.This is a relatively easy computation and accurate estimates can be obtained from transcriptions of languagesimilartothatexpectedasinputtothesystem. Forexample,ifwearerecognisingnewstories,textsuchastheWallStreetJournalcorpuscanbe usedtoestimatebigramprobabilitiesforthelanguagemodel.Thismodelisunlikelytotransfervery welltoanotherdomainsuchastraintimetableenquiries;ingeneraleachapplicationrequiresthe languagemodeltobefinetunedtothelanguageinputexpected. Thebigramlanguagemodelgivesthesimplestmeasureofwordtransition probabilitybutignoresmostofthepreceedingcontext.Itiseasytocomeupwithexamplesofword sequenceswhichwillbeinproperlyestimatedfromabigrammodel(forexample,in"Thedogon thehillbarked",theprobabilityofbarkedfollowinghillislikelytobeunderestimated). The morecontextalanguagemodelcanusethemorelikelyitistobeabletocapturelonger Department of Computer Science Page 19
Inatrigramlanguagemodeltheprobabilityofawordgivenit'spredecessorsisestimatedbythe probabilitygiventheprevioustwowords:
P(wn|w1,w2,w3...wn1)=P(wn|wn2,wn1n)
Toovercomethispaucityofdatathetechniqueoflanguagemodel smoothing isused.Herethe overalltrigramprobabilityisderivedbyinterpolatingtrigram,bigramandunigramprobabilities: P(wn|wn2,wn1)=k1*f(wn|wn2,wn1)+k2*f(wn|wn1)+k3*f(wn) Wherethefunctionsf()aretheunsmoothedestimatesoftrigram,bigramandunigramprobabilities. Thismeansthatforatriplewhichdoesnotoccurinthetrainingtext,theestimatedprobabilitywill bederivedfromthebigrammodelandtheunigrammodel;theestimatewillbenonzeroforevery wordincludedinthelexicon(ie.everywordforwhichthereisanestimateofP(w)).Thechoiceof theparametersk1..k3isanotheroptimisationproblem.
Page 20
AutomaticSpeechRecognition
10.Applications
Health care
In the health care domain, even in the wake of improving speech recognition technologies, medical transcriptionists (MTs) have not yet become obsolete. The services provided may be redistributed rather than replaced.Speech recognition can be implemented in front-end or back-end of the medical documentation process.
Helicopters
The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the fighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask, which would reduce acoustic noise in the microphone.
Management
Battle Management command centres generally require rapid access to and control of large, rapidly changing information databases. Commanders and system operators need to query these databases as Department of Computer Science Page 21
AutomaticSpeechRecognition conveniently as possible, in an eyes-busy environment where much of the information is presented in a display format. Human-machine interaction by voice has the potential to be very useful in these environments. A number of efforts have been undertaken to interface commercially available isolatedword recognizers into battle management environments. In one feasibility study speech recognition equipment was tested in conjunction with an integrated information display for naval battle management applications. Users were very optimistic about the potential of the system, although capabilities were limited.
Telephony and other domains ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread. Despite the high level of integration with word processing in general personal computing. However, ASR in the field of document production has not seen the expected increases in use.
Page 22
AutomaticSpeechRecognition
11. Advantages
It enables increased efficiency in the workplace when hands are busy Quicker input of data for processing Data entry with no need to type-just speak what you want typed Easy for people who are physically challenged.
12. Disadvantages
Robustness graceful degradation, not catastrophic failure Portability independence of computing platform Adaptability to changing conditions (different mic, background noise, new speaker, new task domain, new language even) Language Modelling is there a role for linguistics in improving the language models? Confidence Measures better methods to evaluate the absolute correctness of hypotheses. Spontaneous Speech disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger) Accent, dialect and mixed language non-native speech is a huge problem, especially where code-switching is commonplace
Page 23
AutomaticSpeechRecognition
13. Conclusion
ASR is becoming a sophisticated technology of today and will grow in popularity and its success will bring revolutionary changes in the computer industry. This will occur in business world as well as in our personal life.
14. References
1. L. Rabiner, B.H. Juang, Fundementals of Speech Recognition Pearson Education. First edition.2003 2. L. R. Rabiner, R.W Schafer Digital Processing of speech Signals Pearson Education 3. http://en.wikipedia.org/wiki/Speech_Recognition 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1318444 5. http://sound2sense.eu/images/uploads/DengStrik2007.pdf 6. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4156191
Page 24