0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
57 Ansichten11 Seiten
Speech recognition converts spoken words to on-screen text or a computer command. The term "voice recognition" is sometimes used to reIer to recognition systems that must be trained to a particular speaker. Speech recognition is a broader solution that reIers to technology that can recognize speech without being targeted at single speakers.
Speech recognition converts spoken words to on-screen text or a computer command. The term "voice recognition" is sometimes used to reIer to recognition systems that must be trained to a particular speaker. Speech recognition is a broader solution that reIers to technology that can recognize speech without being targeted at single speakers.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als DOCX, PDF, TXT herunterladen oder online auf Scribd lesen
Speech recognition converts spoken words to on-screen text or a computer command. The term "voice recognition" is sometimes used to reIer to recognition systems that must be trained to a particular speaker. Speech recognition is a broader solution that reIers to technology that can recognize speech without being targeted at single speakers.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als DOCX, PDF, TXT herunterladen oder online auf Scribd lesen
Speech recognition (also known as automatic speech recognition or computer speech
recognition) converts spoken words to text. The term "voice recognition" is sometimes used to reIer to recognition systems that must be trained to a particular speakeras is the case Ior most desktop recognition soItware. Recognizing the speaker can simpliIy the task oI translating speech. Speech recognition is a broader solution that reIers to technology that can recognize speech without being targeted at single speakersuch as a call system that can recognize arbitrary voices. Speech recognition applications include voice user interIaces such as voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), domotic appliance control, search (e.g., Iind a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation oI structured documents (e.g., a radiology report), speech-to- text processing (e.g., word processors or emails), and aircraIt (usually termed Direct Voice Input). 7. See all High-Tech Gadgets articles
n DC translates the analog waves of your voice into digital data by sampling the sound. The higher the sampling and precision rates, the higher the quality.
Speech to Data To convert speech to on-screen text or a computer command, a computer has to go through several complex steps. When you speak, you create vibrations in the air. The analog-to-digital converter (DC) translates this analog wave into digital data that the computer can understand. To do this, it samples, or digitizes, the sound by taking precise measurements oI the wave at Irequent intervals. The system Iilters the digitized sound to remove unwanted noise, and sometimes to separate it into diIIerent bands oI frequency (Irequency is the wavelength oI the sound waves, heard by humans as diIIerences in pitch). It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned. People don't always speak at the same speed, so the sound must be adjusted to match the speed oI the template sound samples already stored in the system's memory. Next the signal is divided into small segments as short as a Iew hundredths oI a second, or even thousandths in the case oI plosive consonant sounds -- consonant stops produced by obstructing airIlow in the vocal tract -- like "p" or "t." The program then matches these segments to known phonemes in the appropriate language. A phoneme is the smallest element oI a language -- a representation oI the sounds we make and put together to Iorm meaningIul expressions. There are roughly 40 phonemes in the English language (diIIerent linguists have diIIerent opinions on the exact number), while other languages have more or Iewer phonemes.
The next step seems simple, but it is actually the most diIIicult to accomplish and is the is Iocus oI most speech recognition research. The program examines phonemes in the context oI the other phonemes around them. It runs the contextual phoneme plot through a complex statistical model and compares them to a large library oI known words, phrases and sentences. The program then determines what the user was probably saying and either outputs it as text or issues a computer command.
istory The Iirst speech recognizer appeared in 1952 and consisted oI a device Ior the recognition oI single spoken digits |1| Another early device was the IBM Shoebox, exhibited at the 1964 New York World's Fair. One oI the most notable domains Ior the commercial application oI speech recognition in the United States has been health care and in particular the work oI the medical transcriptionist (MT) |citation needed| . According to industry experts, at its inception, speech recognition (SR) was sold as a way to completely eliminate transcription rather than make the transcription process more eIIicient, hence it was not accepted. It was also the case that SR at that time was oIten technically deIicient. Additionally, to be used eIIectively, it required changes to the ways physicians worked and documented clinical encounters, which many iI not all were reluctant to do. The biggest limitation to speech recognition automating transcription, however, is seen as the soItware. The nature oI narrative dictation is highly interpretive and oIten requires judgment that may be provided by a real human but not yet by an automated system. Another limitation has been the extensive amount oI time required by the user and/or system provider to train the soItware. A distinction in ASR is oIten made between "artiIicial syntax systems," which are usually domain-speciIic, and "natural language processing," which is usually language-speciIic. Each oI these types oI application presents its own particular goals and challenges
ORITS Hidden Markov models Maln arLlcle Pldden Markov model Modern general-purpose speech recognition systems are based on Hidden Markov Models. These are statistical models that output a sequence oI symbols or quantities. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. In a short time-scales (e.g., 10 milliseconds), speech can be approximated as a stationary process. Speech can be thought oI as a Markov model Ior many stochastic purposes. Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally Ieasible to use. In speech recognition, the hidden Markov model would output a sequence oI n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one oI these every 10 milliseconds. The vectors would consist oI cepstral coeIIicients, which are obtained by taking a Fourier transIorm oI a short time window oI speech and decorrelating the spectrum using a cosine transIorm, then taking the Iirst (most signiIicant) coeIIicients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture oI diagonal covariance Gaussians, which will give a likelihood Ior each observed vector. Each word, or (Ior more general speech recognition systems), each phoneme, will have a diIIerent output distribution; a hidden Markov model Ior a sequence oI words or phonemes is made by concatenating the individual trained hidden Markov models Ior the separate words and phonemes. Decoding oI the speech (the term Ior what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to Iind the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model inIormation, and combining it statically beIorehand (the Iinite state transducer, or FST, approach).
namic time warping {Tased speecb recognition Maln arLlcle uynamlc Llme warplng Dynamic time warping is an approach that was historically used Ior speech recognition but has now largely been displaced by the more successIul HMM-based approach. Dynamic time warping is an algorithm Ior measuring similarity between two sequences that may vary in time or speed. For instance, similarities in walking patterns would be detected, even iI in one video the person was walking slowly and iI in another he or she were walking more quickly, or even iI there were accelerations and decelerations during the course oI one observation. DTW has been applied to video, audio, and graphics indeed, any data that can be turned into a linear representation can be analyzed with DTW.
pplications ealthcare In the health care domain, even in the wake oI improving speech recognition technologies, medical transcriptionists (MTs) have not yet become obsolete. The services provided may be redistributed rather than replaced. Speech recognition can be implemented in Iront-end or back-end oI the medical documentation process. Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are displayed right aIter they are spoken, and the dictator is responsible Ior editing and signing oII on the document. It never goes through an MT/editor. Back-End SR or DeIerred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draIt document is routed along with the original voice Iile to the MT/editor, who edits the draIt and Iinalizes the report. DeIerred SR is being widely used in the industry currently. Many Electronic Medical Records (EMR) applications can be more eIIective and may be perIormed more easily when deployed in conjunction with a speech-recognition engine. Searches, queries, and Iorm Iilling may all be Iaster to perIorm by voice than by using a keyboard. One oI the major issues relating to the use oI speech recognition in healthcare is that the American Recovery and Reinvestment Act oI 2009 (ARRA) provides Ior substantial Iinancial beneIits to physicians who utilize an EMR according to "MeaningIul Use" standards. These standards require that a substantial amount oI data be maintained by the EMR (now more commonly reIerred to as an Electronic Health Record or EHR). UnIortunately, in many instances, the use oI speech recognition within an EHR will not lead to data maintained within a database, but rather to narrative text. For this reason, substantial resources are being expended to allow Ior the use oI Iront-end SR while capturing data within the EHR. ilitary igh-performance fighter aircraft Substantial eIIorts have been devoted in the last decade to the test and evaluation oI speech recognition in Iighter aircraIt. OI particular note is the U.S. program in speech recognition Ior the Advanced Fighter Technology Integration (AFTI)/F-16 aircraIt (F-16 VISTA), and a program in France installing speech recognition systems on Mirage aircraIt, and also programs in the UK dealing with a variety oI aircraIt platIorms. In these programs, speech recognizers have been operated successIully in Iighter aircraIt, with applications including: setting radio Irequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling Ilight displays. Working with Swedish pilots Ilying in the JAS-39 Gripen cockpit, Englund (2004) Iound recognition deteriorated with increasing G-loads. It was also concluded that adaptation greatly improved the results in all cases and introducing models Ior breathing was shown to improve recognition scores signiIicantly. Contrary to what might be expected, no eIIects oI the broken English oI the speakers were Iound. It was evident that spontaneous speech caused problems Ior the recognizer, as could be expected. A restricted vocabulary, and above all, a proper syntax, could thus be expected to improve recognition accuracy substantially. |2|
The EuroIighter Typhoon currently in service with the UK RAF employs a speaker-dependent system, i.e. it requires each pilot to create a template. The system is not used Ior any saIety critical or weapon critical tasks, such as weapon release or lowering oI the undercarriage, but is used Ior a wide range oI other cockpit Iunctions. Voice commands are conIirmed by visual and/or aural Ieedback. The system is seen as a major design Ieature in the reduction oI pilot workload, and even allows the pilot to assign targets to himselI with two simple voice commands or to any oI his wingmen with only Iive commands. |3|
Speaker independent systems are also being developed and are in testing Ior the F35 Lightning II (JSF) and the Alenia Aermacchi M-346 Master lead-in Iighter trainer. These systems have produced word accuracies in excess oI 98. |citation needed|
elicopters The problems oI achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the jet Iighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because oI the high noise levels but also because the helicopter pilot, in general, does not wear a Iacemask, which would reduce acoustic noise in the microphone. Substantial test and evaluation programs have been carried out in the past decade in speech recognition systems applications in helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma helicopter. There has also been much useIul work in Canada. Results have been encouraging, and voice applications have included: control oI communication radios, setting oI navigation systems, and control oI an automated target handover system. As in Iighter applications, the overriding issue Ior voice in helicopters is the impact on pilot eIIectiveness. Encouraging results are reported Ior the AVRADA tests, although these represent only a Ieasibility demonstration in a test environment. Much remains to be done both in speech recognition and in overall speech recognition technology, in order to consistently achieve perIormance improvements in operational settings.
attle management
This section does not cite any references or sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be challenged and removed. : 2009) In general, Battle Management command centres require rapid access to and control oI large, rapidly changing inIormation databases. Commanders and system operators need to query these databases as conveniently as possible, in an eyes-busy environment where much oI the inIormation is presented in a display Iormat. Human-machine interaction by voice has the potential to be very useIul in these environments. A number oI eIIorts have been undertaken to interIace commercially available isolated-word recognizers into battle management environments. In one Ieasibility study, speech recognition equipment was tested in conjunction with an integrated inIormation display Ior naval battle management applications. Users were very optimistic about the potential oI the system, although capabilities were limited. Speech understanding programs sponsored by the DeIense Advanced Research Projects Agency (DARPA) in the U.S. has Iocused on this problem oI natural speech interIace. Speech recognition eIIorts have Iocused on a database oI continuous speech recognition (CSR), large- vocabulary speech designed to be representative oI the naval resource management task. SigniIicant advances in the state-oI-the-art in CSR have been achieved, and current eIIorts are Iocused on integrating speech recognition and natural language processing to allow spoken language interaction with a naval resource management system. Training air traffic controllers Training Ior air traIIic controllers (ATC) represents an excellent application Ior speech recognition systems. Many ATC training systems currently require a person to act as a "pseudo- pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog that the controller would have to conduct with pilots in a real ATC situation. Speech recognition and synthesis techniques oIIer the potential to eliminate the need Ior a person to act as pseudo-pilot, thus reducing training and support personnel. In theory, Air controller tasks are also characterized by highly structured speech as the primary output oI the controller, hence reducing the diIIiculty oI the speech recognition task should be possible. In practice, this is rarely the case. The FAA document 7110.65 details the phrases that should be used by air traIIic controllers. While this document gives less than 150 examples oI such phrases, the number oI phrases supported by one oI the simulation vendors speech recognition systems is in excess oI 500,000. The USAF, USMC, US Army,US Navy, and FAA as well as a number oI international ATC training organizations such as the Royal Australian Air Force and Civil Aviation Authorities in Italy, Brazil, and Canada are currently using ATC simulators with speech recognition Irom a number oI diIIerent vendors.
Telephony and other domains ASR in the Iield oI telephony is now commonplace and in the Iield oI computer gaming and simulation is becoming more widespread. Despite the high level oI integration with word processing in general personal computing. However, ASR in the Iield oI document production has not seen the expected |- whom?| increases in use. The improvement oI mobile processor speeds made Ieasible the speech-enabled Symbian and Windows Mobile Smartphones. Speech is used mostly as a part oI User InterIace, Ior creating pre-deIined or custom speech commands. Leading soItware vendors in this Iield are: MicrosoIt Corporation (MicrosoIt Voice Command), Digital Syphon (Sonic Extractor), Nuance Communications (Nuance Voice Control), Speech Technology Center, Vito Technology (VITO Voice2Go), Speereo SoItware (Speereo Voice Translator), and SVOX.
artber applications O LomaLlc LranslaLlon O LomoLlve speech recognlLlon (eg CnSLar lord Sync) O orL reporLlng (8ealLlme volce WrlLlng) O Pandsfree compLlng volce command recognlLlon compLer ser lnLerface O Pome aLomaLlon O nLeracLlve volce response O Moblle Lelephony lncldlng moblle emall O MlLlmodal lnLeracLlon O 9ronnclaLlon evalaLlon ln compLeralded langage learnlng appllcaLlons O 8oboLlcs O SpeechLoLexL (LranscrlpLlon of speech lnLo moblle LexL messages) O @elemaLlcs (eg vehlcle navlgaLlon SysLems) O @ranscrlpLlon (dlglLal speechLoLexL) O vldeo games wlLh @om lancys LndWar and Llfellne as worklng examples
!erformance O The perIormance oI speech recognition systems is usually evaluated in terms oI accuracy and speed. Accuracy is usually rated with word error rate (WER), whereas speed is measured with the real time Iactor. Other measures oI accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR). O In 1982, Kurzweil Applied Intelligence and Dragon Systems released speech recognition products. By 1985, Kurzweil`s soItware had a vocabulary oI 1,000 wordsiI uttered one word at a time. Two years later, in 1987, its lexicon reached 20,000 words, entering the realm oI human vocabularies, which range Irom 10,000 to 150,000 words. But recognition accuracy was only 10 in 1993. Two years later, the error rate crossed below 50. Dragon Systems released "Naturally Speaking" in 1997, which recognized normal human speech. Progress mainly came Irom improved computer perIormance and larger source text databases. The Brown Corpus was the Iirst major database available, containing several million words. In 2006, Google published a trillion-word corpus, while Carnegie Mellon University researchers Iound no signiIicant increase in recognition accuracy |4|
Further information
O Popular speech recognition conIerences held each year or two include SpeechTEK and SpeechTEK Europe, ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. ConIerences in the Iield oI Natural language processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication. Books like "Fundamentals oI Speech Recognition" by Lawrence Rabiner can be useIul to acquire basic knowledge but may not be Iully up to date (1993). Another good source can be "Statistical Methods Ior Speech Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by Xuedong Huang etc. More up to date is "Computer Speech", by ManIred R. Schroeder, second edition published in 2004. The recently updated textbook oI "Speech and Language Processing (2008)" by JuraIsky and Martin presents the basics and the state oI the art Ior ASR. A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as oI 2007 is the GALE project, which involves both speech recognition and translation components). O In terms oI Ireely available resources, Carnegie Mellon University's SPHINX toolkit is one place to start to both learn about speech recognition and to start experimenting. Another resource (Iree as in Iree beer, not as in Iree speech) is the HTK book (and the accompanying HTK toolkit). The AT&T libraries GRM library, and DCD library are also general soItware libraries Ior large-vocabulary speech recognition. O For more soItware resources, see List oI speech recognition soItware
Speecb Recognition: eaknesses and laws No speech recognition system is 100 percent perIect; several Iactors can reduce accuracy. Some oI these Iactors are issues that continue to improve as the technology improves. Others can be lessened -- iI not completely corrected -- by the user. ow signal-to-noise ratio The program needs to "hear" the words spoken distinctly, and any extra noise introduced into the sound will interIere with this. The noise can come Irom a number oI sources, including loud background noise in an oIIice environment. Users should work in a quiet room with a quality microphone positioned as close to their mouths as possible. Low-quality sound cards, which provide the input Ior the microphone to send the signal to the computer, oIten do not have enough shielding Irom the electrical signals produced by other computer components. They can introduce hum or hiss into the signal. Overlapping speech Current systems have diIIiculty separating simultaneous speech Irom multiple users. "II you try to employ recognition technology in conversations or meetings where people Irequently interrupt each other or talk over one another, you're likely to get extremely poor results," says John GaroIolo. Intensive use of computer power Running the statistical models needed Ior speech recognition requires the computer's processor to do a lot oI heavy work. One reason Ior this is the need to remember each stage oI the word- recognition search in case the system needs to backtrack to come up with the right word. The Iastest personal computers in use today can still have diIIiculties with complicated commands or phrases, slowing down the response time signiIicantly. The vocabularies needed by the programs also take up a large amount oI hard drive space. Fortunately, disk storage and processor speed are areas oI rapid advancement -- the computers in use 10 years Irom now will beneIit Irom an exponential increase in both Iactors. omonyms Homonyms are two words that are spelled diIIerently and have diIIerent meanings but sound the same. "There" and "their," "air" and "heir," "be" and "bee" are all examples. There is no way Ior a speech recognition program to tell the diIIerence between these words based on sound alone. However, extensive training oI systems and statistical models that take into account word context have greatly improved their perIormance. We'll look at the Iuture oI speech recognition programs next.