Speech Recognition

INTRODUCTION
Speech recognition (also known as automatic speech recognition or computer speech

recognition) converts spoken words to text. The term "voice recognition" is sometimes used to
reIer to recognition systems that must be trained to a particular speakeras is the case Ior most
desktop recognition soItware. Recognizing the speaker can simpliIy the task oI translating
speech.
Speech recognition is a broader solution that reIers to technology that can recognize speech
without being targeted at single speakersuch as a call system that can recognize arbitrary
voices.
Speech recognition applications include voice user interIaces such as voice dialing (e.g., "Call
home"), call routing (e.g., "I would like to make a collect call"), domotic appliance control,
search (e.g., Iind a podcast where particular words were spoken), simple data entry (e.g., entering
a credit card number), preparation oI structured documents (e.g., a radiology report), speech-to-
text processing (e.g., word processors or emails), and aircraIt (usually termed Direct Voice
Input).
7. See all High-Tech Gadgets articles

n DC translates the analog waves of your voice into digital data by sampling the sound.
The higher the sampling and precision rates, the higher the quality.

Speech to Data
To convert speech to on-screen text or a computer command, a computer has to go through
several complex steps. When you speak, you create vibrations in the air. The analog-to-digital
converter (DC) translates this analog wave into digital data that the computer can understand.
To do this, it samples, or digitizes, the sound by taking precise measurements oI the wave at
Irequent intervals. The system Iilters the digitized sound to remove unwanted noise, and
sometimes to separate it into diIIerent bands oI frequency (Irequency is the wavelength oI the
sound waves, heard by humans as diIIerences in pitch). It also normalizes the sound, or adjusts it
to a constant volume level. It may also have to be temporally aligned. People don't always speak
at the same speed, so the sound must be adjusted to match the speed oI the template sound
samples already stored in the system's memory.
Next the signal is divided into small segments as short as a Iew hundredths oI a second, or even
thousandths in the case oI plosive consonant sounds -- consonant stops produced by obstructing
airIlow in the vocal tract -- like "p" or "t." The program then matches these segments to known
phonemes in the appropriate language. A phoneme is the smallest element oI a language -- a
representation oI the sounds we make and put together to Iorm meaningIul expressions. There
are roughly 40 phonemes in the English language (diIIerent linguists have diIIerent opinions on
the exact number), while other languages have more or Iewer phonemes.

The next step seems simple, but it is actually the most diIIicult to accomplish and is the is Iocus
oI most speech recognition research. The program examines phonemes in the context oI the other
phonemes around them. It runs the contextual phoneme plot through a complex statistical model
and compares them to a large library oI known words, phrases and sentences. The program then
determines what the user was probably saying and either outputs it as text or issues a computer
command.

istory
The Iirst speech recognizer appeared in 1952 and consisted oI a device Ior the recognition oI
single spoken digits
|1|
Another early device was the IBM Shoebox, exhibited at the 1964 New
York World's Fair.
One oI the most notable domains Ior the commercial application oI speech recognition in the
United States has been health care and in particular the work oI the medical transcriptionist
(MT)
|citation needed|
. According to industry experts, at its inception, speech recognition (SR) was
sold as a way to completely eliminate transcription rather than make the transcription process
more eIIicient, hence it was not accepted. It was also the case that SR at that time was oIten
technically deIicient. Additionally, to be used eIIectively, it required changes to the ways
physicians worked and documented clinical encounters, which many iI not all were reluctant to
do. The biggest limitation to speech recognition automating transcription, however, is seen as the
soItware. The nature oI narrative dictation is highly interpretive and oIten requires judgment that
may be provided by a real human but not yet by an automated system. Another limitation has
been the extensive amount oI time required by the user and/or system provider to train the
soItware.
A distinction in ASR is oIten made between "artiIicial syntax systems," which are usually
domain-speciIic, and "natural language processing," which is usually language-speciIic. Each oI
these types oI application presents its own particular goals and challenges

ORITS
Hidden Markov models
Maln arLlcle Pldden Markov model
Modern general-purpose speech recognition systems are based on Hidden Markov Models.
These are statistical models that output a sequence oI symbols or quantities. HMMs are used in
speech recognition because a speech signal can be viewed as a piecewise stationary signal or a
short-time stationary signal. In a short time-scales (e.g., 10 milliseconds), speech can be
approximated as a stationary process. Speech can be thought oI as a Markov model Ior many
stochastic purposes.
Another reason why HMMs are popular is because they can be trained automatically and are
simple and computationally Ieasible to use. In speech recognition, the hidden Markov model
would output a sequence oI n-dimensional real-valued vectors (with n being a small integer, such
as 10), outputting one oI these every 10 milliseconds. The vectors would consist oI cepstral
coeIIicients, which are obtained by taking a Fourier transIorm oI a short time window oI speech
and decorrelating the spectrum using a cosine transIorm, then taking the Iirst (most signiIicant)
coeIIicients. The hidden Markov model will tend to have in each state a statistical distribution
that is a mixture oI diagonal covariance Gaussians, which will give a likelihood Ior each
observed vector. Each word, or (Ior more general speech recognition systems), each phoneme,
will have a diIIerent output distribution; a hidden Markov model Ior a sequence oI words or
phonemes is made by concatenating the individual trained hidden Markov models Ior the
separate words and phonemes.
Decoding oI the speech (the term Ior what happens when the system is presented with a new
utterance and must compute the most likely source sentence) would probably use the Viterbi
algorithm to Iind the best path, and here there is a choice between dynamically creating a
combination hidden Markov model, which includes both the acoustic and language model
inIormation, and combining it statically beIorehand (the Iinite state transducer, or FST,
approach).

namic time warping {Tased speecb recognition
Maln arLlcle uynamlc Llme warplng
Dynamic time warping is an approach that was historically used Ior speech recognition but has
now largely been displaced by the more successIul HMM-based approach. Dynamic time
warping is an algorithm Ior measuring similarity between two sequences that may vary in time or
speed. For instance, similarities in walking patterns would be detected, even iI in one video the
person was walking slowly and iI in another he or she were walking more quickly, or even iI
there were accelerations and decelerations during the course oI one observation. DTW has been
applied to video, audio, and graphics indeed, any data that can be turned into a linear
representation can be analyzed with DTW.

pplications
ealthcare
In the health care domain, even in the wake oI improving speech recognition technologies,
medical transcriptionists (MTs) have not yet become obsolete. The services provided may be
redistributed rather than replaced.
Speech recognition can be implemented in Iront-end or back-end oI the medical documentation
process.
Front-End SR is where the provider dictates into a speech-recognition engine, the recognized
words are displayed right aIter they are spoken, and the dictator is responsible Ior editing and
signing oII on the document. It never goes through an MT/editor.
Back-End SR or DeIerred SR is where the provider dictates into a digital dictation system, and
the voice is routed through a speech-recognition machine and the recognized draIt document is
routed along with the original voice Iile to the MT/editor, who edits the draIt and Iinalizes the
report. DeIerred SR is being widely used in the industry currently.
Many Electronic Medical Records (EMR) applications can be more eIIective and may be
perIormed more easily when deployed in conjunction with a speech-recognition engine.
Searches, queries, and Iorm Iilling may all be Iaster to perIorm by voice than by using a
keyboard.
One oI the major issues relating to the use oI speech recognition in healthcare is that the
American Recovery and Reinvestment Act oI 2009 (ARRA) provides Ior substantial Iinancial
beneIits to physicians who utilize an EMR according to "MeaningIul Use" standards. These
standards require that a substantial amount oI data be maintained by the EMR (now more
commonly reIerred to as an Electronic Health Record or EHR). UnIortunately, in many
instances, the use oI speech recognition within an EHR will not lead to data maintained within a
database, but rather to narrative text. For this reason, substantial resources are being expended to
allow Ior the use oI Iront-end SR while capturing data within the EHR.
ilitary
igh-performance fighter aircraft
Substantial eIIorts have been devoted in the last decade to the test and evaluation oI speech
recognition in Iighter aircraIt. OI particular note is the U.S. program in speech recognition Ior the
Advanced Fighter Technology Integration (AFTI)/F-16 aircraIt (F-16 VISTA), and a program in
France installing speech recognition systems on Mirage aircraIt, and also programs in the UK
dealing with a variety oI aircraIt platIorms. In these programs, speech recognizers have been
operated successIully in Iighter aircraIt, with applications including: setting radio Irequencies,
commanding an autopilot system, setting steer-point coordinates and weapons release
parameters, and controlling Ilight displays.
Working with Swedish pilots Ilying in the JAS-39 Gripen cockpit, Englund (2004) Iound
recognition deteriorated with increasing G-loads. It was also concluded that adaptation greatly
improved the results in all cases and introducing models Ior breathing was shown to improve
recognition scores signiIicantly. Contrary to what might be expected, no eIIects oI the broken
English oI the speakers were Iound. It was evident that spontaneous speech caused problems Ior
the recognizer, as could be expected. A restricted vocabulary, and above all, a proper syntax,
could thus be expected to improve recognition accuracy substantially.
|2|

The EuroIighter Typhoon currently in service with the UK RAF employs a speaker-dependent
system, i.e. it requires each pilot to create a template. The system is not used Ior any saIety
critical or weapon critical tasks, such as weapon release or lowering oI the undercarriage, but is
used Ior a wide range oI other cockpit Iunctions. Voice commands are conIirmed by visual
and/or aural Ieedback. The system is seen as a major design Ieature in the reduction oI pilot
workload, and even allows the pilot to assign targets to himselI with two simple voice commands
or to any oI his wingmen with only Iive commands.
|3|

Speaker independent systems are also being developed and are in testing Ior the F35 Lightning II
(JSF) and the Alenia Aermacchi M-346 Master lead-in Iighter trainer. These systems have
produced word accuracies in excess oI 98.
|citation needed|

elicopters
The problems oI achieving high recognition accuracy under stress and noise pertain strongly to
the helicopter environment as well as to the jet Iighter environment. The acoustic noise problem
is actually more severe in the helicopter environment, not only because oI the high noise levels
but also because the helicopter pilot, in general, does not wear a Iacemask, which would reduce
acoustic noise in the microphone. Substantial test and evaluation programs have been carried out
in the past decade in speech recognition systems applications in helicopters, notably by the U.S.
Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace
Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma
helicopter. There has also been much useIul work in Canada. Results have been encouraging,
and voice applications have included: control oI communication radios, setting oI navigation
systems, and control oI an automated target handover system.
As in Iighter applications, the overriding issue Ior voice in helicopters is the impact on pilot
eIIectiveness. Encouraging results are reported Ior the AVRADA tests, although these represent
only a Ieasibility demonstration in a test environment. Much remains to be done both in speech
recognition and in overall speech recognition technology, in order to consistently achieve
perIormance improvements in operational settings.

attle management

This section does not cite any references or sources. Please help improve this section
by adding citations to reliable sources. Unsourced material may be challenged and
removed. : 2009)
In general, Battle Management command centres require rapid access to and control oI large,
rapidly changing inIormation databases. Commanders and system operators need to query these
databases as conveniently as possible, in an eyes-busy environment where much oI the
inIormation is presented in a display Iormat. Human-machine interaction by voice has the
potential to be very useIul in these environments. A number oI eIIorts have been undertaken to
interIace commercially available isolated-word recognizers into battle management
environments. In one Ieasibility study, speech recognition equipment was tested in conjunction
with an integrated inIormation display Ior naval battle management applications. Users were
very optimistic about the potential oI the system, although capabilities were limited.
Speech understanding programs sponsored by the DeIense Advanced Research Projects Agency
(DARPA) in the U.S. has Iocused on this problem oI natural speech interIace. Speech
recognition eIIorts have Iocused on a database oI continuous speech recognition (CSR), large-
vocabulary speech designed to be representative oI the naval resource management task.
SigniIicant advances in the state-oI-the-art in CSR have been achieved, and current eIIorts are
Iocused on integrating speech recognition and natural language processing to allow spoken
language interaction with a naval resource management system.
Training air traffic controllers
Training Ior air traIIic controllers (ATC) represents an excellent application Ior speech
recognition systems. Many ATC training systems currently require a person to act as a "pseudo-
pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog that the
controller would have to conduct with pilots in a real ATC situation. Speech recognition and
synthesis techniques oIIer the potential to eliminate the need Ior a person to act as pseudo-pilot,
thus reducing training and support personnel. In theory, Air controller tasks are also
characterized by highly structured speech as the primary output oI the controller, hence reducing
the diIIiculty oI the speech recognition task should be possible. In practice, this is rarely the case.
The FAA document 7110.65 details the phrases that should be used by air traIIic controllers.
While this document gives less than 150 examples oI such phrases, the number oI phrases
supported by one oI the simulation vendors speech recognition systems is in excess oI 500,000.
The USAF, USMC, US Army,US Navy, and FAA as well as a number oI international ATC
training organizations such as the Royal Australian Air Force and Civil Aviation Authorities in
Italy, Brazil, and Canada are currently using ATC simulators with speech recognition Irom a
number oI diIIerent vendors.

Telephony and other domains
ASR in the Iield oI telephony is now commonplace and in the Iield oI computer gaming and
simulation is becoming more widespread. Despite the high level oI integration with word
processing in general personal computing. However, ASR in the Iield oI document production
has not seen the expected
|- whom?|
increases in use.
The improvement oI mobile processor speeds made Ieasible the speech-enabled Symbian and
Windows Mobile Smartphones. Speech is used mostly as a part oI User InterIace, Ior creating
pre-deIined or custom speech commands. Leading soItware vendors in this Iield are: MicrosoIt
Corporation (MicrosoIt Voice Command), Digital Syphon (Sonic Extractor), Nuance
Communications (Nuance Voice Control), Speech Technology Center, Vito Technology (VITO
Voice2Go), Speereo SoItware (Speereo Voice Translator), and SVOX.

artber applications
O LomaLlc LranslaLlon
O LomoLlve speech recognlLlon (eg CnSLar lord Sync)
O orL reporLlng (8ealLlme volce WrlLlng)
O Pandsfree compLlng volce command recognlLlon compLer ser lnLerface
O Pome aLomaLlon
O nLeracLlve volce response
O Moblle Lelephony lncldlng moblle emall
O MlLlmodal lnLeracLlon
O 9ronnclaLlon evalaLlon ln compLeralded langage learnlng appllcaLlons
O 8oboLlcs
O SpeechLoLexL (LranscrlpLlon of speech lnLo moblle LexL messages)
O @elemaLlcs (eg vehlcle navlgaLlon SysLems)
O @ranscrlpLlon (dlglLal speechLoLexL)
O vldeo games wlLh @om lancys LndWar and Llfellne as worklng examples

!erformance
O The perIormance oI speech recognition systems is usually evaluated in terms oI accuracy
and speed. Accuracy is usually rated with word error rate (WER), whereas speed is
measured with the real time Iactor. Other measures oI accuracy include Single Word
Error Rate (SWER) and Command Success Rate (CSR).
O In 1982, Kurzweil Applied Intelligence and Dragon Systems released speech recognition
products. By 1985, Kurzweil`s soItware had a vocabulary oI 1,000 wordsiI uttered one
word at a time. Two years later, in 1987, its lexicon reached 20,000 words, entering the
realm oI human vocabularies, which range Irom 10,000 to 150,000 words. But
recognition accuracy was only 10 in 1993. Two years later, the error rate crossed below
50. Dragon Systems released "Naturally Speaking" in 1997, which recognized normal
human speech. Progress mainly came Irom improved computer perIormance and larger
source text databases. The Brown Corpus was the Iirst major database available,
containing several million words. In 2006, Google published a trillion-word corpus, while
Carnegie Mellon University researchers Iound no signiIicant increase in recognition
accuracy
|4|

Further information

O Popular speech recognition conIerences held each year or two include SpeechTEK and
SpeechTEK Europe, ICASSP, Eurospeech/ICSLP (now named Interspeech) and the
IEEE ASRU. ConIerences in the Iield oI Natural language processing, such as ACL,
NAACL, EMNLP, and HLT, are beginning to include papers on speech processing.
Important journals include the IEEE Transactions on Speech and Audio Processing (now
named IEEE Transactions on Audio, Speech and Language Processing), Computer
Speech and Language, and Speech Communication. Books like "Fundamentals oI Speech
Recognition" by Lawrence Rabiner can be useIul to acquire basic knowledge but may not
be Iully up to date (1993). Another good source can be "Statistical Methods Ior Speech
Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by
Xuedong Huang etc. More up to date is "Computer Speech", by ManIred R. Schroeder,
second edition published in 2004. The recently updated textbook oI "Speech and
Language Processing (2008)" by JuraIsky and Martin presents the basics and the state oI
the art Ior ASR. A good insight into the techniques used in the best modern systems can
be gained by paying attention to government sponsored evaluations such as those
organised by DARPA (the largest speech recognition-related project ongoing as oI 2007
is the GALE project, which involves both speech recognition and translation
components).
O In terms oI Ireely available resources, Carnegie Mellon University's SPHINX toolkit is
one place to start to both learn about speech recognition and to start experimenting.
Another resource (Iree as in Iree beer, not as in Iree speech) is the HTK book (and the
accompanying HTK toolkit). The AT&T libraries GRM library, and DCD library are also
general soItware libraries Ior large-vocabulary speech recognition.
O For more soItware resources, see List oI speech recognition soItware

Speecb Recognition: eaknesses and laws
No speech recognition system is 100 percent perIect; several Iactors can reduce accuracy. Some
oI these Iactors are issues that continue to improve as the technology improves. Others can be
lessened -- iI not completely corrected -- by the user.
ow signal-to-noise ratio
The program needs to "hear" the words spoken distinctly, and any extra noise introduced into the
sound will interIere with this. The noise can come Irom a number oI sources, including loud
background noise in an oIIice environment. Users should work in a quiet room with a quality
microphone positioned as close to their mouths as possible. Low-quality sound cards, which
provide the input Ior the microphone to send the signal to the computer, oIten do not have
enough shielding Irom the electrical signals produced by other computer components. They can
introduce hum or hiss into the signal.
Overlapping speech
Current systems have diIIiculty separating simultaneous speech Irom multiple users. "II you try
to employ recognition technology in conversations or meetings where people Irequently interrupt
each other or talk over one another, you're likely to get extremely poor results," says John
GaroIolo.
Intensive use of computer power
Running the statistical models needed Ior speech recognition requires the computer's processor
to do a lot oI heavy work. One reason Ior this is the need to remember each stage oI the word-
recognition search in case the system needs to backtrack to come up with the right word. The
Iastest personal computers in use today can still have diIIiculties with complicated commands or
phrases, slowing down the response time signiIicantly. The vocabularies needed by the programs
also take up a large amount oI hard drive space. Fortunately, disk storage and processor speed
are areas oI rapid advancement -- the computers in use 10 years Irom now will beneIit Irom an
exponential increase in both Iactors.
omonyms
Homonyms are two words that are spelled diIIerently and have diIIerent meanings but sound the
same. "There" and "their," "air" and "heir," "be" and "bee" are all examples. There is no way Ior
a speech recognition program to tell the diIIerence between these words based on sound alone.
However, extensive training oI systems and statistical models that take into account word context
have greatly improved their perIormance.
We'll look at the Iuture oI speech recognition programs next.

Speech Recognition

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Speech Recognition

Hochgeladen von

Copyright:

Verfügbare Formate

INTRODUCTION

Speech recognition (also known as automatic speech recognition or computer speech

Das könnte Ihnen auch gefallen