Sie sind auf Seite 1von 16

Speech recognition

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Speech recognition (in many contexts also known as automatic speech recognition,
computer speech recognition or erroneously as voice recognition) is the process of
converting a speech signal to a sequence of words, by means of an algorithm
implemented as a computer program.

Speech recognition applications that have emerged over the last few years include voice
dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"),
simple data entry (e.g., entering a credit card number), preparation of structured
documents (e.g., a radiology report), domotic appliances control and content-based
spoken audio search (e.g. find a podcast where particular words were spoken).

Voice recognition or speaker recognition is a related process that attempts to identify the
person speaking, as opposed to what is being said.

Contents
[hide]

• 1 Speech recognition technology


• 2 Performance of speech recognition systems
o 2.1 Hidden Markov model (HMM)-based speech recognition
o 2.2 Neural network-based speech recognition
o 2.3 Dynamic time warping (DTW)-based speech recognition
• 3 Speech recognition patents and patent disputes
• 4 For further information
• 5 Applications of speech recognition
• 6 Microphone
• 7 See also
• 8 References
• 9 Books

• 10 External links

[edit] Speech recognition technology


In terms of technology, most of the technical text books nowadays emphasize the use of
hidden Markov model as the underlying technology. The dynamic programming
approach, the neural network based approach and the knowledge-based learning approach
have been studied intensively in the 1980s and 1990s.

[edit] Performance of speech recognition systems


The performance of a speech recognition systems is usually specified in terms of
accuracy and speed. Accuracy is measured with the word error rate, whereas speed is
measured with the real time factor.

Most speech recognition users would tend to agree that dictation machines can achieve
very high performance in controlled conditions. Part of the confusion mainly comes from
the mixed usage of the terms "speech recognition" and "dictation".

Speaker-dependent dictation systems requiring a short period of training can capture


continuous speech with a large vocabulary at normal pace with a very high accuracy.
Most commercial companies claim that recognition software can achieve between 98% to
99% accuracy (getting one to two words out of one hundred wrong) if operated under
optimal conditions. These optimal conditions usually means the test subjects have 1)
matching speaker characteristics with the training data, 2) proper speaker adaptation, and
3) clean environment (e.g. office space). (This explains why some users, especially those
whose speech is heavily accented, might actually perceive the recognition rate to be much
lower than the expected 98% to 99%).

Limited vocabulary systems, requiring no training, can recognize a small number of


words (for instance, the ten digits) as spoken by most speakers. Such systems are popular
for routing incoming phone calls to their destinations in large organizations.

Both acoustic modeling and language modeling are important studies in modern
statistical speech recognition. In this entry, we will the use of hidden Markov model
(HMM) because notably it is very widely used in many systems. (Language modeling has
many other applications such as smart keyboard and document classification; to the
corresponding entries.)

The Carnegie Mellon University has made some good steps in increasing the speed of
speechchips by using ASICs (application-specific integrated circuits) and reconfigurable
chips called FPGAs (field programmable gate arrays). [1]

[edit] Hidden Markov model (HMM)-based speech recognition

Modern general-purpose speech recognition systems are generally based on (HMMs).


This is a statistical model which outputs a sequence of symbols or quantities. One
possible reason why HMMs are used in speech recognition is that a speech signal could
be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, one
could assume in a short-time in the range of 10 milliseconds, speech could be
approximated as a stationary process. Speech could thus be thought as a Markov model
for many stochastic processes (known as states).
Another reason why HMMs are popular is because they can be trained automatically and
are simple and computationally feasible to use. In speech recognition, to give the very
simplest setup possible, the hidden Markov model would output a sequence of n-
dimensional real-valued vectors with n around, say, 13, outputting one of these every 10
milliseconds. The vectors, again in the very simplest case, would consist of cepstral
coefficients, which are obtained by taking a Fourier transform of a short-time window of
speech and decorrelating the spectrum using a cosine transform, then taking the first
(most significant) coefficients. The hidden Markov model will tend to have, in each state,
a statistical distribution called a mixture of diagonal covariance Gaussians which will
give a likelihood for each observed vector. Each word, or (for more general speech
recognition systems), each phoneme, will have a different output distribution; a hidden
Markov model for a sequence of words or phonemes is made by concatenating the
individual trained hidden Markov models for the separate words and phonemes.

Described above are the core elements of the most common, HMM-based approach to
speech recognition. Modern speech recognition systems use various combinations of a
number of standard techniques in order to improve results over the basic approach
described above. A typical large-vocabulary system would need context dependency for
the phones (so phones with different left and right context have different realizations as
HMM states); it would use cepstral normalization to normalize for different speaker and
recording conditions; for further speaker normalization it might use vocal tract length
normalization (VTLN) for male-female normalization and maximum likelihood linear
regression (MLLR) for more general speaker adaptation. The features would have so-
called delta and delta-delta coefficients to capture speech dynamics and in addition might
use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and
delta-delta coefficients and use splicing and an LDA-based projection followed perhaps
by heteroscedastic linear discriminant analysis or a global semitied covariance transform
(also known as maximum likelihood linear transform, or MLLT). Many systems use so-
called discriminative training techniques which dispense with a purely statistical
approach to HMM parameter estimation and instead optimize some classification-related
measure of the training data. Examples are maximum mutual information (MMI),
minimum classification error (MCE) and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with a
new utterance and must compute the most likely source sentence) would probably use the
Viterbi algorithm to find the best path, and here there is a choice between dynamically
creating a combination hidden Markov model which includes both the acoustic and
language model information, or combining it statically beforehand (the finite state
transducer, or FST, approach).

[edit] Neural network-based speech recognition

Another approach in acoustic modeling is the use of neural networks. They are capable
of solving much more complicated recognition tasks, but do not scale as well as HMMs
when it comes to large vocabularies. Rather than being used in general-purpose speech
recognition applications they can handle low quality, noisy data and speaker
independence. Such systems can achieve greater accuracy than HMM based systems, as
long as there is training data and the vocabulary is limited. A more general approach
using neural networks is phoneme recognition. This is an active field of research, but
generally the results are better than for HMMs. There are also NN-HMM hybrid systems
that use the neural network part for phoneme recognition and the hidden markov model
part for language modeling.

[edit] Dynamic time warping (DTW)-based speech recognition

Main article: Dynamic time warping

Dynamic time warping is an approach that was historically used for speech recognition
but has now largely been displaced by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed. For instance, similarities in walking patterns would be
detected, even if in one video the person was walking slowly and if in another they were
walking more quickly, or even if there were accelerations and decelerations during the
course of one observation. DTW has been applied to video, audio, and graphics -- indeed,
any data which can be turned into a linear representation can be analyzed with DTW.

A well known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal
match between two given sequences (e.g. time series) with certain restrictions, i.e. the
sequences are "warped" non-linearly to match each other. This sequence alignment
method is often used in the context of hidden Markov models.

[edit] Speech recognition patents and patent disputes


Microsoft and Alcatel-Lucent hold patents in speech recognition, and are in dispute as of
March 2, 2007.[2]
This short section requires expansion.

[edit] For further information


Popular speech recognition conferences held each year or two include ICASSP,
Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the
field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are
beginning to include papers on speech processing. Important journals include the IEEE
Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio,
Speech and Language Processing), Computer Speech and Language, and Speech
Communication. Books like "Fundamentals of Speech Recognition" by Lawrence
Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993).
Another good source can be "Statistical Methods for Speech Recognition" by Frederick
Jelinek which is a more up to date book (1998). A good insight into the techniques used
in the best modern systems can be gained by paying attention to government sponsored
competitions such as those organised by DARPA (the largest speech recognition-related
project ongoing as of 2007 is the GALE project, which involves both speech recognition
and translation components).

In terms of freely available resources, the HTK book (and the accompanying HTK
toolkit) is one place to start to both learn about speech recognition and to start
experimenting. Another such resource is Carnegie Mellon University's SPHINX toolkit.

[edit] Applications of speech recognition


• Automatic translation
• Automotive speech recognition
• Dictation
• Hands-free computing: voice command recognition computer user interface
• Home automation
• Interactive voice response
• Medical transcription
• Mobile telephony
• Pronunciation evaluation in computer-aided language learning applications[1]
• Robotics

[edit] Microphone
The microphone type recommend for speech recognition is the array microphone.
This short section requires expansion.

[edit] See also


• Audio visual speech recognition
• Cockpit (aviation) (also termed Direct Voice Input)
• Keyword spotting
• List of speech recognition projects
• Microphone
• Speech Analytics
• Speaker identification
• Speech processing
• Speech synthesis
• Speech verification
• Text-to-speech (TTS)
• VoiceXML
• Acoustic Model
• Speech corpus
[edit] References
• "Survey of the State of the Art in Human Language Technology (1997) by Ron
Cole et all"

1. ^ Dennis van der Heijden. "Computer Chips to Enhance Speech Recognition",


Axistive.com, 2003-10-06.
2. ^ Roger Cheng and Carmen Fleetwood. "Judge dismisses Lucent patent suit
against Microsoft", Wall Street Journal, 2007-03-02.

[edit] Books
• Multilingual Speech Processing, Edited by Tanja Schultz and Katrin Kirchhoff,
April 2006--Researchers and developers in industry and academia with different
backgrounds but a common interest in multilingual speech processing will find an
excellent overview of research problems and solutions detailed from theoretical
and practical perspectives.---CH 1: Introduction / CH 2: Language Characteristics
/ CH 3: Linguistic Data Resources / CH 4: Multilingual Acoustic Modeling / CH
5: Multilingual Dictionaries / CH 6: Multilingual Language Modeling / CH 7:
Multilingual Speech Synthesis / CH 8: Automatic Language Identification / CH 9:
Other Challenges /

[edit] External links


• NIST Speech Group
• How to install and configure speech recognition in Windows.
• Entropic/Cambridge Hidden Markov Model Toolkit
• Open CV library, especially the multi-stream speech and vision combination
programs
• LT-World: Portal to information and resources on the internet
• LDC – The Linguistic Data Consortium
• Evaluations and Language resources Distribution Agency
• OLAC – Open Language Archives Community
• BAS – Bavarian Archive for Speech Signals
• Think-A-Move – Speech and Tongue Control of Robots and Wheelchairs

Audio-visual speech recognition


From Wikipedia, the free encyclopedia

(Redirected from Audio visual speech recognition)


Jump to: navigation, search
Audio visual speech recognition (AVSR) is a technique that uses image processing
capabilities in lip reading to aid speech recognition systems in recognizing
undeterministic phones or giving preponderance among near probability decisions.

Each system lip reading and speech recognition works separately then their results are
mixed at the stage of feature fusion.

[edit] External links


IBM Research - Audio Visual Speech Technologies

[hide]
v•d•e

Major fields of technology


Artificial intelligence • Ceramic engineering • Computing technology
• Electronics • Energy • Energy storage • Engineering physics •
Applied science Environmental technology • Materials science & engineering •
Microtechnology • Nanotechnology • Nuclear technology • Optical
engineering
Information and Communication • Graphics • Music technology • Speech recognition •
communication Visual technology
Construction • Financial engineering • Manufacturing • Machinery •
Industry
Mining
Bombs • Guns and Ammunition • Military technology and equipment
Military
• Naval engineering
Domestic appliances • Domestic technology • Educational technology
Domestic
• Food technology
Aerospace • Agricultural • Architectural • Bioengineering •
Biochemical • Biomedical • Ceramic • Chemical • Civil • Computer •
Construction • Cryogenic • Electrical • Electronic • Environmental •
Engineering
Food • Industrial • Materials • Mechanical • Mechatronics •
Metallurgical • Mining • Naval • Nuclear • Petroleum • Software •
Structural • Systems • Textile • Tissue
Biomedical engineering • Bioinformatics • Biotechnology •
Health and safety Cheminformatics • Fire protection technology • Health technologies •
Pharmaceuticals • Safety engineering • Sanitary engineering
Aerospace • Aerospace engineering • Marine engineering • Motor
Transport
vehicles • Space technology • Transport
1.2: Speech Recognition
Victor Zue, Ron Cole, & Wayne Ward
MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA
Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA
Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

Defining the Problem


Speech recognition is the process of converting an acoustic signal, captured by a
microphone or a telephone, to a set of words. The recognized words can be the final
results, as for applications such as commands & control, data entry, and document
preparation. They can also serve as the input to further linguistic processing in order to
achieve speech understanding, a subject covered in section .

Speech recognition systems can be characterized by many parameters, some of the more
important of which are shown in Figure . An isolated-word speech recognition system
requires that the speaker pause briefly between words, whereas a continuous speech
recognition system does not. Spontaneous, or extemporaneously generated, speech
contains disfluencies, and is much more difficult to recognize than speech read from
script. Some systems require speaker enrollment---a user must provide samples of his or
her speech before using them, whereas other systems are said to be speaker-independent,
in that no enrollment is necessary. Some of the other parameters depend on the specific
task. Recognition is generally more difficult when vocabularies are large or have many
similar-sounding words. When speech is produced in a sequence of words, language
models or artificial grammars are used to restrict the combination of words.

The simplest language model can be specified as a finite-state network, where the
permissible words following each word are given explicitly. More general language
models approximating natural language are specified in terms of a context-sensitive
grammar.

One popular measure of the difficulty of the task, combining the vocabulary size and the
language model, is perplexity, loosely defined as the geometric mean of the number of
words that can follow a word after the language model has been applied (see section for
a discussion of language modeling in general and perplexity in particular). Finally, there
are some external parameters that can affect speech recognition system performance,
including the characteristics of the environmental noise and the type and the placement of
the microphone.
Table: Typical parameters used to characterize the capability of speech recognition
systems

Speech recognition is a difficult problem, largely because of the many sources of


variability associated with the signal. First, the acoustic realizations of phonemes, the
smallest sound units of which words are composed, are highly dependent on the context
in which they appear. These phonetic variabilities are exemplified by the acoustic
differences of the phoneme /t/ in two, true, and butter in American English. At word
boundaries, contextual variations can be quite dramatic---making gas shortage sound like
gash shortage in American English, and devo andare sound like devandare in Italian.

Second, acoustic variabilities can result from changes in the environment as well as in
the position and characteristics of the transducer. Third, within-speaker variabilities can
result from changes in the speaker's physical and emotional state, speaking rate, or voice
quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size
and shape can contribute to across-speaker variabilities.

Figure shows the major components of a typical speech recognition system. The
digitized speech signal is first transformed into a set of useful measurements or features
at a fixed rate, typically once every 10--20 msec (see sections and 11.3 for signal
representation and digital signal processing, respectively). These measurements are then
used to search for the most likely word candidate, making use of constraints imposed by
the acoustic, lexical, and language models. Throughout this process, training data are
used to determine the values of the model parameters.

Figure: Components of a typical speech recognition system.


Speech recognition systems attempt to model the sources of variability described above
in several ways. At the level of signal representation, researchers have developed
representations that emphasize perceptually important speaker-independent features of
the signal, and de-emphasize speaker-dependent characteristics [Her90]. At the acoustic
phonetic level, speaker variability is typically modeled using statistical techniques
applied to large amounts of data. Speaker adaptation algorithms have also been
developed that adapt speaker-independent acoustic models to those of the current speaker
during system use, (see section ). Effects of linguistic context at the acoustic phonetic
level are typically handled by training separate models for phonemes in different
contexts; this is called context dependent acoustic modeling.

Word level variability can be handled by allowing alternate pronunciations of words in


representations known as pronunciation networks. Common alternate pronunciations of
words, as well as effects of dialect and accent are handled by allowing search algorithms
to find alternate paths of phonemes through these networks. Statistical language models,
based on estimates of the frequency of occurrence of word sequences, are often used to
guide the search through the most probable sequence of words.

The dominant recognition paradigm in the past fifteen years is known as hidden Markov
models (HMM). An HMM is a doubly stochastic model, in which the generation of the
underlying phoneme string and the frame-by-frame, surface acoustic realizations are both
represented probabilistically as Markov processes, as discussed in sections , and 11.2.
Neural networks have also been used to estimate the frame based scores; these scores are
then integrated into HMM-based system architectures, in what has come to be known as
hybrid systems, as described in section 11.5.

An interesting feature of frame-based HMM systems is that speech segments are


identified during the search process, rather than explicitly. An alternate approach is to
first identify speech segments, then classify the segments and use the segment scores to
recognize words. This approach has produced competitive recognition performance in
several tasks [ZGPS90,FBC95].

1.2.2 State of the Art


Comments about the state-of-the-art need to be made in the context of specific
applications which reflect the constraints on the task. Moreover, different technologies
are sometimes appropriate for different tasks. For example, when the vocabulary is small,
the entire word can be modeled as a single unit. Such an approach is not practical for
large vocabularies, where word models must be built up from subword units.

Performance of speech recognition systems is typically described in terms of word error


rate, E, defined as:
where N is the total number of words in the test set, and S, I, and D are the total number
of substitutions, insertions, and deletions, respectively.

The past decade has witnessed significant progress in speech recognition technology.
Word error rates continue to drop by a factor of 2 every two years. Substantial progress
has been made in the basic technology, leading to the lowering of barriers to speaker
independence, continuous speech, and large vocabularies. There are several factors that
have contributed to this rapid progress. First, there is the coming of age of the HMM.
HMM is powerful in that, with the availability of training data, the parameters of the
model can be trained automatically to give optimal performance.

Second, much effort has gone into the development of large speech corpora for system
development, training, and testing. Some of these corpora are designed for acoustic
phonetic research, while others are highly task specific. Nowadays, it is not uncommon to
have tens of thousands of sentences available for system training and testing. These
corpora permit researchers to quantify the acoustic cues important for phonetic contrasts
and to determine parameters of the recognizers in a statistically meaningful way. While
many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were
originally collected under the sponsorship of the U.S. Defense Advanced Research
Projects Agency (ARPA) to spur human language technology development among its
contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada,
France, Germany, Japan, and the U.K.) as standards on which to evaluate speech
recognition.

Third, progress has been brought about by the establishment of standards for performance
evaluation. Only a decade ago, researchers trained and tested their systems using locally
collected data, and had not been very careful in delineating training and testing sets. As a
result, it was very difficult to compare performance across systems, and a system's
performance typically degraded when it was presented with previously unseen data. The
recent availability of a large body of data in the public domain, coupled with the
specification of evaluation standards, has resulted in uniform documentation of test
results, thus contributing to greater reliability in monitoring progress (corpus
development activities and evaluation methodologies are summarized in chapters 12 and
13 respectively).

Finally, advances in computer technology have also indirectly influenced our progress.
The availability of fast computers with inexpensive mass storage capabilities has enabled
researchers to run many large scale experiments in a short amount of time. This means
that the elapsed time between an idea and its implementation and evaluation is greatly
reduced. In fact, speech recognition systems with reasonable performance can now run in
real time using high-end workstations without additional hardware---a feat unimaginable
only a few years ago.
One of the most popular, and potentially most useful tasks with low perplexity (PP=11)
is the recognition of digits. For American English, speaker-independent recognition of
digit strings spoken continuously and restricted to telephone bandwidth can achieve an
error rate of 0.3% when the string length is known.

One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource
Management (RM) task, in which inquiries can be made concerning various naval vessels
in the Pacific ocean. The best speaker-independent performance on the RM task is less
than 4%, using a word-pair language model that constrains the possible words following
a given word (PP=60). More recently, researchers have begun to address the issue of
recognizing spontaneously generated speech. For example, in the Air Travel Information
Service (ATIS) domain, word error rates of less than 3% has been reported for a
vocabulary of nearly 2,000 words and a bigram language model with a perplexity of
around 15.

High perplexity tasks with a vocabulary of thousands of words are intended primarily for
the dictation application. After working on isolated-word, speaker-dependent systems for
many years, the community has since 1992 moved towards very-large-vocabulary
(20,000 words and more), high-perplexity ( ), speaker-independent,
continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on
read sentences drawn from North America business news [PFF 94].

With the steady improvements in speech recognition performance, systems are now being
deployed within telephone and cellular networks in many countries. Within the next few
years, speech recognition will be pervasive in telephone networks around the world.
There are tremendous forces driving the development of the technology; in many
countries, touch tone penetration is low, and voice is the only option for controlling
automated services. In voice dialing, for example, users can dial 10--20 telephone
numbers by voice (e.g., call home) after having enrolled their voices by saying the words
associated with telephone numbers. AT&T, on the other hand, has installed a call routing
system using speaker-independent word-spotting technology that can detect a few key
phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to
my calling card.

At present, several very large vocabulary dictation systems are available for document
generation. These systems generally require speakers to pause between words. Their
performance can be further enhanced if one can apply constraints of the specific domain
such as dictating medical reports.

Even though much progress is being made, machines are a long way from recognizing
conversational speech. Word recognition rates on telephone conversations in the
Switchboard corpus are around 50% [CGF94]. It will be many years before unlimited
vocabulary, speaker-independent continuous dictation capability is realized.

1.2.3 Future Directions


In 1992, the U.S. National Science Foundation sponsored a workshop to identify the key
research challenges in the area of human language technology, and the infrastructure
needed to support the work. The key research challenges are summarized in [CH 92].
Research in the following areas for speech recognition were identified:

Robustness:
In a robust system, performance degrades gracefully (rather than catastrophically) as
conditions become more different from those under which it was trained. Differences in
channel characteristics and acoustic environment should receive particular attention.

Portability:
Portability refers to the goal of rapidly designing, developing and deploying systems for
new applications. At present, systems tend to suffer significant degradation when moved
to a new task. In order to return to peak performance, they must be trained on examples
specific to the new task, which is time consuming and expensive.

Adaptation:
How can systems continuously adapt to changing conditions (new speakers, microphone,
task, etc) and improve through use? Such adaptation can occur at many levels in systems,
subword models, word pronunciations, language models, etc.

Adaptation:
How can systems continuously adapt to changing conditions (new speakers, microphone,
task, etc) and improve through use? Such adaptation can occur at many levels in systems,
subword models, word pronunciations, language models, etc.

Language Modeling:
Current systems use statistical language models to help reduce the search space and
resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to
create more habitable systems, it will be increasingly important to get as much constraint
as possible from language models; perhaps incorporating syntactic and semantic
constraints that cannot be captured by purely statistical models.

Confidence Measures:
Most speech recognition systems assign scores to hypotheses for the purpose of rank
ordering them. These scores do not provide a good indication of whether a hypothesis is
correct or not, just that it is better than the other hypotheses. As we move to tasks that
require actions, we need better methods to evaluate the absolute correctness of
hypotheses.

Out-of-Vocabulary Words:
Systems are designed for use with a particular set of words, but system users may not
know exactly which words are in the system vocabulary. This leads to a certain
percentage of out-of-vocabulary words in natural conditions. Systems must have some
method of detecting such out-of-vocabulary words, or they will end up mapping a word
from the vocabulary onto the unknown word, causing an error.

Spontaneous Speech:
Systems that are deployed for real use must deal with a variety of spontaneous speech
phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions
and other common behaviors not found in read speech. Development on the ATIS task
has resulted in progress in this area, but much work remains to be done.

Prosody:

Prosody refers to acoustic structure that extends over several segments or words. Stress,
intonation, and rhythm convey important information for word recognition and the user's
intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How
to integrate prosodic information into the recognition architecture is a critical question
that has not yet been answered.

Modeling Dynamics:
Systems assume a sequence of input frames which are treated as if they were
independent. But it is known that perceptual cues for words and phonemes require the
integration of features that reflect the movements of the articulators, which are dynamic
in nature. How to model dynamics and incorporate this information into recognition
systems is an unsolved problem.

voice recognition

- Voice or speech recognition is the ability of a machine or program to receive


and interpret dictation, or to understand and carry out spoken commands.

For use with computers, analog audio must be converted into digital signals. This
requires analog-to-digital conversion. For a computer to decipher the signal, it must have
a digital database, or vocabulary, of words or syllables,and a speedy means of comparing
this data with signals. The speech patterns are stored on the hard drive and loaded into
memory when the program is run. A comparator checks these stored patterns against the
output of the A/D converter.

In practice, the size of a voice-recognition program's effective vocabulary is directly


related to the random access memory capacity of the computer in which it is installed. A
voice-recognition program runs many times faster if the entire vocabulary can be loaded
into RAM, as compared with searching the hard drive for some of the matches.
Processing speed is critical as well, because it affects how fast the computer can search
the RAM for matches.

All voice-recognition systems or programs make errors. Screaming children, barking


dogs, and loud external conversations can produce false input. Much of this can be
avoided only by using the system in a quiet room. There is also a problem with words
that sound alike but are spelled differently and have different meanings -- for example,
"hear" and "here." This problem might someday be largely overcome using stored
contextual information. However, this will require more RAM and faster processors than
are currently available in personal computers.

Though a number of voice recognition systems are available on the market, the industry
leaders are IBM and Dragon Systems.

LAST UPDATED: 05 Mar 2007


QUESTION POSED ON: 07 October 2002
There has been some consideration for using voice recognition with contact centers
to deflect queuing calls. Recently there have been some nice implementations from
both Nuance and Speechworks.

How do you see this market segment developing and especially how would advise
someone interested in this technology to ensure they leverage existing investments in
either outbound scripts (Siebel Smartscripts) or knowledge bases (Primus and
eGain)?

>
EXPERT RESPONSE
I believe that the successful use of speech recognition in contact centers hinges on two
critical factors:

1. Humanization
2. Application

Let's take these two factors and explore them further.

1. Humanization People do not like talking with a computer. Most interactions involving
speech recognition use either Text-To-Speech or cold, robotic sounding prompts to
interact with the customer. Neither of these work toward building a relationship with the
customer. I know it sounds odd to think about a computer building a relationship with a
customer but that is at the heart of real communication.

If the computer sounds like a person and responds as a person would, then your ability to
engage a customer and keep them engaged for an automated session increases
significantly. As an example compare the following:

"Please state your full name" (stated in a monotone)


"Would you please say your first and last name" (stated with full dynamics)
Clearly the 2nd interaction would be preferred. Achieving this is the first success factor.

2. Application

Certain types of applications lend themselves well toward an automated interaction with a
customer. A good example would be calling in a prescription refill to a pharmacy or
checking to see when an order shipped and when it is expected to be delivered.

These types of applications don't require the skills of a highly trained agent but can be
very time consuming in terms of personnel cost. Imagine the value of reducing your
headcount of less skilled agents while not wasting the time of your highly trained and well
compensated agents.

Summary

It is these types of applications where the largest values can be gained. Don't try to replace
your entire agent population. That is not going to happen. Be realistic. Focus on
applications where the form of the transaction is fairly consistent.

The field of computer science that deals with designing computer systems that can
recognize spoken words. Note that voice recognition implies only that the computer can
take dictation, not that it understands what is being said. Comprehending human
languages falls under a different field of computer science called natural language
processing.

A number of voice recognition systems are available on the market. The most powerful
can recognize thousands of words. However, they generally require an extended training
session during which the computer system becomes accustomed to a particular voice and
accent. Such systems are said to be speaker dependent.

Many systems also require that the speaker speak slowly and distinctly and separate each
word with a short pause. These systems are called discrete speech systems. Recently,
great strides have been made in continuous speech systems -- voice recognition systems
that allow you to speak naturally. There are now several continuous-speech systems
available for personal computers.

Because of their limitations and high cost, voice recognition systems have traditionally
been used only in a few specialized situations. For example, such systems are useful in
instances when the user is unable to use a keyboard to enter data because his or her hands
are occupied or disabled. Instead of typing commands, the user can simply speak into a
headset. Increasingly, however, as the cost decreases and performance improves, speech
recognition systems are entering the mainstream and are being used as an alternative to
keyboards.

Das könnte Ihnen auch gefallen