Beruflich Dokumente
Kultur Dokumente
Speech recognition (in many contexts also known as automatic speech recognition,
computer speech recognition or erroneously as voice recognition) is the process of
converting a speech signal to a sequence of words, by means of an algorithm
implemented as a computer program.
Speech recognition applications that have emerged over the last few years include voice
dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"),
simple data entry (e.g., entering a credit card number), preparation of structured
documents (e.g., a radiology report), domotic appliances control and content-based
spoken audio search (e.g. find a podcast where particular words were spoken).
Voice recognition or speaker recognition is a related process that attempts to identify the
person speaking, as opposed to what is being said.
Contents
[hide]
• 10 External links
Most speech recognition users would tend to agree that dictation machines can achieve
very high performance in controlled conditions. Part of the confusion mainly comes from
the mixed usage of the terms "speech recognition" and "dictation".
Both acoustic modeling and language modeling are important studies in modern
statistical speech recognition. In this entry, we will the use of hidden Markov model
(HMM) because notably it is very widely used in many systems. (Language modeling has
many other applications such as smart keyboard and document classification; to the
corresponding entries.)
The Carnegie Mellon University has made some good steps in increasing the speed of
speechchips by using ASICs (application-specific integrated circuits) and reconfigurable
chips called FPGAs (field programmable gate arrays). [1]
Described above are the core elements of the most common, HMM-based approach to
speech recognition. Modern speech recognition systems use various combinations of a
number of standard techniques in order to improve results over the basic approach
described above. A typical large-vocabulary system would need context dependency for
the phones (so phones with different left and right context have different realizations as
HMM states); it would use cepstral normalization to normalize for different speaker and
recording conditions; for further speaker normalization it might use vocal tract length
normalization (VTLN) for male-female normalization and maximum likelihood linear
regression (MLLR) for more general speaker adaptation. The features would have so-
called delta and delta-delta coefficients to capture speech dynamics and in addition might
use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and
delta-delta coefficients and use splicing and an LDA-based projection followed perhaps
by heteroscedastic linear discriminant analysis or a global semitied covariance transform
(also known as maximum likelihood linear transform, or MLLT). Many systems use so-
called discriminative training techniques which dispense with a purely statistical
approach to HMM parameter estimation and instead optimize some classification-related
measure of the training data. Examples are maximum mutual information (MMI),
minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the term for what happens when the system is presented with a
new utterance and must compute the most likely source sentence) would probably use the
Viterbi algorithm to find the best path, and here there is a choice between dynamically
creating a combination hidden Markov model which includes both the acoustic and
language model information, or combining it statically beforehand (the finite state
transducer, or FST, approach).
Another approach in acoustic modeling is the use of neural networks. They are capable
of solving much more complicated recognition tasks, but do not scale as well as HMMs
when it comes to large vocabularies. Rather than being used in general-purpose speech
recognition applications they can handle low quality, noisy data and speaker
independence. Such systems can achieve greater accuracy than HMM based systems, as
long as there is training data and the vocabulary is limited. A more general approach
using neural networks is phoneme recognition. This is an active field of research, but
generally the results are better than for HMMs. There are also NN-HMM hybrid systems
that use the neural network part for phoneme recognition and the hidden markov model
part for language modeling.
Dynamic time warping is an approach that was historically used for speech recognition
but has now largely been displaced by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed. For instance, similarities in walking patterns would be
detected, even if in one video the person was walking slowly and if in another they were
walking more quickly, or even if there were accelerations and decelerations during the
course of one observation. DTW has been applied to video, audio, and graphics -- indeed,
any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal
match between two given sequences (e.g. time series) with certain restrictions, i.e. the
sequences are "warped" non-linearly to match each other. This sequence alignment
method is often used in the context of hidden Markov models.
In terms of freely available resources, the HTK book (and the accompanying HTK
toolkit) is one place to start to both learn about speech recognition and to start
experimenting. Another such resource is Carnegie Mellon University's SPHINX toolkit.
[edit] Microphone
The microphone type recommend for speech recognition is the array microphone.
This short section requires expansion.
[edit] Books
• Multilingual Speech Processing, Edited by Tanja Schultz and Katrin Kirchhoff,
April 2006--Researchers and developers in industry and academia with different
backgrounds but a common interest in multilingual speech processing will find an
excellent overview of research problems and solutions detailed from theoretical
and practical perspectives.---CH 1: Introduction / CH 2: Language Characteristics
/ CH 3: Linguistic Data Resources / CH 4: Multilingual Acoustic Modeling / CH
5: Multilingual Dictionaries / CH 6: Multilingual Language Modeling / CH 7:
Multilingual Speech Synthesis / CH 8: Automatic Language Identification / CH 9:
Other Challenges /
Each system lip reading and speech recognition works separately then their results are
mixed at the stage of feature fusion.
[hide]
v•d•e
Speech recognition systems can be characterized by many parameters, some of the more
important of which are shown in Figure . An isolated-word speech recognition system
requires that the speaker pause briefly between words, whereas a continuous speech
recognition system does not. Spontaneous, or extemporaneously generated, speech
contains disfluencies, and is much more difficult to recognize than speech read from
script. Some systems require speaker enrollment---a user must provide samples of his or
her speech before using them, whereas other systems are said to be speaker-independent,
in that no enrollment is necessary. Some of the other parameters depend on the specific
task. Recognition is generally more difficult when vocabularies are large or have many
similar-sounding words. When speech is produced in a sequence of words, language
models or artificial grammars are used to restrict the combination of words.
The simplest language model can be specified as a finite-state network, where the
permissible words following each word are given explicitly. More general language
models approximating natural language are specified in terms of a context-sensitive
grammar.
One popular measure of the difficulty of the task, combining the vocabulary size and the
language model, is perplexity, loosely defined as the geometric mean of the number of
words that can follow a word after the language model has been applied (see section for
a discussion of language modeling in general and perplexity in particular). Finally, there
are some external parameters that can affect speech recognition system performance,
including the characteristics of the environmental noise and the type and the placement of
the microphone.
Table: Typical parameters used to characterize the capability of speech recognition
systems
Second, acoustic variabilities can result from changes in the environment as well as in
the position and characteristics of the transducer. Third, within-speaker variabilities can
result from changes in the speaker's physical and emotional state, speaking rate, or voice
quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size
and shape can contribute to across-speaker variabilities.
Figure shows the major components of a typical speech recognition system. The
digitized speech signal is first transformed into a set of useful measurements or features
at a fixed rate, typically once every 10--20 msec (see sections and 11.3 for signal
representation and digital signal processing, respectively). These measurements are then
used to search for the most likely word candidate, making use of constraints imposed by
the acoustic, lexical, and language models. Throughout this process, training data are
used to determine the values of the model parameters.
The dominant recognition paradigm in the past fifteen years is known as hidden Markov
models (HMM). An HMM is a doubly stochastic model, in which the generation of the
underlying phoneme string and the frame-by-frame, surface acoustic realizations are both
represented probabilistically as Markov processes, as discussed in sections , and 11.2.
Neural networks have also been used to estimate the frame based scores; these scores are
then integrated into HMM-based system architectures, in what has come to be known as
hybrid systems, as described in section 11.5.
The past decade has witnessed significant progress in speech recognition technology.
Word error rates continue to drop by a factor of 2 every two years. Substantial progress
has been made in the basic technology, leading to the lowering of barriers to speaker
independence, continuous speech, and large vocabularies. There are several factors that
have contributed to this rapid progress. First, there is the coming of age of the HMM.
HMM is powerful in that, with the availability of training data, the parameters of the
model can be trained automatically to give optimal performance.
Second, much effort has gone into the development of large speech corpora for system
development, training, and testing. Some of these corpora are designed for acoustic
phonetic research, while others are highly task specific. Nowadays, it is not uncommon to
have tens of thousands of sentences available for system training and testing. These
corpora permit researchers to quantify the acoustic cues important for phonetic contrasts
and to determine parameters of the recognizers in a statistically meaningful way. While
many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were
originally collected under the sponsorship of the U.S. Defense Advanced Research
Projects Agency (ARPA) to spur human language technology development among its
contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada,
France, Germany, Japan, and the U.K.) as standards on which to evaluate speech
recognition.
Third, progress has been brought about by the establishment of standards for performance
evaluation. Only a decade ago, researchers trained and tested their systems using locally
collected data, and had not been very careful in delineating training and testing sets. As a
result, it was very difficult to compare performance across systems, and a system's
performance typically degraded when it was presented with previously unseen data. The
recent availability of a large body of data in the public domain, coupled with the
specification of evaluation standards, has resulted in uniform documentation of test
results, thus contributing to greater reliability in monitoring progress (corpus
development activities and evaluation methodologies are summarized in chapters 12 and
13 respectively).
Finally, advances in computer technology have also indirectly influenced our progress.
The availability of fast computers with inexpensive mass storage capabilities has enabled
researchers to run many large scale experiments in a short amount of time. This means
that the elapsed time between an idea and its implementation and evaluation is greatly
reduced. In fact, speech recognition systems with reasonable performance can now run in
real time using high-end workstations without additional hardware---a feat unimaginable
only a few years ago.
One of the most popular, and potentially most useful tasks with low perplexity (PP=11)
is the recognition of digits. For American English, speaker-independent recognition of
digit strings spoken continuously and restricted to telephone bandwidth can achieve an
error rate of 0.3% when the string length is known.
One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource
Management (RM) task, in which inquiries can be made concerning various naval vessels
in the Pacific ocean. The best speaker-independent performance on the RM task is less
than 4%, using a word-pair language model that constrains the possible words following
a given word (PP=60). More recently, researchers have begun to address the issue of
recognizing spontaneously generated speech. For example, in the Air Travel Information
Service (ATIS) domain, word error rates of less than 3% has been reported for a
vocabulary of nearly 2,000 words and a bigram language model with a perplexity of
around 15.
High perplexity tasks with a vocabulary of thousands of words are intended primarily for
the dictation application. After working on isolated-word, speaker-dependent systems for
many years, the community has since 1992 moved towards very-large-vocabulary
(20,000 words and more), high-perplexity ( ), speaker-independent,
continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on
read sentences drawn from North America business news [PFF 94].
With the steady improvements in speech recognition performance, systems are now being
deployed within telephone and cellular networks in many countries. Within the next few
years, speech recognition will be pervasive in telephone networks around the world.
There are tremendous forces driving the development of the technology; in many
countries, touch tone penetration is low, and voice is the only option for controlling
automated services. In voice dialing, for example, users can dial 10--20 telephone
numbers by voice (e.g., call home) after having enrolled their voices by saying the words
associated with telephone numbers. AT&T, on the other hand, has installed a call routing
system using speaker-independent word-spotting technology that can detect a few key
phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to
my calling card.
At present, several very large vocabulary dictation systems are available for document
generation. These systems generally require speakers to pause between words. Their
performance can be further enhanced if one can apply constraints of the specific domain
such as dictating medical reports.
Even though much progress is being made, machines are a long way from recognizing
conversational speech. Word recognition rates on telephone conversations in the
Switchboard corpus are around 50% [CGF94]. It will be many years before unlimited
vocabulary, speaker-independent continuous dictation capability is realized.
Robustness:
In a robust system, performance degrades gracefully (rather than catastrophically) as
conditions become more different from those under which it was trained. Differences in
channel characteristics and acoustic environment should receive particular attention.
Portability:
Portability refers to the goal of rapidly designing, developing and deploying systems for
new applications. At present, systems tend to suffer significant degradation when moved
to a new task. In order to return to peak performance, they must be trained on examples
specific to the new task, which is time consuming and expensive.
Adaptation:
How can systems continuously adapt to changing conditions (new speakers, microphone,
task, etc) and improve through use? Such adaptation can occur at many levels in systems,
subword models, word pronunciations, language models, etc.
Adaptation:
How can systems continuously adapt to changing conditions (new speakers, microphone,
task, etc) and improve through use? Such adaptation can occur at many levels in systems,
subword models, word pronunciations, language models, etc.
Language Modeling:
Current systems use statistical language models to help reduce the search space and
resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to
create more habitable systems, it will be increasingly important to get as much constraint
as possible from language models; perhaps incorporating syntactic and semantic
constraints that cannot be captured by purely statistical models.
Confidence Measures:
Most speech recognition systems assign scores to hypotheses for the purpose of rank
ordering them. These scores do not provide a good indication of whether a hypothesis is
correct or not, just that it is better than the other hypotheses. As we move to tasks that
require actions, we need better methods to evaluate the absolute correctness of
hypotheses.
Out-of-Vocabulary Words:
Systems are designed for use with a particular set of words, but system users may not
know exactly which words are in the system vocabulary. This leads to a certain
percentage of out-of-vocabulary words in natural conditions. Systems must have some
method of detecting such out-of-vocabulary words, or they will end up mapping a word
from the vocabulary onto the unknown word, causing an error.
Spontaneous Speech:
Systems that are deployed for real use must deal with a variety of spontaneous speech
phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions
and other common behaviors not found in read speech. Development on the ATIS task
has resulted in progress in this area, but much work remains to be done.
Prosody:
Prosody refers to acoustic structure that extends over several segments or words. Stress,
intonation, and rhythm convey important information for word recognition and the user's
intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How
to integrate prosodic information into the recognition architecture is a critical question
that has not yet been answered.
Modeling Dynamics:
Systems assume a sequence of input frames which are treated as if they were
independent. But it is known that perceptual cues for words and phonemes require the
integration of features that reflect the movements of the articulators, which are dynamic
in nature. How to model dynamics and incorporate this information into recognition
systems is an unsolved problem.
voice recognition
For use with computers, analog audio must be converted into digital signals. This
requires analog-to-digital conversion. For a computer to decipher the signal, it must have
a digital database, or vocabulary, of words or syllables,and a speedy means of comparing
this data with signals. The speech patterns are stored on the hard drive and loaded into
memory when the program is run. A comparator checks these stored patterns against the
output of the A/D converter.
Though a number of voice recognition systems are available on the market, the industry
leaders are IBM and Dragon Systems.
How do you see this market segment developing and especially how would advise
someone interested in this technology to ensure they leverage existing investments in
either outbound scripts (Siebel Smartscripts) or knowledge bases (Primus and
eGain)?
>
EXPERT RESPONSE
I believe that the successful use of speech recognition in contact centers hinges on two
critical factors:
1. Humanization
2. Application
1. Humanization People do not like talking with a computer. Most interactions involving
speech recognition use either Text-To-Speech or cold, robotic sounding prompts to
interact with the customer. Neither of these work toward building a relationship with the
customer. I know it sounds odd to think about a computer building a relationship with a
customer but that is at the heart of real communication.
If the computer sounds like a person and responds as a person would, then your ability to
engage a customer and keep them engaged for an automated session increases
significantly. As an example compare the following:
2. Application
Certain types of applications lend themselves well toward an automated interaction with a
customer. A good example would be calling in a prescription refill to a pharmacy or
checking to see when an order shipped and when it is expected to be delivered.
These types of applications don't require the skills of a highly trained agent but can be
very time consuming in terms of personnel cost. Imagine the value of reducing your
headcount of less skilled agents while not wasting the time of your highly trained and well
compensated agents.
Summary
It is these types of applications where the largest values can be gained. Don't try to replace
your entire agent population. That is not going to happen. Be realistic. Focus on
applications where the form of the transaction is fairly consistent.
The field of computer science that deals with designing computer systems that can
recognize spoken words. Note that voice recognition implies only that the computer can
take dictation, not that it understands what is being said. Comprehending human
languages falls under a different field of computer science called natural language
processing.
A number of voice recognition systems are available on the market. The most powerful
can recognize thousands of words. However, they generally require an extended training
session during which the computer system becomes accustomed to a particular voice and
accent. Such systems are said to be speaker dependent.
Many systems also require that the speaker speak slowly and distinctly and separate each
word with a short pause. These systems are called discrete speech systems. Recently,
great strides have been made in continuous speech systems -- voice recognition systems
that allow you to speak naturally. There are now several continuous-speech systems
available for personal computers.
Because of their limitations and high cost, voice recognition systems have traditionally
been used only in a few specialized situations. For example, such systems are useful in
instances when the user is unable to use a keyboard to enter data because his or her hands
are occupied or disabled. Instead of typing commands, the user can simply speak into a
headset. Increasingly, however, as the cost decreases and performance improves, speech
recognition systems are entering the mainstream and are being used as an alternative to
keyboards.