White Paper - Demystifying Speech Recognition by Charles Corfield - July2012

Demystifying Speech Recognition
By Charles Corfield, President & CEO, nVoq

Charles Corfield has been in the technology sector for the last 25 years in both operational
and investment roles. He has served at organizations from start-ups to public companies,
including stints on audit and compensation committees. His interest in voice automation to
speed desktop workflow grew out of the experiences from his portfolio companies BeVocal
(now Nuance) and iBasis (now KPN). He served as cofounder and chief technology officer of
Frame Technology, which was acquired by Adobe. He is also an early investor in SilverLake Partners.

How voice-to-text really works
After you have used computer-based speech recognition for a while, it is natural to wonder How does it work?
and, when it does something unexpected, you will likely ask Where did that come from?
Lets take some of the mystery out of speech recognition by stepping through the process by which a computer
turns your voice into text.
There are four steps in the conversion of voice to text:
1. Audio to phonemes
2. Phonemes to words
3. Words to phrases
4. Raw transcribed text to formatted text
The first step turns a continuous audio stream into the basic sounds of English, which are called phonemes.
US English has forty different sounds from which all the native words are built (see US English Phonemes). I say
native, because English has adopted words from other languages, which may contain non-native sounds. For
example, the ch in chutzpah is a phoneme, well known to New Yorkers, but it is not native. Similarly, and
are common phonemes in German, but do not exist in US English. In order to identify the sequence of
phonemes in continuous speech, the computer divides up the incoming audio into short time-slices called
frames.
For each frame, the computer measures the strength of pre-defined frequency bands within the overall range
of speech (approximately 60Hz 7kHz, see Human Voice). Thus, each frame is converted into a set of numbers
(one number per frequency-band). The recognition engine uses a reference table to find the best matched
phoneme for a given frame. This table contains representative frequency-band-strengths for each of US Englishs
forty phonemes. This is why you are asked to record a profile when using SayIt for the first time it needs to
know how you pronounce each of the forty phonemes, so as to get the best possible entries in the look-up table.

8663834500 www.nvoq.com
Note that the audio for one spoken phoneme may span over multiple frames, thus, when we replace frames by
their corresponding phonemes, we will end up with duplicates. The engine cleans up the sequence of converted
phonemes by eliminating the duplicates.
The second step is to translate phonemes into words. The recognizer uses a lexicon which contains all the words
it knows about together with their pronunciations each pronunciation is described using phonemes. For
example, the pronunciation of the word cat has three phonemes: k, ae, t (see US English Phonemes). Some
words have multiple pronunciations and the lexicon has these too, for example either has two common
pronunciations ay dh ax (eye-th-er) and iy dh ax (ee-th-er). Note that US English and UK English have
similar lexicons, where the few differences consist of variations in pronunciation and spelling (e.g., neighbor vs.
neighbour). As the recognizer moves along the sequence of phonemes, it looks for words hidden in the
sequence. This is similar (in spirit) to the newspaper puzzle which has a grid of scrambled letters and you must
find words hidden within the grid. The puzzle makes your life a little harder by allowing different directions (up,
down, left, right, etc.), whereas the recognizer proceeds in one direction left to right, but it is allowed to create
overlapping sequences of words. Why? The answer is that different words and phrases may share the same
pronunciations a simple example is there and their; a more complex example comes from the campfire
song, life is butter melon cauliflower which sounds the same as life is but a melancholy flower. After the
engine has identified its candidate word sequences, we have to sort out which is the correct one.
The third step identifies the best sequence of words by using language modeling. A language model describes
speech patterns in terms of words which are likely to be seen together. The conventional way of representing
speech patterns is to create lists of two- and three-word sequences (bigrams and trigrams) together with their
relative frequencies. For example, if the usage is soccer, then the sequences off side and half-time score will
be relatively common, while, conversely, checkmate (from chess) is very unlikely. Another way to think of this
is that the language model helps you to predict the most likely next word or words. In the soccer example, if you
see the word center, you will not be surprised if the next word is forward or field.
All language models start from collections of the things people say or write in a particular context. For example,
if you want to create a language model for the NY Times, you might compile a years worth of editions and
generate the relative counts for all the two and three word sequences you find in that collection. Similarly, if you
want to create a language model for a medical specialty, you might gather transcripts of reports in that specialty
and compile the relative frequencies of all the two and three word sequences in those transcripts. Models
constructed in this manner represent a kind of average, since they reflect the combined usage of a lot of users
within a given field.
The language model helps us sort between the competing sequences of words from the conversion of
phonemes into possible words and phrases. For example, suppose the recognition, thus far, has yielded two
possible fragments -- over there and over their. If the next word identified is heads, then the language
model would help the engine choose over their heads as opposed to over there heads. After the language
model has done its job, the engine has produced the longhand version of the transcript, where each word is
spelled out fully in plain text. You can think of this as the raw text of the transcription, and, if you read it out
loud, it will sound correct, but it (typically) will not look correct as printed text. For example, the raw transcript
of the audio might be one dollar and thirty three cents paid on March third two thousand and twelve.

However, most of us would expect the printed version to be $1.33 paid on 03/03/2012 or $1.33 paid on
March 3
rd
, 2012.
This brings us to the fourth and final step in the recognition process: formatting or normalization. This is where
the clean-up happens substitutions to make the text appear in a form that is most comfortable to read
punctuation; capital letters at the beginning of sentences; formatted dates, times, and monetary amounts;
standard abbreviations; common acronyms; and so forth. For the most part, the formatting is handled by simple
substitutions, which work like search-and-replace in a word processor. One of SayIts features is that you can
add your own formatting rules and/or substitutions. For example, in customer care, some clients replace
customer called in with CCI, or in health care, some users replace alert and oriented times three with
A&Ox3.
These four steps are a high-level summary of how speech recognition works in an ideal world. However, they
also provide clues about what goes wrong in real-world situations and what you can do to remedy misfires.
Building on this, lets now talk about how to troubleshoot the most common problems encountered with speech
recognition.
Troubleshooting recognition errors
One common issue is poor-quality audio, typically due to one of the following: poor articulation, a poor quality
microphone, a poorly placed microphone boom (on headsets), background noise, electrical interference,
speaking too loudly, or speaking too quietly. When the audio quality is poor, it is hard for the recognition engine
to identify phonemes, since the frequency spectrum of the audio frames has been polluted and the calculated
frequency-band-strengths no longer correspond closely to entries in the look-up table. As the audio
deteriorates, the phoneme sequence generated from the audio contains more errors, and the more errors there
are, the harder it is for the engine to find the right words in the lexicon. This is why the first hurdle that has to
be cleared in speech recognition is audio quality. It must be fixed before tackling anything else.
The quality of the audio also depends on the users speech habits. If the users audio contains lots of artifacts
which are not phonetic, such as, slurring, coughing, Ers, Ums, eating, drinking, and noisy breathing, then it is
no surprise that the recognition engine will have difficulty interpreting the users speech. The computer expects
to hear phonemes, so there is no point making other sounds if you want good recognition.
The next problems are lexical -- What happens when users say words which are not in the lexicon? This can
occur when a word is completely missing, or the word is present, but the users particular pronunciation is not.
Since the recognizer is limited to just the words in its lexicon, it will find entries that are the closest match to
what is being said. If a word you want is completely missing, then, no matter how you pronounce it, it will never
be returned. On the other hand if the desired word appears sporadically, the issue may be a missing or
inaccurate pronunciation. This is why SayIt has a feature which allows users to add words and their
pronunciations to its lexicon.
Perhaps the most perplexing errors in computer-based speech recognition are due to problems with the
language model. Language models are difficult for users to diagnose and remedy, since users are unable to see

what is inside the model. As defined previously, a language model is a collection of short sequences of words
together with their relative frequencies. The models that are supplied out-of-the-box with recognizers are a kind
of average, since they are compiled from numerous individuals (in a given subject matter area). Inevitably, an
average model is not a precise model for a given user and it will occasionally push the transcription in a
direction that makes sense for the average user, but not for a particular user. Thus, there needs to be a way
for individual users to true up the supplied average model. SayIt allows users (or their administrators) to
upload examples of their work and modifies the installed language model accordingly.
Another set of recognition errors results from combining two, or more, language models. The rationale for
combining two (or more) models is that users may switch between different types of work, for example,
documenting office work (technical language) and e-mail (colloquial English). The least disruptive way to
accommodate this back-and-forth in the work flow is to combine the language models for the technical work
with that for the e-mail correspondence. Most of the time this works well, but occasionally the wrong model
will bleed through and result in a transcription going off the rails until the right language model re-asserts
itself again. This can happen at a fork in the road where the most recent words occur naturally in both
language models but, the wrong model has higher relative frequencies and the engine throws its lot in with
that model and it proceeds down the wrong fork. Sometimes this can be fixed by altering the blend of the
two models. However, if the errors are persistent (and annoying), then the user will need to have two accounts,
each configured for the relevant stand-alone model.
A related issue occurs when the body of material used to prepare a language model is too broad and there are
too many choices for the recognition engine to follow. This is the challenge faced by consumer-oriented
services, such as Apples Siri the transcription frequently wanders off in undesired directions, which are often
quite entertaining. Conversely, if the usage context allows for a tight language model, where the vocabulary
and phraseology are narrowly defined, then the engine can yield good transcriptions even in difficult audio
environments, simply because its choices are limited.
The last class of problems we consider are formatting or normalization, and they are usually the easiest to fix.
These are errors in presentation, rather than recognition. In other words, if you read out loud the transcribed
text, it will match what you said. The problem is that text does not look right. A few examples will illustrate the
point. You may be using a term which is spelled differently from normal, for instance, if you want to talk about a
Joule case, it is quite likely that the recognition engine will return jewel case (as in a CD container). Or, you
might be talking about a product where you say seventh heaven, but you would like it to appear as
Heaven7. These are straightforward substitutions, just like when you perform a search-and-replace in a word
processor. In SayIt, you can provide the target and replacement strings (including special characters). These
forms of search-and-replace can be used to re-order words, make abbreviations, expand abbreviations, insert
special characters, insert text-formatting effects (bold, italic, etc.), and so on. Recognition engines have built-in
formatting rules for common things such as punctuation, currencies, dates, and times, which are hard to do with
simple search-and-replace strings. More complex formatting effects, or changes to dates and currencies, may
require scripting, beyond the scope of what most users are willing to do. The key to spotting normalization
issues is to read aloud the problematic transcription if it sounds right, but looks wrong, the recognition
worked, but the formatting needs help.

This concludes our quick tour through the mechanics of speech recognition. The key to understanding
computer-based speech recognition is to remember the four steps described above and use them as a guide to
understanding why recognition works, when it does, and potential fixes, when it does not.
For more information, contact info@nvoq.com or call 866-383-4500.

White Paper - Demystifying Speech Recognition by Charles Corfield - July2012

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

White Paper - Demystifying Speech Recognition by Charles Corfield - July2012

Hochgeladen von

Copyright:

Verfügbare Formate

Demystifying Speech Recognition

By Charles Corfield, President & CEO, nVoq

Das könnte Ihnen auch gefallen