Charles Corfield has been in the technology sector for the last 25 years in both operational and investment roles. He has served at organizations from start-ups to public companies, including stints on audit and compensation committees. His interest in voice automation to speed desktop workflow grew out of the experiences from his portfolio companies BeVocal (now Nuance) and iBasis (now KPN). He served as cofounder and chief technology officer of Frame Technology, which was acquired by Adobe. He is also an early investor in SilverLake Partners.
How voice-to-text really works After you have used computer-based speech recognition for a while, it is natural to wonder How does it work? and, when it does something unexpected, you will likely ask Where did that come from? Lets take some of the mystery out of speech recognition by stepping through the process by which a computer turns your voice into text. There are four steps in the conversion of voice to text: 1. Audio to phonemes 2. Phonemes to words 3. Words to phrases 4. Raw transcribed text to formatted text The first step turns a continuous audio stream into the basic sounds of English, which are called phonemes. US English has forty different sounds from which all the native words are built (see US English Phonemes). I say native, because English has adopted words from other languages, which may contain non-native sounds. For example, the ch in chutzpah is a phoneme, well known to New Yorkers, but it is not native. Similarly, and are common phonemes in German, but do not exist in US English. In order to identify the sequence of phonemes in continuous speech, the computer divides up the incoming audio into short time-slices called frames. For each frame, the computer measures the strength of pre-defined frequency bands within the overall range of speech (approximately 60Hz 7kHz, see Human Voice). Thus, each frame is converted into a set of numbers (one number per frequency-band). The recognition engine uses a reference table to find the best matched phoneme for a given frame. This table contains representative frequency-band-strengths for each of US Englishs forty phonemes. This is why you are asked to record a profile when using SayIt for the first time it needs to know how you pronounce each of the forty phonemes, so as to get the best possible entries in the look-up table.
8663834500 www.nvoq.com Note that the audio for one spoken phoneme may span over multiple frames, thus, when we replace frames by their corresponding phonemes, we will end up with duplicates. The engine cleans up the sequence of converted phonemes by eliminating the duplicates. The second step is to translate phonemes into words. The recognizer uses a lexicon which contains all the words it knows about together with their pronunciations each pronunciation is described using phonemes. For example, the pronunciation of the word cat has three phonemes: k, ae, t (see US English Phonemes). Some words have multiple pronunciations and the lexicon has these too, for example either has two common pronunciations ay dh ax (eye-th-er) and iy dh ax (ee-th-er). Note that US English and UK English have similar lexicons, where the few differences consist of variations in pronunciation and spelling (e.g., neighbor vs. neighbour). As the recognizer moves along the sequence of phonemes, it looks for words hidden in the sequence. This is similar (in spirit) to the newspaper puzzle which has a grid of scrambled letters and you must find words hidden within the grid. The puzzle makes your life a little harder by allowing different directions (up, down, left, right, etc.), whereas the recognizer proceeds in one direction left to right, but it is allowed to create overlapping sequences of words. Why? The answer is that different words and phrases may share the same pronunciations a simple example is there and their; a more complex example comes from the campfire song, life is butter melon cauliflower which sounds the same as life is but a melancholy flower. After the engine has identified its candidate word sequences, we have to sort out which is the correct one. The third step identifies the best sequence of words by using language modeling. A language model describes speech patterns in terms of words which are likely to be seen together. The conventional way of representing speech patterns is to create lists of two- and three-word sequences (bigrams and trigrams) together with their relative frequencies. For example, if the usage is soccer, then the sequences off side and half-time score will be relatively common, while, conversely, checkmate (from chess) is very unlikely. Another way to think of this is that the language model helps you to predict the most likely next word or words. In the soccer example, if you see the word center, you will not be surprised if the next word is forward or field. All language models start from collections of the things people say or write in a particular context. For example, if you want to create a language model for the NY Times, you might compile a years worth of editions and generate the relative counts for all the two and three word sequences you find in that collection. Similarly, if you want to create a language model for a medical specialty, you might gather transcripts of reports in that specialty and compile the relative frequencies of all the two and three word sequences in those transcripts. Models constructed in this manner represent a kind of average, since they reflect the combined usage of a lot of users within a given field. The language model helps us sort between the competing sequences of words from the conversion of phonemes into possible words and phrases. For example, suppose the recognition, thus far, has yielded two possible fragments -- over there and over their. If the next word identified is heads, then the language model would help the engine choose over their heads as opposed to over there heads. After the language model has done its job, the engine has produced the longhand version of the transcript, where each word is spelled out fully in plain text. You can think of this as the raw text of the transcription, and, if you read it out loud, it will sound correct, but it (typically) will not look correct as printed text. For example, the raw transcript of the audio might be one dollar and thirty three cents paid on March third two thousand and twelve.
8663834500 www.nvoq.com However, most of us would expect the printed version to be $1.33 paid on 03/03/2012 or $1.33 paid on March 3 rd , 2012. This brings us to the fourth and final step in the recognition process: formatting or normalization. This is where the clean-up happens substitutions to make the text appear in a form that is most comfortable to read punctuation; capital letters at the beginning of sentences; formatted dates, times, and monetary amounts; standard abbreviations; common acronyms; and so forth. For the most part, the formatting is handled by simple substitutions, which work like search-and-replace in a word processor. One of SayIts features is that you can add your own formatting rules and/or substitutions. For example, in customer care, some clients replace customer called in with CCI, or in health care, some users replace alert and oriented times three with A&Ox3. These four steps are a high-level summary of how speech recognition works in an ideal world. However, they also provide clues about what goes wrong in real-world situations and what you can do to remedy misfires. Building on this, lets now talk about how to troubleshoot the most common problems encountered with speech recognition. Troubleshooting recognition errors One common issue is poor-quality audio, typically due to one of the following: poor articulation, a poor quality microphone, a poorly placed microphone boom (on headsets), background noise, electrical interference, speaking too loudly, or speaking too quietly. When the audio quality is poor, it is hard for the recognition engine to identify phonemes, since the frequency spectrum of the audio frames has been polluted and the calculated frequency-band-strengths no longer correspond closely to entries in the look-up table. As the audio deteriorates, the phoneme sequence generated from the audio contains more errors, and the more errors there are, the harder it is for the engine to find the right words in the lexicon. This is why the first hurdle that has to be cleared in speech recognition is audio quality. It must be fixed before tackling anything else. The quality of the audio also depends on the users speech habits. If the users audio contains lots of artifacts which are not phonetic, such as, slurring, coughing, Ers, Ums, eating, drinking, and noisy breathing, then it is no surprise that the recognition engine will have difficulty interpreting the users speech. The computer expects to hear phonemes, so there is no point making other sounds if you want good recognition. The next problems are lexical -- What happens when users say words which are not in the lexicon? This can occur when a word is completely missing, or the word is present, but the users particular pronunciation is not. Since the recognizer is limited to just the words in its lexicon, it will find entries that are the closest match to what is being said. If a word you want is completely missing, then, no matter how you pronounce it, it will never be returned. On the other hand if the desired word appears sporadically, the issue may be a missing or inaccurate pronunciation. This is why SayIt has a feature which allows users to add words and their pronunciations to its lexicon. Perhaps the most perplexing errors in computer-based speech recognition are due to problems with the language model. Language models are difficult for users to diagnose and remedy, since users are unable to see
8663834500 www.nvoq.com what is inside the model. As defined previously, a language model is a collection of short sequences of words together with their relative frequencies. The models that are supplied out-of-the-box with recognizers are a kind of average, since they are compiled from numerous individuals (in a given subject matter area). Inevitably, an average model is not a precise model for a given user and it will occasionally push the transcription in a direction that makes sense for the average user, but not for a particular user. Thus, there needs to be a way for individual users to true up the supplied average model. SayIt allows users (or their administrators) to upload examples of their work and modifies the installed language model accordingly. Another set of recognition errors results from combining two, or more, language models. The rationale for combining two (or more) models is that users may switch between different types of work, for example, documenting office work (technical language) and e-mail (colloquial English). The least disruptive way to accommodate this back-and-forth in the work flow is to combine the language models for the technical work with that for the e-mail correspondence. Most of the time this works well, but occasionally the wrong model will bleed through and result in a transcription going off the rails until the right language model re-asserts itself again. This can happen at a fork in the road where the most recent words occur naturally in both language models but, the wrong model has higher relative frequencies and the engine throws its lot in with that model and it proceeds down the wrong fork. Sometimes this can be fixed by altering the blend of the two models. However, if the errors are persistent (and annoying), then the user will need to have two accounts, each configured for the relevant stand-alone model. A related issue occurs when the body of material used to prepare a language model is too broad and there are too many choices for the recognition engine to follow. This is the challenge faced by consumer-oriented services, such as Apples Siri the transcription frequently wanders off in undesired directions, which are often quite entertaining. Conversely, if the usage context allows for a tight language model, where the vocabulary and phraseology are narrowly defined, then the engine can yield good transcriptions even in difficult audio environments, simply because its choices are limited. The last class of problems we consider are formatting or normalization, and they are usually the easiest to fix. These are errors in presentation, rather than recognition. In other words, if you read out loud the transcribed text, it will match what you said. The problem is that text does not look right. A few examples will illustrate the point. You may be using a term which is spelled differently from normal, for instance, if you want to talk about a Joule case, it is quite likely that the recognition engine will return jewel case (as in a CD container). Or, you might be talking about a product where you say seventh heaven, but you would like it to appear as Heaven7. These are straightforward substitutions, just like when you perform a search-and-replace in a word processor. In SayIt, you can provide the target and replacement strings (including special characters). These forms of search-and-replace can be used to re-order words, make abbreviations, expand abbreviations, insert special characters, insert text-formatting effects (bold, italic, etc.), and so on. Recognition engines have built-in formatting rules for common things such as punctuation, currencies, dates, and times, which are hard to do with simple search-and-replace strings. More complex formatting effects, or changes to dates and currencies, may require scripting, beyond the scope of what most users are willing to do. The key to spotting normalization issues is to read aloud the problematic transcription if it sounds right, but looks wrong, the recognition worked, but the formatting needs help.
8663834500 www.nvoq.com This concludes our quick tour through the mechanics of speech recognition. The key to understanding computer-based speech recognition is to remember the four steps described above and use them as a guide to understanding why recognition works, when it does, and potential fixes, when it does not. For more information, contact info@nvoq.com or call 866-383-4500.