Sie sind auf Seite 1von 4

BOB: A LEXICON AND PRONUNCIATION DICTIONARY GENERATOR Vincent Wan , John Dines , Asmaa El Hannani , Thomas Hain Dept.

Computer Science University of Shefeld, UK {v.wan , a.elhannani , t.hain}@dcs.shef.ac.uk


ABSTRACT This paper presents Bob, a tool for managing lexicons and generating pronunciation dictionaries for automatic speech recognition systems. It aims to maintain a high level of consistency between lexicons and language modelling corpora by managing the text normalisation and lexicon generation processes in a single dedicated package. It also aims to maintain consistent pronunciation dictionaries by generating pronunciation hypotheses automatically and aiding their verication. The tools design and functionality are described. Also two case studies highlighting the importance of consistency and illustrating the use of the tool are reported. Index Terms Speech recognition, text processing, software packages 1. INTRODUCTION In speech and language research there are many freely available packages such as HTK [1] for large vocabulary speech recognition, SRI-LM [2] toolkit for language modelling and Festival [3] for speech synthesis. This paper presents Bob, a software tool for managing and generating lexicon and pronunciations dictionaries to aid the development of automatic speech recognition (ASR) systems. The tool is written in Java and is freely available1 for noncommercial use under a Creative Commons licence. Specically, it targets lexicon and pronunciation dictionary generation in a way that maintains: consistency between spellings in the lexicon and spellings in text corpora used for language modelling; and consistency in use, by non-phoneticians, of a set of phones to describe the pronunciation of words in a language. To understand the difculty and importance of achieving consistency let us examine the lexicon and pronunciation dictionary generation process, which is shown in leftmost path of gure 1. ASR lexicons are often limited in size: shorter lists reduce the size of language models and speed up decoding by reducing the search space. Therefore, lexicons are usually tailored to a specic domain. Specialist vocabulary from the target domain is captured by taking a sample of development text D. To make best use of any existing (background) language model corpora the words in D must be normalised. This means that the spelling of words in D must be made the same as those in the existing corpora: for example, if all of the background corpora use the UK spelling colour then any occurrences of the US spelling color in D should be mapped to the UK
This work was partly supported by the European IST Programme Project AMIDA (Augmented Multi-party Interaction with Distance Access) FP6033812. This paper only reects the authors views and funding agencies are not liable for any use that may be made of the information contained herein. 1 Available for download at www.webasr.com.

IDIAP Research Institute Switzerland j.dines@idiap.ch

start development text D


OOV short lexicon & list

words & attributes

OOV list
backgrnd unigram counts

generation
short lexicon

text normalisation

pad lexicon to desired size


all verified prons

new word pronunciation generation

pronunciation dictionary generation

finish

Fig. 1. Flow chart showing Bobs processing steps.

version. ASR systems usually impose a limit on the number of words in the lexicon. If text normalisation were not applied then the lexicon would contain alternative spellings of the same word, leading to an effective reduction in the potential lexicon size and an unnecessary increase in the size of any language model (LM) which would need to store duplicate statistics of the alternate spellings. On the other hand, if the correct version of a word was not included in the lexicon then an LM would be weaker as the statistics of the incorrect version are poor. However, producing a normalised text corpus, be it a development text of 100K words or a large collection of 100M words, is tedious and error prone. Spelling mistakes can be difcult to correct (even when the word is highlighted) and could be substituted with other mistakes. There is often little or no consideration as to the content of existing corpora and how they were normalised previously. If several people work on the task then more inconsistencies are likely to be introduced as each person uses their own strategy: this is a problem that is compounded over time as different people work on different parts of an ever increasing set of corpora. Once a lexicon is produced, the corresponding pronunciation dictionary must be generated. Although pronunciations need only be generated for new words that have not been phonetically transcribed already, consistency problems may still arise. The people employed to do this are rarely expert phoneticians. To them, whether ABOUT should be phonetically transcribed as ax b ow t or ax b aw t is not immediately obvious. The decision is made by examining other similar sounding words that have already been transcribed however, a persons accent can affect this. Deriving a

978-1-4244-3472-5/08/$25.00 2008 IEEE

217

SLT 2008

pronunciation from a words spelling automatically is also not 100% accurate and they need to be veried manually as well. Again, if several people work on the task then there is a greater chance that inconsistencies arise. Inconsistencies in the phonetic transcriptions, introduced either through automation or human-error, may affect acoustic modelling adversely by polluting the samples of one phone with those of other phones. The tool attempts to address the above issues by putting all of the processing steps into a single framework that is built upon knowledge and experience of the procedure. It constrains users to adopt the procedure shown in gure 1, which has been used successfully in the NIST Rich Transcription (RT) evaluations [4]. Constraining the way people work is likely to increase the consistency of their output. It also aims to both simplify and speed up the text normalisation, lexicon and dictionary generation processes by automating many of the things that a person would otherwise have to do manually using many different tools. Combined with many different types of error checking the tool reduces the chances that an individual has to introduce human errors. This paper now describes the tools design and the framework to obtain better consistency from non-experts. Descriptions of experiments showing the importance of pronunciation (section 3) and lexicon (section 4) consistency follow there-after. 2. DESIGN The tool is designed around a number of processes described in sections 2.1 to 2.4. Figure 1 shows the ow of information between each process. 2.1. Word attributes At the heart of the tool is a database of words. Attached to each word are its pronunciations and two additional attributes2 . The attributes are category and quality. They are optional abstract quantities that users can assign to every word. The category is a closed set of predened strings (which can be chosen arbitrarily) while quality is an integer from 0 to 5. These attributes provide a simple way to lter words when producing a lexicon. For example, to prevent alternative word spellings entering the lexicon, simply place such words in a multiple spellings category and assign different quality values to each of the alternatives with the preferred spelling given the highest value. 2.2. Lexicon generation Generating consistent lexicons for ASR is one of Bobs primary functions. The user must rst select the word attributes that can be used to create a lexicon of some specied size. The tool will then consider every unique word in D starting with the most frequently occurring word. This ensures that the most important keywords from the domain are present in the list. If the word exists in its database then it is included in the lexicon provided the attributes match those chosen by the user (only words that have veried pronunciations are stored in the tools database). Any word that was not included in the lexicon is placed in to an out of vocabulary (OOV) word list which is handled by the text normalisation process (section 2.3). Typically the number of unique words in D is not sufciently large to ll the lexicon so the remaining words are taken from existing background language modelling corpora. This is the lexicon
2 The pronunciations are stored separately from the word attributes in order to comply with licencing agreements that prevent the redistribution of dictionaries.

padding process shown in gure 1. An up-to-date unigram count of the words available in the background corpora is required. These counts are over all of the words in the background corpora not just those stored in the tools database and are easily created using tools such as the SRI-LM toolkit [2]. The words included in the lexicon are those that have the highest occurrence count in the background corpora and have the highest quality attributes from the set of attributes chosen by the user. 2.3. Text normalisation The next factor that Bob addresses is that of spelling consistency. Text normalisation aims to make the spelling of words consistent across all normalised corpora. In English, spelling variations may be due, for example, to UK versus US differences (colour/color), inconsistent use of -ise and -ize sufxes in British English or simply spelling mistakes. If variations such as these are frequent then they can corrupt any statistics derived for language modelling, particularly if the original text corpus is small. If the generated lexicon is consistent then text normalisation of D is best performed by examining the OOV list as it will contain all the words that were excluded. Words may be excluded because they were ltered by the word attributes or because they do not exist in the tools attributes database. The latter case corresponds to words that either have normalisation errors, which need xing, or legitimate words that should be added to the database. The most frequent OOV words are presented to the user for correction rst. The tool helps the user to determine the most appropriate course of action for each OOV word by presenting the following information: 1. If the OOV has word attribute assignments then they are displayed: if the word did not meet the word attribute criteria for inclusion in the lexicon then the user will be able to see that this was the case and act accordingly. 2. The Levenshtein (or edit) distance between each OOV word and every other word in its database is computed automatically. The list of the most similar words are shown to assist the user in detecting spelling mistakes. The correction can be made with a single click. This approach also prevents the user from inadvertently introducing more errors at this stage. 3. The occurrence count in the background corpora (mentioned in section 2.2) of each OOV word can help the user to choose between different spelling variations. Choosing the variation that is more common in existing corpora will mean that that words statistics will be more reliable in any resulting LM. The tool provides an interface through which all of the above pieces of information are shown to the user simultaneously and his/her decisions can be recorded or applied with a minimum of effort. Text normalisation xes can be saved separately in case they need to be applied to the background corpora or other future corpora. 2.4. Pronunciation generation The nal function of Bob is to aid generation of pronunciations for new words while maintaining a high level of consistency in the use of each phone in the phonetic transcriptions. To achieve this efciently an automatic pronunciation hypothesis generator is incorporated that uses classication and regression trees (CARTs) [5] (see section 3 for more details). The CART predictor can generate pronunciation hypotheses for any word. Additionally, pronunciations for partial words are also

218

Dictionary WER (%)

base (consistent) 35.3

CART (inconsistent) 39.5

Table 1. Comparing the effect of consistent vs. inconsistent pronunciation dictionaries.

broadcast news texts and the CTS transcripts. Word error rate (WER) results on the 2001 NIST CTS data (6 hours of speech) are shown in table 1. The WER using inconsistent dictionaries is 4.2% (absolute) worse than when using consistent dictionaries. This highlights the importance of manually checking the pronunciation dictionaries if better ASR is desired. However, even without manual correction, the CART letter-to-sound prediction is still usable. Fig. 2. Screen shot of the pronunciation generation interface. 4. UPDATING LANGUAGE MODELS automatically derived from existing entries in the dictionary by examining the phone sequences of matching complete words. Regular expression and Levenshtein distance based search facilities are available to allow the user to quickly cross check hypotheses against veried entries in the pronunciation dictionary. By simplifying the search process the user able to cross check against many more entries for greater consistency. Once a new pronunciation has been veried it can be saved permanently for future use. The tool also performs a number of basic checks automatically, such as verifying that every phone entered is valid and highlighting different words that share the same pronunciation. A screen shot of the pronunciation generation interface is shown in gure 2. 3. PRONUNCIATION CONSISTENCY Intuitively, it is important to obtain consistent pronunciation dictionaries in order to obtain good ASR performance. It is possible to have the pronunciations of all words generated automatically using the CART based letter-to-sound predictor. Normally, a user would perform manual corrections on the predicted pronunciations to improve consistency. The following experiment measures the loss in ASR accuracy due to inconsistencies when the dictionary is not manually corrected. A base dictionary of 115K words was generated from the UNISYN pronunciation dictionary [6] with pronunciations mapped to the General American accent. The CART was trained on the base dictionary by tools provided with the Festival speech synthesis software [3] using a left and right context of ve letters and a left context of two phones. This gave 98% phone accuracy and 89% word accuracy on the base dictionary. On a held out set of 11.5K manually generated and checked pronunciations the accuracies of the CART predictor were 89% and 51% respectively. Although the word accuracy is quite low on new words (many of which were proper names, partial words, etc.), the phone accuracy remains relatively high. The base dictionary was compared with the CART-generated dictionary on a conversational telephone speech (CTS) task. CTS models were trained using a standard basic conguration. The frontend extracted 39 dimensional MF-PLP vectors which were cepstral mean and variance normalised. For each dictionary, state clustered triphone models were trained using the maximum likelihood criterion on approximately 300 hours of CTS training data comprised of the Switchboard-I, Switchboard Cellular and Callhome English corpora. LMs were built using a 40K word lexicon on a combination of This section describes a scenario in which the Bobs functionality was exploited fully. The application was to automatically transcribe a section of an internal AMIDA project meeting. The ASR transcription was fed into a set of demonstrations of AMIDA technology (e.g. summarisation) to be presented later that day. The tool was used to produce an updated lexicon and pronunciation dictionary and to normalise two new corpora so that better language models could be used in the application. A one hour ten minute recording was made. The recording consisted of two presentations each followed by a heated discussion. There were 10 active speakers in total. Speakers had Dutch, French, German and various regional English accents. A microphone array was used to record the audience while and a lapel microphone was used to record presenters. The recording was transcribed using a modied rst pass of the AMIDA RT07s ASR system [4]: this consists of a bigram lattice generation using maximum likelihood models without speaker adaptation followed by a four-gram lattice expansion. 4.1. The task The two original LMs from RT07s were trained on a variety of texts totalling over 2 billion words and were tuned specically for boardroom (conf) and lecture theatre (lect) style meetings [4]. However, the topics and style of the project meeting were known to be different so it was decided to build new LMs instead of reusing the old ones. Additionally, two new text corpora were available that could improve language modelling: the project outputs (ten PDF documents, 305K words) and the AMIDA Wiki pages (384K words). In this task the tool was used to the clean up the new corpora quickly. It is particularly important to normalise texts extracted from PDF documents as it can be an errorful process. One common error is due to ligatures in certain fonts (e.g the letters are sometimes converted to a single character where the top of the f is joined with the dot over the i). Thus a na ve conversion to plain text may introduce a large number of spelling errors. Since these documents are so closely related to the meeting such errors could reduce the effectiveness of the LM. After normalisation, a new lexicon and pronunciation dictionary containing over 60 new words was produced. The text D used to derive the lexicon and estimate LM interpolation weights is a combination of the two RT07s conf and lect development texts plus a 30 minute transcription of another project meeting made several months earlier.

219

LM AMIDA PDFs only AMIDA Wiki only AMIDA full

Using D for interpolation With text norm PPL on D weight WER 631 0.01 355 0.02 102 46.1

Using E for interpolation With text norm No text norm PPL on E weight WER PPL on E weight WER 603 0.07 715 0.04 357 0.08 377 0.07 129 45.8 130 46.1

Table 3. Perplexities and interpolation weights of the LMs derived from the two new corpora with and without text normalisation and their impact on the nal interpolated LM. Interpolating on the manual transcript E tells us the theoretical limit achievable.

LM RT07sconf RT07slect

AMIDA full

Lexicon RT07s RT07s RT07s AMIDA AMIDA AMIDA

LM Intrp. data conf lect D D D D

New data not incld not incld not incld not incld unnormd normd

PPL on E 143 138 134 142 136 134

WER (%) 48.8 48.4 48.2 46.9 46.4 46.1

5. CONCLUSION Bob is a tool that manages lexicons and pronunciation dictionaries for automatic speech recognition systems. It helps to maintain lexicons that are consistent with the words in language modelling corpora by managing the text normalisation and lexicon generation processes in a single dedicated package. For maintaining consistent pronunciation dictionaries it includes a letter-to-sound pronunciation predictor and it facilitates manual pronunciation validation. The two case studies described in this paper highlight the importance of text normalisation in language modelling and consistency in the use of words and phones in lexicons and dictionaries. Bob is built from the experience, knowledge and procedures that have been applied successfully in the NIST RT evaluations. It is exible enough to enable users to generate lexicons tailored for specic tasks, including the ability to exclude classes of words (e.g. removing expletives). The word attribute labels may be customised and extended to suit the users needs. Finally, it is portable across languages: the interface is not specic to English and the letter-to-sound pronunciation predictor can be retrained for any language. 6. REFERENCES [1] S. J. Young, The HTK HMM toolkit: Design and philosopy., Tech. Rep., Univ. Cambridge, Dept. Eng., 1993. [2] A. Stolcke, SRILM an extensible language modeling toolkit, in Proc. ICSLP, 2002. [3] A. W. Black, P. Taylor, and R. Caley, The Festival speech synthesis system, version 1.95beta., CSTR, University of Edinburgh, Edinburgh, 2004. [4] T. Hain, L. Burget, M. Karaatand J. Dines, D. van Leeuwen amd G. Garau, M. Lincoln, and V. Wan, The 2007 AMI(DA) system for meeting transcription, in Proc RT07 Workshop 2007, 2007. [5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classication and Regression Trees, Wadsworth & Brooks/Cole Advanced Books & Software, Pacic Grove, CA., 1984. [6] S. Fitt, Documentation and user guide to UNISYN lexicon and post-lexical rules, Tech. Rep., Centre for Speech Technology Research, Edinburgh, 2000. [7] R. Kneser and H. Ney, Improved backing-off for m-gram language modeling, in Proc. ICASSP, 1995.

Table 2. Comparing the various stages of the LM update. Note that LM perplexities are not comparable when the lexicons are different.

4.2. Results Let E denote the reference transcript of the AMIDA meeting being recognised. Table 2 shows the perplexities (PPLs) and word error rates (WERs) of the LM at various stages of the update. All language models are built using the SRI-LM tool kit [2] using 50,000 word lexicons and Kneser-Ney discounting [7]. The rst two rows show the results of the original unmodied RT07s LMs optimised for conf and lect data. The third row shows the improvement obtained in reoptimising the LM by tuning the interpolation weights for the D data. The forth row further incorporates the updated lexicon which has a 0.2% lower OOV rate than the RT07s lexicon. The fth row adds the new AMIDA PDF and Wiki corpora without text normalisation and the last row adds them with normalisation. The fully updated AMIDA full LM yielded a 2.3% absolute reduction in the WER compared to the RT07slect LM. The improvement can be broken down in the following way: 0.2% comes from reoptimising the interpolation weights, 1.3% from updating the lexicon and 0.8% from the addition of two new language modelling corpora (of which 0.3% would be lost without text normalisation). Table 3 analyses the the effectiveness of the new LM corpora. The LMs are reinterpolated on the reference transcription E to observe their full impact. Although this is an unrealistic test it tells us the theoretical WER that could have been achieved and gives additional insight into the effectiveness of aspects of the LM build. The rst block in table 3, using D for interpolation with text normalisation applied to the new corpora, corresponds to the AMIDA full LM. The middle block, using E for the interpolation instead, indicates that the weights for the new AMIDA PDF and Wiki LMs could have been much larger than was initially estimated and that would have led to a slightly lower WER: a further 0.3% could have been gained through LM interpolation. The third block also uses E for the interpolation but without applying any text normalisation to the new corpora. The PPLs, interpolation weights and WER highlight the importance of text normalisation. In agreement with table 2, 0.3% of the WER reduction is attributable to text normalisation.

220

Das könnte Ihnen auch gefallen