Beruflich Dokumente
Kultur Dokumente
DOI 10.1007/s10772-008-9009-1
Received: 1 October 2008 / Accepted: 9 October 2008 / Published online: 28 October 2008
Springer Science+Business Media, LLC 2008
Abstract Although the Arab world has an estimated completely described. Fully diacritic Arabic transcrip-
number of 250 million Arabic speakers, there has tions, for all the three corpuses were developed too.
been little research on Arabic speech recognition when SPHINX-IV engine was customized and trained,
compared to other languages of similar importance for both the language model and the lexicon modules
(e.g. Mandarin). Due to the lack of diacritic Ara- shown in the frame work architecture block diagram
bic text and the lack of Pronunciation Dictionary on next page.
(PD), most of previous work on Arabic Automatic Using the three mentioned corpuses; the (PD) de-
Speech Recognition has been concentrated on devel- veloped by our automatic tool with the transcripts,
oping recognizers using Romanized characters i.e. let SPHINX-IV engine is trained and tuned in order to
the system recognizes the Arabic word as an English develop three acoustic models, one for each corpus.
one, then map it to Arabic word from lookup table that Training is based on an HMM model that is built on
maps the Arabic word to its Romanized pronunciation. statistical information and random variables distribu-
In this work, we introduce the first SPHINX- tions extracted from the training data itself. New algo-
IV-based Arabic recognizer and propose an auto- rithm is proposed to add unlabeled data to the training
matic toolkit, which is capable of producing (PD) corpus in order to increase the corpus size. This algo-
for both Holly Quraan and standard Arabic lan- rithm is based on Neural Network confidence scorer
guage. Three corpuses are completely developed in and then is used to annotate the decoded speech in or-
this work, namely the Holly Quraan Corpus HQC-1 der to decide whether the proposed transcript is ac-
about 18.5 hours, the command and control corpus cepted and can be added to the seed corpus or not.
CAC-1 about 1.5 hours and Arabic digits corpus ADC The model parameters were fine-tuned using simu-
less than one hour of speech. The building process is lated annealing algorithm; optimum values were tested
and reported. Our major contribution is mainly using
the open source SPHINX-IV model in Arabic speech
recognition by building our own language and acoustic
H. Hyassat
Arab Academy of Business and Financial Sciences, models without Romanization for the Arabic speech.
Amman, Jordan The system is fine-tuned and data are refined for train-
ing and validation. Optimum values for number of
R. Abu Zitar ()
Gaussian mixtures distributions and number of states
School of Computing and Engineering, New York Institute
of Technology, Amman, Jordan in HMMs have been found according to specified per-
e-mail: rzitar@nyit.edu formance measures. Optimum values for confidence
134 Int J Speech Technol (2006) 9: 133150
scores were found for the training data. Although era, resulting in further stratification of society and
much more work need to be done to complete the tragic loss in human potential. Automatic Speech
work with this size, we consider the corpus used in our Recognition (ASR) field is one of these interfaces,
system is enough to validate our approach. SPHINX which witnesses over the last decade enormous
has never been used before in this manner for Arabic progress, and can be reliably done on large vocab-
speech recognition. The work is an invitation for all ularies, on continuous speech and speaker indepen-
open source speech recognition developers and groups dently. The word error rate of these recognizers under
to take over and capitalize on what we have started. special conditions often is below 10 percent (Pallet
et al. 1999) and for general purposes Large Vocabulary
Keywords SPHINX engine Pronunciation Continuous Speech Recognizers (LVCSR) the best
Dictionary Diacritic Arabic word error rates were as high as 23.9% (Rosti 2004;
Hain et al. 2003) for English language.
With an estimated number of 250 million native
speakers, Arabic language is the sixth most widely
1 Introduction
spoken language in the world. But research on ASR
for Arabic is too limited compared to other languages
Large Vocabulary Continuous Speech Recognizers with similar importance like mandarin (Kirchhoff et al.
(LVCSR) are commercially available from different 2002).
vendors. Along with this increased availability comes Most of previous work on Arabic ASR aims at
the demand for recognizers in many different lan- developing recognizer, for either Modern Standard
guages that often were not focused on the speech Arabic (MSA) or Egyptian Colloquial Arabic (ECA).
recognition research. So far, Arabic language is one Some results of Word Error Rates (WER) obtained
of these languages. With the increasing role of com- from both MSA and ECA are shown in Table 1.
puters in our life, there is a desire to communicate From Table 1 we see that the performance of the
with them naturally. Speech processing by computer Arabic ASR for ECA are very poor compared to other
provides one vehicle for natural communication be- ASR for other languages like English, this result is an-
tween man and machine. Interactive networks provide other motivation for this research.
easy access to a wealth of information and services Most previous works on Arabic ASR was done
that will fundamentally affect how people work, play by training the system using two formats, either the
and conduct their daily affairs. Romanized format or standard Arabic script with-
The average citizen needs to communicate with out Romanized transcript. Arabic ASR are concen-
these networks using natural communication skills us- trated on developing recognizers for Modern Standard
ing everyday devices, such as telephones either mo- Arabic (MSA), which is a formal linguistic standard
bile or fixed and televisions. Without fundamental ad- used throughout the Arabic-speaking world and is em-
vances in user-centered interfaces, a large portion of ployed in the media (e.g. broadcast news), lectures,
society will be prevented from participating in the info courtrooms, etc. or colloquial Arabic Language.
Int J Speech Technol (2006) 9: 133150 135
Table 1 WER (%) Arabic language type Year Word error rate (WER) Reference
obtained for both MSA and
ECA
MSA 1997 1520% Billa et al. (2002a, 2002b)
ECA 96/97 6156% Kirchhoff et al. (2002), Zavagliakos
et al. (1998)
ECA 2002 55.154.9% Kirchhoff et al. (2002)
SPHINX-IV engine will be customized in this re- Institute of Technology (MIT). The current working
search. SPHINX-IV is an open source speech recog- engines of SPHINX are SPHINX I, II, III, IV and
nition engine built for research purposes by speech re- pocket SPHINX, in addition to these engines SPHINX
search group at Carnegie Mellon University (CMU) has one trainer, this trainer is capable to produce
(CMU SPHINX Open Source Speech Recognition En- acoustic model, this model can be used in all SPHINX
gines 2007; Huang et al. 2003). Many theses around version except SPHINX-I, every SPHINX engine has
the world tackled SPHINX for Speech Recognition its unique characteristics and usage.
(Rosti 2004; Nedel 2004; Doh 2000; Ohshima 1993;
Raj 2000; Huerta 2000; Rozzi 1991; Liu 1994; Gou- 2.2 Hidden Markov Model Toolkit (HTK)
va 1996; Seltzer 2000; Siegler 1999), but not for
Arabic. Reasons for selecting this engine will be pre- S. Young, presented a frame work for the HTK TOOL
sented latter. SPHINX-IV architecture consists of se- KIT (Hermansky 1990) he stated that The Hidden
ries of processes independent of each other as it will be Markov Model Toolkit (HTK) is a portable toolkit
shown in next sections. Each block in this figure rep- for building and manipulating hidden Markov mod-
resents a series of process, independent of each other. els. HTK is primarily used for speech recognition re-
search although it has been used for numerous other
applications. It was originally developed at the Ma-
2 Review of speech recognition engines chine Intelligence Laboratory (formerly known as the
Speech Vision and Robotics Group) of the Cambridge
In this section some of the famous speech recogni- University Engineering Department (CUED) where it
tion engines are reviewed, SPHINX, Hidden Markov has been used to build large vocabulary speech recog-
Toolkit (HTk) and Center for Spoken Language Un- nition systems. It consists of a set of library modules
derstanding Toolkit (CSLU). and a set of more than 20 tools. A HTK-based recog-
nizer was included in both the ARPA September 1992
2.1 SPHINX engine Resource Management Evaluation and the November
1993 Wall Street Journal CSR Evaluation, where in
SPHINX is a large-vocabulary, speaker-independent, both cases performance was comparable with the sys-
Hidden Markov Model (HMM)-based continuous tems developed by the main ARPA contractors.
speech recognition system. SPHINX was developed In the year 1999, the current version of HTK was
at CMU in 1988 (Russell et al. 1995; Christensen V2.2 and all rights to HTK rested with Entropic. At
1996; Rabiner and Juang 1993) and was one of the this time Entropics major business focus was voice-
first systems to demonstrate the feasibility of accu- enabling the Web and Microsoft purchased Entropic in
rate, speaker-independent, large-vocabulary continu- November 1999. Recently Microsoft decided to make
ous speech recognition. SPHINX-II (Russell et al. the core HTK toolkit available again and licensed the
1995) was one of the first systems to employ semi- software back to researchers and academic usage, so
continuous HMMs. that it could distribute and develop the software for
SPHINX is a collection of several ASR; it was these purposes.
created in collaboration between the SPHINX group
at CMU, Sun Microsystems Laboratories, Mitsubishi 2.3 Hybrid systems
Electric Research Labs (MERL), and Hewlett Packard
(HP), with contributions from the University of Cal- A. Ganapathiraju et al. described the use of a powerful
ifornia at Santa Cruz (UCSC) and the Massachusetts machine learning scheme, Support Vector Machines
136 Int J Speech Technol (2006) 9: 133150
(SVM) (Lee et al. 1990), within the framework of Hid- against the script and Romanized transcriptions, re-
den Markov Model (HMM) based speech recognition. spectively), they concluded that it would be advan-
They developed the hybrid SVM/HMM system based tageous to have large amount of Romanized training
on their public domain toolkit. The hybrid system has data for the development of future Arabic ASR sys-
been evaluated on the OGI Alpha-digits corpus and tems, and focused on building an (ART), rather than
performs at 11.6% WER, as compared to 12.7% with try to explore the reasons behind these results.
a triphone mixture-Gaussian HMM system, while us- In my opinion the real reasons for this result is
ing only a fifth of the training data used by triphone that, it is unfair to compare these two systems, as
system. Several important issues that arise out of the they are totally two different systems. The first one
nature of SVM classifiers have been addressed. was trained on standard Arabic script without diacrit-
ics, while the other was trained using Romanized tran-
scription which includes vowels information, doing so
3 Arabic language speech recognition research you fool the system by hiding important information
like the short vowels in the former, while this infor-
Katrin Kirchhoff et al. on their project at the 2002 mation is presented in the later. Due this reason, the
Johns Hopkins Summer Workshop (Kirchhoff et al. fact that Romanized Arabic is unnatural and difficult
2002), which focused on the recognition of dialectal to read for native speakers and the failure of using
Arabic. Three problems were addressed: out-of-corpus data that have proved successful in other
languagesaccording to Katrin Kirchhoff et al., we
1. The lack of short vowels and other pronunciation think that, research for Arabic ASR should be done
information in Arabic texts. on original fully or partially diacritized Arabic cor-
2. The morphological complexity of Arabic. pus not Romanized and to start developing (APDT),
3. The discrepancies between dialectal and formal rather than developing ART as stated by Sir Thomas
Arabic. Elliot where he stated that If physicians be angry, that
They used the only standardized corpus of dialectal I have written physics in English, let them remember
Arabic currently available (2002), the LDC Call Home that the Greeks wrote in Greek, the Romans in Latin,
(CH) corpus of ECA. The corpus is accompanied by Avicenna and the other in Arabic, which were their
transcriptions in two formats: standard Arabic script own and proper maternal tongues (CMU SPHINX
without diacritics and a Romanized version, which trainer 2008).
is close to a phonemic transcription. Example of the Modular recurrent Elman neural networks
Romanized form used in their experiments is shown (MRENN) for Arabic isolated speech recognition is
in Table 2, they stated that Romanized Arabic is un- implemented (Young 1994). This is a special kind of a
natural and difficult to read for native speakers; more- recurrent network. The Elman network, originally de-
over, script-based recognizers (where acoustic models veloped for speech recognition, is a two-layer network
are trained on graphemes rather than phonemes) have in which the hidden layer is recurrent. The inputs to
performed well on Arabic ASR tasks in the past. the hidden layer are the present inputs and the outputs
of the hidden layer are saved from the previous time-
3.1 Automatic Romanizing Tool (ART) step in buffers called context units. Their work is a
duplicate of a previous work done by (Ganapathiraju
et al. 2000) but for English language. They described
Once Katrin Kirchhoff et al. evaluate their system,
a novel method of using recurrent neural networks
a WER of 59.9% and 55.8% is obtained (evaluated
(RNN) for isolated word recognition. Each word in the
target vocabulary is modeled by a fully connected re-
Table 2 ECA transliterated and Romanized sentence represen-
tations (Kirchhoff et al. 2002)
current network. To recognize an input utterance, the
best matching word is determined based on its tem-
ECA poral output response. The system is trained in two
Transliterated script AlHmd llh kwlsB w Antl Azlk stages. First, the RNN speech models are trained inde-
Romanized word forms llHamdulillA kuwayyisaB wi inti pendently to capture the essential static and temporal
izzayik characteristics of individual words. This is performed
Int J Speech Technol (2006) 9: 133150 137
State Transducers. They investigated both the use of Based (PD) will be presented, first the importance of
knowledge-based and data-driven approaches. PD for ASR will be described, then rules of phonolog-
ical Arabic system will be presented followed by Or-
4.2 Arabic speech sounds and properties thographic to Phonetic Transcription description. The
section will be concluded by description of the gener-
Arabic is a Semitic language, and it is one of the old- ation of (PD) for both MSA and Holly Quraan Large
est languages in the world today. It is the fifth widely Vocabulary Continuous Speech Recognizers (LVCSR)
used language nowadays. Arabic alphabets are used in are commercially available from different vendors.
several languages, such as Persian and Urdu (Hiyassat Along with this increased availability comes the de-
et al. 2005) Arabic linguistics came into being in the mand for recognizers in many different languages that
eighth century with the beginning of the expansion of often were not focused enough for the speech recog-
Islam. This early start can be explained in terms of the nition research so far. It is estimated that today as
tremendous need felt by the members of the new com- much as four to six thousand different languages exist
munity to know the language of the Holly Quraan, (Alghamdi 2001). Therefore, over the last period in-
which had become the official language of the young creased thought has been given to creating methods for
Islamic state (Al-Zabibi 1990). automating the design of speech recognition systems
Arabic linguistics exerted huge effort, explaining for new languages while making use of the knowledge
linguistic rules and Arabic grammar; however, this lin-
that has been gathered from already studied languages.
guist did not last long especially in the information era
One of the core components of a speech recognition
(Alghamdi et al. 2004).
system is the PD. The main purpose of it is to map the
The relative regularity of the syntax presents some
orthographic representation of a word to its pronunci-
advantages for its formalization. In addition, the Ara-
ation. The search space of the recognizer is the (PD)
bic language has the following characteristic: from
(Andersen et al. 1996). The performance of a recogni-
one root the derivational and inflectional systems are
tion system depends on the choice of subunits and the
able to produce a large number of words, or lexical
accuracy of the PD. An accurate mapping of the or-
forms, each of which has specific patterns and seman-
tics. In a certain sense, the Arabic language seems bet- thographic representation of a word onto a subunit se-
ter suited for computers than English or French (Hadj- quence is important to ensure recognition quality, oth-
Salah 1983). erwise the acoustic models trained with the wrong data
Contemporary Standard Arabic, a modernized ver- or during decoding the calculation of the scores for a
sion of classical Arabic, is the language commonly hypothesis is falsified by applying the wrong models
in use in all Arab speaking lands today. It is the lan- (Schultz 2002; Schultz et al. 2004).
guage of science and learning, of literature and the The PD lists the most likely pronunciation or cita-
theater, and of the press, radio and television. Notwith- tion form of all words that are contained in the speech
standing the unanimous acceptability of Contempo- corpus. The pronunciation of the corpus can range
rary standard Arabic and its general adoption as the from very simple and achievable with automatic pro-
common medium of communication throughout the cedures to very complex and time-consuming (Fukada
Arab world, it is not every day speech of the people et al. 1999).
(Alghamdi et al. 2004). The creation of a PD is not a trivial task as men-
tioned earlier and the process has to be at least in part
4.3 Grapheme based Pronunciation Dictionary for be automated. With sufficient knowledge of the target
Arabic language, one can try to build a set of rules that map
the orthography of a word to its pronunciation. For
Grapheme-to-phoneme conversion is an important some languages this might work very well, for others
prerequisite for many applications involving speech this might be almost impossible. Arabic language is
synthesis and recognition (Lee et al. 1998). For ASR an example of languages with a very close grapheme
this process is important in developing the (PD), nor- to phoneme relation (Hadj-Salah 1983). Thus compar-
mally as mentioned earlier this (PD) is hand crafted. atively few rules suffice to build a PD containing the
In this section, a thorough description of grapheme canonical information.
Int J Speech Technol (2006) 9: 133150 139
4.4 Automatic versus Hand Crafted Pronunciation When properly applied, these rules enable one to seg-
Dictionaries ment almost any utterance in Arabic correctly and eas-
ily, for they make the division between the coda and
The recognition quality is maintained by maintaining the onset of nearly all contiguous syllables clear-cut.
the quality of the PD with which it maps the orthog-
raphy of a word to the way it is pronounced by the 4.6 Orthographic to phonetic transcription
speakers. Best dictionaries such as CMU-dict (pronun-
ciation dictionary created by Carnage Mellon Univer- Conversion of Arabic phonetic script into rules is one
sity) are usually achieved with hand-crafted dictionar- of the major obstacles facing the researchers on Ara-
ies (Fukada et al. 1999; Killer et al. 2003). However, bic text to speech systems and speech recognition. Al-
manually created dictionaries require an expert in the though Arabic is one of the oldest languages that its
target language (Killer et al. 2004). However; this is sounds and phonological rules were extensively stud-
a time consuming and costly approach, especially for ied and documented (more than 12 centuries ago) (Al-
large vocabulary speech recognition. If no language ghamdi et al. 2004), these valuable studies need to be
expert knowledge is available or affordable, methods compiled from scattered literatures and formulated in
are needed to automate the PD creation process. Sev- a modern mathematical frame work. The aim of this
eral different methods have been introduced over time. section is to formulate the relation between grapheme
Most of them are based on the conversion of the ortho- to phoneme relationship for Arabic.
graphic transcription to a phonetic one, using either Arabic language is an algorithmic language, at least
rule based (Killer et al. 2003) or statistical approaches from the phonology, writing and derivatives point of
(Killer et al. 2004). view, for example no law can explain the pronuncia-
In order to reduce, both the cost and time required tion of g in English in the following words laugh,
to develop LVCSR systems, the problem of creating
through, good and geography, while Arabic language
(PD) must be solved. In the following sections devel-
has direct grapheme to phoneme mapping for most
oping automatic tool for (PD) for Standard Arabic lan-
grapheme. In general, Arabic text with diacritics is
guage will be described.
pronounced as it is written using certain rules. Con-
trary to English, Arabic does not have words with dif-
4.5 Segmenting Arabic utterance
ferent orthographic forms and the same pronunciation.
The first basic rule that operates in the phonological There are sixteen essential rules in orthographic
system of Arabic without exception is that the number to phonetic transcription (Al-Zabibi 1990; Hadj-Salah
of syllables in an utterance is equal to the number of 1983).1 These rules are:
vowels. The issue, then, is not the number of syllables 1. The sokon sign ( ), is not symbol of any phoneme,
in an utterance, since this is automatic, but rather the but it is meaning is this consonant is followed
boundaries that are signaled either by zero, one or two by another consonant, without intermediate vowel,
consonants (Alghamdi et al. 2004). (i.e. if it exist or not it will not affect the pronun-
The second basic rule of Arabic phonology is that
ciation of the consonant itself). Example this
the onset of the syllable equals the beginning of an ut-
means that will be pronounced as is without
terance. Thus, both can begin with a single consonant
introducing any vowles.
example the first phoneme is consonant n
2. The after group waw as in is not
and the second is the short vowel ] ], (Alghamdi
pronounced.
et al. 2004).
3. Pharyngealization (emphasis): There are Pharyn-
The third rule is that the coda of the syllable is iden-
gealized consonants in standard Arabic were you
tical with the end of an utterance, coinciding infinitely
stress the consonant when pronouncing it.
with the codas of the six syllable types previously dis-
Example is the word ( ) count the sign here,
cussed. Accordingly, syllables in Arabic can be either
open or closed, i.e., they can end in one or two conso- used as stress when the is pronounced .
nants, respectively.
Clearly, then, one should use the three rules just 1 URL:http://www.phonetik.unimuenchen.de/Forschung/BITS/
phonemes or do they perform comparably well, how Qaaf . Although some researchers have already
do we cluster graphemes into poly graphemes, how do used English corpus for Arabic Speech Recognition
we generate the questions to built up the decision tree? purposes but most of these approaches did not offer
Apart from dialectic Arabic, there are two kinds of good performance results (according to Katrin Kirch-
pronunciations for Arabic language, the MSA pronun- hoff et al. (2002) WER obtained is 59.9% for Roman-
ciation and The Holy Quraan pronunciation (Baugh ized Arabic which is not comparable to English ASR
and Cable 1978). The standard Arabic pronunciation WER) (Kirchhoff et al. 2002). To this effect, we de-
is governed by the above rules mentioned earlier in cided to build pure formal Arabic corpus that will be
Sect. 4.5, while The Holly Quraan pronunciation is used in testing our algorithm. This corpus will may be
governed by what is called Tajweed rules which will used as benchmark for future researches.
be described in Sect. 5.3. The proposed (PD) deals In building a corpus for any language certain do-
with both of these pronunciations. main should be selected, a domain dependent tran-
scription should be obtained. Recording this transcrip-
tion is done using different speakers in a sound iso-
5 Experimental environment lated booth and sampled using deferent sampling rates
(Rosti 2004; Huang et al. 2003; Raj 2000; Alghamdi
In this section the development of Arabic Corpuses et al. 2004; Killer et al. 2003).
and Baseline System used in experimenting the (PD) Of course, such tasks are exhaustive in both time
developed will be presented. Namely the Holly and cost and beyond individuals capabilities. Usually
Quraan Corpus (HQC-1), Command And Control such tasks are done through bodies such as Defense
Corpus (CAC-1) and Arabic Digits Corpus (ADC). Advanced Research Projects Agency (DARPA), Jon
The focus of our research is on developing these cor- Hopkins University (JHU), Carnegie Mellon Univer-
puses in order to facilitate testing the (PD) already sity (CMU) or Harvard Tool Kit (HTK) and Network
developed. Selecting the SPHINX-IV engine, which for Euro-Mediterranean Language Resources (NEM-
is built on open architecture, will make the results we LAR) (CMU SPHINX Open Source Speech Recogni-
present independent of the specific recognition engine tion Engines 2007; Huang et al. 2003; Fukada et al.
used. The particular aspects of the speech databases 1999; Mimer et al. 2004; Black et al. 1998).
are presented to provide the reader with useful context As mentioned earlier The Arabic alphabet only
information for interpreting our results and to provide contains letters for long vowels and consonants. Short
other researchers with enough information to repeat vowels and other pronunciation phenomena, such as
and validate our experiments. consonant doubling, can be indicated by diacritics
(short strokes placed above or below the preceding
5.1 Arabic Corpus and Baseline System consonant). However, Arabic texts are almost never
fully diacritized and are thus potentially unsuitable for
Most of the research done on SPHINX-IV used ei- recognizer training except The Holly Quraan and few
ther Wall Street Journal (s3-94, s0-94), and/or Re- other school text books. The Holly Quraan is consid-
source Management Corpus (RM) (Rosti 2004; CMU ered as the most important reference for Arabic lan-
SPHINX Open Source Speech Recognition Engines guage.
2007; Huang et al. 2003; Nedel 2004; Ohshima 1993;
CMU SPHINX trainer 2008; Young 1994; Hiyassat 5.2 Corpuses design criteria
et al. 2005; Al-Zabibi 1990). These corpuses are used
for other Latin languages such as French or Italian due Developing speech corpus is not a trivial task and need
to similarities between these languages and the Eng- resources to be allocated, unfortunately such resources
lish language from phoneme point of view. does not exists for the purpose of this research, in some
Unfortunately, there are great differences between researches a figure of hundreds of thousand of dollars
English language and Arabic language from phoneme is considered limited budget. Some researchers con-
point of view due to the existence of some special sider that the size of 40 hours of broadcast News is
phoneme such as Dhad , Dh , Tah, , aeen enough, due to the high cost of the resources (Mimer
, ghaeen , haa , ssad , KHaa and et al. 2004).
142 Int J Speech Technol (2006) 9: 133150
5.4.4 Feature extraction to 16 kHz and divided into small utterancesand then
transcribed, resulting in a total of 59428 words and
For every recording in the training corpus, a set of fea- 25740 unique words with about 18.35 of recording
tures files are computed from the audio training data. hours. It took a total of about 732 working hours to
Each recording can be transformed into a sequence build this corpus.
of feature vectors using the front-end executable pro-
vided with the SPHINX-IV training package, as ex-
plained earlier. 6 Pronunciation Dictionary creation
The process start with Pre-emphasis, this is ap-
plied first, in order to remove noise by applying hi- In order to create the PD, the APDT described earlier
pass filter. Then hamming window is applied, in or- should be invoked; APDT needs a transcription file to
der to slices up Data objects into a number of overlap- produce PD based on this transcription file. Once the
ping windows, (usually referred to as frames in the APDT is invoked, two files are created; one is the (PD)
speech world). After applying the hamming window and the other is a file containing the transcription with
the (FFT) is applied to compute the Discrete (FT) of pronunciation alignment, so that each word in the tran-
the input sequence. This is done by analyzing a signal scription is mapped to its pronunciation in the PD file.
into its frequency components. Filters bank (MFFB) is The PD file has all acoustic events and words in the
applied to the output of the FFT, the output is an array transcripts mapped onto the acoustic units we want to
of filtered values, typically called Mel-spectrum, each
train. Redundancy in the form of extra words is permit-
corresponding to the result of filtering the input spec-
ted. The dictionary must have all alternate pronuncia-
trum through an individual filter. Therefore, the length
tions marked with parenthesized serial numbers start-
of the output array is equal to the number of filters cre-
ing from (2) for the second pronunciation. The marker
ated. In order to obtain the MFCC the (DCT) is applied
(1) is omitted. Each word in the dictionary is followed
and its mean is computed for the MFCC to obtain the
by its pronunciation as shown in Fig. 3.
(CMN)), this is done in order to reduce the distortion
caused by the transmission channel. After computing
6.1 Filler dictionary
the (CMN), the first and the second derivative of it are
computed in order to model the speech signal dynam-
Filler dictionary usually lists the non-speech events as
ics (all as explained earlier).
words and maps them to user-defined phones. This
5.4.5 Transcription file dictionary must at least have the entries (Fig. 4).
The CAC-1 corpus is considered as a small vocabu- Dictionariespronunciation and fillerare devel-
lary set (approximately 30 words in lexicon), the utter- oped in a similar way as for HQC-1. Since the CAC-
ances consist of command and control word as shown 1 corpus is a planned one, transcription were done
in Table 6. easilyeach speaker is asked to say exactly the same
Baseline system is trained using about 2 hours of word in the same orderonly mapping of the record-
speech in CAC-1, including all conditions together. ing to the control file is needed. Of course making sure
About 10 minutes of evaluation data are used for the that the recording reflects exactly the same transcrip-
test in this research. tion is essential; otherwise we fool the system in the
training phase.
7 Arabic Digits Corpus (ADC)
7.1 Training and evaluation
The third corpus is the ADC corpus is totally devel-
oped in this research too. This corpus is built for devel- Once the model definition file is ready, training is
oping Arabic digit model recognition for digits zero, started by initializing the model parameter and run-
one . . . nine ... , . ning the Baum-Welch algorithm described in next sec-
The ADC corpus is developed using the recording tions.
of 142 speakers, Table 7 shows the details of those
speakers. This corpus was developed exactly in the 7.2 Effect of number of Gaussians on the ADC
same manner and the same environment of the CAC-1
corpus. The ADC Consists of two disjoint sets of ut- Table 8 through Table 14 show the different perfor-
terances: 1213 training utterances collected from 73 mance measure of the ADC, from these tables it is
male and 49 female speakers, and 143 testing utter-
ances from 12 male and 8 female speakers details are Table 8 Number of Number Accuracy
shown in Table 7. The total length of the training ut- Gaussians versus word of Gaus-
accuracy versus
terances is about 0.67 hr. sians
Baseline system is trained using about 35 minutes
of speech, including all conditions together. About 7 1 80.159
minutes of evaluation data are used for the test in this 2 84.127
research. 4 88.889
8 89.683
Table 5 Details of Gender Total 16 88.889
speakers participated in
32 77.778
CAC-1 corpus
Male 118 64 69.048
Female 82 128 65.079
256 65.079
Table 6 Words used in
CAC-1 corpus
Table 9 Number of Gaussians versus number of errors
Number of Gaussians Errors
Sub Ins Del
1 24 0 1
2 18 0 2
4 12 0 2
8 9 0 4
16 9 0 5
Table 7 Details of Gender Total 32 9 0 19
speakers participated in 64 9 0 30
ADC corpus
Male 85 128 9 0 35
Female 57 256 9 0 35
146 Int J Speech Technol (2006) 9: 133150
Table 10 Number of Number of WER Table 13 Number of Number of Speed X
Gaussians versus WER Gaussians Versus speed as
Gaussians ratio of real time audio Gaussians real time
1 19.841 1 0.06
2 15.873 2 0.06
4 11.111 4 0.05
8 10.317 8 0.07
16 11.111 16 0.08
32 22.222 32 0.09
64 30.952 64 0.11
128 34.921 128 0.13
256 34.921 256 0.15
1 101 1 7.4
2 106 2 7.53
4 112 4 8.44
8 113 8 9.4
16 112 16 11.39
32 98 32 16.08
64 87 64 25.17
128 82 128 43.13
256 82 256 79.02
1 101 80.159
2 106 84.127
4 112 88.889
8 113 89.683
16 112 88.889
32 98 77.778
64 87 69.048
128 82 65.079
256 82 65.079
8 Summary and conclusions both formal and Holly Quran. HUSDICT60 contain
59424 words, this dictionary will be freely available
In this research we have used SPHINX-IV for Arabic on Arabic. Three corpuses are entirely developed by
speech recognition, build speech recognition resources the author of this work. In the Holly Quraan Corpus
for Arabic, build new tools that do not originally ex- (HQC-1) about 7,742 recordings were processed and
ist in SPHINX-IV that are suitable for Arabic Recog- then transcribed, which results in total of 59,428 words
nition such as APDT and linguistic questions and we and 25,740 unique words, and about 18.35 of record-
also investigated fine-tuning SPHINX-IV parameters ing hours, this process consumes about 432 working
for this purpose too. In this section, we present a sum- hours. Note that one research at Carnegie Mellon Uni-
mary of this research and the relevant observations
versity (CMU) is done on about 1,400 hours of speech
that we have drawn from our investigation in train-
for training one system (CMU SPHINX Open Source
ing, fine-tuning and testing using the three acoustic
Speech Recognition Engines 2007). Results are shown
models (HQC-1, CAC-1 and ADC) based on our pro-
in Fig. 7 and Fig. 8
posed dictionary. Some comments on future research
The CAC-1 corpus consists of two disjoint sets
directions and unresolved questions are presented too.
of utterances: 5628 training utterances collected from
The section is closed with final summary and conclu-
103 male and 74 female Arabic native speakers, and
sions. What is mostly unique about this research is
372 testing utterances from 15 male and 8 female
the (APDT) algorithm we developed and tested. Three
Arabic corpuses, namely HQC-1, CAC-1 and ADC speakers. The CAC-1 corpus is considered as a small
were created for providing acceptable level of train- vocabulary set (approximately 30 words in lexicon),
ing and testing of our system. The recognition per- final results for this corpus is shown in Fig. 9.
formance obtained using these corpuses and the dic- ADC corpus is developed using the recording of
tionary obtained using our (APDT) for Arabic is very 142 Arabic native speakers. This corpus is concerned
successful. To the best of our knowledge, neither this in developing Arabic digit model recognition for dig-
tool nor the HQC-1 corpus were exist prior to this re- its zero, one . . . nine ... , . The ADC
search. SPHINX-IV parameters were tuned using a Consists of two disjoint sets of utterances: 1213 train-
global search algorithm, training data was extended ing utterances collected from 73 male and 49 female
with help of neural network, and a features-based sys- speakers, and 143 testing utterances from 12 male and
tem (not Romanized) for Arabic speech recognition 8 female speakers. The total length of the training ut-
based on SPHINX-IV technology is finally found. Our terances is about 2431 seconds (Fig. 10).
system could be the basis for any future open source From the results of research throughout this work,
research on Arabic Speech Recognition and we intend many suggestions for future work are recommended
to keep it open for the research community. Automatic as shown in coming subsections. One major weakness
tool kit for generating (PD) is fully developed and of conventional HMMs is that they do not provide an
tested in this work. This tool kit is a Rule based pro- adequate representation of the temporal structure of
nunciation tool. (PD) (HUSDICT60) is produced for speech. This is because the probability of state occu-
pancy decreases exponentially with time. This issue Billa, J., et al. (2002a). Arabic speech and text in Tides On Tap.
is a promising area to investigate and many issues on In Proceedings of HLT, 2002.
Billa, J., et al. (2002b). Audio indexing of broadcast news. In
the HMM modeling and temporal structuring can be Proceedings of ICASSP, 2002.
studied. Another issue is the training of the HMM, al- Black, A., Lenzo, K., & Pagel, V. (1998). Issues in building gen-
though Ant Colony Optimization is a stochastic and eral letter to sound rules. In Proceedings of the ESCA work-
discrete optimization algorithm, we believe that this shop on speech synthesis, Australia (p. 7780) 1998.
Christensen, H. (1996). Speaker adaptation of hidden Markov
could be a promising algorithm if adapted for training models using maximum likelihood linear regression. Ph.D.
speech recognition models or at least can be used in Thesis, Institute of Electronic Systems Department of
optimizing the training process. Final words, it should Communication Technology, Aalborg University.
be noticed that most Arabic texts are almost never CMU SPHINX Open Source Speech Recognition Engines.
URL:http://www.speech.cs.cmu.edu/ (2007).
fully diacritical, and are thus potentially unsuitable for CMU SPHINX trainer Open Source Speech Recognition En-
recognizer training except the Holly Quraan and few gines, URL: http//:www.cmusphinx.org/trainer (2008).
other Text Books and some religion old books. In addi- Doh, S.-J. (2000). Enhancements to transformation-based
tion to that, the existence of electronic versions of such speaker adaptation: principal component and inter-class
maximum likelihood linear regression. Ph.D. Thesis,
text is not always available. There should be an Ara- Department of Electrical and Computer Engineering,
bic effort to create diacritical corpus for both speech Carnegie Mellon University.
recognition and text to speech research. During this re- El Choubassi, M. M., El Khoury, H. E., Jabra Alagha, C. E.,
search, about 200,000 unique diacritic words are col- Skaf, J. A., & Al-Alaoui, M. A. (2003). Arabic speech
recognition using recurrent neural networks. Electrical and
lected and are now available on our free website cor- Computer Engineering Department, Faculty of Engineer-
pus as mentioned earlier. ing and ArchitectureAmerican University of Beirut.
Essa, O. (1998). Using prosody in automatic segmentation of
speech. In Proceedings of the ACM 36th annual south east
conference (pp. 4449). Apr. 1998.
References Fukada, T., Yoshimura, T., & Sagisaka, Y. (1999). Automatic
generation of multiple pronunciations based on neural net-
Al-Zabibi, M. (1990). An acoustic-phonetic approach in auto- works. Speech Communication, 27, 6373.
matic Arabic speech recognition. The British Library in Ganapathiraju, A., Hamaker, J., & Picone, J. (2000). Hybrid
Association with UMI. SVM/HMM architectures for speech recognition. In Pro-
Alghamdi, M. (2001). Arabic phonetics. Riyadh: Altawbah ceedings of the international conference on spoken lan-
Printing. guage processing (Vol. 4, pp. 504507). November 2000.
Alghamdi, M., Al-Muhtaseb, H., & Elshafei, M. (2004). Ara- Gouva, E. B. (1996). Acoustic-feature-based frequency warp-
bic phonological rules. Journal of King Saud University: ing for speaker normalization. Ph.D. Thesis, Department
Computer Sciences and Information, 16, 125 (in Arabic). of Electrical and Computer Engineering, Carnegie Mellon
Andersen, O., & Kuhn, R., et al. (1996). Comparison of two University.
tree-structured approaches for grapheme-to-phoneme con- Hadj-Salah, A. (1983). A description of the characteristics of
version. In ICSLP 96 (Vol. 3, pp. 17001703) Oct. 1996. the Arabic language. In Applied Arabic linguistics, signal
Baugh, A. C., & Cable, T. (1978). A history of the English lan- & information processing, Rabat, Morocco, 26 September
guage. Oxon: Redwood Burn Ltd. 5 October 1983.
150 Int J Speech Technol (2006) 9: 133150
Hain, T., et al. (2003). Automatic transcription of conversational Ph.D. Thesis, Department of Electrical and Computer En-
telephone speechdevelopment of the CU-HTK 2002 sys- gineering, Carnegie Mellon University.
tem. (Technical Report CUED/F-INFENG/TR. 465). Cam- Ohshima, Y. (1993). Environmental robustness in speech recog-
bridge University Engineering Department. Available at nition using physiologically-motivated signal processing.
http://mi.eng.cam.ac.uk/reports/. Ph.D. Thesis, Department of Electrical and Computer En-
Hermansky, H. (1990). Perceptual linear predictive (PLP) analy- gineering, Carnegie Mellon University.
sis of speech. Journal of the Acoustic Society of America, Pallet, D. S., et al. (1999). 1998 Broadcast news bench-
87, 17381752. mark test results. In Proceedings of the DARPA broadcast
Hiyassat, H., Nedhal, Y., & Asem, E. Automatic speech recog- news workshop, Herndon, Virginia, February 28March 3,
nition system requirement using Z notation. In Proceedings 1999.
of Amse 05, Roan, France, 2005. Rabiner, L. R., & Juang, B.-H. (1993). Fundamentals of speech
Huang, X., Alleva, F., Wuen, H., Hwang, M.-Y., & Rosen- recognition. Englewood Cliffs: Prentice-Hall.
feld, R. (2003). The SPHINX-II speech recognition sys- Raj, B. (2000). Reconstruction of incomplete spectrograms for
tem: an overview . In School of Computer Science Carnegie robust speech recognition. Ph.D. Thesis, Department of
Mellon University, Pittsburgh, 15213, 2003. Electrical and Computer Engineering, Carnegie Mellon
Huerta, J. M. (2000). Robust speech recognition in GSM codec University.
environments. Ph.D. Thesis, Department of Electrical and Rosti, A.-V.I. (2004). Linear Gaussian models for speech recog-
Computer Engineering, Carnegie Mellon University. nition. Ph.D. Thesis, Wolfson College, University of Cam-
Killer, M., Stker, S., & Schultz, T. (2003). Grapheme based bridge.
speech recognition. Eurospeech, Geneva, Switzerland, Rozzi, W. A. (1991). Speaker adaptation in continuous speech
September 2003. recognition via estilsiation of correlated mean vectors.
Killer, M., Stker, S., & Schultz, T. (2004). A grapheme based Ph.D. Thesis, Department of Electrical and Computer En-
speech recognition system for Russian. In SPECOM2004: gineering, Carnegie Mellon University.
9th conference, speech and computer, St. Petersburg, Rus- Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local
sia, September 2022. learning in probabilistic networks with hidden variables.
Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., Computer Science, IJCAI.
He, F., Henderson, J., Liu, D., Noamany, M., Schone, P., Schultz, T. (2002). Globalphone: a multilingual speech and text
Schwartz, R., & Vergyri, D. (2007). Novel approaches to database developed at Karlsruhe University. In Proceed-
Arabic speech recognition. The 2002 Johns-Hopkins sum- ings of the ICSLP, Denver, CO, 2002.
mer workshop, 2002. Schultz, T., Alexander, D., Black, A., Peterson, K., Suebvisai,
Lee, K., Hon, H., & Reddy, R. (1990). An overview of S., & Waibel, A. (2004). A Thai speech translation system
the SPHINX speech recognition. IEEE Transactions on for medical dialogs. In Proceedings of the human language
Acoustics, Speech, and Signal Processing, ASSP-28(1), technologies (HLT), Boston, MA, May 2004.
3545. Seltzer, M. L. (2000). Automatic detection of corrupt spectro-
Lee, T., Ching, P. C., & Chan, L. W. (1998). Isolated word graphic features for robust speech recognition. Master de-
recognition using modular recurrent neural networks. Pat- gree Thesis, Department of Electrical and Computer Engi-
tern Recognition, 31(6), 751760. neering, Carnegie Mellon University.
Liu, F.-H. (1994). Environmental adaptation for robust speech Siegler, M. A. (1999). Integration of continuous speech recogni-
recognition. Ph.D. Thesis, Department of Electrical and tion and information retrieval for mutually optimal perfor-
Computer Engineering, Carnegie Mellon University, Pitts- mance. Ph.D. Thesis, Department of Electrical and Com-
burgh, PA. puter Engineering, Carnegie Mellon University.
Mimer, B., Stuker, S., & Schultz, T. (2004). Flexible decision Young, S. J. (1994). The HTK hidden Markov model toolkit:
trees for grapheme based speech recognition. In Proceed- design and philosophy (CUED/F-INFENG/TR.152). Engi-
ings of the 15th conference elektronische sprach signal ve- neering Department, University of Cambridge.
rarbeitung (ESSV), Cottbus, Germany, 2004. Zavagliakos, G., et al. (1998). The BNN Byblos 1997 large vo-
Nedel, J. P. (2004). Duration normalization for robust recog- cabulary conversational speech recognition system. In Pro-
nition of spontaneous speech via missing feature methods. ceedings of ICASSP, 1998.