ACL Arabic Diacritics Speech and Text Final

A Weighted Combination of Speech with Text-based Models for Arabic
Diacritization
First Author
Affiliation / Address line 1
email@domain
Abstract
Predicting the diacritics in Arabic text is a
necessary step in most Arabic NLP
applications, but is challenging due to the
languages complex syntax and morphology.
The diacritization of case endings has
particularly suffered from a higher error rate
than the rest of the text. Current research
focuses on solving the problem using textually
inferred information alone. This paper proposes
a novel approach. It explores the effects of
combining speech input with a text-based
model, to allow the linguistically insensitive
information from speech to correct and
complement the errors generated by the text
models predictions. The acoustic model is
based on Hidden Markov Models and the
textual model on Conditional Random Fields.
We demonstrate that introducing speech to
diacritization significantly reduces error rates
across all metrics, especially case endings. We
make a comparison between SVMs and CRFs
for diacritization, and between two widely used
tools in the industry, BAMA and MADA, in the
context of our system. The results in this paper
are the most accurate reported to date, with
diacritic and word error rates of 1.6 and 5.2
inclusive of case endings, and 1.0 and 3.0
exclusive of them.
Introduction
The diacritization of raw Arabic text is a necessary

step for most applications. Being an abjad script,
Second Author
email@domain
Arabic words are written as sequences of

consonants, without vowels or other pronunciation
cues. Given the languages highly inflective nature
and morphological complexity, this means that a
single sequence of three consonants could easily
represent over fifty unique words. While this is not
problematic for native readers, it poses serious
problems in textual disambiguation for automated
systems, as well as for learners of the language.
Zitouni et al. (2006) describe the eight diacritics
in detail. Of these eight, we make special mention
of the three tanweens, or nunation diacritics, as
they may appear only as case endings. Also, the
shadda, or germination diacritic may appear in
combination with other diacritics.
Diacritization can be divided into two subproblems. Lexemic diacritization disambiguates
the different lexemes that may arise when a single
sequence of consonants is marked with different
combinations
of
diacritics.
Inflectional
diacritization disambiguates
the
different
syntactic roles that a specific lexeme may assume
in a sentence, and is typically expressed on the
final consonant of a word as a case ending (Habash
and Rambow, 2007). An example of lexemic
diacritization is presented in Table 1. Considering
the third meaning (books), different inflectional
diacritics applied on the final letter of the word
could represent different syntactic roles of the
books in the sentence, such as whether they are
the subject of a verb, or an object, and so on.
Predicting inflectional diacritics is a complex
grammatical problem (Habash et al.; 2007). Error
rates for inflectional diacritization are therefore

higher than other error rates.
Word
consonants
Diacritized
Pronunciation
Meaning
/kataba/
he wrote
/kattaba/
he made
someone
write
/kutubun/
books
ktb
without
diacritics
Table 1. Three of several valid diacritizations of

the Arabic consonants that represent /k/, /t/ and /b/.
Many approaches have been proposed to deal
with diacritization, but mostly using cues from
textual input alone. Acoustic input has only been
studied in the context of automatic speech
recognition (ASR).
In this paper we make three main contributions:
(1) we propose and describe a system that
combines speech input with a textual resource to
correct the errors induced by text-only
diacritization, especially in case endings; (2) we
compare the effects of using SVMs versus CRFs in
our text model; and (3) we compare the use of two
widely used tools, BAMA and MADA, in a
speech-and-text combined diacritizer. Our results
show higher accuracy rates than the current stateof-the-art systems, without the need for explicitly
learning complex linguistic information. The most
important outcome of this work is bringing the
accuracy of inflectional diacritization to a level
comparable to that of lexemic diacritization. This
work is conducted in Modern Standard Arabic, but
it requires minimal linguistic features and could be
applied to Arabic dialects with less sophisticated
lexicons available.
The rest of this paper is organized as follows. In
Section 2 we discuss related work. We present the
methodology of the system and a description of the
individual text-based and speech-based diacritizers
in Section 3. Their combination, evaluation and
experiments are described in Section 4. Finally, we
conclude the paper in Section 5.
Related Work
Few notable efforts in Arabic diacritization have

been published, which use cues from acoustic
signals. Most work has been text-based, facilitated
by the growth of large annotated text corpora in
Modern Standard Arabic (MSA), and the
development of tools such as the Buckwalter
Morphological Analyzer (BAMA). Given raw
input text, BAMA generates a list of possible
diacritized solutions for each word. Hence
predicting the correct diacritics for a word may be
treated as a selection of the best generated solution.
Habash and Rambow (2007) follow this
approach, selecting the most likely solution
generated by their own morphological analyzer.
They train SVM-based taggers for several
linguistic features, and combined with a lexeme
language model, they compute collective scores to
rank each solution. Their system, MADA, results
in 2.2 and 5.5 DER and WER respectively,
excluding case endings. With case endings, the
DER and WER are 5.0 and 15.1.
Nelken and Schieber (2005) use a cascade of
weighted finite-state transducers that integrate
morphology with word and letter-based language
models to predict diacritics. Schlippe (2008) makes
a comparison between CRF-based diacritization
and statistical machine translation of character and
word n-grams. Zitouni et al. (2006) use Maximum
Entropy classifiers to predict the diacritics for a
given sequence of consonants, using POS, lexical
and segment-based features.
The most accurate diacritization achieved todate is that of Rashwan et al. (2011). They build a
dual-mode stochastic system. In the first mode, it
selects the most likely full-form diacritized words
using maximum marginal probability from a lattice
search. The second mode backs off to selecting the
most likely diacritization from a lattice of
morphemes. They report 3.1 DER and 12.5 WER
with case endings, and 1.2 DER and 3.2 WER
without.
All of the above studies use only textual input.
Vergyri and Kirchoff (2004) presented one of the
few diacritization methods that incorporate speech,
albeit specifically in the context of ASR. They
tagged solutions from BAMA with the help of an
acoustic signal. Their results showed that speechbased
diacritization
could
be
improved
significantly by incorporating textually-extracted

contextual and morphological information.
This paper takes essentially the opposite view. It
demonstrates the effectiveness of speech in
correcting errors induced by text-based methods.
Aside from ASR, it explores the use of acoustic
input based on its own merits for diacritization.
3.1
System Description
Text-based diacritization methods are not yet able

to predict inflectional diacritics as accurately as
lexemic diacritics. However, the human mind is
certainly capable of doing so. Our system makes
use of human intuition via speech data and
combines it with text-based diacritization to
smooth the prediction accuracy across all of the
text, including case endings.
The system accepts two streams of data: raw
Arabic text, and an acoustic signal of the correctly
vocalized speech corresponding to that text. Two
independent diacritizers are employed, one that is
text-based and modeled by Conditional Random
Fields (CRFs); and one that is speech-based and
modeled by Hidden Markov Models (HMMs).
Figure 1 on the right summarizes the systems
architecture and process, given an input word w.
Let W be the set of raw input words. Then for
each word w W, we have a set of potential
diacritized solutions, Dw. These solutions are
generated by BAMA 2.0. Since our task is
diacritization, we are only concerned with the
unique sequence of diacritics on a string of
consonants, and not with its morphological
analyses. Therefore if BAMA produces several
different morphological possibilities for a word, all
with the same sequence of diacritics, then the word
is counted as having a single solution. If a word w
has n unique solutions, then Dw is expressed as:
(2)
Similarly, Tw, with likelihood score t w,i is:

Tw = { x | x = (dw,i,tw,i), dw,i Dw , w W }
Overview
Dw = { dw,1, dw,2, , dw,n }
Sw = { x | x = (dw,i,sw,i), dw,i Dw , w W }
(3)
Let the score of a tuple in these sets be denoted by

Sw[dw,i] and Tw[dw,i]. A weighted combination is
applied to the tuples in sets Sw and Tw, and the final
diacritized solution dw* is that which maximizes the
combination. This is expressed below:
dw* = argmax (.Sw[dw,i] + .Tw[dw,i])
(4)
dw,i Dw
where + = 1.
w
human
speech
raw
text
BAMA
Dw= {dw,1 ,..., dw,n}

Speech
Model
Text
Model
(dw,1, s1)
(dw,2, s2)
(dw,n, sn)
weighted
interpolation
We denote the speech-based diacritizers scored

set of solutions for a word w by Sw, and the textbased diacritizers by Tw. Sw can be described as:
(dw,1, t1)
(dw,2, t2)
(dw,n, tn)
dw,1: s1 + t1
dw,2: s2 + t2
dw,n: sn + tn
d w*
best
diacritized
solution
(1)
Figure 1. Combined diacritization architecture.
This set of solutions is taken as input to both

diacritizers. Let x be a tuple (dw,i, sw,i) that relates a
potential solution dw,i Dw, with its likelihood
score, sw,i. The acoustic and text models are used to
independently generate each solutions score,
producing a set of n tuples for each word.
We discuss the details of the scoring algorithms

and the weighted interpolation in the following
sections. Both models use supervised training from
the same data source, enabling us to fairly compare
their individual and combined performances.
3.2
Speech-based Diacritization
The acoustic model was trained on approximately

3 hours of spoken utterances, read out from
selections of the 510 Al-Nahr articles in the
training set described by Zitouni et al. (2006) and
used by subsequent researchers (Habash and
Rambow, 2007; Schlippe, 2008; Rashwan et al.,
2011). The speech was recorded using native
knowledge of correct diacritization.
The model was built using flat-start, 3-state, 24component HMMs, based on the MFCC features
extracted from the speech signals. 62 non-silent
phones were modeled, comprising:
The 28 Arabic consonants.
28 geminated variations of the above.
The three short vowels (/a/, /i/, /u/).
The two diphthongs (/ay/ and /aw/).
An additional phone for the emphasized L.
This is a pharyngealized form of the
alveodental alveolar ( /l/), and is used
uniquely in the pronunciation of the Arabic
name for God, Allah. It was modeled as a
separate phone since its pronunciation is
notably different from the regular vocalization
of ( /l/).
The phones were converted into tied-state
context-dependent triphones, using decision tree
clustering based on the classification of Arabic
consonants by Elshafei (1991).
Once the acoustic model is trained, speechbased diacritization is a three- step process:
(1) Identifying w in the speech input. Given an
undiacritized input word w, its corresponding
utterance is identified in the speech input by
extracting its word boundaries in time. We
used HTK as our speech recognition toolkit,
and extracted word boundaries by allowing
the speech recognizer to decode the input
speech and generate its results inclusive of
segment times. This was done using the HVite
command, by removing T from the o option.
(2) G2P. We build a rule-based grapheme-tophoneme (G2P) layer to convert the speech
input and each potential solution dw,i Dw into
phonetic transcriptions compatible with the
speech recognizer.
(3) Force alignment. The acoustic signal
corresponding to w is force-aligned against
the phonetic transcription of each solution dw,i
generated in Step (2). This is done in HTK by
using the a option of the HVite command,

and setting the alignment to be between the
transcription of dw,i and the timeframe of w, as
identified by its boundaries in Step (1). From
this alignment, the speech recognizer assigns
an acoustic log likelihood score to dw,i using
the Viterbi algorithm.
Grapheme-to-Phoneme Rules
For Step (2) in speech-based diacritization, a G2P
layer is required prior to alignment. As BAMA
solutions are encoded in Buckwalter transliteration,
which is a one-to-one mapping from Arabic
orthography to ASCII code, the G2P layer applies
orthographic rules and text normalization to
convert the Buckwalter transliteration to phones.
These rules include:
Geminating solar letters that follow the
definite article ( /al/), and removing the
(/l/) consonant of the article.
Appending the consonant ( /n/) for nunation.
Converting the taa marbootah ( )to ( /t/) if
it has a diacritic, and to ( /h/) if not.
Normalizing alef maksoorah ( )and other
forms of the long vowel alef (/aa/).
Mapping the various orthographic forms of the
glottal stop, hamza ( )to one phone.
Eliminating alef if it is part of the article
(/al/) and preceded by a prefix, or if it is the
final letter of a word and preceded by an
undiacritized ( /w/).
Removing tatweel, the text elongating glyph.
Additionally, we add the following constraints:
Two consonants diacritized with sukoon may
not follow one another.
No diacritic may appear without a consonant.
Lexical Modeling
The alignment of each solution dw,i in Step (3) of
speech-based diacritization is constrained by the
lexical model. A phone-level dictionary is
preferred over a word-level lexicon because of the
agglutinative nature of Arabic since numerous
words may be composed out of flexible
combinations of morphemes, the size of the
dictionary becomes quickly unmanageable.
With the above system in place, we may score
the solutions in Dw to generate Sw. The acoustic
scores from the forced alignment are normalized
for the utterance of each word w, and the solution

picked by the speech-based diacritizer, dsw*, is the
solution with the maximum log likelihood:
dsw* = argmax [Sw]
(5)
dw,i Dw
3.3
Text-based Diacritization
The text-based diacritizer uses a sequence labeling

approach and is modeled on Conditional Random
Fields (CRFs), a stochastic method of segmenting
and labeling data. Rather than describing joint
probabilities, CRFs have the advantage of defining
conditional probabilities between labels and
observations (Lafferty et al., 2001). This relaxes
the independence assumption inherent in HMMs,
making CRFs more powerful at capturing context.
If we let X be the sequence of observed
consonants and Y the sequence of diacritic labels,
then the parameters of the model that maximize
likelihood for the given training data T can be
learnt as follows:
* = argmax
log p (Y | X, )
(6)
(X,Y)T
where log p (Y|X, ) = i fi (y, x), and fi is a

feature function between y Y and x X.
In this framework, the length of the vowel and
consonant sequences must be equal. We ensure
that every consonant has a one-to-one mapping
with a diacritic, as described by Schlippe (2008),
which introduces a no diacritic label. Following
his approach, our base model is trained on the
following seven features: the current consonant,
the current, previous and next words, and the partof-speech (POS) tags for the current, previous, and
next words. The model also incorporates a vowel
bigram. Additionally, the context window of
training is -+4. Hence, if the current consonant is
ci, then the above seven features are learnt for each
of the 9 consonants in {ci-4, , ci, , ci+4}.
The POS tags were generated by the Stanford
POS Tagger on a dataset of approximately 471000
words extracted from Al-Nahr articles, available in
the ATB3 corpus and Arabic Gigaword Fourth

Edition. This same dataset was used to train the
CRF model of the text-based diacritizer.
Once the CRF model is trained, the diacritizer
takes in a word w, as a sequence of raw Arabic
consonants, C. Each consonant c C may be
assigned fifteen possible labels: the 3 short vowels,
the 3 tanweens, sukoon, combinations of the
gemination diacritic with the short vowels and
tanweens, and no diacritic. Let V be this set of
fifteen labels. The marginal probability of each
diacritic v V per consonant c C may be
computed. We refer to this probability as pv.
CRF++, which we used to build our text model,
calculates pv for each v V, given dw,i Dw, during
the testing phase.
We have the per-label marginal probabilities
now, and are interested in finding out how likely
the CRF model predicts a solution dw,i to be. Let
Vw,i V be the sequence of labels proposed by dw,i
as the solution.
We calculate the score of solution dw,i by
summing over the logs of the marginal
probabilities associated with its labels:
Tw [dw,i] =
log p
(7)
v V w,i
The best diacritization solution in the text-based

context, dtw* is:
dtw* = argmax [Tw]
(8)
dw,i Dw
4.1
Weighted
Linear
Combinations:
Evaluation and Experiments
Basic Combination
Text and speech cannot be directly combined.

While the text can be normalized into consonantvowel pairs for processing, this is not possible with
the acoustic models phonetic transcriptions.
TEXT
9:1
8:2
7:3
6:4
5:5
4:6
3:7
2:8 1:9 SPEECH
DER CE
4.4
3.9
3.5
3.1
2.6
2.2
1.9
1.6
1.6 2.3
4.7
DERtrue no CE
2.1
2.0
1.8
1.7
1.6
1.5
1.4
1.3
1.4 2.0
4.4
DER no CE
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
1.1 1.6
3.5
WER CE
17.1
14.8 13.1 11.2 9.4
7.5
6.2
5.2
5.2 7.0
12.0
WER no CE
5.2
4.8
4.5
4.2
3.9
3.5
3.2
3.0
3.4 5.0
9.4
Table 2. Weighted interpolations of text and speech, using TEXT:SPEECH ratios. CE and no CE
refer to Error Rates with and without Case Endings.
Firstly, current literature on diacritization uses
Taking an example from English, the word all tokens from the established datasets in
phone, can be split it into the pairs: (p, no- calculating error, but we exclude numbers and
vowel), (h, o), (n, e). However there is no direct punctuation. This is because there is no variation in
way to map these pairs to the words phonetic learning these tokens diacritics; they all have no
transcription: /f oh n/, while maintaining diacritic. Hence including them in evaluation
consonant-vowel consistency.
portrays a slightly optimistic measure of true
To solve this problem, we select the word-level diacritization accuracy of Arabic words.
log likelihoods produced by the speech recognizer,
Secondly, in the case of computing error without
as opposed to diacritic-level scoring by the text- case endings, calculations are usually made after
based diacritizer (see Equation (7)). Since the removing the diacritics on the final consonant of a
acoustic scores are in the log domain, logs are word, but letting the final consonant remain. For
taken of the text-based marginal probabilities. Tw the same reason as above, we suggest a new
and Sw are now compatible for interpolation.
metric: DERtrue no CE this excludes the final
Figure 3 presents an example of tuples from sets consonant in calculations. We believe this more
T109839 and S109839. The set of solutions for word accurately brings out the difference between a
109839 is D109839, which consists of 17 solutions, systems diacritization performance that includes
each with an ID, the sequence of consonants of case endings, and that which does not.
word w (whm in this case), and a sequence of
The error rates of the individual text-based and
vowels. The vowel sequence for Solution 6 is aoi, speech-based diacritizers are listed in the TEXT
shown in the example. A solution is stored with its and SPEECH columns in Table 2. For both
text-based score to form the top tuple, (d109839,6, columns, DER CE is comparable. However, within
t109839,6) T109839, and with its acoustic score to the TEXT column, DERtrue differs from DER CE
by 47.7%, while there is no significant reduction to
form the bottom tuple, (d109839,6, s109839,6) S109839.
DERtrue in SPEECH. This confirms the hypothesis
109839_6
whm
aoi
-19.86843
that text features contribute to systematic
109839_6
whm
aoi
-52.45680
inflectional errors, while acoustic features generate
Figure 3. Sample tuples from the scored sets.
regular errors throughout the text. We further
The scores are combined as in Equation (4) to found that while the text-based diacritizer
produce a new score, a109839,6. Likewise, the scores incorrectly predicted 13.6% of all case endings, the
from other solutions are combined, producing a set speech-based diacritizer only did so for 6.1%. This
of 17 scores, a109839,1, , a109839,17, from which the implies that most acoustically generated errors
occur outside of case endings.
best solution may be selected.
We experimentally derived the optimal values of
4.2 Evaluation
the interpolation weights, for TEXT and for
All the experiments in this section are tested using SPEECH. The results in Table 2 show that the
the same dataset1 proposed by Zitouni et al. (2006). interpolation improves diacritization across all
The evaluation metrics are also the same, with two metrics consistently, until it peaks at = 0.3 and
= 0.7. This shows that acoustic information plays a
points to note.
significant role in lowering the error rates. Most
importantly, the prediction of case endings
1
282 sentences were excluded from this set, as they contained
improves substantially. Figure 4 shows the
one or more words with no solution selected in the corpus.
different solutions of the word whm chosen by the

two diacritizers:
(a) 109839_6
whm
aoi
(b) 109839_9
whm
a u <empty>
Figure 4. Textually scored solution corrected by
combination with acoustic score.
The text-diacritizer initially selected (a), which is
the sixth solution, d109839,6 D109839. Later,
combined with the acoustic score, the speechbased diacritizers choice, (b) d109839,9, was finally
selected. The correctly diacritized word is in fact
wahum. (<empty> corresponds to no diacritic).
We can conclude from our experiments that
while text and speech do complement each other,
acoustic information is a crucial factor in
canceling out the text-based errors
4.3
Text-based Model: SVMs versus CRFs
As opposed to the probabilistic framework of

CRFs, Support Vector Machines (SVMs) are nonprobabilistic linear classifiers (Burges, 1998). We
compare the accuracy of SVM-based diacritization
with our CRF-based method, using MADA
(version 3.1), by Habash et al. (2009).
MADA uses 14 SVMs, each trained on a
different morphological feature, to predict that
specific feature in an input word (Roth et al.,
2008). The SVM predictions are weighted and
combined with a language model to score the
potential solutions provided by the analysis engine.
The final solution is that which scores the highest
on collective feature predictions.
To compare the performance of the SVM-based
approach with CRFs in our system, we used two
models, Tcrf and Tsvm, both of which selected
solutions provided by MADA (rather than
BAMA), for compatibility. We trained CRFs to
build Tcrf, with the same configurations in Section
3.3, on the training data described by Zitouni et al.
(2006). Tsvm is already available in MADA. The
models were used to score the solutions for each w
to produce the sets Twcrf and Twsvm. The best
diacritized solutions were selected according to
formulas (7) and (8). Table 3 lists the results.
The individual scored sets, Twcrf and Twsvm, were
then interpolated with speech for each word w. The
optimal combined values are reported in Table 4.
Both Tables 3 and 4 suggest that CRFs are

superior to SVMs at modeling Arabic diacritic
patterns.
SVMs
CRFs
DER CE
5.4
4.8
DERtrue no CE
2.6
2.2
DER no CE
2.1
1.7
WER CE
20.1
18.4
WER no CE
6.4
5.3
Table 3. Text-based diacritization using CRFs vs.
SVMs, before combing speech.
This is especially so given that SVMs were trained
on several linguistic features, while the only
linguistic feature explicitly learned by the CRFs is
POS. This can be understood through the long
context
and
consonant-vowel
conditional
dependencies across words that are captured by
CRFs.
SVMs
CRFs
DER CE
2.8
2.0
DERtrue no CE
2.2
1.5
DER no CE
1.8
1.2
WER CE
8.7
6.5
WER no CE
5.5
3.6
Table 4. Text-based diacritization using CRFs vs.
SVMs, after combing speech.
We further explored the power of CRFs by testing
a model that did not explicitly learn any linguistic
features. Using a reduced subset of the training
data (~200K words), we built two models,
Crf_base and Crf_pos, and tested them on the
same test dataset used throughout this paper.
The configuration of Crf_pos and Crf_base was
identical to our text model in Section 3.3. The one
difference was that the training of Crf_base
excluded morphological information; the current,
previous and next POS features were removed. The
two models were allowed to freely diacritize text,
without being constrained by BAMAs solutions.
Surprisingly, the results in Table 5 show that
with the exception of WER CE, POS features offer
little to no significant reduction in error,
confirming the strength of CRFs in modeling
diacritics without morphological information.
Crf_base
Crf_pos
DER CE
5.4
5.4
DER no CE
3.5
3.4
WER CE
21.4
17.1
WER no CE
7.5
7.4
Table 5. CRF-based diacritization with and without
learning linguistic features.
4.4
Base Solutions for Combined System:

BWT versus MADA
BAMA generates morphological analyses for

words using linguistic rules hard-coded in three
tables: prefixes, stems and suffixes. Words are
segmented into prefix-stem-affix triples before
they are analyzed against the corresponding tables.
In contrast to the surface form of words used
above, MADA operates on their functional form,
extending BAMA with a lexicon of lexeme and
feature keys. When a word is input, it is analyzed
not by its triples, but by its lexeme and feature
keys (Habash, 2010).
On average, MADA produces 2 less solutions
for a word than BAMA. As a diacritizer, it
produces close to the most accurate results in the
literature. We therefore investigate its use in our
combined system.
We evaluate our combined diacritizer on three
different sets of base solutions from which we
select dw*: BWT, MADA, ALL.
BWT is the set of solutions generated by BAMA,
MADA is the set generated by MADA, and ALL is
the union of BWT and MADA. The results are
shown in Figure 5.
Despite the superiority of MADA for
diacritization and morphological analysis, using its
analyses as the base solutions in a combined
diacritizer causes a reduction in accuracy. This
suggests that BAMAs less rigorous analysis
produces solutions that more closely resemble the
linguistic insensitivity of speech.
This gives the speech-based diacritizer the
flexibility to choose solutions differently from the
text-based diacritizer. This disparity in the choice
of each diacritizer is what combines the advantages
of the text models morphological sensitivity with
the syntactic accuracy of the speech model. Note
that inclusion of MADAs solutions in ALL
slightly confuses the diacritizers choices as well.
Figure 5. Comparing error from three sets of base

analyses in combined diacritization.
Conclusions and Future Work
This paper introduced speech as an important

component to Arabic diacritization. A weighted
combination of speech and text, with attention to
case endings, was found to combine the strengths
of both mediums and produce results superior to
those of current systems. Importantly, inflectional
diacritization was significantly reduced.
Within the combined framework, the use of
SVMs and CRFs were compared; CRFs yielded
higher accuracy without using morphological
information. An important comparison was also
made between the solutions provided by the widely
used MADA and BAMA analyzers. BAMAs
solutions produced better results in our work.
The system proposed is a general framework,
with useful applications for multi-modal systems,
such as the simultaneous production of audio and
fully diacritized text books to aid language
learners. More sophisticated underlying text or
speech models can further reduce the combined
error. Future work includes incorporating a
factored language model into the interpolation
layer, and additional investigations into CRFs for
text-based diacritization.
References
Burges, Christopher J.C. 1998. A Tutorial on Support
Vector Machines for Pattern Classification. Data
mining and Knowledge Discovery, 2, pp.121-167.
The Netherlands.
Elshfaei, M. 1991. Toward an Arabic Text-to-speech
System. The Arabian Journal for Science and
Engineering, vol. 16, pp.565-583, Dharan, Kingdom

of Saudi Arabia.
Habash, Nizar and Rambow, Owen. 2007. Arabic
Diacritization Through Full Morphological Tagging.
In Proceedings of NAACL HLT 2007. Companion
Volume, Short Papers, New York, USA.
Habash, Nizar; Rambow Owen; Roth, Ryan. 2009.
MADA+TOKAN:
A
Toolkit
for
Arabic
Tokenization,
Diacritization,
Morphological
Disambiguation, POS Tagging, Stemming and
Lemmatization. In Proceedings of the 2nd
International Conference on Arabic Language
Resources and Tools (MEDAR), Cairo, Egypt.
Habash, Nizar; Roth, Ryan; Rambow Owen; Kulick,
Seth and Mitch Marcus. 2007. Determining Case in
Arabic: Learning Complex Linguistic Behavior
Requires Complex Linguistic Features. In
Proceedings of EMNLP-CoNLL'2007. pp.1084-1092,
Prague.
Habash, Nizar Y. 2010. Introduction to Arabic natural
language processing (Synthesis Lectures on Human
Language Technologies). 1 ed., pp.70-72. Morgan &
Claypool.
Kirchoff, Katrin and Vergyri, Dimitra. 2004.
Automatic Diacritization of Arabic for Acoustic
Modeling in Speech Recognition. In COLING
Workshop on Arabic-script Based Languages,
Geneva, Switzerland.
Lafferty, John; McCallum, Andrew; Pereira Fernando.
2001. Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence
Data. In Proceedings of the 18th ICML,
Williamstown, MA.
Nelken, Rani and Sheiber, Stuart M. 2005. Arabic
Diacritization
Using
Weighted
Finite-State
Transducers. In Proceedings of the ACL Workshop
on Computational Approaches to Semitic Languages,
Ann Arbor, Michigan, USA.
Rashwan, M.A.A.; Al-Badrashiny, M.A.S.A.A.; Attia,
M.; Abdou, S.M.; Rafea, A. 2011. "A Stochastic
Arabic Diacritizer Based on a Hybrid of Factorized
and Unfactorized Textual Features. IEEE
Transactions on Audio, Speech, and Language
Processing, vol.19, no.1, pp.166-175.
Roth, Ryan; Rambow, Owen; Habash, Nizar; Diab,
Mona; Rudin, Cynthia. 2008. Arabic Morphological
Tagging, Diacritization, and Lemmatization Using
Lexeme Models and Feature Ranking. In
Proceedings of Association for Computational
Linguistics (ACL), Columbus, Ohio.
Schlippe, Tim; Nguyen, ThuyLinh; Vogel, Stephan.

2008. Diacritization as a Machine Translating
Problem and as a Sequence Labeling Problem. In
Proceedings of the Eighth Conference of the
Association for Machine Translation in the Americas
(AMTA), Hawai'i, USA.
Zitouni, Imed; Sorensen, Jeffrey S; Ruhi, Sarikaya.
2006. Maximum entropy based restoration of arabic
diacritics. In Proceedings of the 21st International
Conference on Computational Linguistics and 44th
Annual Meeting of the ACL, pp.577-584, Sydney,
Australia.

ACL Arabic Diacritics Speech and Text Final

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

ACL Arabic Diacritics Speech and Text Final

Hochgeladen von

Copyright:

Verfügbare Formate

A Weighted Combination of Speech with Text-based Models for Arabic

The diacritization of raw Arabic text is a necessary

Arabic words are written as sequences of

rates for inflectional diacritization are therefore

Table 1. Three of several valid diacritizations of

Few notable efforts in Arabic diacritization have

significantly by incorporating textually-extracted

Text-based diacritization methods are not yet able

Similarly, Tw, with likelihood score t w,i is:

Dw = { dw,1, dw,2, , dw,n }

Let the score of a tuple in these sets be denoted by

Dw= {dw,1 ,..., dw,n}

We denote the speech-based diacritizers scored

This set of solutions is taken as input to both

We discuss the details of the scoring algorithms

The acoustic model was trained on approximately

using the a option of the HVite command,

for the utterance of each word w, and the solution

The text-based diacritizer uses a sequence labeling

where log p (Y|X, ) = i fi (y, x), and fi is a

the ATB3 corpus and Arabic Gigaword Fourth

The best diacritization solution in the text-based

Text and speech cannot be directly combined.

different solutions of the word whm chosen by the

Text-based Model: SVMs versus CRFs

As opposed to the probabilistic framework of

Both Tables 3 and 4 suggest that CRFs are

Base Solutions for Combined System:

BAMA generates morphological analyses for

Figure 5. Comparing error from three sets of base

Conclusions and Future Work

This paper introduced speech as an important

Engineering, vol. 16, pp.565-583, Dharan, Kingdom

Schlippe, Tim; Nguyen, ThuyLinh; Vogel, Stephan.

Das könnte Ihnen auch gefallen