Beruflich Dokumente
Kultur Dokumente
Diacritization
First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
Abstract
Predicting the diacritics in Arabic text is a
necessary step in most Arabic NLP
applications, but is challenging due to the
languages complex syntax and morphology.
The diacritization of case endings has
particularly suffered from a higher error rate
than the rest of the text. Current research
focuses on solving the problem using textually
inferred information alone. This paper proposes
a novel approach. It explores the effects of
combining speech input with a text-based
model, to allow the linguistically insensitive
information from speech to correct and
complement the errors generated by the text
models predictions. The acoustic model is
based on Hidden Markov Models and the
textual model on Conditional Random Fields.
We demonstrate that introducing speech to
diacritization significantly reduces error rates
across all metrics, especially case endings. We
make a comparison between SVMs and CRFs
for diacritization, and between two widely used
tools in the industry, BAMA and MADA, in the
context of our system. The results in this paper
are the most accurate reported to date, with
diacritic and word error rates of 1.6 and 5.2
inclusive of case endings, and 1.0 and 3.0
exclusive of them.
Introduction
Second Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
consonants
Diacritized
Pronunciation
Meaning
/kataba/
he wrote
/kattaba/
he made
someone
write
/kutubun/
books
ktb
without
diacritics
Related Work
3.1
System Description
(2)
Overview
Sw = { x | x = (dw,i,sw,i), dw,i Dw , w W }
(3)
(4)
dw,i Dw
where + = 1.
w
human
speech
raw
text
BAMA
Text
Model
(dw,1, s1)
(dw,2, s2)
(dw,n, sn)
weighted
interpolation
(dw,1, t1)
(dw,2, t2)
(dw,n, tn)
dw,1: s1 + t1
dw,2: s2 + t2
dw,n: sn + tn
d w*
best
diacritized
solution
(1)
Figure 1. Combined diacritization architecture.
3.2
Speech-based Diacritization
(5)
dw,i Dw
3.3
Text-based Diacritization
log p (Y | X, )
(6)
(X,Y)T
log p
(7)
v V w,i
(8)
dw,i Dw
4.1
Weighted
Linear
Combinations:
Evaluation and Experiments
Basic Combination
TEXT
9:1
8:2
7:3
6:4
5:5
4:6
3:7
2:8 1:9 SPEECH
DER CE
4.4
3.9
3.5
3.1
2.6
2.2
1.9
1.6
1.6 2.3
4.7
DERtrue no CE
2.1
2.0
1.8
1.7
1.6
1.5
1.4
1.3
1.4 2.0
4.4
DER no CE
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
1.1 1.6
3.5
WER CE
17.1
14.8 13.1 11.2 9.4
7.5
6.2
5.2
5.2 7.0
12.0
WER no CE
5.2
4.8
4.5
4.2
3.9
3.5
3.2
3.0
3.4 5.0
9.4
Table 2. Weighted interpolations of text and speech, using TEXT:SPEECH ratios. CE and no CE
refer to Error Rates with and without Case Endings.
Firstly, current literature on diacritization uses
Taking an example from English, the word all tokens from the established datasets in
phone, can be split it into the pairs: (p, no- calculating error, but we exclude numbers and
vowel), (h, o), (n, e). However there is no direct punctuation. This is because there is no variation in
way to map these pairs to the words phonetic learning these tokens diacritics; they all have no
transcription: /f oh n/, while maintaining diacritic. Hence including them in evaluation
consonant-vowel consistency.
portrays a slightly optimistic measure of true
To solve this problem, we select the word-level diacritization accuracy of Arabic words.
log likelihoods produced by the speech recognizer,
Secondly, in the case of computing error without
as opposed to diacritic-level scoring by the text- case endings, calculations are usually made after
based diacritizer (see Equation (7)). Since the removing the diacritics on the final consonant of a
acoustic scores are in the log domain, logs are word, but letting the final consonant remain. For
taken of the text-based marginal probabilities. Tw the same reason as above, we suggest a new
and Sw are now compatible for interpolation.
metric: DERtrue no CE this excludes the final
Figure 3 presents an example of tuples from sets consonant in calculations. We believe this more
T109839 and S109839. The set of solutions for word accurately brings out the difference between a
109839 is D109839, which consists of 17 solutions, systems diacritization performance that includes
each with an ID, the sequence of consonants of case endings, and that which does not.
word w (whm in this case), and a sequence of
The error rates of the individual text-based and
vowels. The vowel sequence for Solution 6 is aoi, speech-based diacritizers are listed in the TEXT
shown in the example. A solution is stored with its and SPEECH columns in Table 2. For both
text-based score to form the top tuple, (d109839,6, columns, DER CE is comparable. However, within
t109839,6) T109839, and with its acoustic score to the TEXT column, DERtrue differs from DER CE
by 47.7%, while there is no significant reduction to
form the bottom tuple, (d109839,6, s109839,6) S109839.
DERtrue in SPEECH. This confirms the hypothesis
109839_6
whm
aoi
-19.86843
that text features contribute to systematic
109839_6
whm
aoi
-52.45680
inflectional errors, while acoustic features generate
Figure 3. Sample tuples from the scored sets.
regular errors throughout the text. We further
The scores are combined as in Equation (4) to found that while the text-based diacritizer
produce a new score, a109839,6. Likewise, the scores incorrectly predicted 13.6% of all case endings, the
from other solutions are combined, producing a set speech-based diacritizer only did so for 6.1%. This
of 17 scores, a109839,1, , a109839,17, from which the implies that most acoustically generated errors
occur outside of case endings.
best solution may be selected.
We experimentally derived the optimal values of
4.2 Evaluation
the interpolation weights, for TEXT and for
All the experiments in this section are tested using SPEECH. The results in Table 2 show that the
the same dataset1 proposed by Zitouni et al. (2006). interpolation improves diacritization across all
The evaluation metrics are also the same, with two metrics consistently, until it peaks at = 0.3 and
= 0.7. This shows that acoustic information plays a
points to note.
significant role in lowering the error rates. Most
importantly, the prediction of case endings
1
282 sentences were excluded from this set, as they contained
improves substantially. Figure 4 shows the
one or more words with no solution selected in the corpus.
4.3
Crf_base
Crf_pos
DER CE
5.4
5.4
DER no CE
3.5
3.4
WER CE
21.4
17.1
WER no CE
7.5
7.4
Table 5. CRF-based diacritization with and without
learning linguistic features.
4.4
References
Burges, Christopher J.C. 1998. A Tutorial on Support
Vector Machines for Pattern Classification. Data
mining and Knowledge Discovery, 2, pp.121-167.
The Netherlands.
Elshfaei, M. 1991. Toward an Arabic Text-to-speech
System. The Arabian Journal for Science and